Generating PDF/A compliant PDFs from pdftex

From STMDocs
Revision as of 01:42, 27 November 2008 by Thanh (talk | contribs) (→‎Annotation)
Jump to navigation Jump to search

Introduction

This page describes necessary steps to create PDF/A compliant PDFs from pdftex and related issues. When we compile a latex document with pdftex, there can be a few issues that can prevents the result from begin pdf/a compliant, such as:

  • problems with fonts:
    • font files are not embedded,
    • mismatch of character widths,
    • characters of zero widths,
    • fonts don't have a ToUnicode mapping
  • problems with metadata:
    • XMP data not included,
    • XMP data don't match the info in pdfInfo catalog.
  • problem with interword spacing: pdftex don't use space to separate words in pdf output.
  • problem with color data.

The usual way to verify if a pdf file is pdf/a compliant is to use a validating tool. There are a few pdf/a checking tools; the most common one is the Preflight tool in Acrobat Professional version 8 or newer. Beware that these checking tools can give very different the result on pdf/a compliance of a given pdf: a pdf file that passes pdf/a compliance checking in acrobat 8 can still fail to pass a check by another tool. In this document, we assume the following:

  • input are latex documents
  • tex live 2008 (which includes pdftex version 1.40.9) is used for latexing
  • Acrobat 8.0 for pdf/a validation

We start by a minimal example, and then move to more complex ones, to illustrate the issues one may encounter when trying to achieve pdf/a compliance.All needed input files for the examples described in this wiki page are included in this zip

A minimal example

Let's have a minimal document hello.tex that looks as follows: <geshi lang="latex"> \documentclass{report} \begin{document} Hello, world! \end{document} </geshi>

When we compile it with pdflatex and check for pdf/a compliance, we will get a report like this:Report of checking hello.pdf

So it looks like our pdf is missing metadata. To fix this, we make a copy of hello.tex named hello-pdfa-1b.tex that looks as follows: <geshi lang="latex">

\documentclass{report}

%**************** % define medatata %________________ \def\Title{An Example Document} \def\Author{Some Name} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document}

%*************************************************************************** % \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00 %___________________________________________________________________________ \def\convertDate{%

   \getYear

}

{\catcode`\D=12

\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth}

} \def\getMonth#1#2{\edef\xMonth{#1#2}\getDay} \def\getDay#1#2{\edef\xDay{#1#2}\getHour} \def\getHour#1#2{\edef\xHour{#1#2}\getMin} \def\getMin#1#2{\edef\xMin{#1#2}\getSec} \def\getSec#1#2{\edef\xSec{#1#2}\getTZh} \def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm} \def\getTZm '#1#2'{%

   \edef\xTZm{#1#2}%
   \edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}%

}

\expandafter\convertDate\pdfcreationdate

%************************** % get pdftex version string %__________________________ \newcount\countA \countA=\pdftexversion \advance \countA by -100 \def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision}


%********* % XMP data %_________ \usepackage{xmpincl} \includexmp{pdfa-1b}

%******** % pdfInfo %________ \pdfinfo{%

   /Title    (\Title)
   /Author   (\Author)
   /Subject  (\Subject)
   /Keywords (\Keywords)
   /ModDate  (\pdfcreationdate)
   /Trapped  /False

}


\begin{document} Hello, world! \end{document} </geshi>

Some notes on the example:

  • it uses the latex package xmpincl to include XMP data to the pdf;
  • it assumes there is a file pdfa-1b.xmp in the current directory.

When we check the pdf result using acrobat 8, we got this report:
Report of checking hello-pdfa-1b.pdf

With a little more effort, we can make our example to pass pdf/a-1a checking:

<geshi lang="latex"> \includexmp{pdfa-1b} </geshi> by <geshi lang="latex"> \includexmp{pdfa-1a} </geshi>

    • add the following code:

<geshi lang="latex"> %************************* % explicit interword space %_________________________ \pdfmapline{+dummy-space <dummy-space.pfb} \pdfgeninterwordspace=1 </geshi>

Compile the file with patched pdftex, and we should get this report from checking:
Report of checking hello-pdfa-1a.pdf

Another trivial example

Let's apply what we did above for another example: small2e.tex which is part of standard latex distribution.

  • We put all the additional latex code a file called pdfa-supp.tex which looks as follows:

<geshi lang="latex"> %*************************************************************************** % \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00 %___________________________________________________________________________ \def\convertDate{%

   \getYear

}

{\catcode`\D=12

\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth}

} \def\getMonth#1#2{\edef\xMonth{#1#2}\getDay} \def\getDay#1#2{\edef\xDay{#1#2}\getHour} \def\getHour#1#2{\edef\xHour{#1#2}\getMin} \def\getMin#1#2{\edef\xMin{#1#2}\getSec} \def\getSec#1#2{\edef\xSec{#1#2}\getTZh} \def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm} \def\getTZm '#1#2'{%

   \edef\xTZm{#1#2}%
   \edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}%

}

\expandafter\convertDate\pdfcreationdate


%************************** % get pdftex version string %__________________________ \newcount\countA \countA=\pdftexversion \advance \countA by -100 \def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision}


%******** % pdfInfo %________ \pdfinfo{%

   /Title    (\Title)
   /Author   (\Author)
   /Subject  (\Subject)
   /Keywords (\Keywords)
   /ModDate  (\pdfcreationdate)
   /Trapped  /False

}


%************************* % explicit interword space %_________________________ \expandafter\ifx\csname pdfgeninterwordspace\endcsname\relax

   \message{\string\pdfgeninterwordspace\space not supported by this version of pdftex}

\else

   \pdfmapline{+dummy-space <dummy-space.pfb}
   \pdfgeninterwordspace=1

\fi </geshi>

  • let's add to small2e.tex (after the line containing \documentclass{article}) these lines to make it pass pdfa/1b check:

<geshi lang="latex"> \def\Title{An Example Document} \def\Author{Leslie Lamport} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document} \input{pdfa-supp} \usepackage{xmpincl} \includexmp{pdfa-1b} </geshi>

  • to pass pdfa/1a check, we change

<geshi lang="latex"> \includexmp{pdfa-1b} </geshi> to <geshi lang="latex"> \includexmp{pdfa-1a} </geshi> and compile the file by pdftex with the patch mentioned above.

The result should be the same.

A less trivial example

Now let's move on to sample2e.tex, which is another sample that is part of latex distribution. Again, let's start with pdf/a-1b check: we add the same thing as we did for small2e.tex above, and the result is the same. However, with pdf/a-1a check we are not that lucky as before:
Report of checking sample2e-pdfa-1a.pdf

To fix this, we need to add these magic lines: <geshi lang="latex"> \input glyphtounicode.tex \input glyphtounicode-cmr.tex \pdfgentounicode=1 </geshi>

The above code will cause that pdftex will generate ToUnicode mapping for all Type1 embedded fonts.

A slightly more complex example

Let's continue with the example at http://www.tug.org/pracjourn/2006-2/eglen/. Based on what we did before, our first attempt is to add to the preamble these lines: <geshi lang="latex"> \def\Title{An Example Document} \def\Author{Stephen Eglen} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document} \input{pdfa-supp} \usepackage{xmpincl} \includexmp{pdfa-1b} </geshi>

But this time we got more errors than previously:
Report from checking intro2-pdfa-1b.pdf

There are 2 problems:

  • some fonts are not embedded. This is caused by the included pdf figure in the example. We fix this by loading the pdf in Inkspace, and save it again with all text converted to curves. Not the ideal approach, but it's a fast solution for the problem we are facing: how to get rid of non-embedded fonts from the pdf figure.
  • the Color profile is not defined.

So we fix the pdf figure, and add to the preamble these lines: <geshi lang="latex"> \immediate\pdfobj stream attr{/N 4} file{sRGBIEC1966-2.1.icm} \pdfcatalog{%

 /OutputIntents [ <<
 /Type /OutputIntent
 /S/GTS_PDFA1
 /DestOutputProfile \the\pdflastobj\space 0 R
 /OutputConditionIdentifier (sRGB IEC61966-2.1)
 /Info(sRGB IEC61966-2.1)
>> ]

} </geshi>

After this step, the output should pass pdf/a-1b check. For pdf/a-1a check, we replace \includexmp{pdfa-1b} by \includexmp{pdfa-1a} and compile by the patched pdftex.

Further notes

It's hard to ensure pdf/a compliance for an arbitrary latex document. Here is a brief summary of issues we have experienced:

Zero charwidth

Some fonts have glyphs with zero character widths, which will be reported by Acrobat as Width information incomplete. Examples of such fonts are:

  • cmsy
  • euler
  • MathTime Plus
  • stmary
  • xypic

Usually we fix the problem by these steps:

  • convert original tfm to pl by tftopl
  • fix chars with zero width in pl (0.0 -> 0.001)
  • converted pl back to tfm by pltotf
  • convert original pfb to txt by t1disasm
  • check chars with zero width in *.txt (0 hsbw -> 1 hsbw)
  • converted txt back to pfb by t1asm
  • for each original pl, create a vpl which simply maps each char to itself
  • convert vpl to vf/tfm

Example: let's have mtsy.tfm and mtsy.pfb where a few characters have zero widths. Applying the above steps, we get:

  • mtsy2.tfm = original mtsy.tfm, except that chars with zero widths are replaced by 0.001
  • mtsy2.pfb = mtsy.pfb, except that chars with zero widths (0 hsbw) are replaced by 1 hsbw
  • mtsy.vf: map each char to the same char in mtsy2
  • mtsy.tfm: identical to original mtsy.tfm

The reason of using virtual fonts is that we don't want to make any change to mtsy.tfm, so that our workaround doesn't change the layout at all. TeX still sees the same font.

Character width mismatch

Sometimes a Type1 font can use the div operator for specifying charwidth (hsbw) for better precision and the result don't match the value in tfm. In such cases, we bite the bullet and fix both tfm and type1 fonts by converting them to text format, process the text by some script and convert them back to tfm/type1. Luckily such fonts are not very common (so far we have encountered only one font xycirc10 from xypic).

Color

to be added....

Annotation

to be added....