Generating PDF/A compliant PDFs from pdftex

From STMDocs
Jump to navigation Jump to search

Introduction

This page describes necessary steps to create PDF/A compliant PDFs from pdftex and related issues. When we compile a latex document with pdftex, there can be a few issues that can prevents the result from begin pdf/a compliant, such as:

  • problems with fonts:
    • font files are not embedded,
    • mismatch of character widths,
    • characters of zero widths,
    • fonts don't have a ToUnicode mapping
  • problems with metadata:
    • XMP data not included,
    • XMP data don't match the info in pdfInfo catalog.
  • problem with interword spacing: pdftex don't use space to separate words in pdf output.
  • problem with color data.

The usual way to verify if a pdf file is pdf/a compliant is to use a validating tool. There are a few pdf/a checking tools; the most common one is the Preflight tool in Acrobat Professional version 8 or newer. Beware that these checking tools can give very different the result on pdf/a compliance of a given pdf: a pdf file that passes pdf/a compliance checking in acrobat 8 can still fail to pass a check by another tool. In this document, we assume the following:

  • input are latex documents
  • tex live 2008 (which includes pdftex version 1.40.9) is used for latexing
  • Acrobat 8.0 for pdf/a validation

We start by a minimal example, and then move to more complex ones, to illustrate the issues one may encounter when trying to achieve pdf/a compliance.

A minimal example

Let's have a minimal document hello.tex that looks as follows: <geshi lang="latex"> \documentclass{report} \begin{document} Hello, world! \end{document} </geshi>

When we compile it with pdflatex and check for pdf/a compliance, we will get a report like this:Report of checking hello.pdf

So it looks like our pdf is missing metadata. To fix this, we make a copy of hello.tex named hello-pdfa-1b.tex that looks as follows: <geshi lang="latex">

\documentclass{report}

%**************** % define medatata %________________ \def\Title{An Example Document} \def\Author{Some Name} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document}

%*************************************************************************** % \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00 %___________________________________________________________________________ \def\convertDate{%

   \getYear

}

{\catcode`\D=12

\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth}

} \def\getMonth#1#2{\edef\xMonth{#1#2}\getDay} \def\getDay#1#2{\edef\xDay{#1#2}\getHour} \def\getHour#1#2{\edef\xHour{#1#2}\getMin} \def\getMin#1#2{\edef\xMin{#1#2}\getSec} \def\getSec#1#2{\edef\xSec{#1#2}\getTZh} \def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm} \def\getTZm '#1#2'{%

   \edef\xTZm{#1#2}%
   \edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}%

}

\expandafter\convertDate\pdfcreationdate

%************************** % get pdftex version string %__________________________ \newcount\countA \countA=\pdftexversion \advance \countA by -100 \def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision}


%********* % XMP data %_________ \usepackage{xmpincl} \includexmp{pdfa-1b}

%******** % pdfInfo %________ \pdfinfo{%

   /Title    (\Title)
   /Author   (\Author)
   /Subject  (\Subject)
   /Keywords (\Keywords)
   /ModDate  (\pdfcreationdate)
   /Trapped  /False

}


\begin{document} Hello, world! \end{document} </geshi>

Some notes on the example:

  • it uses the latex package xmpincl to include XMP data to the pdf;
  • it assumes there is a file pdfa-1b.xmp in the current directory. That file is included in this zip

When we check the pdf result using acrobat 8, we got this report:
Report of checking hello-pdfa-1b.pdf

With a little more effort, we can make our example to pass pdf/a-1a checking:

<geshi lang="latex"> \includexmp{pdfa-1b} </geshi> by <geshi lang="latex"> \includexmp{pdfa-1a} </geshi>

    • add the following code:

<geshi lang="latex"> %************************* % explicit interword space %_________________________ \pdfmapline{+dummy-space <dummy-space.pfb} \pdfgeninterwordspace=1 </geshi>

Compile the file with patched pdftex, and we should get this report from checking:
Report of checking hello-pdfa-1a.pdf

Another trivial example

Let's apply what we did above for another example: small2e.tex which is part of standard latex distribution.

  • We put all the additional latex code a file called pdfa-supp.tex which looks as follows:

<geshi lang="latex"> %*************************************************************************** % \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00 %___________________________________________________________________________ \def\convertDate{%

   \getYear

}

{\catcode`\D=12

\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth}

} \def\getMonth#1#2{\edef\xMonth{#1#2}\getDay} \def\getDay#1#2{\edef\xDay{#1#2}\getHour} \def\getHour#1#2{\edef\xHour{#1#2}\getMin} \def\getMin#1#2{\edef\xMin{#1#2}\getSec} \def\getSec#1#2{\edef\xSec{#1#2}\getTZh} \def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm} \def\getTZm '#1#2'{%

   \edef\xTZm{#1#2}%
   \edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}%

}

\expandafter\convertDate\pdfcreationdate


%************************** % get pdftex version string %__________________________ \newcount\countA \countA=\pdftexversion \advance \countA by -100 \def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision}


%******** % pdfInfo %________ \pdfinfo{%

   /Title    (\Title)
   /Author   (\Author)
   /Subject  (\Subject)
   /Keywords (\Keywords)
   /ModDate  (\pdfcreationdate)
   /Trapped  /False

}


%************************* % explicit interword space %_________________________ \expandafter\ifx\csname pdfgeninterwordspace\endcsname\relax

   \message{\string\pdfgeninterwordspace\space not supported by this version of pdftex}

\else

   \pdfmapline{+dummy-space <dummy-space.pfb}
   \pdfgeninterwordspace=1

\fi </geshi>

  • let's add to small2e.tex (after the line containing \documentclass{article}) these lines to make it pass pdfa/1b check:

<geshi lang="latex"> \def\Title{An Example Document} \def\Author{Leslie Lamport} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document} \input{pdfa-supp} \usepackage{xmpincl} \includexmp{pdfa-1b} </geshi>

  • to pass pdfa/1a check, we change

<geshi lang="latex"> \includexmp{pdfa-1b} </geshi> to <geshi lang="latex"> \includexmp{pdfa-1a} </geshi> and compile the file by pdftex with the patch mentioned above.

The result should be the same.

A less trivial example

Now let's move on to sample2e.tex, which is another sample that is part of latex distribution. Again, let's start with pdf/a-1b check: we add the same thing as we did for small2e.tex above, and the result is the same. However, with pdf/a-1a check we are not that lucky as before:
Report of checking sample2e-pdfa-1a.pdf

To fix this, we need to add these magic lines: <geshi lang="latex"> \input glyphtounicode.tex \input glyphtounicode-cmr.tex \pdfgentounicode=1 </geshi>

The above code will cause that pdftex will generate ToUnicode mapping for all Type1 embedded fonts.

A slightly more complex example

Let's continue with the example at http://www.tug.org/pracjourn/2006-2/eglen/. Based on what we did before, our first attempt is to add to the preamble these lines: <geshi lang="latex"> \def\Title{An Example Document} \def\Author{Stephen Eglen} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document} \input{pdfa-supp} \usepackage{xmpincl} \includexmp{pdfa-1b} </geshi>

But this time we got more errors than previously:
Report from checking intro2-pdfa-1b.pdf

There are 2 problems:

  • some fonts are not embedded. This is caused by the included pdf figure in the example. We fix this by loading the pdf in Inkspace, and save it again with all text converted to curves. Not the ideal approach, but it's a fast solution for the problem we are facing: how to get rid of non-embedded fonts from the pdf figure.
  • the Color profile is not defined.

So we fix the pdf figure, and add to the preamble these lines: <geshi lang="latex"> \immediate\pdfobj stream attr{/N 4} file{sRGBIEC1966-2.1.icm} \pdfcatalog{%

 /OutputIntents [ <<
 /Type /OutputIntent
 /S/GTS_PDFA1
 /DestOutputProfile \the\pdflastobj\space 0 R
 /OutputConditionIdentifier (sRGB IEC61966-2.1)
 /Info(sRGB IEC61966-2.1)
>> ]

} </geshi>