Generating PDF/A compliant PDFs from pdftex: Difference between revisions

From STMDocs
Jump to navigation Jump to search
Line 215: Line 215:
Now let's move on to '''sample2e.tex''', which is another sample that is part of latex distribution. Again, let's start with pdf/a-1b check: we add the same thing as we did for '''small2e.tex''' above, and the result is the same. However, with pdf/a-1a check we are not that lucky as before:<br>
Now let's move on to '''sample2e.tex''', which is another sample that is part of latex distribution. Again, let's start with pdf/a-1b check: we add the same thing as we did for '''small2e.tex''' above, and the result is the same. However, with pdf/a-1a check we are not that lucky as before:<br>
[[Image:Sample2e-pdfa-1a-report.png|Report of checking sample2e-pdfa-1a.pdf]]
[[Image:Sample2e-pdfa-1a-report.png|Report of checking sample2e-pdfa-1a.pdf]]
To fix this, we need to add these magic lines:
<geshi lang="latex">
\input glyphtounicode.tex
\input glyphtounicode-cmr.tex
\pdfgentounicode=1
</geshi>

Revision as of 23:30, 23 November 2008

Introduction

This page describes necessary steps to create PDF/A compliant PDFs from pdftex and related issues. When we compile a latex document with pdftex, there can be a few issues that can prevents the result from begin pdf/a compliant, such as:

  • problems with fonts:
    • font files are not embedded,
    • mismatch of character widths,
    • characters of zero widths,
    • fonts don't have a ToUnicode mapping
  • problems with metadata:
    • XMP data not included,
    • XMP data don't match the info in pdfInfo catalog.
  • problem with interword spacing: pdftex don't use space to separate words in pdf output.
  • problem with color data.

The usual way to verify if a pdf file is pdf/a compliant is to use a validating tool. There are a few pdf/a checking tools; the most common one is the Preflight tool in Acrobat Professional version 8 or newer. Beware that these checking tools can give very different the result on pdf/a compliance of a given pdf: a pdf file that passes pdf/a compliance checking in acrobat 8 can still fail to pass a check by another tool. In this document, we assume the following:

  • input are latex documents
  • tex live 2008 (which includes pdftex version 1.40.9) is used for latexing
  • Acrobat 8.0 for pdf/a validation

We start by a minimal example, and then move to more complex ones, to illustrate the issues one may encounter when trying to achieve pdf/a compliance.

A minimal example

Let's have a minimal document hello.tex that looks as follows: <geshi lang="latex"> \documentclass{report} \begin{document} Hello, world! \end{document} </geshi>

When we compile it with pdflatex and check for pdf/a compliance, we will get a report like this:Report of checking hello.pdf

So it looks like our pdf is missing metadata. To fix this, we make a copy of hello.tex named hello-pdfa-1b.tex that looks as follows: <geshi lang="latex">

\documentclass{report}

%**************** % define medatata %________________ \def\Title{An Example Document} \def\Author{Some Name} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document}

%*************************************************************************** % \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00 %___________________________________________________________________________ \def\convertDate{%

   \getYear

}

{\catcode`\D=12

\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth}

} \def\getMonth#1#2{\edef\xMonth{#1#2}\getDay} \def\getDay#1#2{\edef\xDay{#1#2}\getHour} \def\getHour#1#2{\edef\xHour{#1#2}\getMin} \def\getMin#1#2{\edef\xMin{#1#2}\getSec} \def\getSec#1#2{\edef\xSec{#1#2}\getTZh} \def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm} \def\getTZm '#1#2'{%

   \edef\xTZm{#1#2}%
   \edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}%

}

\expandafter\convertDate\pdfcreationdate

%************************** % get pdftex version string %__________________________ \newcount\countA \countA=\pdftexversion \advance \countA by -100 \def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision}


%********* % XMP data %_________ \usepackage{xmpincl} \includexmp{pdfa-1b}

%******** % pdfInfo %________ \pdfinfo{%

   /Title    (\Title)
   /Author   (\Author)
   /Subject  (\Subject)
   /Keywords (\Keywords)
   /ModDate  (\pdfcreationdate)
   /Trapped  /False

}


\begin{document} Hello, world! \end{document} </geshi>

Some notes on the example:

  • it uses the latex package xmpincl to include XMP data to the pdf;
  • it assumes there is a file pdfa-1b.xmp in the current directory. That file is included in this zip

When we check the pdf result using acrobat 8, we got this report:
Report of checking hello-pdfa-1b.pdf

With a little more effort, we can make our example to pass pdf/a-1a checking:

<geshi lang="latex"> \includexmp{pdfa-1b} </geshi> by <geshi lang="latex"> \includexmp{pdfa-1a} </geshi>

    • add the following code:

<geshi lang="latex"> %************************* % explicit interword space %_________________________ \pdfmapline{+dummy-space <dummy-space.pfb} \pdfgeninterwordspace=1 </geshi>

Compile the file with patched pdftex, and we should get this report from checking:
Report of checking hello-pdfa-1a.pdf

Another trivial example

Let's apply what we did above for another example: small2e.tex which is part of standard latex distribution.

  • We put all the additional latex code a file called pdfa-supp.tex which looks as follows:

<geshi lang="latex"> %*************************************************************************** % \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00 %___________________________________________________________________________ \def\convertDate{%

   \getYear

}

{\catcode`\D=12

\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth}

} \def\getMonth#1#2{\edef\xMonth{#1#2}\getDay} \def\getDay#1#2{\edef\xDay{#1#2}\getHour} \def\getHour#1#2{\edef\xHour{#1#2}\getMin} \def\getMin#1#2{\edef\xMin{#1#2}\getSec} \def\getSec#1#2{\edef\xSec{#1#2}\getTZh} \def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm} \def\getTZm '#1#2'{%

   \edef\xTZm{#1#2}%
   \edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}%

}

\expandafter\convertDate\pdfcreationdate


%************************** % get pdftex version string %__________________________ \newcount\countA \countA=\pdftexversion \advance \countA by -100 \def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision}


%******** % pdfInfo %________ \pdfinfo{%

   /Title    (\Title)
   /Author   (\Author)
   /Subject  (\Subject)
   /Keywords (\Keywords)
   /ModDate  (\pdfcreationdate)
   /Trapped  /False

}


%************************* % explicit interword space %_________________________

\expandafter\ifx\csname pdfgeninterwordspace\endcsname\relax

   \message{\string\pdfgeninterwordspace\space not supported by this version of pdftex}

\else

   \pdfmapline{+dummy-space <dummy-space.pfb}
   \pdfgeninterwordspace=1

\fi </geshi>

  • let's add to small2e.tex these lines to make it pass pdfa/1b check:

<geshi lang="latex"> \def\Title{An Example Document} \def\Author{Leslie Lamport} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document} \input{pdfa-supp} \usepackage{xmpincl} \includexmp{pdfa-1b} </geshi>

  • to pass pdfa/1a check, we change

<geshi lang="latex"> \includexmp{pdfa-1b} </geshi> to <geshi lang="latex"> \includexmp{pdfa-1a} </geshi> and compile the file by pdftex with the patch mentioned above.

The result should be the same.

A less trivial example

Now let's move on to sample2e.tex, which is another sample that is part of latex distribution. Again, let's start with pdf/a-1b check: we add the same thing as we did for small2e.tex above, and the result is the same. However, with pdf/a-1a check we are not that lucky as before:
Report of checking sample2e-pdfa-1a.pdf

To fix this, we need to add these magic lines: <geshi lang="latex"> \input glyphtounicode.tex \input glyphtounicode-cmr.tex \pdfgentounicode=1 </geshi>