Generating PDF/A compliant PDFs from pdftex: Difference between revisions

From STMDocs
Jump to navigation Jump to search
Line 130: Line 130:


==Another trivial example==
==Another trivial example==
Let's apply what we did above for another example: '''small2e.tex''' which is part of standard latex distribution. The result should be the same.
Let's apply what we did above for another example: '''small2e.tex''' which is part of standard latex distribution.  
* We put all the additional latex code a file called '''pdfa-supp.tex''' which looks as follows:
<geshi lang="latex">
%***************************************************************************
% \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00
%___________________________________________________________________________
\def\convertDate{%
    \getYear
}
 
{\catcode`\D=12
\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth}
}
\def\getMonth#1#2{\edef\xMonth{#1#2}\getDay}
\def\getDay#1#2{\edef\xDay{#1#2}\getHour}
\def\getHour#1#2{\edef\xHour{#1#2}\getMin}
\def\getMin#1#2{\edef\xMin{#1#2}\getSec}
\def\getSec#1#2{\edef\xSec{#1#2}\getTZh}
\def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm}
\def\getTZm '#1#2'{%
    \edef\xTZm{#1#2}%
    \edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}%
}
 
\expandafter\convertDate\pdfcreationdate
 
 
%**************************
% get pdftex version string
%__________________________
\newcount\countA
\countA=\pdftexversion
\advance \countA by -100
\def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision}
 
 
%********
% pdfInfo
%________
\pdfinfo{%
    /Title    (\Title)
    /Author  (\Author)
    /Subject  (\Subject)
    /Keywords (\Keywords)
    /ModDate  (\pdfcreationdate)
    /Trapped  /False
}
 
 
%*************************
% explicit interword space
%_________________________
 
\expandafter\ifx\csname pdfgeninterwordspace\endcsname\relax
    \message{\string\pdfgeninterwordspace\space not supported by this version of pdftex}
\else
    \pdfmapline{+dummy-space <dummy-space.pfb}
    \pdfgeninterwordspace=1
\fi
</geshi>
* let's add to '''small2e.tex''' these lines to make it pass pdfa/1b check:
<geshi lang="latex">
\def\Title{An Example Document}
\def\Author{Leslie Lamport}
\def\Subject{An Example Document}
\def\Keywords{LaTeX,Example,Document}
\input{pdfa-supp}
\usepackage{xmpincl}
\includexmp{pdfa-1b}
</geshi>
* to pass pdfa/1a check, we change
<geshi lang="latex">
\includexmp{pdfa-1b}
</geshi>
to
<geshi lang="latex">
\includexmp{pdfa-1a}
</geshi>
and compile the file by pdftex with the patch mentioned above.
 
The result should be the same.


==A less trivial example==
==A less trivial example==
Now let's move on to '''sample2e.tex''', which is another sample that is part of latex distributioon.
Now let's move on to '''sample2e.tex''', which is another sample that is part of latex distributioon.

Revision as of 23:17, 23 November 2008

Introduction

This page describes necessary steps to create PDF/A compliant PDFs from pdftex and related issues. When we compile a latex document with pdftex, there can be a few issues that can prevents the result from begin pdf/a compliant, such as:

  • problems with fonts:
    • font files are not embedded,
    • mismatch of character widths,
    • characters of zero widths,
    • fonts don't have a ToUnicode mapping
  • problems with metadata:
    • XMP data not included,
    • XMP data don't match the info in pdfInfo catalog.
  • problem with interword spacing: pdftex don't use space to separate words in pdf output.
  • problem with color data.

The usual way to verify if a pdf file is pdf/a compliant is to use a validating tool. There are a few pdf/a checking tools; the most common one is the Preflight tool in Acrobat Professional version 8 or newer. Beware that these checking tools can give very different the result on pdf/a compliance of a given pdf: a pdf file that passes pdf/a compliance checking in acrobat 8 can still fail to pass a check by another tool. In this document, we assume the following:

  • input are latex documents
  • tex live 2008 (which includes pdftex version 1.40.9) is used for latexing
  • Acrobat 8.0 for pdf/a validation

We start by a minimal example, and then move to more complex ones, to illustrate the issues one may encounter when trying to achieve pdf/a compliance.

A minimal example

Let's have a minimal document hello.tex that looks as follows: <geshi lang="latex"> \documentclass{report} \begin{document} Hello, world! \end{document} </geshi>

When we compile it with pdflatex and check for pdf/a compliance, we will get a report like this:Report of checking hello.pdf

So it looks like our pdf is missing metadata. To fix this, we make a copy of hello.tex named hello-pdfa-1b.tex that looks as follows: <geshi lang="latex">

\documentclass{report}

%**************** % define medatata %________________ \def\Title{An Example Document} \def\Author{Some Name} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document}

%*************************************************************************** % \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00 %___________________________________________________________________________ \def\convertDate{%

   \getYear

}

{\catcode`\D=12

\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth}

} \def\getMonth#1#2{\edef\xMonth{#1#2}\getDay} \def\getDay#1#2{\edef\xDay{#1#2}\getHour} \def\getHour#1#2{\edef\xHour{#1#2}\getMin} \def\getMin#1#2{\edef\xMin{#1#2}\getSec} \def\getSec#1#2{\edef\xSec{#1#2}\getTZh} \def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm} \def\getTZm '#1#2'{%

   \edef\xTZm{#1#2}%
   \edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}%

}

\expandafter\convertDate\pdfcreationdate

%************************** % get pdftex version string %__________________________ \newcount\countA \countA=\pdftexversion \advance \countA by -100 \def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision}


%********* % XMP data %_________ \usepackage{xmpincl} \includexmp{pdfa-1b}

%******** % pdfInfo %________ \pdfinfo{%

   /Title    (\Title)
   /Author   (\Author)
   /Subject  (\Subject)
   /Keywords (\Keywords)
   /ModDate  (\pdfcreationdate)
   /Trapped  /False

}


\begin{document} Hello, world! \end{document} </geshi>

Some notes on the example:

  • it uses the latex package xmpincl to include XMP data to the pdf;
  • it assumes there is a file pdfa-1b.xmp in the current directory. That file is included in this zip

When we check the pdf result using acrobat 8, we got this report:
Report of checking hello-pdfa-1b.pdf

With a little more effort, we can make our example to pass pdf/a-1a checking:

<geshi lang="latex"> \includexmp{pdfa-1b} </geshi> by <geshi lang="latex"> \includexmp{pdfa-1a} </geshi>

    • add the following code:

<geshi lang="latex"> %************************* % explicit interword space %_________________________ \pdfmapline{+dummy-space <dummy-space.pfb} \pdfgeninterwordspace=1 </geshi>

Compile the file with patched pdftex, and we should get this report from checking:
Report of checking hello-pdfa-1a.pdf

Another trivial example

Let's apply what we did above for another example: small2e.tex which is part of standard latex distribution.

  • We put all the additional latex code a file called pdfa-supp.tex which looks as follows:

<geshi lang="latex"> %*************************************************************************** % \convertDate converts D:20080419103507+02'00' to 2008-04-19T10:35:07+02:00 %___________________________________________________________________________ \def\convertDate{%

   \getYear

}

{\catcode`\D=12

\gdef\getYear D:#1#2#3#4{\edef\xYear{#1#2#3#4}\getMonth}

} \def\getMonth#1#2{\edef\xMonth{#1#2}\getDay} \def\getDay#1#2{\edef\xDay{#1#2}\getHour} \def\getHour#1#2{\edef\xHour{#1#2}\getMin} \def\getMin#1#2{\edef\xMin{#1#2}\getSec} \def\getSec#1#2{\edef\xSec{#1#2}\getTZh} \def\getTZh +#1#2{\edef\xTZh{#1#2}\getTZm} \def\getTZm '#1#2'{%

   \edef\xTZm{#1#2}%
   \edef\convDate{\xYear-\xMonth-\xDay T\xHour:\xMin:\xSec+\xTZh:\xTZm}%

}

\expandafter\convertDate\pdfcreationdate


%************************** % get pdftex version string %__________________________ \newcount\countA \countA=\pdftexversion \advance \countA by -100 \def\pdftexVersionStr{pdfTeX-1.\the\countA.\pdftexrevision}


%******** % pdfInfo %________ \pdfinfo{%

   /Title    (\Title)
   /Author   (\Author)
   /Subject  (\Subject)
   /Keywords (\Keywords)
   /ModDate  (\pdfcreationdate)
   /Trapped  /False

}


%************************* % explicit interword space %_________________________

\expandafter\ifx\csname pdfgeninterwordspace\endcsname\relax

   \message{\string\pdfgeninterwordspace\space not supported by this version of pdftex}

\else

   \pdfmapline{+dummy-space <dummy-space.pfb}
   \pdfgeninterwordspace=1

\fi </geshi>

  • let's add to small2e.tex these lines to make it pass pdfa/1b check:

<geshi lang="latex"> \def\Title{An Example Document} \def\Author{Leslie Lamport} \def\Subject{An Example Document} \def\Keywords{LaTeX,Example,Document} \input{pdfa-supp} \usepackage{xmpincl} \includexmp{pdfa-1b} </geshi>

  • to pass pdfa/1a check, we change

<geshi lang="latex"> \includexmp{pdfa-1b} </geshi> to <geshi lang="latex"> \includexmp{pdfa-1a} </geshi> and compile the file by pdftex with the patch mentioned above.

The result should be the same.

A less trivial example

Now let's move on to sample2e.tex, which is another sample that is part of latex distributioon.