Did someone try PdfPig against the collection of PDF provided by the PDF association ? #895

Greybird · 2024-09-01T11:29:57Z

Greybird
Sep 1, 2024

Hi,

First, I'd like to thank every maintainer & contributor tof PdfPif for the awesome library!

I have been working for my company on an internal corpus of around 3 millions documents (safly all with personal info, thus not shareable), to validate our capacity to extract some document information using PdfPig.
I contributed a few PR already to fix some PDF parsing issues I found, and I'm now in a state where less than 3000 documents lead to an error, and less than 500 of them lead to an error not encountered by itext library which we aim to replace.

Out of curiosity, has someone already tested to run PdfPig against one of the pdf datasets provided by the PDF association ?
For example: https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
I think the whole list of datasets is available here: https://pdfa.org/resources/#Test_suites_and_protocols.

BobLd · 2024-09-01T12:04:11Z

BobLd
Sep 1, 2024
Maintainer

HI @Greybird thanks a lot for your contributions. I haven't used these datasets in a systematic manner to check for bugs on my side but I remember sharing some links a while ago in here about this kind of datasets. The one I'm mainly using for personal test is https://pdfa.org/stressful-pdf-corpus/

Another source of document could be (I haven't tested it) https://github.com/GerHobbelt/Evil-PDF-Library-for-Qiqqa

6 replies

BobLd Sep 3, 2024
Maintainer

By the way, if you did any benchmarking of PdfPig vs itext, feel free to share the results

Greybird Sep 4, 2024
Author

Sure,

My tests are conducted on 3 026 716 pdf files sent by applicants to personal loan offers in France, Italy, Spain, Portugal and Germany.
The tested operation is the extraction of metadata from documents (ie XmlMetadata and Document Information).

The results are as follow, between itext 7.2.2 (not the latest), and PdfPig main branch.

Results	Count
Fully Identical	2 614 517
Identical with charset issue fixed	409 259
More keywords extracted	128
Differences	512
Not readable in itext but readable by PdfPig	244
Readable in itext but not readable by PdfPig	256
Not readable in both	2312

To me it's really good results, even if on a very small set of features.
The extraction is of better quality globally.

Qualitatively:

The differences seem to be mainly caused by charset decoding issues, that are due to malformed pdfs, where itext extracts better strings. But on the other hand, on more than 400k files, itext does worse in this space.
For files readable in one lib and not the other, I would say that the parsing of xref tables when malformed is working a little bit better in itext, with incorrectly reported xref loops, or failure to find xref tables in the last parts of the files. Files with startxref written as startref (!), or missing whitespaces at xref tables (xref0 5) seem to work better in itext.
itext seems more lenient with regards to broken encryption as it accepts to decrypt truncated byte arrays. But I'm not sure about this being a very good idea, as the data is garbage afterwards.

Hope this can be of use

BobLd Sep 8, 2024
Maintainer

Fyi, I've flagged an open issue that seem to have the xref0 5 issue. Not sure how best to handle that yet.

Regarding the startxref written as startref issue, the solution might be as simple as adding the mapping the OperatorToken's Create() method:

Not sure though as I don't have a document example at hand

BobLd Sep 8, 2024
Maintainer

@Greybird just saw your answer to the issue, I'll have a look now

Greybird Sep 8, 2024
Author

@BobLd , thanks, I'll check if this solves the parsing issue for startref, and see if I can produce a document with the issue.
I did it another way (https://github.com/Greybird/PdfPig/tree/feature/startref), but your proposal seems much more elegant. I suppose we could add a leniency option check to accept this syntax.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Did someone try PdfPig against the collection of PDF provided by the PDF association ? #895

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Did someone try PdfPig against the collection of PDF provided by the PDF association ? #895

Greybird Sep 1, 2024

Replies: 1 comment · 6 replies

BobLd Sep 1, 2024 Maintainer

BobLd Sep 3, 2024 Maintainer

Greybird Sep 4, 2024 Author

BobLd Sep 8, 2024 Maintainer

BobLd Sep 8, 2024 Maintainer

Greybird Sep 8, 2024 Author

Greybird
Sep 1, 2024

Replies: 1 comment 6 replies

BobLd
Sep 1, 2024
Maintainer

BobLd Sep 3, 2024
Maintainer

Greybird Sep 4, 2024
Author

BobLd Sep 8, 2024
Maintainer

BobLd Sep 8, 2024
Maintainer

Greybird Sep 8, 2024
Author