Replies: 1 comment 6 replies
-
HI @Greybird thanks a lot for your contributions. I haven't used these datasets in a systematic manner to check for bugs on my side but I remember sharing some links a while ago in here about this kind of datasets. The one I'm mainly using for personal test is https://pdfa.org/stressful-pdf-corpus/ Another source of document could be (I haven't tested it) https://github.com/GerHobbelt/Evil-PDF-Library-for-Qiqqa |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
First, I'd like to thank every maintainer & contributor tof PdfPif for the awesome library!
I have been working for my company on an internal corpus of around 3 millions documents (safly all with personal info, thus not shareable), to validate our capacity to extract some document information using PdfPig.
I contributed a few PR already to fix some PDF parsing issues I found, and I'm now in a state where less than 3000 documents lead to an error, and less than 500 of them lead to an error not encountered by itext library which we aim to replace.
Out of curiosity, has someone already tested to run PdfPig against one of the pdf datasets provided by the PDF association ?
For example: https://pdfa.org/new-large-scale-pdf-corpus-now-publicly-available/
I think the whole list of datasets is available here: https://pdfa.org/resources/#Test_suites_and_protocols.
Beta Was this translation helpful? Give feedback.
All reactions