Content Comparison

...

Install Python 2.4 or newer. (Python 3 is not supported.)
Download the PDFMiner source.
Unpack it.
Run setup.py to install:
```
# python setup.py install
```

Do the following test:

$ pdf2txt.py samples/simple1.pdf
Hello

World

Hello

World

H e l l o

W o r l d

H e l l o

W o r l d

Done!

...

$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf
(extract text as an HTML file whose filename is output.html)

$ pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf
(extract a Japanese HTML file in vertical writing, CMap is required)

$ pdf2txt.py -P mypassword -o output.txt secret.pdf
(extract a text from an encrypted PDF file)

Options

...

dumppdf.py

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).

...

$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)

$ dumppdf.py -T foo.pdf
(dump the table of contents)

$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)

...

Version	Old Version 2	New Version Current
Changes made by	Daniel Alabi	Daniel Alabi
Saved on	Mar 20, 2013	Mar 20, 2013

Versions Compared

Key

Options

dumppdf.py