...
- Install Python 2.4 or newer. (Python 3 is not supported.)
- Download the PDFMiner source.
- Unpack it.
- Run
setup.py
to install:# python setup.py install
- Do the following test:
$ pdf2txt.py samples/simple1.pdf Hello World Hello World H e l l o W o r l d H e l l o W o r l d
- Done!
...
$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf (extract text as an HTML file whose filename is output.html) $ pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf (extract a Japanese HTML file in vertical writing, CMap is required) $ pdf2txt.py -P mypassword -o output.txt secret.pdf (extract a text from an encrypted PDF file)
Options
...
dumppdf.py
dumppdf.py
dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).
...
$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)
$ dumppdf.py -T foo.pdf
(dump the table of contents)
$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)
...