...
$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf (extract text as an HTML file whose filename is output.html) $ pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf (extract a Japanese HTML file in vertical writing, CMap is required) $ pdf2txt.py -P mypassword -o output.txt secret.pdf (extract a text from an encrypted PDF file)
Options
...
dumppdf.py
dumppdf.py
dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).
...
$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)
$ dumppdf.py -T foo.pdf
(dump the table of contents)
$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)
...