Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

  1. Install Python 2.4 or newer. (Python 3 is not supported.)
  2. Download the PDFMiner source.
  3. Unpack it.
  4. Run setup.py to install:
    # python setup.py install
    
  5. Do the following test:
    $ pdf2txt.py samples/simple1.pdf
    Hello
    
    World
    
    Hello
    
    World
    
    H e l l o
    
    W o r l d
    
    H e l l o
    
    W o r l d
    
  6. Done!

...

$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf
(extract text as an HTML file whose filename is output.html)

$ pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf
(extract a Japanese HTML file in vertical writing, CMap is required)

$ pdf2txt.py -P mypassword -o output.txt secret.pdf
(extract a text from an encrypted PDF file)

Options

...

dumppdf.py

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).

...

$ dumppdf.py -a foo.pdf
(dump all the headers and contents, except stream objects)

$ dumppdf.py -T foo.pdf
(dump the table of contents)

$ dumppdf.py -r -i6 foo.pdf > pic.jpeg
(extract a JPEG image)

...