PDFMiner
Python PDF parser and analyzer
For the full documentation on PDFMiner, see http://unixuser.org/~euske/python/pdfminer/index.html
What's It?
PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
Features
- Written entirely in Python. (for version 2.4 or newer)
- Parse, analyze, and convert PDF documents.
- PDF-1.7 specification support. (well, almost)
- CJK languages and vertical writing scripts support.
- Various font types (Type1, TrueType, Type3, and CID) support.
- Basic encryption (RC4) support.
- PDF to HTML conversion (with a sample converter web app).
- Outline (TOC) extraction.
- Tagged contents extraction.
- Reconstruct the original layout by grouping text chunks.
PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.
Online Demo: (pdf -> html conversion webapp)
http://pdf2html.tabesugi.net:8080/
Download
Source distribution:
http://pypi.python.org/pypi/pdfminer/
github:
https://github.com/euske/pdfminer/
Where to Ask
Questions and comments:
http://groups.google.com/group/pdfminer-users/
How to Install
- Install Python 2.4 or newer. (Python 3 is not supported.)
- Download the PDFMiner source.
- Unpack it.
- Run
setup.py
to install:# python setup.py install
- Do the following test:
$ pdf2txt.py samples/simple1.pdf Hello World Hello World H e l l o W o r l d H e l l o W o r l d
- Done!
For CJK languages
In order to process CJK languages, you need an additional step to take during installation:
# make cmap python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt cp950 big5 reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'... writing 'CNS1_H.py'... ... (this may take several minutes) # python setup.py install
On Windows machines which don't have make
command, paste the following commands on a command line prompt:
python tools\conv_cmap.py pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt cp950 big5 python tools\conv_cmap.py pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt cp936 gb2312 python tools\conv_cmap.py pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt cp932 euc-jp python tools\conv_cmap.py pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt cp949 euc-kr python setup.py install
Command Line Tools
PDFMiner comes with two handy tools: pdf2txt.py
and dumppdf.py
.
pdf2txt.py
pdf2txt.py
extracts text contents from a PDF file. It extracts all the text that are to be rendered programmatically, i.e. text represented as ASCII or Unicode strings. It cannot recognize text drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a password for protected PDF documents when its access is restricted. You cannot extract any text from a PDF document which does not have extraction permission.
Note: Not all characters in a PDF can be safely converted to Unicode.
Examples
$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf (extract text as an HTML file whose filename is output.html) $ pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf (extract a Japanese HTML file in vertical writing, CMap is required) $ pdf2txt.py -P mypassword -o output.txt secret.pdf (extract a text from an encrypted PDF file)
dumppdf.py
dumppdf.py
dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (such as images).
Examples
$ dumppdf.py -a foo.pdf (dump all the headers and contents, except stream objects) $ dumppdf.py -T foo.pdf (dump the table of contents) $ dumppdf.py -r -i6 foo.pdf > pic.jpeg (extract a JPEG image)