Pdf to structured data

Sample 1: Student Admission List

This project involved use of python scripts to convert tabular data within the pdf to a csv file. Tabula python package along with python pandas were used to create the out csv file as shown below. This approach is useful in number of applications such as exporting pdfs data to a database.

Get details and source code here

Pdf containing the data to be converted to CSV

Final output in CSV file

Sample 2: Pdf to word and HTML

A pdf report was converted to html and word file formats using tabula and docx python packages. The resultant data can be used in analysis such as NLP or conversion of pdf model manuals to html/word documents.

Explore more here

Section of the original pdf report