Pdf to word

Conversion of Pdf to Word

Overview

Goal: This project aimed at illustrating the use of python tools to convert pdf documents to word and HTML file formats

Data: The pdf used in this project has 41 pages with a combination of text and tables on various pages. This pdf can be downloaded from here. The entire pdf will need to be converted into word and HTML documents.

Tools: This activity was executed using python . The required python packages that need to be installed are: Pandas, tabula, and docx. Here are the simple steps that you need to follow.

# Part 1: Import packages

import pandas as pd

from tabula import read_pdf

from docx import Document

# Part 2: Load pdf report

pdf_ = 'report.pdf'

# Part 3: read pdf report with tabula

df = read_pdf(pdf_,pages=4,stream=False, guess = False, pandas_options={'header': None})

# Merge all the dataframes

df2 = pd.concat(df)

# Part 4: Export to word and html

parag = list(df2[0]) # Get the the paragraphs to write

# Write a new paragraph if the next line is blank

parag = ['blankspace' if str(x)=='nan' else x for x in parag]

# Join all the text and split the text at the end of each paragraph

parag_jn = ' '.join(parag).split('blankspace')

# Export to html

par_df = pd.DataFrame({'Paragraphs': parag_jn})

par_df.to_html('word.html',border = 0,header = False, index= False)

# Export to word

document = Document()

# Add heading to the document

document.add_heading('Pdf to Word example', 0)

# Create a word document

for pg in parag_jn:

document.add_paragraph(str(pg))

# Saving the word document

document.save('Pdftoword.docx')



Outputs:

Here are the final word and html documents

HTML Output

Word Output file

pdf_to_word.docx