Pdf to word
Conversion of Pdf to Word
Overview
Goal: This project aimed at illustrating the use of python tools to convert pdf documents to word and HTML file formats
Data: The pdf used in this project has 41 pages with a combination of text and tables on various pages. This pdf can be downloaded from here. The entire pdf will need to be converted into word and HTML documents.
Tools: This activity was executed using python . The required python packages that need to be installed are: Pandas, tabula, and docx. Here are the simple steps that you need to follow.
# Part 1: Import packages
import pandas as pd
from tabula import read_pdf
from docx import Document
# Part 2: Load pdf report
pdf_ = 'report.pdf'
# Part 3: read pdf report with tabula
df = read_pdf(pdf_,pages=4,stream=False, guess = False, pandas_options={'header': None})
# Merge all the dataframes
df2 = pd.concat(df)
# Part 4: Export to word and html
parag = list(df2[0]) # Get the the paragraphs to write
# Write a new paragraph if the next line is blank
parag = ['blankspace' if str(x)=='nan' else x for x in parag]
# Join all the text and split the text at the end of each paragraph
parag_jn = ' '.join(parag).split('blankspace')
# Export to html
par_df = pd.DataFrame({'Paragraphs': parag_jn})
par_df.to_html('word.html',border = 0,header = False, index= False)
# Export to word
document = Document()
# Add heading to the document
document.add_heading('Pdf to Word example', 0)
# Create a word document
for pg in parag_jn:
document.add_paragraph(str(pg))
# Saving the word document
document.save('Pdftoword.docx')
Outputs:
Here are the final word and html documents