Admission List
Conversion of Admission List to CSV and HTML
Overview:
Goal: The main goal of this project was to illustrate the use of python tools to convert tabula data within pdf file s to csv and html file formats
Data: The pdf used in this project has 1770 pages with each page containing the names of students who were admitted to different programs at Makerere University. This pdf can be downloaded from here. All this information will need to be converted to a structured format that can uploaded to a database or stored for future analysis.
Tools: This activity was executed using python . The required packages that need to be installed are: Python Pandas, Python tabula. Here are the simple steps that you need to follow.
# Part 1: Import packages
import pandas as pd
from tabula import read_pdf
# Part 2: Get pdf file
pdf_file = 'MAK Admission List.pdf'
# Part 3: Read the pdf content using tabula
# Read the first page
df = read_pdf(pdf_file, pages=1,stream=False, guess = False, pandas_options={'header': None})
# Part 4: Merge all the dataframes
df2 = pd.concat(df)
# Part 5: Export to csv and html.
# Export to html
df2.to_html('filename.html',index=False,header = False)
# Export to csv
df2.to_csv('filename.csv',index=False,header = False)
Result: By running the above python code, two output files will be generated.
(1) CSV file containing all the data within the pdf, and
(2) HTML file with similar structure to the CSV file . See figure below. For convenience, I am showing the first page of the pdf .