Admission List

Conversion of Admission List to CSV and HTML

Overview:

Goal: The main goal of this project was to illustrate the use of python tools to convert tabula data within pdf file s to csv and html file formats

Data: The pdf used in this project has 1770 pages with each page containing the names of students who were admitted to different programs at Makerere University. This pdf can be downloaded from here. All this information will need to be converted to a structured format that can uploaded to a database or stored for future analysis.

Tools: This activity was executed using python . The required packages that need to be installed are: Python Pandas, Python tabula. Here are the simple steps that you need to follow.


# Part 1: Import packages

import pandas as pd

from tabula import read_pdf


# Part 2: Get pdf file

pdf_file = 'MAK Admission List.pdf'


# Part 3: Read the pdf content using tabula

# Read the first page

df = read_pdf(pdf_file, pages=1,stream=False, guess = False, pandas_options={'header': None})


# Part 4: Merge all the dataframes

df2 = pd.concat(df)

# Part 5: Export to csv and html.

# Export to html

df2.to_html('filename.html',index=False,header = False)


# Export to csv

df2.to_csv('filename.csv',index=False,header = False)


Result: By running the above python code, two output files will be generated.

(1) CSV file containing all the data within the pdf, and

(2) HTML file with similar structure to the CSV file . See figure below. For convenience, I am showing the first page of the pdf .

CSV Output file- First page only

HTML Output file - First page only