A note to the reader
This post is a legacy post. The legacy posts that are available on this website were written many years ago. These posts are made available here for archival purposes only.
They reflect the age I was, and the level of knowledge that I had when I wrote them, and they may contain outdated information, so please keep that in mind as you proceed to read this article.
Introduction
Last week, I scraped the School Website clean looking for report cards. I wanted the data to do some statistics (the website is very spartan) and, much to my pleasure, I was successfully able to get all the data I need.
The Problem
I need the report card data in JSON format. The format given is, quite unsurprisingly, Bad HTML
Problem 1: Bad HTML
The Website is built with ASP.NET. What better can I expect? Tables in tables
|
|
That’s an example of the stuff I had to deal with
Problem 2: Inconsistent ID Numbers
Come on. Who would expect people from Class 11 and 12 to have joined the school in 2014?
The Solution
Solution 1: Xpath
XPath is what saved me. True, it took some analysis, but I got a working thing ready
Solution 2: Python + lxml
Python was used to write the bruteforcing script. It’s what helped me to make it easier to work with the tedious job
Lxml was used to parse the Xpath and extract the data
Finally some helper scripts made the data into yearly chunks, classes, segregated and organized it into nice self-contained JSON files
Solution 3: Koding
My internet is too slow to allow the script to be practical. A VPS from Koding helped me to get the job done in minutes instead of weeks.
The entire extraction took 15 Mins on the VPS. There were some >2500 Records, My connection can do 25.1 records in 15 mins
What went Right
The extraction and processing
What went Wrong
A lot of things
- Firstly, I could have used threading. It would have made everything much much faster
- Secondly, I should have used Scrapy
- Thirdly, I should have done my homework about the subjects and classes
- Fourthly, I did a lot of useless year-crawling like 2000 and 1998
Downloads
Ok. Here you go: data download link is now removed
Read Next
I’m running an experiment for better content recommendations. These are the 3 posts that are most likely to be interesting for you:
-
To A Man With
jq
, Everything Looks Like JSON
Explore how I tackled a unique challenge using jq for HTML generation, mirroring the inventive spirit you saw in my data scraping adventure, and highlighting the value of adaptability in tech. -
Get leads collected by a particular lead ad form between two timestamps using Facebook Graph API
After tackling the complexities of web scraping, you might be curious about harnessing APIs for data retrieval, offering a smoother path to gather specific online information. -
Install Postgres into XAMPP on Windows
After wrestling with data extraction, you might appreciate the simplicity of setting up a robust local development environment with Postgres and XAMPP to streamline your future projects.