A note to the reader
This post is a legacy post. The legacy posts that are available on this website were written many years ago. These posts are made available here for archival purposes only.
They reflect the age I was, and the level of knowledge that I had when I wrote them, and they may contain outdated information, so please keep that in mind as you proceed to read this article.
Last week, I scraped the School Website clean looking for report cards. I wanted the data to do some statistics (the website is very spartan) and, much to my pleasure, I was successfully able to get all the data I need.
I need the report card data in JSON format. The format given is, quite unsurprisingly, Bad HTML
Problem 1: Bad HTML
The Website is built with ASP.NET. What better can I expect? Tables in tables
That’s an example of the stuff I had to deal with
Problem 2: Inconsistent ID Numbers
Come on. Who would expect people from Class 11 and 12 to have joined the school in 2014?
Solution 1: Xpath
XPath is what saved me. True, it took some analysis, but I got a working thing ready
Solution 2: Python + lxml
Python was used to write the bruteforcing script. It’s what helped me to make it easier to work with the tedious job
Lxml was used to parse the Xpath and extract the data
Finally some helper scripts made the data into yearly chunks, classes, segregated and organized it into nice self-contained JSON files
Solution 3: Koding
My internet is too slow to allow the script to be practical. A VPS from Koding helped me to get the job done in minutes instead of weeks.
The entire extraction took 15 Mins on the VPS. There were some >2500 Records, My connection can do 25.1 records in 15 mins
What went Right
The extraction and processing
What went Wrong
A lot of things
- Firstly, I could have used threading. It would have made everything much much faster
- Secondly, I should have used Scrapy
- Thirdly, I should have done my homework about the subjects and classes
- Fourthly, I did a lot of useless year-crawling like 2000 and 1998
Ok. Here you go: data download link is now removed