Web scraping 101 – scraping standard pages

, , Leave a comment

I have been recently experimenting web scraping and open data information scraping. Automation in this case is most often crucial to avoid repetitive (and useless) tasks.

A lot of companies and government open an API for people to use, and these can be scraped with python. For instance I was pleasantly surprised that Copenhagen opened their data: http://data.kk.dk/ which is a compilation of all municipal data, from the toilet location to biodiversity measures or satisfaction surveys. DSB also have their API: http://www.dsb.dk/dsb-labs/. Lots to play with! With a little more time, that would make very interesting maps.

In the meantime, I found a few script that make easy standard webscraping. For instance kimono or https://import.io/ are good tools for this. kimono is easier to customize, while import.io can be used for a quick and dirty scraping.


As easy as Ctlr+C, Ctrl+V. Make spreadsheet out of lists on webpages, and export as CSV.



You have to download an extension for Chrome, sign in, then bookmark it and use it. A little bit more complex that import.io, but so much more flexible. You can also use it on multiple pages. For me, the highlight of this application, is its help section (see below). If only all applications had such a help page, we would spend even less time discovering new pages.


Google Spreadsheet

If you only want to scrape a table or a list from a webpage, there is even easier! Just use Google Spreadsheet. I found this very good tutorial: http://www.mulinblog.com/basic-web-scraping-data-visualization-using-google-spreadsheets/ that explains how to use the function importHtml. It has three parameters: the URL of the wikipedia page, the type of data to scrape (table or list), and the occurence number in the webpage. You only need to enter the function in the top left cell of your page for instance and fill in the parameters. If you want to change a bit the spreadsheet, don’t forget to copy / paste values otherwise it’s impossible to add columns in the middle of the table created by the function. Apparently it’s possible to have 5 of these function in the same spreadsheet.

I recently used this to create a list of intangible cultural heritage of humanity, from the wikipedia article. More later on how to make that a shapefile / GIS data!


Leave a Reply