S2DS – 3 weeks

, , Leave a comment

I’m three weeks (out of five) in the S2DS course and it already feels like we haven’t done anything and will not have enough time. Truth is, we’ve been super busy, working more than full time on the project. Thankfully – although we didn’t think it was a good thing at first – we have a mid-project presentation on Tuesday, which forced us to sit back and reflect. What have we done? What is our plan? Well, we actually have done a lot and we still have a lot of options. We are also in a much better position to advise our company what we can do, what we need and what we can’t do just yet.
For one thing, we’ve finally understood the structure of the NGO, and the data. I recently read again the description of the data and realised that almost everything was there. But during the first week, it was just impossible to understand it, as too many things were new: the team, the data, and the NGO. But now we know much better what we have in our hands, and I’m glad to be working for an NGO as I understand better the complexity and specificity of their functioning. And data is mostly clean! It took so long to do it; the data was such a mess (hello email address stored with phones, O instead of 0, inconsistent data split across many sheet..). The trouble is, making a machine reading what has been stored to be read by a human is not straightforward. Many old companies and charities have legacy databases that have been imported into very compliant databases. That’s were our cleaning began.
Then obviously the data is not stored as we need it. Not just for the analysis, but also to understand what we can do. For too long during this project we’ve looked at the mess of data we had without being able to question it. But now we’re ready and still have time to see what we can do!

What have I learned so far?

Our projet so far has mainly consisted of data cleaning, data sourcing, database handling, and data propping. Others in the group are currently doing some natural language processing, or spatial analysis.

Data sourcing

I’ve managed to source data from a Facebook page (using their API), Twitter (with twitter search API and Twitter API), and a forum (using scrapy).
Interestingly, a Facebook page has the same structure than a forum: multiple posts, each of them having comments; a bit like a tree. At first I found it hard to get the hang of the step order and loops to walk through each post, its comments, and going to the next post. At some point I was reopening the file storing comments and erasing all data in! Anyway, I managed something mostly satisfactory and hope to have learned for next time.
With Twitter I learned a lot about ‘waiting’ since we cannot scrape that much data.. it took my computer 2 days to scrape 50k followers.
With the forum, I sharpened again my xpath skills, which is really good to be able to extract info from structured pages such as html and xml documents.
I’m also becoming a master at using command lines to merge CSV files, merge Excel sheets into one, and generally mess things around. The day I sent a file into a non-existing folder is way behind me (what happened to that file? it went into nothingness apparently..)

Data cleaning

We had data of different formats, especially text, but also dates and phone numbers. Some dates were not entered in a recognised format so we had to transform it: e.g. 20061203.0 for 3rd March 2006. I managed to clean it out so pandas could read it as a date. The phone numbers were largely unconsistent, not just because some used a space, a dash, or dots, but also the country code as +44, +44 (0), or 0044. Worst was “Lancs 12345” to indicate Lancashire prefix followed with 12345. huh. Serioulously. Many were stored this way.

Database handling

So far I had mainly used MySQL to stored data and iteract with it in PHP and SQL. This time we used sqlite, and we query it from python. That’s really handy, since we have 2Gb csv files. We also used draw.io to draw the database structure in a collaborative way.
Now that the data is mostly clean, we can finally query it!