S2DS – done!

, , Leave a comment

The project with S2DS is finally done! I’m glad it’s finished… group projects can be challenging, especially remote projects. We nevertheless managed to deliver what we hope will be useful to the organisation. Since the project was mainly explorative, we fired in many directions which we never fully exploited. It’s a shame, but I personally learned a lot during these 5 weeks.

My main contribution to the project consisted in building the consolidated database (during the project, I referred to it as ‘the world’, and it did become MY world for a good week..) that I eventually renamed ‘the network’. It contains everyone who had ever appeared in the datasets that the organisation had. The quality of the data was not very consistent, which made it difficult to put together. It was also only meant as an exploration, to understand how people approach the organisation, and how long they stayed with them.. so not something to build on. From this network, Francesco was able to produce some statistic of presence between datasets. He then tried to find common behaviour between certain users of the services, which was quite successful.

I had already dealt a fair bit with databases: designed my own databases, especially to store survey results, I had made many mistakes in doing so (like storing checkboxes results as a list in a field.. which is a pain when analysing), and even had stored geographic data (points as latitude and longitude, but also lines, using the geocoding algorithm from Google). However, I had never dealt with so many datasets, in such an inconsistent state. Neither had I ever tried to consolidate database. It was therefore a huge step for me to do it with python, not just once but 5 times, since most of the datasets actually listed individuals many times. Actually one of the databases was not supposed to have duplicated individuals in it, but I discovered when making the network that 10% of it was duplicated.

makingthenetwork-page001I used the email and phone to match individuals. This required cleaning many phone fields (it is incredible what people write in a ‘phone’ entry) and email fields (same here, and it was more difficult to clean in an automated way). I tried using people’s name, but this created some duplicates, even including names that I didn’t think were common.

In the end, we produced an interesting picture of what services are used by different users. This is quite helpful to understand how people use different services, how efficient are these services at pulling users from one to the other.