Topic Modelling of Research in the Arts and Humanities

, , Leave a comment

We just published a report on “Topic Modelling of Research in the Arts and Humanities”. The project was actually done before I started in Digital Science, but to familiarise myself with topic modelling I carried out again the work and wrote the report with my colleague.

The project was really interesting: the Arts and Humanities has been much studied using Natural Language Processing tools (essentially due to a lack of data in a subject which does not produce as many peer reviewed articles as the Sciences), and we had access to a large collection of funded and unfunded grants from the British national funding agency for arts and humanities (Arts and Humanities Research Council, AHRC). These grant applications had been submitted in pre-defined panels, but what would a data-driven categorisation look like?

We used Python performed a Topic Modelling (see Wikipedia) to uncover latent topics from the text. The hardest parts of topic modelling are the data cleaning process (how much should you remove to ensure that no irrelevant links would appear, as would happen for instance if half of your document have a “Copyright” mention) and selecting the number of topics. To clean, after removing artefacts from the dataset, the data we also used Tf-idf (term frequency–inverse document frequency) to remove the too frequent and too infrequent words. Then came the selection of topic number. Some researchers created a stability measure which helps selecting the number of topics which give a stable model. The number of topics will depend on the homogeneity of the dataset, as well as the purpose of the exercise.

The analysis we wanted to do implied in-depth description of academic subjects relatively homogeneous (arts and humanities only), so we needed at least 150 to 250 topics. We ran our model for 150 to 250 and then searched for the stability. Looking for a high stability, we settled at 185 topics. It’s difficult to find what is the impact of adding one more topic or removing now. I had done some visualisations earlier this year, and it doesn’t change topics consistently. Most will remain the same and a few will change (not one dividing one in two, rather a few splitting in more), so it’s difficult to say what is the ideal number of topics. In the end, topic modelling is only a representation of a collection of documents, so the best is to set a number that suits your analysis and represents well your dataset.

The press release describing the report can be found here: