The accessibility of text data is changing how scientists are studying the social world. Whereas previously researchers needed deep area knowledge, big budgets, trained assistants, and years of careful hand labeling of text to get results, today individual researchers can approximate the same process in a matter of days.
A crucial step in this change is the availability of huge collections of digitized textual data, a product of the big data revolution. But a second essential ingredient is the means to automate the exactions of patterns from a collection of text documents. Northwestern postdoctoral researcher Martin Gerlach and his collaborators Tiago Peixoto of the University of Bath and Eduardo Altmann of the University of Sydney recently published their research on a new, principled way to approach this complex problem. Their paper “A network approach to topic models” is published in the general science journal Science Advances.
The practice of extracting topics from a collection of documents is known as topic modeling and is widely used in academic and industry research. The dominant form of topic modeling, latent dirichlet allocation (or just LDA), is simple in its intuitions. We assume that all the documents in the collection were generated by topics that the writers had in mind. Topics are general concepts that are conveyed by the collection of words used to write or talk about them.
A climate change topic has words like temperature, increase, effects, and costs. A breakfast topic might have words like coffee, bagel, muffin, and quick. While human readers can learn to identify broad themes like climate change or breakfast directly, computers can’t. Instead, automatic approaches try to infer them from the associations found between words in the text. The result of this inference process is a model of latent topics that plausibly generates the collection of actual text.
While the intuitions of topic modeling are straightforward, the mechanics of the required statistical inference are not and it turns out that the LDA technique has some shortcomings. For one, it cannot automatically identify the number of topics in the collection, requiring the researcher to guess. The number of topics is immensely consequential in the determining the outcome of LDA and so the researcher is often left to trial and error. Another, less visible problem is that the topics LDA identifies will tend to have similar shapes for the distributions of words in topics, something that is very likely untrue of real topics. These and other issues have researchers thinking about approaches that can improve LDA.
For inspiration in addressing these problems, Gerlach turned to research in community detection in networks. Networks are a way of describing a set of relationships, like who knows whom, what airports have service between them, and what existing research new scientific articles reference. As the Six Degrees of Kevin Bacon parlor game helps to show, networks rarely have complete breaks and there is almost always a path from one person, airport, or article to any other. This means it is not always visually apparent what elements form a functional group. In social life, communities are one such functional group and the methods of community detection grew out of a desire to find meaningful groups within the thicket of massive social networks.
The basic problem of community detection has a lot in common with topic modeling. We often don’t know the groups that individuals belong to, but we might know with whom they associate and use that information to infer the groups. The inference techniques developed in community detection research, however, have addressed some of the same types of problems that undermine the usefulness of the LDA topic modeling technique. Gerlach and his collaborators capitalize on these advancements and adapted elements of community detection to the problem of topic modeling in what they call the hierarchical stochastic blockmodel approach to topic modeling, or hSBM.
The results speak for themselves. The approach doesn’t need a prespecified number of topics and tests of its ability to properly identify known ground-truth topics come down decisively in favor of hSBM. And as a bonus, the method happens to group documents together into communities, letting researchers identify simultaneously both what is being discussed and who is saying what.
Gerlach says natural language processing is a fast moving research area so it’s not clear how others will put the method to use and what types of insights it might ultimately lead to, but he’s hopeful it can shed light on important dynamics like language change.