There is a controversy over “Computational linguistics and literary scholarship.” A paper published at the Association for Computational Linguistics David Bamman, Brendan O’Connor, & Noah A. Smith, “Learning Latent Personas of Film Characters“) has come under criticims by Hannah Alpert-Abrams with Dan Garrette, “Some thoughts on the relationship between computational linguistics and literary scholarship“. The primary criticism has been that the ACL paper fails to cite any literary scholarship past the middle of the last century.
Brendan O’Connor (a friend of mine, I should say) responded by saying the authors of the ACL paper were not trying to do literary theory, but “developing a computational linguistic research method of analyzing characters in stories.” O’Connor states that this is a paper about method, not literary content; and compares it to the early days of quantitative models in social science and economics (with which he is more familiar).
There is a slight irony in this controversy because a common trope on the Language Log blog (where Alpert-Adams and Garrette published their comments) is about physicists expounding on linguistics without consulting any linguists (for example, “Word String frequency distributions” and Joshua Goodman’s wonderful rant, “Language Trees and Zipping.”) This idea of poaching on other people’s grounds – or, better, trying to do other people’s work for them without a proper understanding of just what that work is – is something linguists, computational or otherwise, might have a sensitivity to.
However, I think this overlooks the real criticism of the latent personas work. To quote from Alpert-Abrams and Garrette:
When we look at Wikipedia entries about film, for example, we would not expect to find universal, latent character personas. This is because Wikipedia entries are not simply transcripts of films: they are written by a community that talks about film in a specific way. The authors are typically male, young, white, and educated; their descriptive language is informed by their cultural context. In fact, to generalize from the language of this community is to make the same mistake as Campbell and Jung by treating the worldview of an empowered elite as representative of the world at large.
I am no literary scholar, but it’s common knowledge that much of post-modern literary scholarship concerns itself with power relationships. One might even say that much of what it means to be post-modern is to study those relationships, especially to disillusion elites and recover disempowered voices. Alpbert-Abrams and Garrette’s criticism is, I believe, not so much about poaching or about ignoring recent advances in the field of literary criticism, but ignoring the realities of power in the Wikipedia data.
This affects the work of Bamman, O’Connor and Smith in at least two ways. First, it suggests that the data upon which they base their conclusions is skewed in a way they did not recognize. It is reasonable for O’Connor to say that they were more interested in method, and not criticism. But it is very important for any machine learning programme that one understands the data that models are being run on. To ignore the quality of the data will produce likely produce erroneous results (in fact, this is one of the points that Mark Liberman makes in his friendly criticism of the word string frequency paper mentioned above). Second, and this is related to the first issue, given the skewed nature of Wikipedia contributors towards young educated males, the results based on this data will likely be towards perpetuating models reflecting their views, and further disempowering women, elders, and non-educated people. The Wikipedia project states that the top quartile of Wikipedia participants starts at 30 years old, and half are under 22. It is easy to imagine how the comparative lack of data about films from people in their 30s, 40s, 50s, etc, can skew the results, and the same is easily said for women (12% of contributors) and educational level, co-varies with the young age of the average participant. Actually, the Wikipedia study did not provide data on race/ethnicity (only nationality), although it seems prima facie likely that contributors tend to be white, heterosexual, and cis-gendered.
As a single example of this, Bamman, O’Connor and Smith list the topics in a 50-topic model in their table 3. Two of the fifty topics are WOMAN (woman, friend wife, sister, husband) and WITCH (witch, villager, kid, boy, mom) – the label name is, as is typical for this research, the word with the highest PMI (pointwise mutual information) for the topic. There is no topic for MAN (because it is totally unmarked?) or ethnicity except for monsters – FAIRY, WEREWOLF, ALIEN. MAYA is a topic, but the other high PMI words in this topic are (monster, monk, goon, dragon). Granted that movies skew towards the interests of young males, these topics are not surprising, but it is likely that the makeup of contributors even further skews these results.
I am grateful for Alpert-Abrams and Garrette’s note. Wikipedia is increasingly being seen as a source of ground truth for any number of machine-learned models and systems. Its breadth and depth make it natural to use Wikipedia titles as a kind of list of things of interest (see, for example, the Wikifier project, which does entity extraction based on Wikipedia titles and content). It is a good reminder that the of the contributors to Wikipedia will bias results based on Wikipedia in certain systematic (and political) ways; understanding these is, as Alpert-Abrams and Garrette note, an important project in its own right.