From time to time, the OTW will be hosting guest posts on our OTW News accounts. These guests will be providing an outside perspective on the OTW or aspects of fandom where our projects may have a presence. The posts express each author’s personal views and do not necessarily reflect the views of the OTW or constitute OTW policy. We welcome suggestions from fans for future guest posts, which can be left as a comment here or by contacting us directly.
Smitha Milli is a 4th-year undergraduate at UC Berkeley whose research interests lie in artificial intelligence and cognitive science. Today, Smitha talks about her research using natural language processing to reveal patterns in fanfiction texts, the results of which is available online.
How did you come to work with fanfiction in your research?
At the time I started this project my main research focus was in natural language processing (NLP). Natural language processing is a subfield of artificial intelligence that is concerned with creating algorithms to process and understand language. If you’ve ever used Google Translate or Siri, you’ve used products that depend on NLP research!
In addition to having many commercial applications, NLP can also be used as a tool to explore literature. People have automatically tracked dynamic relationships between characters, created computational models of literary character, and analyzed the change in emotional content over the course of a story. However, a bottleneck to improving algorithms in the literary domain was the lack of a large-scale dataset of modern literature. I originally started looking into fanfiction as a source for this kind of data. As I looked further, I found that the structure of fanfiction also made it possible to define interesting, new problems for NLP and I became interested in computationally analyzing social science questions about fanfiction.
Could you explain how your study was done and what you found?
The goal of this work was to share what fanfiction has to offer to the fields of natural language processing, computational social science, and digital humanities. Towards this end, we collected a large dataset of fanfiction from fanfiction.net that consists of about 6 million stories written by around 1 million authors. To characterize the interaction between authors and readers, we analyzed the network
structure of the community. We found that 52% of the authors in our dataset had reviewed another author’s story. Of these authors, each had reviewed on average 13 other stories. We did exploratory data analysis to investigate the content of these reviews. In particular, we ran a statistical model called “latent dirichlet allocation” to extract the different topics underlying the reviews. Probably unsurprising to most of you, most of the reviews consisted of positive author encouragement (“please update!!!”) or emotional reactions to the story (“aww cute”).
We also investigated differences between fanfiction and canon. Specifically, we compared ten canons present in the Gutenberg corpus to their fanfiction counterparts. (We used canons from Gutenberg, so that we would have access to the text of the original stories. The canons we looked at were Les Miserables, Sherlock Holmes, Pride and Prejudice, Peter Pan, Alice in Wonderland, Anne of Green Gables, Jane Eyre, Little Women, The Scarlet Pimpernel, and the Secret Garden).
In both fanfiction and canon we found that female characters were mentioned less frequently than male characters. However, we did find that fanfiction had a slight, but very statistically significant, increase in the frequency of female character mentions. In fanfiction 42.4% of character mentions were female, while in the canons 40.1% of character mentions were female. We also analyzed how the number of times specific characters were mentioned differs between canon and fanfiction. For example, in Pride and Prejudice fanfiction, Mr. Darcy receives a large increase in mentions, while nearly every other character drops in the amount that they’re mentioned.
In addition to analyzing the fanfiction itself, the fact that the dataset had reader reviews on a chapter-by-chapter basis allowed us to pose a new, challenging NLP task about predicting reader sentiment towards characters. The goal of the task is to create an algorithm that when given any character in a story can predict whether a reader will like the character or not. To create labeled data for this task, we had annotators on Mechanical Turk label sentences in reader reviews as containing positive or negative sentiment towards a character. We trained a simple machine learning model to classify characters as positive or negative based on the text of the fanfiction. Our simple model finds plausible features. For example, it picks up on the fact that characters that “hiss”, “sneer”, or “shove” tend to be disliked.
Despite this, it does not achieve high performance on the task. We believe that is because you need a much higher-level abstraction of characters to understand why a character is disliked or not.
Are you planning further research on fanfiction and, if so, what are your goals?
I’m not sure, but I do think that there’s lots of room to do interesting computational work on fanfiction, and encourage others to consider looking into it! Fun fact: at the conference I presented this work at I found a surprisingly large number of natural language processing researchers that were fans themselves.
How did you hear about Transformative Works and Cultures and what part did it play in your research?
At the start of this research I began reading existing studies done by fan scholars, many of which were published in Transformative Works and Cultures. These were very helpful for me to narrow down on what questions to look into for my own research.
What about your research findings has inspired you the most?
I actually find it inspiring how poorly our baseline model did on the predicting reader sentiment task. We have a long way to go before computers can come even close to the story understanding that humans do naturally.
Catch up on earlier guest posts