What is sentiment analysis?
Many texts contain an opinion about something: say, a movie or product review, a description of the service at a restaurant, a political position or candidate, or how things are going at a company. Sentiment analysis attempts to classify those opinions: did the writer have a positive opinion? A negative one? Or are they neutral? Sentiment analysis might also suggest how positive or negative an opinion is.
Machine learned approaches to sentiment analysis
If the goal is to confidently predict whether an opinion is positive or negative, then sentiment analysis is a binary classification problem. If the goal is to predict whether an opinion is positive, negative, or neutral, then it is a multiclass classification problem. If the goal is to predict how positive or negative the opinion is, it is a regression problem.
Sentiment analysis as a classification problem
Any binary or multiclass classification technique will probably do a reasonable job of sentiment analysis, given enough data and features. It is probably better to treat sentiment analysis as a multiclass classification problem because a lot of text, even text that is supposed to hold opinions, can be relatively neutral, and not having a neutral class will result in in modeling errors. Consider “McDonald’s food is horrible,” “McDonald’s food is amazing,” and “McDonald’s food is neither good nor bad.” Trying to shoehorn that last statement into either positive or negative is problematic. Still, it could be that “negative” means “not positive,” and so a binary classification might be just fine. McDonald’s would surely be unhappy with a “neither good or bad” opinion.
The simplest sentiment analysis technique uses word lists annotated with valences (measurements of positive/neutral/negative). For example, the researchers Peter Sheridan Dodds, Kameron Decker Harris, Isabel M. Kloumann, Catherine A. Bliss, and Christopher M. Danforth collected valence values for over 10,000 English words based on crowdsourcing. Simply averaging the valences over a text (treating unknown text as neutral) and using cutoff numbers, provides a reasonable first model. I created such a system, called a sentimenticon, callable from Python. It is available at http://github.com/willf/sentimenticon. This code weighs words from +1.0 to -1.0, so reasonable cutoffs are +0.5 and -0.5.
Beyond this, any high-feature model will do, I suspect, for most practical purposes. My first take at models like this is to use maximum entropy/log-linear models, because they are relatively resistant to non-independence among features, and usually train quickly, and are relatively easy to debug.
Recently, the deep learning revolution has begun to address sentiment analysis. For example, NLP researchers at Stanford have applied “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.” In particular, they are interested in how sentiment valences are composed within a text. (See their website: http://nlp.stanford.edu/sentiment). But this might be more architecture than a ordinary working data scientist cares about. On the other hand, it can handle more nuanced opinions such as “There are slow and repetitive parts, but it has just enough spice to keep it interesting.” (their example). It can also handle negating positive polarities (e.g., “This movie is not good.”) which are sometimes a problem for bag-of-words models (but maybe not as much a problem for bag-of-ngram models).
Sentiment analysis as a regression problem
A simple method has already been suggested, which is to use a sentimenticon, and take an average over text, treating unknown words as neutral. This can be adequate for many uses. Otherwise, other regression models will be required.
It is likely, however, that measured sentiments of a text has a roughly sigmoid shape; this is certainly true of individual words (see the distribution of the crowdsourced words below). That is to say, some few texts will be very, very positive, or very, very negative, with most texts evenly distributed; for example: Obama RULES!!!! vs. Obama SUX!!!!! vs. most people’s more or less positive/negative opinion. So, logistic regression models are likely good fits.
Richard Socher, Alex Perelygin, Jean Wu,Jason Chuang, Christopher Manning,Andrew Ng and Christopher Potts, 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, Conference on Empirical Methods in Natural Language Processing (EMNLP 2013). http://nlp.stanford.edu/sentiment