Saturday, January 11, 2014

Sentiment Analysis using Classification


At the Introduction to Data Science course I took last year at Coursera, one of our Programming Assignments was to do sentiment analysis by aggregating the positivity and negativity of words in the text against the AFINN word list, a list of words manually annotated with positive and negative valences representing the sentiment indicated by the word.

At the time I wondered if perhaps the word list approach was not too labor intensive, since one must go through a manual process for each domain to identify and score positive and negative words. I figured it may be better to just treat it as a classification problem - manually identifying documents (instead of words) as positive or negative, then use that to train a classifier that can predict the sentiment of unseen documents. But then I got busy with other things and forgot about this until a few days ago, when I came across this post where it describes using classification for sentiment analysis.

The author, Andy Bromberg, describes using NLTK and Python to classify movie reviews as positive or negative. He also refers to a previous attempt using R and the AFINN polarity wordlist, similar to the Programming Assignment I described earlier. In addition, the post describes how feature selection was used to increase the accuracy of the classifier.

As a learning exercise, I decided to do something similar with Scikit-Learn. I used the review training data from the Yelp Recruiting Competition on Kaggle, which I had entered as part of the Peer Assessments in the Intro to Data Science course. Part of the data consisted of 229,907 restaurant reviews in JSON format which had votes by users to indicate usefulness, funnyness and coolness of the review. I used the text as a bag of words and consider a review to be useful, funny and cool respectively if they have more than 0 votes for that attribute. This is used to train 3 binary classifiers that can predict these attributes in new reviews. Following along with Andy's post, I then used the Chi-squared metric to find the most useful features and measured accuracy, precision and recall of the classifiers for different feature sizes.

The code to build and test each classifier using 10-fold cross validation is shown below. We first read the review files, parsing each line into a JSON object, then extracting the text and the useful, funny and cool votes. We then convert the text into a sparse matrix where each word is a feature. In our first pass, we use every word (116,713 unique words) in our text. For each of the useful, funny or cool attribute, we use the matrix and the binarized vote vector for that attribute to construct a Naive Bayes classifier. We then test the classifier and compute accuracy, precision and recall. Next we calculate the most informative features for the attribute using the Chi-squared test, and build models for 1000, 3000, 10000, 30000, and 100000 top features and calculate their accuracy, precision and recall.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# Source: src/yelp_ufc/build_classifier.py
from sklearn.cross_validation import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.naive_bayes import MultinomialNB
import json
import numpy as np
import operator

def read_data(fname):
  f = open(fname, 'rb')
  texts = []
  ys = []
  for line in f:
    rec = json.loads(line.strip())
    texts.append(rec["text"])
    ys.append([
      1 if int(rec["votes"]["useful"]) > 0 else 0,
      1 if int(rec["votes"]["funny"]) > 0 else 0,
      1 if int(rec["votes"]["cool"]) > 0 else 0])
  f.close()
  return texts, np.matrix(ys)

def vectorize(texts, vocab=[]):
  vectorizer = CountVectorizer(min_df=0, stop_words="english") 
  if len(vocab) > 0:
    vectorizer = CountVectorizer(min_df=0, stop_words="english", 
      vocabulary=vocab)
  X = vectorizer.fit_transform(texts)
  return vectorizer.vocabulary_, X

def cross_validate(ufc_val, X, y, nfeats):
  nrows = X.shape[0]
  kfold = KFold(nrows, 10)
  scores = []
  for train, test in kfold:
    Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]
    clf = MultinomialNB()
    clf.fit(Xtrain, ytrain)
    ypred = clf.predict(Xtest)
    accuracy = accuracy_score(ytest, ypred)
    precision = precision_score(ytest, ypred)
    recall = recall_score(ytest, ypred)
    scores.append((accuracy, precision, recall))
  print ",".join([ufc_val, str(nfeats), 
    str(np.mean([x[0] for x in scores])),
    str(np.mean([x[1] for x in scores])),
    str(np.mean([x[2] for x in scores]))])

def sorted_features(ufc_val, V, X, y, topN):
  iv = {v:k for k, v in V.items()}
  chi2_scores = chi2(X, y)[0]
  top_features = [(x[1], iv[x[0]], x[0]) 
    for x in sorted(enumerate(chi2_scores), 
    key=operator.itemgetter(1), reverse=True)]
  print "TOP 10 FEATURES FOR:", ufc_val
  for top_feature in top_features[0:10]:
    print "%7.3f  %s (%d)" % (top_feature[0], top_feature[1], top_feature[2])
  return [x[1] for x in top_features]

def main():
  ufc = {0:"useful", 1:"funny", 2:"cool"}
  texts, ys = read_data("../../data/yelp_ufc/yelp_training_set_review.json")
  print ",".join(["attrtype", "nfeats", "accuracy", "precision", "recall"])
  for ufc_idx, ufc_val in ufc.items():
    y = ys[:, ufc_idx].A1
    V, X = vectorize(texts)
    cross_validate(ufc_val, X, y, -1)
    sorted_feats = sorted_features(ufc_val, V, X, y, 10)
    for nfeats in [1000, 3000, 10000, 30000, 100000]:
      V, X = vectorize(texts, sorted_feats[0:nfeats])
      cross_validate(ufc_val, X, y, nfeats)

if __name__ == "__main__":
  main()

The top 10 features for each classifier (ie the words that have highest "polarity" for that particular attribute) are shown below. The first column is the Chi-squared score for the word, the second column is the word itself, and the third column is the index of the word in the sparse matrix.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
TOP 10 FEATURES FOR: useful

5170.064  like (60636)
4835.884  just (56649)
2595.147  don (32684)
2456.476  know (58199)
2346.778  really (84130)
2083.032  time (104423)
2063.618  people (76776)
2039.718  place (78659)
1873.081  think (103835)
1858.230  little (61092)

TOP 10 FEATURES FOR: funny

9087.141  like (60636)
6049.875  just (56649)
4848.157  know (58199)
4664.542  don (32684)
3361.983  people (76776)
2649.594  think (103835)
2505.478  oh (72420)
2415.325  ll (61174)
2349.312  really (84130)
2345.851  bar (11472)

TOP 10 FEATURES FOR: cool

6675.123  like (60636)
4616.683  just (56649)
3173.775  know (58199)
3010.526  really (84130)
2847.494  bar (11472)
2715.794  little (61092)
2670.838  don (32684)
2300.151  people (76776)
2217.659  place (78659)
2216.888  ve (110157)

We also plot some graphs for each classifier showing how the accuracy, precision and recall vary with the number of features. The horizontal lines represent the accuracy, precision and recall achieved using the full data set. As can be seen, the metrics improve as more features are added but tend to flatten out eventually.




The code to build these graphs out of the metrics printed out by our classifier training code uses Pandas dataframe plotting functionality and is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Source: src/yelp_ufc/plot_results.py
import matplotlib.pyplot as plt
import pandas as pd
import sys

def main():
  assert(len(sys.argv) == 2)
  df = pd.read_csv("all.csv")
  adf = df.ix[df.attrtype == sys.argv[1]]
  adf_all = adf.ix[adf.nfeats < 0]
  adf_rest = adf.ix[adf.nfeats > 0]
  print adf_all
  print adf_rest
  adf_rest = adf_rest.drop("attrtype", 1)
  adf_rest = adf_rest.set_index("nfeats")
  adf_rest["accuracy_all"] = adf_all[["accuracy"]].values[0][0]
  adf_rest["precision_all"] = adf_all[["precision"]].values[0][0]
  adf_rest["recall_all"] = adf_all[["recall"]].values[0][0]
  adf_rest.plot(title=sys.argv[1])
  plt.show()

if __name__ == "__main__":
  main()

Thats all I have for today. Many thanks to Andy Bromberg for posting his analysis, without which my analysis would not have happened. The code for this blog can also be found on my GitHub.

14 comments (moderated to prevent spam):

Anonymous said...

Sir I am looking to implement a similar project in Java. Identification of Wishes in User Reviews. I am kind of New to Java coding. Could you please help me to get started. I have both netbeans and eclipse installed on my system.
Below is the link to the paper. https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CC8QFjAA&url=http%3A%2F%2Fwww.aclweb.org%2Fanthology%2FW10-0207&ei=a3jXUuePCMOHrQf83oCIBQ&usg=AFQjCNFGtK6b2V7SdJVhMw3TDhOVHGddAg&sig2=0FL9qb81BedzcGYHwMmy2g&bvm=bv.59568121,d.bmk

Sujit Pal said...

Hi Anonymous, thanks for the link to the paper, very interesting approach (for the benefit of others who haven't read it - the authors used lexicons of words representing product attributes, negative sentiment and positive sentiment to bootstrap their training set, then manually built rules to decide if a sentence represents a "wish"). I guess you could use a similar strategy for your system? I think you need a good grasp of formal English grammar (something I am a bit fuzzy on unfortunately) for the manual rule building part. Regarding Java, I would suggest using a language you are familiar with unless you have time to learn it as you go - for example, if you know Python or R, they can get the work done just as easily (and readers of your paper probably don't care about the programming language?).

Anonymous said...

Thaks a ton for your reply sir.
I am interested to implement this in Java. I am facing some trouble trying to install opennlp and link it with my netbeans. I am not getting any clear step by step process which will help me do the same without causing errors.

Sujit Pal said...

I would suggest building standard Java/Maven project using mvn archetype:create, then including the OpenNLP dependency in it. Maven can help you automatically build the IDE's descriptor - Netbeans actually automatically recognizes the Maven project structure. Here is a code snippet showing how to use OpenNLP to extract noun phrases from some text, hopefully that will help you get started. Although if you are following the paper, this is not the approach they advocate.

Anonymous said...

Sir, the following is the error message i received while trying to create a new Maven Project.
cd C:\Users\Ragesh\Documents\NetBeansProjects; "JAVA_HOME=C:\\Program Files (x86)\\Java\\jdk1.7.0_10" "\"C:\\Program Files (x86)\\NetBeans 7.0\\java\\maven\\bin\\mvn.bat\"" -DarchetypeVersion=1.1 -Darchetype.interactive=false -DgroupId=com.mycompany -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeRepository=http://repo1.maven.org/maven2/ -Dversion=1.0-SNAPSHOT -DarchetypeGroupId=org.apache.maven.archetypes -Dbasedir=C:\\Users\\Ragesh\\Documents\\NetBeansProjects -Dpackage=com.mycompany.mavenproject1 -DartifactId=mavenproject1 --batch-mode archetype:generate
Scanning for projects...
Downloading: http://repo1.maven.org/maven2/org/apache/maven/plugins/maven-clean-plugin/2.4.1/maven-clean-plugin-2.4.1.pom
Failed to retrieve plugin descriptor for org.apache.maven.plugins:maven-clean-plugin:2.4.1: Plugin org.apache.maven.plugins:maven-clean-plugin:2.4.1 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.apache.maven.plugins:maven-clean-plugin:jar:2.4.1

Sujit Pal said...

Not sure what the problem is, I am guessing you are using some tool to do this which seems to add many extraneous parameters which leads to the error - I typically use something like this from the command line.

Anonymous said...

sir
i am want to do a project to enrich an ontology with concepts extracted from text. Can you help me to find the semantic similarity between concept in text document and concept in ontology.

Sujit Pal said...

I am not entirely sure what you are looking to do, but it is generally sufficient in my experience that a phrase in the text matches a phrase in the ontology. In cases where a phrase in text matches multiple phrases in the ontology, only then you have to disambiguate. You can do this in various ways - one way could be to use Wordnet path_similarity of other phrases and this one in a context (usually 1-3 sentences), like I describe here, or better, compute something similar for your ontology.

Anonymous said...

Sir, I have succesfully been able to kick start my work on the wish Identification paper. What i have done so far is follows:
1.Given a paragraph as input, i separate them into sentences.
2.Given a sentence as input, I tokenise and perform POS tagging on them.
NOTE: as u may have noticed, i am trying to use opennlp as they provide all the above functionalities.
3. I have written a simple string comparison program that will check for the manually built rules on the sentences, to check if they are wishes.
Could you please suggest if my approach is in the right direction? I want to stick to the method proposed in the paper as far as possible.

Sujit Pal said...

Yes I guess that sounds right given what I read. One way you can extend the paper's approach is to use your work so far to bootstrap a training set - ie, the "wish" sentences you have identified could be used as a training set (with a downsampled set of "non-wish" sentences, assuming the former is a much smaller set) to build a model that can predict if a sentence is a "wish" sentence or not (that perhaps your rules can't detect because you haven't considered these cases).

Anonymous said...

Sir, I have completed implementing the rules given in the paper. These rules can now be used to label a set of unlabelled sentences as wish / non wish sentences. I wanted to use Active learning. (http://en.wikipedia.org/wiki/Active_learning_(machine_learning)
Could you please suggest how I can implement the querying function for active learning?

Sujit Pal said...

Perhaps use the rules to label sentences and then train various classifiers on it, and from that choose the ones which most classifiers agree is one or the other? Alternatively try feature reduction, using info gain or chi-square (or some other measure), then label using the reduced set of features.

Anonymous said...

Hello Sir,

I am back after a short break on my work "Identification on User Reviews" I am really excited about the idea you provided me to use the rules to label the sentences and then train multiple classifiers on the labelled set. Could you please help me on how to convert my sentences into feature vectors? What is the process i should follow to obtain feature vectors for training the classifiers?

Sujit Pal said...

Check out my vectorize() method - it uses scikit-learn's CountVectorizer to vectorize the text into feature vector of words. Or take a look at scikit-learn's tutorial for more information.