Tuesday, November 26, 2013

Making Sense of Data with Python Pandas


In the Introduction to Recommender Systems course from Coursera that I am taking, there was one entire module dedicated to evaluation of different recommender systems. The course is built around the very comprehensive Lenskit recommender library, which provides a Groovy DSL to perform large scale evaluations. The assignment for this module consisted of evaluating 9 algorithms, each rated against 6 metrics either individually or across 10 different user neighborhood sizes (where the performance of the algorithm is dependent on neighborhood size), then answering some questions based on the data collected.

This sounds like a formidable task, but its really just a single Groovy DSL, and the script was mostly already provided. The real work was to analyze the data and answer the questions. The recommended tool for the analysis was R, but I am more comfortable with Python, and I wanted to check out Pandas, a Python library that provides R-like abstractions for data analysis. I didn't know Pandas, but figured it would be a good opportunity to learn it for this assignment. What follows is my attempt to use Pandas to analyze the data from the evaluation.

To learn about Pandas, I watched this 3 hour long video on Youtube. In it, Wes McKinney, the creator of Pandas, presents the library at Pycon 2013 at Santa Clara. Its a bit of time investment, but I thought it was well worth it, because it gives you a feel for the library and what you can do with it using a bunch of case studies. There is also a Pandas book, also by Wes McKinney, and there is a reference guide on the Pandas site (I used this also to figure out things I got stuck on).

Before we start on the analysis, we need to understand how the data is structured. We do this using a bit of exploratory analysis at the Python shell. As we can see, there are 15 columns in our dataset, of which the last 11 are metrics. Of these 11 metrics, we are going to be concerned only (because of the questions that follow) with only 5 of these - TestTime, RMSE.ByUser, nDCG, TopN.nDCG and TagEntropy@10.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
>>> import pandas as pd
>>> df = pd.read_csv("eval-results.csv")
>>> df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 270 entries, 0 to 269
Data columns (total 15 columns):
Algorithm        270  non-null values
Partition        270  non-null values
DataSet          270  non-null values
NNbrs            250  non-null values
BuildTime        270  non-null values
TestTime         270  non-null values
NUsers           270  non-null values
NAttempted       270  non-null values
NGood            270  non-null values
Coverage         265  non-null values
RMSE.ByRating    265  non-null values
RMSE.ByUser      265  non-null values
nDCG             265  non-null values
TopN.nDCG        265  non-null values
TagEntropy@10    265  non-null values
dtypes: float64(7), int64(6), object(2)

We see that there are 11 algorithms being tested. Of these, some are tested across user neighborhoods varying in size from 5 to 100, and the others are not evaluated with user neighborhoods at all (as can be seen from the nan in the result for df.NNbrs.unique()). This latter class contains GlobalMean, Popular, ItemMean and PersMean, which are not affected by changes in the size of the user neighborhood. We also see that the evaluation uses 5-fold cross validation (from the result of df.Partition.unique()) and that the DataSet column has only 1 value and can be safely ignored.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
>>> df.Algorithm.unique()
array(['GlobalMean', 'Popular', 'ItemMean', 'PersMean', 'Lucene',
       'LuceneNorm', 'UserUserCosine', 'UserUser', 'UserUserNorm'], 
      dtype=object)
>>> df.NNbrs.unique()
array([  nan,    5.,   10.,   15.,   20.,   25.,   30.,   40.,   50.,
         75.,  100.])
>>> df.ix[pd.isnull(df.NNbrs)]["Algorithm"].unique()
array(['GlobalMean', 'Popular', 'ItemMean', 'PersMean'], dtype=object)
>>> df.ix[-pd.isnull(df.NNbrs)]["Algorithm"].unique()
array(['Lucene', 'LuceneNorm', 'UserUserCosine', 'UserUser', 'UserUserNorm'], 
      dtype=object)
>>> df.Partition.unique()
array([0, 1, 2, 3, 4])
>>> df.DataSet.unique()
array(['RecSysMOOC'], dtype=object)

Now that we have some sense of how the data is set up, lets move on to question answering! I plan on using as much built-in Pandas functionality as possible (or as much as I know how to). To keep things simple, each question is answered using analysis that stands on its own, at the cost of some code duplication. This is the same way I did the actual analysis, localizing each analysis to its own method and calling it one by one from the main() method.

Q2: How many neighbors are needed for normalized Lucene to beat personalized mean on RMSE?

Here "RMSE" implies "RMSE.ByUser" (this was clarified later by course staff). Also "normalized Lucene" is called LuceneNorm and "personalized mean" is called PersMean in the data. As we noted earlier, LuceneNorm was run multiple times per dataset with varying NNbr (number of neighbors), while PersMean was run only once. So if we were to plot the mean RMSE.ByUser across the datasets for each value of NNbr for LuceneNorm and a horizontal line representing the mean RMSE.ByUser across the datasets for PersMean, the intersection would give us the answer to our question.

The code is shown below. We first pull out only the rows and columns we care about for LuceneNorm and PersMean, group LuceneNorm data by NNbrs and generate the mean RMSE.ByUser, and compute the mean RMSE.ByUser for PersMean. We then set them up in a new "plotting" DataFrame so we can use Pandas's DataFrame.plot() method to draw a chart with all the default bells and whistles (much easier than doing this manually via Matplotlib).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def q2():
  df = pd.read_csv("eval-results.csv")
  # select out LuceneNorm rows and only NNbrs and RMSE.ByUser columns
  df_lucenenorm = df.ix[df.Algorithm == "LuceneNorm", 
                        ["NNbrs", "RMSE.ByUser"]]
  # select out PersMean rows and only NNbrs and RMSE.ByUser columns
  # in case of PersMean, neighborhood size does not matter so no 
  # experiments were performed with varying NNbrs, so our plot will
  # be a horizontal line across all NNbrs values.
  df_persmean = df.ix[df.Algorithm == "PersMean", ["RMSE.ByUser"]]
  # we want average RMSE.ByUser values per NNbr in our plot so we
  # create a new DataFrame based on the mean RMSE.ByUser value per NNbr
  df_plot = pd.DataFrame(df_lucenenorm.groupby(["NNbrs"]).mean())
  df_plot.rename(columns={"RMSE.ByUser" : "RMSE.ByUser.LuceneNorm"}, 
    inplace=True)
  persmean_mean = df_persmean["RMSE.ByUser"].mean()
  # add this column to the df_plot
  df_plot["RMSE.ByUser.PersMean"] = persmean_mean
  df_plot.plot()
  plt.show()

The graph generated by this method is shown below. As you can see the answer is approximately a neighborhood size of 43.


Q3: How many neighbors are needed for normalized Lucene to beat personalized mean on prediction nDCG?

This is similar to Q2, except that we are asked to consider a different metric. The code is also almost identical, as shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
def q3():
  df = pd.read_csv("eval-results.csv")
  df_lucenenorm = df.ix[df.Algorithm == "LuceneNorm", ["NNbrs", "nDCG"]]
  df_persmean = df.ix[df.Algorithm == "PersMean", ["nDCG"]]
  df_plot = pd.DataFrame(df_lucenenorm.groupby(["NNbrs"]).mean())
  df_plot.rename(columns={"nDCG" : "nDCG.LuceneNorm"}, inplace=True)
  persmean_mean = df_persmean["nDCG"].mean()
  df_plot["nDCG.PersMean"] = persmean_mean
  df_plot.plot()
  plt.show()

The graph for this analysis is shown below. As you can see, the answer is a neighborhood size of approximately 73 users.


Q4: When is Lucene CBF the best algorithm? Choices are: (a) Never; (b) On nDCG, when it is normalized and has many neighbors; (c) On RMSE with a medium number of neighbors.

There are 2 questions here, one for the RMSE.ByUser and another for the nDCG metric, and data for all algorithms has to be analyzed for each metric. If we plot the metric value over NNbr for all the algorithms, one plot per metric, then we can have our answer. Building up this plot is a bit complex but is basically just an extension of the approach in Q2 and Q3.

For convenience, we move the plotting code into its own function, because answers to subsequent questions involve similar functionality. We first split up the algorithms into the ones that are "NNbrs-aware" vs ones that are not. We happen to know that the NNbrs-aware ones are always run for the same NNbrs set. We loop through this set, pulling out rows and the columns "NNbrs" and the metric column (RMSE.ByUser in the first case, nDCG in the second), grouping by NNbrs and computing the mean across all datasets, then joining each set of algorithm means to a "plotting" DataFrame by NNbrs. We then loop through the non-NNbrs-aware algorithms, computing their mean across all datasets, and sticking a column containing this mean value to our plotting DataFrame. At each stage, we rename the metric column by suffixing the algorithm name to it, and finally we just call plot() on the resulting DataFrame. Here is the code.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def plot_all_algos(metric, ref_algos=None, report_file=None):
  df = pd.read_csv("eval-results.csv")
  # Before starting, split algos into neighbor aware and neighbor unaware
  non_nnbr_algos = df.ix[pd.isnull(df.NNbrs)]["Algorithm"].unique()
  nnbr_algos = df.ix[-pd.isnull(df.NNbrs)]["Algorithm"].unique()
  df_plot = None
  for algo in nnbr_algos:
    if ref_algos is not None and algo not in ref_algos: continue
    dfa = df.ix[df.Algorithm == algo, ["NNbrs", metric]]
    if df_plot is None:
      df_plot = pd.DataFrame(dfa.groupby(["NNbrs"]).mean())
    else:
      df_plot = df_plot.join(dfa.groupby(["NNbrs"]).mean())
    df_plot.rename(columns={metric : "%s.%s" % (metric, algo)}, inplace=True)
  # now add in the columns representing the non-nnbr algos
  for algo in non_nnbr_algos:
    if ref_algos is not None and algo not in ref_algos: continue
    algo_mean = df.ix[df.Algorithm == algo, [metric]].mean()[0]
    df_plot["%s.%s" % (metric, algo)] = algo_mean
  if report_file is not None:
    f = open(report_file, 'wb')
    f.write(df_plot.transpose().to_html())
    f.close()
  df_plot.plot(title=metric)
  plt.show()

For question 4, we can create two plots, one for RMSE.ByUser and one for nDCG, using the code below:

1
2
3
def q4():
  for metric in ["RMSE.ByUser", "nDCG"]:
    plot_all_algos(metric)

The plots are shown below. Inspecting them tells us that the correct answer is (b), ie, on nDCG, when it is normalized and has many neighbors.






Q5: What algorithm produces the most diverse Top-10 lists (by our entropy metric) (a) Popular Movies; (b) Unnormalized Lucene; (c) Item/Pers. Mean; (d) Normalized Lucene; (e) User-User CF.

The metric of interest in this question is the TagEntropy@10 metric. The "most diversity" corresponds to highest values of TagEntropy@10. To answer the question, we can call the plot_all_algos() function we built earlier. The plot shows that the most diverse spot is claimed by two algorithms, whose horizontal plots are overlaid on each other. A plot is thus not enough to tell us which algorithm produces the most diverse top-10 list. So we add an extra parameter to our plot_all_algos function to optionally print out an HTML table with the values just before plotting it. For readability, we transpose the tables, so each algorithm is represented by a row in this table.

1
2
def q5():
  plot_all_algos("TagEntropy@10", report_file="/tmp/entropy.html")


NNbrs 5.0 10.0 15.0 20.0 25.0 30.0 40.0 50.0 75.0 100.0
TagEntropy@10.Lucene 167.0 167.1 169.3 171.4 175.0 178.5 183.9 185.9 186.6 186.5
TagEntropy@10.LuceneNorm 214.9 220.6 223.9 225.2 225.6 225.8 225.6 224.8 223.9 223.9
TagEntropy@10.UserUserCosine 214.7 220.3 223.3 225.0 226.3 227.2 228.5 229.2 230.7 231.4
TagEntropy@10.UserUser 209.9 217.9 222.0 224.1 225.7 227.0 228.2 229.5 230.9 231.7
TagEntropy@10.UserUserNorm 207.1 215.1 219.6 222.0 223.9 225.2 226.8 228.2 230.0 231.1
TagEntropy@10.GlobalMean NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
TagEntropy@10.Popular 193.7 193.7 193.7 193.7 193.7 193.7 193.7 193.7 193.7 193.7
TagEntropy@10.ItemMean 237.8 237.8 237.8 237.8 237.8 237.8 237.8 237.8 237.8 237.8
TagEntropy@10.PersMean 237.8 237.8 237.8 237.8 237.8 237.8 237.8 237.8 237.8 237.8

From the table, it's quite clear that the correct answer is (c) Item/Pers. Mean.

Q6: Does increasing neighborhood size generally increase or decrease the tag entropy of top-10 lists? (a) Increase; (b) Decrease.

This can be answered from the graph for the previous question. Where the graph is not horizontal, the general trend seems to be an increase in TagEntropy@10 with increase in NNbrs.

Q8: In practice, recommenders cost money to run and it isn't worthwhile to run recommenders that take a lot of computational power and provide little benefit. Based on this experiment, what algorithm would be best to deploy for recommending items (in ranked lists) from this data set? (a) Popular; (b) Normalized Lucene; (c) User-user CF; (d) Mean Rating; (e) Unnormalized Lucene.

We want to find recommenders that are not too expensive to run, so we use TestTime as our metric. Generating the plot of TestTime across all algorithms is once again simply a matter of calling plot_all_algos(). As in Q5, however, the algorithms with low runtimes are all clumped together, so its difficult to differentiate, so as before we generate the corresponding HTML table. The calling code, chart and data are shown below:

1
2
def q8():
  plot_all_algos("TestTime", report_file="/tmp/testtime.html")


NNbrs 5.0 10.0 15.0 20.0 25.0 30.0 40.0 50.0 75.0 100.0
TestTime.Lucene 5931.8 4896.0 5178.2 4994.0 5493.2 5626.0 6005.6 6070.2 6252.4 6358.6
TestTime.LuceneNorm 6465.4 7224.0 7190.6 7106.4 7550.8 7624.0 7927.4 7755.4 7921.2 8172.4
TestTime.UserUserCosine 197427.8 199182.4 203132.0 204139.2 203278.0 205560.0 209793.6 209759.2 212834.6 206995.0
TestTime.UserUser 216341.6 219634.6 222249.0 223269.4 222814.6 224616.6 226187.8 229578.2 231668.0 233883.6
TestTime.UserUserNorm 228488.8 230338.8 233950.4 236373.4 234727.8 235640.0 240503.6 241874.4 245372.0 239330.6
TestTime.GlobalMean 329.2 329.2 329.2 329.2 329.2 329.2 329.2 329.2 329.2 329.2
TestTime.Popular 5512.8 5512.8 5512.8 5512.8 5512.8 5512.8 5512.8 5512.8 5512.8 5512.8
TestTime.ItemMean 7355.8 7355.8 7355.8 7355.8 7355.8 7355.8 7355.8 7355.8 7355.8 7355.8
TestTime.PersMean 7569.4 7569.4 7569.4 7569.4 7569.4 7569.4 7569.4 7569.4 7569.4 7569.4

From the data above, it is clear that the winner on TestTime is GlobalMean. However, GlobalMean is more of a baseline, and we should choose the next best one, which is (a) Popular.

Q9: Ignoring entropy, what user-user configuration generally performs the best? (a) Normalized w/ Cosine; (b) Unnormalized; (c) Normalized

To answer this question, we should look at the 4 ratings RMSE.ByRating, RMSE.ByUser, nDCG and TopN.nDCG against UserUserCosine, UserUser and UserUserNorm algorithms. We add another parameter to our plot_all_algos to only plot for a restricted set of algorithms if such a set is passed in. After this, we can plot the four metrics using the following code:

1
2
3
4
5
6
def q9():
  ref_algos=["UserUserCosine", "UserUser", "UserUserNorm"]
  plot_all_algos("RMSE.ByRating", ref_algos=ref_algos)
  plot_all_algos("RMSE.ByUser", ref_algos=ref_algos)
  plot_all_algos("nDCG", ref_algos=ref_algos)
  plot_all_algos("TopN.nDCG", ref_algos=ref_algos)

From the set of 4 charts generated from this code, it is clear that UserUserCosine, aka (a) Normalized w/Cosine, is the clear winner in this respect.










Q10: One algorithm has increasing entropy for low neighborhood sizes, and then entropy starts going down. Which is it? (a) Unnormalized Lucene; (b) Lucene with Normalization; (c) User-user CF with Cosine; (d) Unnormalized User-User CF

This question can be answered from the graph in Q5. It is LuceneNorm, aka (b) Lucene with Normalization, represented by the green line with TagEntropy@10 between 220 and 230.

Q11: What algorithms beat Popular Items in recommendation list entropy? (a) Normalized Lucene; (b) Item/Pers. Mean; (c) User-user CF; (d) Unnormalized Lucene.

Once again, this can be answered from the graph and data for Q5. Popular has higher TagEntropy@10 for all the algorithms except for Lucene (aka Unnormalized Lucene), so the answer is (a), (b) and (c).

Q12: What is the Top-N nDCG of Popular?

To answer this question, we isolate the TopN.nDCG column for the algorithm Popular and return the mean. The code is shown below, the answer is 0.475202.

1
2
3
4
def q12():
  df = pd.read_csv("eval-results.csv")
  print("TopN.nDCG(Popular) = %f" % 
    (df.ix[df.Algorithm == "Popular", ["TopN.nDCG"]].mean()[0]))

The code is structured as a single Python script, and can be found, along with the data, at my GitHub page for this project.

I had lots of fun applying my newly acquired knowledge of Pandas to some real-world (well real-world-ish) data, and I hope you found this interesting as well. Being able to analyze data and extract information from it is an important skill, and Pandas/R can help you do that easily and efficiently. If you are reasonably proficient with Python, then learning Pandas may be easier than learning R.


Wednesday, November 20, 2013

Using Graph Centrality Metrics for Crime Fighting


I haven't had much time to post anything lately. I've been taking quite a few courses at Coursera simultaneously (I am a bit embarassed to say exactly how many, since it reflects complete lack of judgement on my part regarding my ability to do justice to these courses). My laptop crashing and taking all my data with it didn't help much either. In any case, one of the courses is the excellent Social Network Analysis taught by Prof Lada Adamic of the University of Michigan.

Social Network Analysis is just one of the practical applications of Graph Theory. My interest in Graph Theory is because our semantic search algorithms uses our medical taxonomy as one of its inputs, and the taxonomy is represented as an in-memory graph. However, our use of this graph is limited mostly to navigation. I took the course hoping to see how Graph Theory is used in the real-world and see if I could apply the learning to do some analytics on our taxonomy graph. The course did introduce me to two excellent Python graph libraries, NetworkX and IGraph, as well as visualization tools like Gephi, which I hope to use to do this in future.

Meanwhile, I did some interesting empirical network analysis on the Enron Email Dataset as part of an open ended programming project for this course. Although its not that complicated (the libraries make things easy), its quite powerful, which is why I thought I'd share. Specifically, the objective is to use graph centrality metrics to narrow down the search for "people of interest" in the Enron Scandal.

We check our results against lists of "key players". The list of key players based on reports from here and here include the following individuals - Kenneth Lay (kenneth.lay@enron.com), Jeffrey Skilling (jeff.skilling@enron.com), Andrew Fastow (andrew.fastow@enron.com), Richard Causey (richard.causey@enron.com), Michael Kopper (michael.kopper@enron.com), Rex T Smiley (smiley@flash.net), Ben Glisan (ben.glisan@enron.com), Greg Whalley (greg.whalley@enron.com), Mark Koenig (mark.koenig@enron.com), Lou Lung Pai (lou.pai@enron.com), Kenneth D Rice (ken.rice@enron.com), and Rebecca Mark (rebecca.mark@enron.com).

The first step is to convert the data into something these graph libraries can consume. The dataset itself is available as a tarred gzipped directory that expands into a collection of 150 subdirectories, each subdirectory representing a single Enron employee's emails. Within each subdirectory is a set of subdirectories representing different mailboxes (inbox, deleted, sent, etc), within which reside multiple email messages in RFC-5322 format. The code below parses out the From and To addresses from a single message.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import email.parser
import os
import re
import sys

def remove_special_chars(s):
  return re.sub(r"[<>\"' ]", "", s)

fname = sys.argv[1]
if os.path.isfile(fname) and fname.endswith("."):
  fin = open(sys.argv[1], 'rb')
  parser = email.parser.HeaderParser()
  msg = parser.parse(fin, headersonly=True)
  fin.close()
  try:
    from_value = msg["From"]
    to_values = msg["To"].replace("\r\n", "").replace("\t", "").split(", ")
    if from_value != None and to_values != None:
      from_value = remove_special_chars(from_value)
      for to_value in to_values:
        to_value = remove_special_chars(to_value)
        print("%s\t%s" % (from_value, to_value))
  except AttributeError:
    pass

It is fed the output of the Unix find command piped through GNU Parallel to take advantage of my multi-CPU laptop. The output is a tab-separated file of (from,to) tuples.

1
2
sujit@tsunami:~$ find /path/to/enron_mail_20110402 -name "*" -print | \
        parallel python mail_parser.py {} > from_to.txt

The dataset contains 517,425 email messages, and results in 3,130,296 tuples, which is then processed by the following code to build a weighted directed graph using NetworkX and written to a file in Graph Modelling Language (GML) format for data analysis. The resulting weighted directed graph contains 79,736 nodes and 311,212 edges. We also generate the node IDs for our test set for further analysis.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import networkx as nx
import collections

TUPLE_FILE = "from_to.txt"

def to_ascii(s):
  return unicode(s, "ascii", "ignore")

# collect all nodes from tuples
vertices = set()
fin = open(TUPLE_FILE, 'rb')
for line in fin:
  from_email, to_email = line.strip().split("\t")
  vertices.add(to_ascii(from_email))
  vertices.add(to_ascii(to_email))
fin.close()
print "#-vertices:", len(vertices)

# collect all edges as vertex index tuples
vertex_idx = {x[1] : x[0] for x in enumerate(vertices)}
edges = collections.defaultdict(int)
fin = open(TUPLE_FILE, 'rb')
for line in fin:
  from_email, to_email = line.strip().split("\t")
  edge = (vertex_idx[to_ascii(from_email)], vertex_idx[to_ascii(to_email)])
  edges[edge] += 1
fin.close()
print "#-edges:", len(edges)

# build graph and write as GML file
G = nx.DiGraph()
for vertex in vertices:
  G.add_node(vertex_idx[vertex], label=vertex)
for edge in edges.keys():
  G.add_edge(edge[0], edge[1], weight=edges[edge])
nx.write_gml(G, "enron.gml")

# generate list of test node ids
known_guilty = ["kenneth.lay@enron.com", "jeff.skilling@enron.com",
    "andrew.fastow@enron.com", "richard.causey@enron.com",
    "michael.kopper@enron.com", "smiley@flash.net", "ben.glisan@enron.com",
    "greg.whalley@enron.com", "mark.koenig@enron.com", "lou.pai@enron.com",
    "ken.rice@enron.com", "rebecca.mark@enron.com"]
known_guilty_vertices = [vertex_idx[to_ascii(x)] for x in known_guilty]
print known_guilty_vertices

  

The underlying intuition for the analysis is that the centrality of a vertex measures its relative importance within a graph. Since important people are usually those with power, and power has the potential to corrupt, people with power are more likely to commit crimes. Therefore a good first pass at finding "people of interest" in the Enron scandal would be to look at the nodes ordered by one or more centrality measures.

The intuition seems to be borne out in the visualization of the top 100 nodes (by degree) in the Enron graph. The red nodes represent individuals in our test set. As you can see, there are 3 in the top 100, giving us a probability of 0.03, which is orders of magnitude higher than the probability of finding them in the entire graph (0.00015). We select the top 100 because the full network takes forever to plot and the plot becomes unreadable anyway. Plotting code is available here.


In the analysis that follows, we test various centrality measures against our test set. In all cases, we find the size of the intersection set between the top 1,000 nodes (by that centrality measure) against our test set. While 1,000 nodes may appear high, note that it consists of only about 1.25% of the entire network.

At this point, we switch to using the IGraph library because the resulting graph is too large for NetworkX to handle. IGraph is basically a C library with Python (and R) bindings and thus faster and can handle larger datasets.

We apply 4 centrality metrics (Degree Centrality, Closeness Centrality, Betweenness Centrality and Eigenvector Centrality) and 2 other measures of importance (PageRank and HITS Authority). For each metric we calculate the Recall of the top 1,000 vertices by that metric against our test set. The code to do the analysis is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import igraph as ig
import matplotlib.pyplot as plt
import operator
import pandas as pd

KNOWN_GUILTY = set([7682, 49535, 8324, 12433, 60113, 43955, 
                    47466, 4820, 10960, 47384, 49213, 64249])

def cumulative_hits(nodes, metric, n, filter_nodes):
  pairs = zip(nodes, metric)
  if filter_nodes is not None:
    pairs = [p for p in pairs if p[0] in filter_nodes]
  sorted_metric = sorted(pairs, key=operator.itemgetter(1), 
    reverse=True)[0:n]
  top_nodes = [x[0] for x in sorted_metric]
  cum_hits = []
  for i in range(0, len(top_nodes)):
    top_node_set = set(top_nodes[0:i])
    match_set = top_node_set.intersection(KNOWN_GUILTY)
    cum_hits.append(len(match_set))
  return cum_hits

def run_all_hypotheses(G, topn, filter_nodes):
  df = pd.DataFrame(index=range(0,topn))
  nodes = [v.index for v in G.vs]
  df["Degree_Centrality"] = cumulative_hits(
    nodes, G.degree(), topn, filter_nodes)
  df["Closeness_Centrality"] = cumulative_hits(
    nodes, G.closeness(cutoff=3), topn, filter_nodes)
  df["Betweenness_Centrality"] = cumulative_hits(
    nodes, G.betweenness(cutoff=3), topn, filter_nodes)
  df["Eigenvector_Centrality"] = cumulative_hits(
    nodes, G.eigenvector_centrality(), topn, filter_nodes)
  df["PageRank"] = cumulative_hits(
    nodes, G.pagerank(directed=True, damping=0.85), topn, filter_nodes)
  df["HITS_Authority"] = cumulative_hits(
    nodes, G.authority_score(), topn, filter_nodes)
  df.plot()
  plt.show()

def prune_enron_only(G):
  enron = set([v.index for v in G.vs if v["label"].endswith("@enron.com")])
  return enron

def prune_with_nonenron_collaborators_only(G):
  # find list of non enron nodes
  not_enron = set([v.index for v in G.vs 
    if not v["label"].endswith("@enron.com")])
  # find nodes with non enron collaborators
  nnecs = set()
  for v in G.vs:
    if v["label"].endswith("@enron.com"):
      nvs = set([int(nv["id"]) for nv in v.neighbors()])
      if len(nvs.intersection(not_enron)) > 0:
        nnecs.add(v.index)
  return nnecs

def main():
  G = ig.read("enron.gml")
  run_all_hypotheses(G, 1000, None)
  run_all_hypotheses(G, 1000, prune_enron_only(G))
  run_all_hypotheses(G, 1000, prune_with_nonenron_collaborators_only(G))

if __name__ == "__main__":
  main()

The growth of the intersection set against the group of top 1,000 nodes sorted by each metric are summarized in the chart below. As you can see, all of these measures tend have similar effectiveness.


As noted before, these metrics are only useful for identifying "important" nodes in a graph, ie, narrowing down the list for further investigation, rather than being absolute predictors. This is because centrality is an indicator of importance, and importance in a corporate environment implies power which can be a catalyst for criminal behavior, but does not automatically imply it.

Anyway, I am finding the course quite interesting, and I hope to do more with graphs in the future. All the source code in this post can be found on my GitHub page.

Update 2013-11-27: I added a network visualization diagram (the first one) at the suggestion of Vinh Ton (thanks Vinh!), a classmate at the Social Network Analysis course at Coursera who was kind enough to provide feedback. It definitely helps to formalize the intuition that the work is based on.

Update 2013-12-07: Based on suggestions from Prof Adamic (the course instructor for the Coursera SNA course), I have replaced the 6 charts representing each metric with a single one. This allows you to compare the effectiveness of each individual metric against each other.

She also suggested that I may want to filter out non-Enron email addresses from the graph, the intuition being that since the scandal is a company matter, non-employees are less likely to be involved than employees. The resulting chart is shown below. As you can see, effectiveness measured against the top 1,000 nodes is about the same, but the effectiveness is higher at lower number of nodes.


At this point, I had changed my code sufficiently to very easily handle another idea that I had originally when doing the project. The idea is that people with power in a company typically interact more with the outside world than those without. For this, we retain nodes with Enron email addresses whose neighbors contain at least one non-Enron email address. The resulting chart is shown below. As you can see, effectiveness at 1,000 is the same, but effectiveness grows even faster than the approach of retaining only Enron employees.