I haven't had much time to post anything lately. I've been taking quite a few courses at Coursera simultaneously (I am a bit embarassed to say exactly how many, since it reflects complete lack of judgement on my part regarding my ability to do justice to these courses). My laptop crashing and taking all my data with it didn't help much either. In any case, one of the courses is the excellent Social Network Analysis taught by Prof Lada Adamic of the University of Michigan.
Social Network Analysis is just one of the practical applications of Graph Theory. My interest in Graph Theory is because our semantic search algorithms uses our medical taxonomy as one of its inputs, and the taxonomy is represented as an in-memory graph. However, our use of this graph is limited mostly to navigation. I took the course hoping to see how Graph Theory is used in the real-world and see if I could apply the learning to do some analytics on our taxonomy graph. The course did introduce me to two excellent Python graph libraries, NetworkX and IGraph, as well as visualization tools like Gephi, which I hope to use to do this in future.
Meanwhile, I did some interesting empirical network analysis on the Enron Email Dataset as part of an open ended programming project for this course. Although its not that complicated (the libraries make things easy), its quite powerful, which is why I thought I'd share. Specifically, the objective is to use graph centrality metrics to narrow down the search for "people of interest" in the Enron Scandal.
We check our results against lists of "key players". The list of key players based on reports from here and here include the following individuals - Kenneth Lay (kenneth.lay@enron.com), Jeffrey Skilling (jeff.skilling@enron.com), Andrew Fastow (andrew.fastow@enron.com), Richard Causey (richard.causey@enron.com), Michael Kopper (michael.kopper@enron.com), Rex T Smiley (smiley@flash.net), Ben Glisan (ben.glisan@enron.com), Greg Whalley (greg.whalley@enron.com), Mark Koenig (mark.koenig@enron.com), Lou Lung Pai (lou.pai@enron.com), Kenneth D Rice (ken.rice@enron.com), and Rebecca Mark (rebecca.mark@enron.com).
The first step is to convert the data into something these graph libraries can consume. The dataset itself is available as a tarred gzipped directory that expands into a collection of 150 subdirectories, each subdirectory representing a single Enron employee's emails. Within each subdirectory is a set of subdirectories representing different mailboxes (inbox, deleted, sent, etc), within which reside multiple email messages in RFC-5322 format. The code below parses out the From and To addresses from a single message.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import email.parser
import os
import re
import sys
def remove_special_chars(s):
return re.sub(r"[<>\"' ]", "", s)
fname = sys.argv[1]
if os.path.isfile(fname) and fname.endswith("."):
fin = open(sys.argv[1], 'rb')
parser = email.parser.HeaderParser()
msg = parser.parse(fin, headersonly=True)
fin.close()
try:
from_value = msg["From"]
to_values = msg["To"].replace("\r\n", "").replace("\t", "").split(", ")
if from_value != None and to_values != None:
from_value = remove_special_chars(from_value)
for to_value in to_values:
to_value = remove_special_chars(to_value)
print("%s\t%s" % (from_value, to_value))
except AttributeError:
pass
|
It is fed the output of the Unix find command piped through GNU Parallel to take advantage of my multi-CPU laptop. The output is a tab-separated file of (from,to) tuples.
1 2 | sujit@tsunami:~$ find /path/to/enron_mail_20110402 -name "*" -print | \
parallel python mail_parser.py {} > from_to.txt
|
The dataset contains 517,425 email messages, and results in 3,130,296 tuples, which is then processed by the following code to build a weighted directed graph using NetworkX and written to a file in Graph Modelling Language (GML) format for data analysis. The resulting weighted directed graph contains 79,736 nodes and 311,212 edges. We also generate the node IDs for our test set for further analysis.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | import networkx as nx
import collections
TUPLE_FILE = "from_to.txt"
def to_ascii(s):
return unicode(s, "ascii", "ignore")
# collect all nodes from tuples
vertices = set()
fin = open(TUPLE_FILE, 'rb')
for line in fin:
from_email, to_email = line.strip().split("\t")
vertices.add(to_ascii(from_email))
vertices.add(to_ascii(to_email))
fin.close()
print "#-vertices:", len(vertices)
# collect all edges as vertex index tuples
vertex_idx = {x[1] : x[0] for x in enumerate(vertices)}
edges = collections.defaultdict(int)
fin = open(TUPLE_FILE, 'rb')
for line in fin:
from_email, to_email = line.strip().split("\t")
edge = (vertex_idx[to_ascii(from_email)], vertex_idx[to_ascii(to_email)])
edges[edge] += 1
fin.close()
print "#-edges:", len(edges)
# build graph and write as GML file
G = nx.DiGraph()
for vertex in vertices:
G.add_node(vertex_idx[vertex], label=vertex)
for edge in edges.keys():
G.add_edge(edge[0], edge[1], weight=edges[edge])
nx.write_gml(G, "enron.gml")
# generate list of test node ids
known_guilty = ["kenneth.lay@enron.com", "jeff.skilling@enron.com",
"andrew.fastow@enron.com", "richard.causey@enron.com",
"michael.kopper@enron.com", "smiley@flash.net", "ben.glisan@enron.com",
"greg.whalley@enron.com", "mark.koenig@enron.com", "lou.pai@enron.com",
"ken.rice@enron.com", "rebecca.mark@enron.com"]
known_guilty_vertices = [vertex_idx[to_ascii(x)] for x in known_guilty]
print known_guilty_vertices
|
The underlying intuition for the analysis is that the centrality of a vertex measures its relative importance within a graph. Since important people are usually those with power, and power has the potential to corrupt, people with power are more likely to commit crimes. Therefore a good first pass at finding "people of interest" in the Enron scandal would be to look at the nodes ordered by one or more centrality measures.
The intuition seems to be borne out in the visualization of the top 100 nodes (by degree) in the Enron graph. The red nodes represent individuals in our test set. As you can see, there are 3 in the top 100, giving us a probability of 0.03, which is orders of magnitude higher than the probability of finding them in the entire graph (0.00015). We select the top 100 because the full network takes forever to plot and the plot becomes unreadable anyway. Plotting code is available here.
In the analysis that follows, we test various centrality measures against our test set. In all cases, we find the size of the intersection set between the top 1,000 nodes (by that centrality measure) against our test set. While 1,000 nodes may appear high, note that it consists of only about 1.25% of the entire network.
At this point, we switch to using the IGraph library because the resulting graph is too large for NetworkX to handle. IGraph is basically a C library with Python (and R) bindings and thus faster and can handle larger datasets.
We apply 4 centrality metrics (Degree Centrality, Closeness Centrality, Betweenness Centrality and Eigenvector Centrality) and 2 other measures of importance (PageRank and HITS Authority). For each metric we calculate the Recall of the top 1,000 vertices by that metric against our test set. The code to do the analysis is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | import igraph as ig
import matplotlib.pyplot as plt
import operator
import pandas as pd
KNOWN_GUILTY = set([7682, 49535, 8324, 12433, 60113, 43955,
47466, 4820, 10960, 47384, 49213, 64249])
def cumulative_hits(nodes, metric, n, filter_nodes):
pairs = zip(nodes, metric)
if filter_nodes is not None:
pairs = [p for p in pairs if p[0] in filter_nodes]
sorted_metric = sorted(pairs, key=operator.itemgetter(1),
reverse=True)[0:n]
top_nodes = [x[0] for x in sorted_metric]
cum_hits = []
for i in range(0, len(top_nodes)):
top_node_set = set(top_nodes[0:i])
match_set = top_node_set.intersection(KNOWN_GUILTY)
cum_hits.append(len(match_set))
return cum_hits
def run_all_hypotheses(G, topn, filter_nodes):
df = pd.DataFrame(index=range(0,topn))
nodes = [v.index for v in G.vs]
df["Degree_Centrality"] = cumulative_hits(
nodes, G.degree(), topn, filter_nodes)
df["Closeness_Centrality"] = cumulative_hits(
nodes, G.closeness(cutoff=3), topn, filter_nodes)
df["Betweenness_Centrality"] = cumulative_hits(
nodes, G.betweenness(cutoff=3), topn, filter_nodes)
df["Eigenvector_Centrality"] = cumulative_hits(
nodes, G.eigenvector_centrality(), topn, filter_nodes)
df["PageRank"] = cumulative_hits(
nodes, G.pagerank(directed=True, damping=0.85), topn, filter_nodes)
df["HITS_Authority"] = cumulative_hits(
nodes, G.authority_score(), topn, filter_nodes)
df.plot()
plt.show()
def prune_enron_only(G):
enron = set([v.index for v in G.vs if v["label"].endswith("@enron.com")])
return enron
def prune_with_nonenron_collaborators_only(G):
# find list of non enron nodes
not_enron = set([v.index for v in G.vs
if not v["label"].endswith("@enron.com")])
# find nodes with non enron collaborators
nnecs = set()
for v in G.vs:
if v["label"].endswith("@enron.com"):
nvs = set([int(nv["id"]) for nv in v.neighbors()])
if len(nvs.intersection(not_enron)) > 0:
nnecs.add(v.index)
return nnecs
def main():
G = ig.read("enron.gml")
run_all_hypotheses(G, 1000, None)
run_all_hypotheses(G, 1000, prune_enron_only(G))
run_all_hypotheses(G, 1000, prune_with_nonenron_collaborators_only(G))
if __name__ == "__main__":
main()
|
The growth of the intersection set against the group of top 1,000 nodes sorted by each metric are summarized in the chart below. As you can see, all of these measures tend have similar effectiveness.
As noted before, these metrics are only useful for identifying "important" nodes in a graph, ie, narrowing down the list for further investigation, rather than being absolute predictors. This is because centrality is an indicator of importance, and importance in a corporate environment implies power which can be a catalyst for criminal behavior, but does not automatically imply it.
Anyway, I am finding the course quite interesting, and I hope to do more with graphs in the future. All the source code in this post can be found on my GitHub page.
Update 2013-11-27: I added a network visualization diagram (the first one) at the suggestion of Vinh Ton (thanks Vinh!), a classmate at the Social Network Analysis course at Coursera who was kind enough to provide feedback. It definitely helps to formalize the intuition that the work is based on.
Update 2013-12-07: Based on suggestions from Prof Adamic (the course instructor for the Coursera SNA course), I have replaced the 6 charts representing each metric with a single one. This allows you to compare the effectiveness of each individual metric against each other.
She also suggested that I may want to filter out non-Enron email addresses from the graph, the intuition being that since the scandal is a company matter, non-employees are less likely to be involved than employees. The resulting chart is shown below. As you can see, effectiveness measured against the top 1,000 nodes is about the same, but the effectiveness is higher at lower number of nodes.
At this point, I had changed my code sufficiently to very easily handle another idea that I had originally when doing the project. The idea is that people with power in a company typically interact more with the outside world than those without. For this, we retain nodes with Enron email addresses whose neighbors contain at least one non-Enron email address. The resulting chart is shown below. As you can see, effectiveness at 1,000 is the same, but effectiveness grows even faster than the approach of retaining only Enron employees.
Be the first to comment. Comments are moderated to prevent spam.
Post a Comment