Sherlock Holmes short stories, written by Sir Arthur Conan Doyle used to be quite popular back when I was a teenager. The stories are a collection that describes the exploits of Sherlock Holmes, as narrated by his friend and sidekick Dr Watson. Not only do the stories share their central characters, they also have characters dropping in and out in different stories, much like a long running TV series. Recently I found some of these stories on Project Gutenberg, so I thought it might be interesting to write some code to discover relationships among the various characters in the stories. This post describes what I did. For my set of stories, I chose the following stories.
- pg108.txt: The Return of Sherlock Holmes
- pg1661.txt: The Adventures of Sherlock Holmes
- pg2097.txt: The Sign of the Four
- pg2343.txt: The Adventure of Wisteria Lodge
- pg2344.txt: The Adventure of the Cardboard Box
- pg2345.txt: The Adventure of the Red Circle
- pg2346.txt: The Adventure of the Bruce-Partington Plans
- pg2347.txt: The Adventure of the Dying Detective
- pg2348.txt: The Disappearance of Lady Frances Carfax
- pg2349.txt: The Adventure of the Devil's Foot
- pg2350.txt: His Last Bow
- pg244.txt: A Study In Scarlet
- pg2852.txt: The Hound of the Baskervilles
- pg3289.txt: The Valley of Fear
- pg834.txt: Memoirs of Sherlock Holmes
Each of these files are available as plain text. The first step is to remove the Gutenberg Project headers and footers, then convert the remaining text to paragraphs. From that point on, I used the Stanford CoreNLP library to convert the paragraphs to sentences, then extract named entities from these sentences. Since I was interested in identifying the relationships between characters, I focused on the PERSON entities only. The next step is to disambiguate the entities, ie, to determine that "Mr. Holmes", "Holmes" and "Sherlock Holmes" all refer to the same person. Finally, I compute co-occurrence across various blocks and compute a weighted co-occurrence score between the various entities and build a graph (for the top entities) for visualization.
The driver code is shown below. It calls out to various components which are described in more detail below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | // Source: src/main/scala/com/mycompany/scalcium/sherlock/SherlockMain.scala
package com.mycompany.scalcium.sherlock
import java.io.File
import java.io.FileWriter
import java.io.PrintWriter
import scala.Array.canBuildFrom
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import org.apache.commons.io.FileUtils
object SherlockMain extends App {
val dataDir = new File("data/sherlock/")
val metadataRemover = new MetadataRemover()
val paragraphSplitter = new ParagraphSplitter()
val sentenceSplitter = new SentenceSplitter()
val nerFinder = new NERFinder()
val entityDisambiguator = new EntityDisambiguator()
val graphDataGenerator = new GraphDataGenerator()
preprocess(new File(dataDir, "gutenberg"),
new File(dataDir, "output.tsv"))
wordCounts(new File(dataDir, "gutenberg"))
disambiguatePersons(new File(dataDir, "output.tsv"),
new File(dataDir, "output-PER-disambig.tsv"))
generateGraphData(new File(dataDir, "output-PER-disambig.tsv"),
new File(dataDir, "PER-graph-verts.tsv"),
new File(dataDir, "PER-graph-edges.tsv"))
def preprocess(indir: File, outfile: File): Unit = {
val writer = new PrintWriter(new FileWriter(outfile), true)
val gutenbergFiles = indir.listFiles()
gutenbergFiles.zipWithIndex.map(tf => {
val text = FileUtils.readFileToString(tf._1, "UTF-8")
(tf._2, text)
})
.map(ft => (ft._1, metadataRemover.removeMetadata(ft._2)))
.flatMap(ft => paragraphSplitter.split(ft))
.flatMap(fpt => sentenceSplitter.split(fpt, false))
.flatMap(fpst => nerFinder.find(fpst))
.foreach(fpstt => save(writer, fpstt.productIterator))
writer.flush()
writer.close()
}
def save(outfile: PrintWriter, line: Iterator[_]): Unit =
outfile.println(line.toList.mkString("\t"))
def wordCounts(indir: File): Unit = {
val gutenbergFiles = indir.listFiles()
val numWordsInFile = ArrayBuffer[Int]()
val numWordsInPara = ArrayBuffer[Int]()
val numWordsInSent = ArrayBuffer[Int]()
gutenbergFiles.zipWithIndex.foreach(tf => {
val text = metadataRemover.removeMetadata(
FileUtils.readFileToString(tf._1, "UTF-8"))
numWordsInFile += text.split(" ").size
val paras = paragraphSplitter.split((tf._2, text))
paras.foreach(para => {
numWordsInPara += para._3.split(" ").size
val sents = sentenceSplitter.split(para, false)
sents.foreach(sent =>
numWordsInSent += sent._4.split(" ").size)
})
})
Console.println("Average words/file: %.3f".format(
mean(numWordsInFile.toArray)))
Console.println("Average words/para: %.3f".format(
mean(numWordsInPara.toArray)))
Console.println("Average words/sent: %.3f".format(
mean(numWordsInSent.toArray)))
}
def mean(xs: Array[Int]): Double =
xs.foldLeft(0)(_ + _).toDouble / xs.size
def disambiguatePersons(infile: File, outfile: File): Unit = {
val personEntities = entityDisambiguator.filterEntities(
new File(dataDir, "output.tsv"), "PERSON")
val uniquePersonEntities = personEntities.distinct
val personSims = entityDisambiguator.similarities(uniquePersonEntities)
val personSyns = entityDisambiguator.synonyms(personSims, 0.6) ++
// add few well-known ones that escaped the 0.6 dragnet
Map(("Holmes", "Sherlock Holmes"),
("MR. HOLMES", "Sherlock Holmes"),
("MR. SHERLOCK HOLMES", "Sherlock Holmes"),
("Mycroft", "Mycroft Holmes"),
("Brother Mycroft", "Mycroft Holmes"))
val writer = new PrintWriter(new FileWriter(outfile), true)
Source.fromFile(infile).getLines
.map(line => line.split("\t"))
.filter(cols => cols(4).equals("PERSON"))
.map(cols => (cols(0), cols(1), cols(2),
personSyns.getOrElse(cols(3), cols(3))))
.foreach(cs => writer.println("%d\t%d\t%d\t%s".format(
cs._1.toInt, cs._2.toInt, cs._3.toInt, cs._4)))
writer.flush()
writer.close()
}
def generateGraphData(infile: File, vfile: File, efile: File): Unit = {
val vwriter = new PrintWriter(new FileWriter(vfile), true)
val vertices = graphDataGenerator.vertices(infile, 0)
vertices.toList.sortWith((a, b) => a._2 > b._2)
.foreach(vert => vwriter.println("%s\t%d".format(
vert._1, vert._2)))
vwriter.flush()
vwriter.close()
// consider minFreq = 32 (top 20), create exclude Set
val excludeVertices = vertices.filter(vert => vert._2 < 32)
.map(vert => vert._1)
.toSet
val ewriter = new PrintWriter(new FileWriter(efile), true)
val edges = graphDataGenerator.edges(infile, 0.1D, excludeVertices)
edges.foreach(edge => ewriter.println("%s\t%s\t%.3f".format(
edge._1, edge._2, edge._3)))
ewriter.flush()
ewriter.close()
}
}
|
Preprocessing
This phase consists of multiple operations. Each operation has an associated class. I had initially set this up as each step reading the output of the previous step and writing its output. Later I modeled it as a pipeline so it can be readily adapted into a Spark pipeline.
The first step is to remove the Gutenberg header and footer blocks from the files, since we don't want to extract entities from that portion of the text. We also apply some heuristics to remove paragraphs which can be easily identified as being outside the story.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | // Source: src/main/scala/com/mycompany/scalcium/sherlock/MetadataRemover.scala
package com.mycompany.scalcium.sherlock
class MetadataRemover {
def removeMetadata(text: String): String = {
val lines = text.split("\n")
val bounds = lines.zipWithIndex
.filter(li =>
li._1.startsWith("*** START OF THIS PROJECT") ||
li._1.startsWith("*** END OF THIS PROJECT"))
.map(li => li._2)
.toList
lines.slice(bounds(0) + 1, bounds(1))
.filter(line => looksLikeStoryLine(line))
.mkString("\n")
}
def looksLikeStoryLine(line: String): Boolean = {
(!line.startsWith("Produced by") &&
!line.startsWith("By") &&
!line.startsWith("by") &&
!line.startsWith("End of the Project") &&
!line.startsWith("End of Project") &&
!line.contains("Arthur Conan Doyle") &&
!(line.trim.size > 0 && line.toUpperCase().equals(line)))
}
}
|
The input at this stage consists of lines that are terminated at around the 80-th character, with one sentence potentially split across multiple lines. Paragraphs are separated by multiple newlines. Our code converts this into paragraphs of text, where each paragraph consists of multiple sentences in a single line. We also eliminate paragraphs that are shorter than 40 characters, as these appear to be titles or section headings.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | // Source: src/main/scala/com/mycompany/scalcium/sherlock/ParagraphSplitter.scala
package com.mycompany.scalcium.sherlock
import scala.Array.canBuildFrom
import scala.collection.mutable.Stack
class ParagraphSplitter {
def split(fileText: (Int, String)): List[(Int, Int, String)] = {
val filename = fileText._1
val lines = fileText._2.split("\n")
val nonEmptyLineNbrPairs = lines
.zipWithIndex
.filter(li => li._1.trim.size > 0)
.map(li => li._2)
.sliding(2)
.map(poss => (poss(0), poss(1)))
.toList
val spans = Stack[(Int,Int)]()
var inSpan = false
nonEmptyLineNbrPairs.foreach(pair => {
if (spans.isEmpty) {
if (pair._1 + 1 == pair._2) {
spans.push(pair)
inSpan = true
} else {
spans.push((pair._1, pair._1))
spans.push((pair._2, pair._2))
}
} else {
val last = spans.pop
if (pair._1 + 1 == pair._2) {
spans.push((last._1, pair._2))
inSpan = true
} else {
if (inSpan) {
spans.push((last._1, pair._1 + 1))
spans.push((pair._2, pair._2))
inSpan = false
} else {
spans.push(last)
spans.push((pair._2, pair._2))
}
}
}
})
val lastSpan = spans.pop
spans.push((lastSpan._1, lastSpan._2 + 1))
// extract paragraphs in order and add extra sequence info
spans.reverse.toList
.map(span => {
if (span._1 == span._2) lines(span._1)
else lines.slice(span._1, span._2).mkString(" ")
})
.filter(line => !line.startsWith(" ") &&
line.length > 40) // remove TOCs and titles
.zipWithIndex
.map(paraIdx => (filename, paraIdx._2, paraIdx._1))
}
}
|
Once we have paragraphs, we use the Stanford CoreNLP library to split it up into sentences. Along with the sentences themselves, we also record the file number, paragraph number and sentence number. We will need this for calculating the similarities between entities later.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | // Source: src/main/scala/com/mycompany/scalcium/sherlock/SentenceSplitter.scala
package com.mycompany.scalcium.sherlock
import java.util.Properties
import scala.collection.JavaConversions.asScalaBuffer
import scala.collection.JavaConversions.propertiesAsScalaMap
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
class SentenceSplitter {
val props = new Properties()
props("annotators") = "tokenize, ssplit"
val pipeline = new StanfordCoreNLP(props)
def split(fileParaText: (Int, Int, String), doPadding: Boolean):
List[(Int, Int, Int, String)] = {
val fileId = fileParaText._1
val paraId = fileParaText._2
val paraText = if (!doPadding) fileParaText._3
else "<START>. " + fileParaText._3
val annot = new Annotation(paraText)
pipeline.annotate(annot)
annot.get(classOf[SentencesAnnotation])
.map(sentence => sentence.toString())
.zipWithIndex
.map(sentWithId =>
(fileId, paraId, sentWithId._2, sentWithId._1))
.toList
}
}
|
Finally, we use Stanford CoreNLP again to extract named entities from the sentences. It can extract around 7 classes of entities, but we are only interested in the PERSON entity class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | // Source: src/main/scala/com/mycompany/scalcium/sherlock/NERFinder.scala
package com.mycompany.scalcium.sherlock
import java.util.Properties
import scala.collection.JavaConversions.asScalaBuffer
import scala.collection.JavaConversions.propertiesAsScalaMap
import scala.collection.mutable.Stack
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
class NERFinder {
val props = new Properties()
props("annotators") = "tokenize, ssplit, pos, lemma, ner"
props("ssplit.isOneSentence") = "true"
val pipeline = new StanfordCoreNLP(props)
def find(sent: (Int, Int, Int, String)):
List[(Int, Int, Int, String, String)] = {
val fileId = sent._1
val paraId = sent._2
val sentId = sent._3
val sentText = sent._4
val annot = new Annotation(sentText)
pipeline.annotate(annot)
val tokTags = annot.get(classOf[SentencesAnnotation])
.head // only one sentence in input
.get(classOf[TokensAnnotation])
.map(token => {
val begin = token.beginPosition()
val end = token.endPosition()
val nerToken = sentText.substring(begin, end)
val nerTag = token.ner()
(nerToken, nerTag)
})
// consolidate NER for multiple tokens into one, so
// for example: Ronald/PERSON Adair/PERSON becomes
// "Ronald Adair"/PERSON
val spans = Stack[(String, String)]()
tokTags.foreach(tokTag => {
if (spans.isEmpty) spans.push(tokTag)
else if (tokTag._2.equals("O")) spans.push(tokTag)
else {
val prevEntry = spans.pop
if (prevEntry._2.equals(tokTag._2))
spans.push((Array(prevEntry._1, tokTag._1).mkString(" "),
tokTag._2))
else {
spans.push(prevEntry)
spans.push(tokTag)
}
}
})
spans.reverse
.filter(tokTag => !tokTag._2.equals("O"))
.map(tokTag => (fileId, paraId, sentId, tokTag._1, tokTag._2))
.toList
}
}
|
The output of this pipeline is a single tab separated file containing 17,437 entities that looks like this. The first three columns are the file number, the paragraph number in the file, and the sentence number in the paragraph respectively. The next column is the span of text representing the entity and the last column is the entity type that CoreNLP assigned to this text.
1 2 3 4 5 6 7 8 9 10 11 | 0 0 0 the spring of the year 1894 DATE
0 0 0 London LOCATION
0 0 0 Ronald Adair PERSON
0 0 2 now DATE
0 0 2 the end of nearly ten years DURATION
0 0 4 now DATE
0 0 4 once DATE
0 0 5 first ORDINAL
0 0 5 third ORDINAL
0 0 5 last month DATE
...
|
Entity Disambiguation
Since my objective is to figure out the relationships between characters in the stories, I focus only on the PERSON entities going forward. However, even among the PERSON entities, there is potential for ambiguity - for example, "Sherlock Holmes" can be variously refered to as "Sherlock", "Holmes", "Mr. Holmes", etc. So I needed a way to "normalize" these mentions to a single entity name.
My disambiguation code first tries to compute the Levenshtein's distance between a pair of entity names. If the distance is 0 it means they are identical and nothing more needs to be done. If they are not identical, I calculate a measure of word overlap between the two strings - the word intersection size divided by the average number of words in the two strings, and consider the two entities equal if the score exceeds a certain threshold (in my case 0.6, chosen by quickly eyeballing the generated pairwise scores).
This list of entity pairs is converted into a dictionary of synonyms where the shorter entity points to the longer entity. I then pass the file generated above through this filter. Any text span that can be found in this dictionary is transformed into the longer entity, otherwise it is kept as-is. The end product isa file of PERSON entities with the entity names mostly disambiguated. I did notice some that were major and were not addressed by the code above, so I added a set of manual overrides as well. The original list of 6,428 PERSON entities came down to 701 entities after disambiguation. Here is the code for the disambiguator.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | // Source: src/main/scala/com/mycompany/scalcium/sherlock/EntityDisambiguator.scala
package com.mycompany.scalcium.sherlock
import java.io.File
import scala.io.Source
import org.apache.commons.lang3.StringUtils
class EntityDisambiguator {
def filterEntities(infile: File, entityType: String): List[String] = {
Source.fromFile(infile).getLines
.filter(line => line.split("\t")(4).equals(entityType))
.map(line => line.split("\t")(3))
.toList
}
def similarities(uniqueEntities: List[String]):
List[(String, String, Double)] = {
val entityPairs = for (e1 <- uniqueEntities;
e2 <- uniqueEntities)
yield (e1, e2)
// the filter removes exact duplicates, eg (Holmes, Holmes)
// as well as makes the LHS shorter than RHS, eg (Holmes,
// Sherlock Holmes)
entityPairs.filter(ee => ee._1.length < ee._2.length)
.map(ee => (ee._1, ee._2, similarity(ee._1, ee._2)))
}
def similarity(entityPair: (String, String)): Double = {
val levenshtein = StringUtils.getLevenshteinDistance(
entityPair._1, entityPair._2)
if (levenshtein == 0.0D) 1.0D
else {
val words1 = entityPair._1.split(" ").toSet
val words2 = entityPair._2.split(" ").toSet
words1.intersect(words2).size.toDouble /
((words1.size + words2.size) * 0.5)
}
}
def synonyms(sims: List[(String, String, Double)],
minSim: Double): Map[String, String] = {
sims.filter(ees => (!ees._1.equals(ees._2) && ees._3 > minSim))
// remove duplicate mappings, eg Peter => Peter Carey and
// Peter => Peter Jones. Like the unique cases which have
// no suspected candidates, these will resolve to themselves
.groupBy(ee => ee._1)
.filter(eeg => eeg._2.size == 1)
.map(eeg => (eeg._1, eeg._2.head._2))
.toMap
}
}
|
Scoring
I then took pairwise sets of the entities (along with their "co-ordinates" given by the first 3 columns in the file - the file ID, paragraph ID and sentence ID), and compute pairwise distances between the entities. The final representation of these entities is as a graph, with edges representing connections between entities and edge weights representing the degree of connectedness. Each line in the input file represents an instance of one of the entities.
The pairwise distance is computed based on the three coordinates. If two entity instances co-occur in the same sentence, they contribute the highest to the edge weight between the two entities. Less so if they co-occur in the same paragraph and even less if they occur in the same file. The actual weights are the reciprocals of the average sentence length, average paragraph length and the average file length in words. The intuition here is that it is more likely to have a pair of entity instances co-occur in the same paragraph than the same sentence, and the likelihood depends on the number of words being considered.
Just like term frequencies, entity frequencies also seem to behave in a Zipfian manner. So it makes sense to discard entities with low frequencies since they don't model anything interesting. Also, removing them at the outset results in fewer cycles when we compute their pairwise weights. In our case we compute frequencies for all entities, but only consider entities whose frequencies are higher than 32 (leaving us with around 20 top entities in our entity relationship graph). Here is the code to generate the entity frequency and the entity relation data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | // Source: src/main/scala/com/mycompany/scalcium/sherlock/GraphDataGenerator.scala
package com.mycompany.scalcium.sherlock
import java.io.File
import scala.io.Source
import java.io.PrintWriter
import java.io.FileWriter
class GraphDataGenerator {
def pFile = 1.0D / 30579
def pPara = 1.0D / 56.534
def pSent = 1.0D / 15.629
def vertices(infile: File, minFreq: Int = 0): Map[String, Int] = {
Source.fromFile(infile).getLines
.map(line => line.split("\t")(3))
.toList
.groupBy(name => name)
.map(ng => (ng._1, ng._2.size))
.filter(ns => ns._2 > minFreq)
}
def edges(infile: File, minSim: Double = 0.01D,
exclude: Set[String] = Set()):
List[(String,String,Double)] = {
val personData = Source.fromFile(infile).getLines
.map(line => line.split("\t"))
.filter(cols => !exclude.contains(cols(3)))
.toList
val personPairs = for (p1 <- personData; p2 <- personData)
yield (p1, p2)
personPairs.filter(pp => pp._1(3).length < pp._2(3).length)
.map(pp => (pp._1(3), pp._2(3), edgeWeight(pp._1, pp._2)))
.toList
.groupBy(ppw => (ppw._1, ppw._2))
.map(ppwg => (ppwg._1._1, ppwg._1._2, ppwg._2.map(_._3).sum))
.filter(ppwg => ppwg._3 > minSim)
.toList
}
def edgeWeight(p1: Array[_], p2: Array[_]): Double = {
val p1Locs = p1.slice(0, 3).map(_.asInstanceOf[String].toInt)
val p2Locs = p2.slice(0, 3).map(_.asInstanceOf[String].toInt)
if (p1Locs(0) == p2Locs(0)) { // same file
if (p1Locs(1) == p2Locs(1)) { // same para
if (p1Locs(2) == p2Locs(2)) pSent // same sentence
else pPara // same para but not same sentence
} else pFile // same file but not same para
} else 0.0D // not same file
}
}
|
This outputs two files, one with entities and their frequency in the corpus, and another with pairs of entities and their edge weights. They are shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | # PER-graph-verts.tsv
Sherlock Holmes 1818
Watson 609
Messrs. Lestrade 200
Henry 129
Miss Stapleton 91
Mortimer 86
Tobias Gregson 66
Charles 66
Barrymore 64
Hopkins 41
...
# PER-graph-edges.tsv
Mary Sherlock Holmes 0.442
Jones Sherlock Holmes 0.583
Barrymore Sherlock Holmes 0.603
Hopkins Sherlock Holmes 1.688
Drebber Jefferson Hope 0.216
Hopkins Messrs. Lestrade 0.111
Watson Charles 0.257
Henry Watson 0.633
Watson Sherlock Holmes 10.432
James Sherlock Holmes 0.643
...
|
Visualizations
We use the files generated in the previous step to create a frequency chart for the top 20 entities and a graph showing the relationships between these entities. Code (in Python) to generate the frequency distribution for the top 20 entities is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | # Source: src/main/python/sherlock/per_freqs.py
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
fin = open("data/sherlock/PER-graph-verts.tsv", 'rb')
xs = []
ys = []
num_entries = 20
curr_entries = 0
for line in fin:
cols = line.strip().split("\t")
xs.append(cols[0])
ys.append(int(cols[1]))
if curr_entries >= num_entries:
break
curr_entries += 1
fin.close()
plt.barh(range(len(xs[::-1])), ys[::-1], align="center", height=1)
plt.yticks(range(len(xs)), xs[::-1])
plt.xlabel("Occurrence Count")
plt.grid()
plt.show()
|
Here is the code to generate the graph of relationships among the top 20 entities. Edges with higher weights have greater thickness and are also color coded - red represents the highest weights, followed by blue, green and orange. Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | # Source: src/main/python/sherlock/per_rels.py
# -*- coding: utf-8 -*-
from __future__ import division
import networkx as nx
import matplotlib.pyplot as plt
import math
G = nx.Graph()
fin = open("data/sherlock/PER-graph-edges.tsv", 'rb')
for line in fin:
cols = line.strip().split("\t")
G.add_edge(cols[0], cols[1], weight=int(1.0+math.log(10.0*float(cols[2]))))
graph_pos = nx.shell_layout(G)
nx.draw_networkx_nodes(G, graph_pos, node_size=5000, alpha=0.3,
node_color="blue")
nx.draw_networkx_labels(G, graph_pos, font_size=12, font_family="sans-serif")
edge_widths = set(map(lambda x: x[2]["weight"], G.edges(data=True)))
colors = ["magenta", "orange", "green", "blue", "red"]
for edge_width, color in zip(edge_widths, colors):
edge_subset = [(u, v) for (u, v, d) in G.edges(data=True)
if d["weight"] == edge_width]
nx.draw_networkx_edges(G, graph_pos, edgelist=edge_subset, width=edge_width,
alpha=0.3, edge_color=color)
fig = plt.gcf()
fig.set_size_inches(15, 15)
plt.xticks([])
plt.yticks([])
plt.show()
|
Not surprisingly, Sherlock Holmes is the most connected, and he is connected most strongly to Watson (red) and Lestrade (blue). The graph seems quite densely connected, also not that surprising, since we are considering the most frequent entities, which would imply that these are popular characters in the series. Unfortunately, I was unable to identify any of the characters other than the top 3. I also thought that some characters such as Mycroft Holmes (#25 frequent entity) and Prof Moriarty (#58) should have ranked higher. But that probably just means that I wasn't enough of a Holmes fan to know about the other characters :-).
Ideas for Improvement
There are some things I think I could have done that may have given better results (or at least more in line with my ideas of who the main characters should have been). But I deliberately skipped them because I wanted to see some results quickly with comparitively little effort.
The first would be to manually split the stories. As it stands, each file can either be a full novel (for example, The Hound of the Baskervilles) or a collection of short stories (for example, The Adventures of Sherlock Holmes). So it perhaps makes more sense to split the text up according to story rather than by file for weighting purposes. This may also mean revisiting the concept of average file length, since story lengths for short stories vs novels are likely to be quite different, so we may have to use local weights corresponding to the current story or do away with this measure altogether.
Another thing that I plan to try out in the future is Pronoun Resolution. Currently our entities are found from direct mentions of people names. I did try to CoreNLP's Coreference Resolution module to capture the pronouns but it generates all sorts of coreference candidates, it may be worth building something custom that only works on pronouns and associates them with PERSON mentions close enough in the past.
All the Scala code and Python code are available at these links on GitHub.