Some time ago, as part of a discussion I don't remember much about anymore, I was referred to this somewhat old (Jan/Feb 2018) set of articles about Deutsche Bank and its involvement in money laundering activities.
- DEUTSCHE BANK: A GLOBAL BANK FOR OLIGARCHS — AMERICAN & RUSSIAN, PART 1
- DEUTSCHE BANK: A GLOBAL BANK FOR OLIGARCHS — AMERICAN & RUSSIAN, PART 2
- DEUTSCHE BANK: A GLOBAL BANK FOR OLIGARCHS — AMERICAN & RUSSIAN, PART 3
Now I know as much about money laundering as the average person on the street, which is to say not much, so it was a fascinating and tedious read at the same time. Fascinating because of the scale of operations and the many big names involved, and tedious because there were so many players that I had a hard time keeping track of them as I read through the articles. In any case, I had just finished some work on my fork of the open source NERDS toolkit for training Named Entity Recognition models, and it occurred to me that identifying the entities in this set of articles and connecting them up into a graph might help to make better sense of it all. Sort of like how people draw mind-maps when trying to understand complex information. Except our process is going to be (mostly) automated, and our mind-map will have entities instead of concepts.
Skipping to the end, here is the entity graph I ended up building, it's a screenshot from the Neo4j web console. Red nodes represent persons, green nodes represent organizations, and yellow nodes represent geo-political entities. The edges are shown as directed, but of course co-occurrence relationships are bidirectional (or equivalently undirected).
The basic idea is to find Named Entities in the text using off the shelf Named Entity Recognizers (NERs), and connect a pair of entities if they co-occur in the same sentence. The transformation from unstructured text to entity graph is mostly automated, except for one step in the middle where we manually refine the entities and their synonyms. The graph data was ingested into a Neo4j graph database, and I used Cypher and Neo4j graph algorithms to generate insights from the graph. In this post I describe the steps to convert from unstructured article text to entity graph. The code is provided on GitHub, and so is the the data for this example, so you can use them to glean other interesting insights from this data, as well as rerun the pipeline to create entity graphs for your own text.
I structured the code as a sequence of Python scripts and Jupyter notebooks that are applied to the data. Each script or notebook reads the data files already available and writes new data files for the next stage. Scripts are numbered to indicate the sequence in which they should be run. I describe these steps below.
As mentioned earlier, the input is the text from the three articles listed above. I screen scraped the text into a local text file (select the article text and then copy the text, then paste it into a local text editor, and finally saved it into the file db-article.txt. The text is organized into paragraphs, with an empty line delimiting each paragraph. The first article also provided a set of acronyms and their expansions, which I captured similarly into the file db-acronyms.txt.
- 01-preprocess-data.py -- this script reads the paragraphs and converts it to a list of sentences. For each sentence, it checks to see if any token is an acronym, and if so, it replaces the token with the expansion. The script uses the SpaCy sentence segmentation model to segment the paragraph text into sentences, and the English tokenizer to tokenize sentences into tokens. Output of this step is a list of 585 sentences in the sentences.txt file.
- 02-find-entities.py -- this script uses the SpaCy pre-trained NER to find instances of Person (PER), Organization (ORG), GeoPolitical (GPE), Nationalities (NORP), and other types of entities. Output is written to the entities.tsv file, one entity per line.
- 03-cluster-entity-mentions.ipynb -- in this Jupyter notebook, we do simple rule-based entity disambiguation, so that similar entity spans found in the last step are clustered under the same entity -- for example, "Donald Trump", "Trump", and "Donald J. Trump", are all clustered under the same PER entity for "Donald J. Trump". The disambiguation finds similar spans of text (Jaccard token similarity) and considers those above a certain threshold to refer to the same entity. The most frequent entity types found are ORG, PERSON, GPE, DATE, and NORP. This step writes out each cluster as a key-value pair, with the key being the longest span in the cluster, and the value as a pipe-separated list of the other spans. Output from this stage are the files person_syns.csv, org_syns.csv, and gpe_syns.csv.
- 04-generate-entity-sets.py -- This is part of the manual step mentioned above. The *_syns.csv files contain clusters that are mostly correct, but because the clusters are based solely on lexical similarity, they still need some manual editing. For example, I found the "US Justice Department" and "US Treasury Department" in the same cluster, but "Treasury" in a different cluster. Similarly, "Donald J. Trump" and "Donald Trump, Jr." appeared in the same cluster. This script re-adjusts the clusters, removing duplicate synonyms for clusters, and assigning the longest span as the main entity name. It is designed to be run with arguments so you can version the *_syn.csv files. The repository contains my final manually updated files as gpe_syns-updated.csv, org_syns-updated.csv, and person_syns-updated.csv.
- 05-find-corefs.py -- As is typical in most writing, people and places are introduced in the article, and are henceforth referred to as "he/she/it", at least while the context is available. This script uses the SpaCy neuralcoref to resolve pronoun coreferences. We restrict the coreference context to the paragraph in which the pronoun occurs. Input is the original text file db-articles.txt and the output is a file of coreference mentions corefs.tsv. Note that we don't yet attempt to update the sentences in place like we did with the acronyms because the resulting sentences are too weird for the SpaCy sentence segmenter to segment accurately.
- 06-find-matches.py -- In this script, we use the *_syns.csv files to construct a Aho-Corasick Automaton object (from the PyAhoCorasick module), basically a Trie structure against which the sentences can be streamed. Once the Automaton is created, we stream the sentences against it, allowing it to identify spans of text that match entries in its dictionary. Because we want to match any pronouns as well, we first replace any coreferences found in the sentence with the appropriate entity, then run the updated sentence against the Automaton. Output at this stage is the matched_entities.tsv, a structured file of 998 entities containing the paragraph ID, sentence ID, entity ID, entity display name, entity span start and end positions.
- 07-create-graphs.py -- We use the keys of the Aho-Corasick Automaton dictionary that we created in the previous step to write out a CSV file of graph nodes, and the matched_entities.tsv to construct entity pairs within the same sentence to write out a CSV file of graph edges. The CSV files are in the format required by the neo4j-admin command, which is used to import the graph into a Neo4j 5.3 community edition database.
- 08-explore-graph.ipynb -- We have three kinds of nodes in the graph, PERson, ORGanization, and LOCation nodes. In this notebook, we compute PageRank on each type of node to find the top people, organizations, and locations we should look at. From there, we select a few top people and find their neighbors. One other feature we built was a search like functionality, where once two nodes are selected, we show a list of sentences where these two entities cooccur. And finally, compute the shortest path between a pair of nodes. The notebook shows the different queries, the associated Cypher queries (including calls to Neo4j Graph algorithms), as well as the outputs of these queries, its probably easier for you to click through and take a look yourself than for me to describe it.
There are obviously many other things that can be done with the graph, limited only by your imagination (and possibly by your domain expertise on the subject at hand). For me, the exercise was fun because I was able to use off the shelf NLP components (as opposed to having to train my own compoenent for my domain) to solve a problem I was facing. Using the power of NERs and graphs allows us to gain insights that would normally not be possible solely from the text.
This is an absolutely amazing way to read articles. I would like to extend this approach to reading non-fiction books - management, AI, architecture.
ReplyDeleteThank you. WRT you idea about extending to reading non-fiction books, you could probably look up its index and construct a page level co-occurrence graph directly, or at least use the entries in the index as the list of entities.
ReplyDeleteI have done it for Kaggle CORD 19 competition, write up
ReplyDeletehttps://medium.com/@alex.mikhalev/building-knowledge-graph-from-covid-medical-literature-kaggle-cord19-competition-f0178d2a19bd
Hi Alex, thanks for the article, this is very cool. I am on a similar path with CORD-19 dataset, but maybe many steps behind you. I too am using Dask (also learning it in parallel) and SciSpacy models, specifically the one with UMLS concepts (kind of takes me back to my Healthline days where we used these concepts extensively). For the Aho-Corasick, have you considered using map_partitions() and giving each worker its own copy of the automaton? That would fix the single-threaded issue (assuming one thread per worker).
ReplyDeleteHello Sujit,
ReplyDeleteI already build a better, Redis based version:
https://github.com/AlexMikhalev/cord19redisknowledgegraph
SciSpacy is great but memory consuming.
Do you want to join our team on Discord? https://discord.gg/UcqQRTB
I will try map_partitions, but Aho-Corasick is super fast, it's not a problem to run it's in a single thread. I loaded full UMLS into Aho-Corasick under a few minutes and matching can be parallelized.
Regards,
Alex
Dask distributed is actually really difficult to use with scispacy - it's trying to serialise and deserialise 9 GB RAM, I had all sorts of issues with it - this is why I build a Redis based pipeline - chopping off things which can be processed quickly and differently.
ReplyDeleteThank you for the invite, but I don't think I will be able to move at the pace required for Kaggle competitions. Thanks for the link to your Redis based pipeline, I will check it out, from a cursory look it looks similar to what I did with an external HTTP server, but Redis is probably a better choice. I have described my use of Dask with SciSpacy in my latest blog post, hopefully you might find something helpful in there.
ReplyDelete