Saturday, October 11, 2014

Clustering Section Titles with FuzzyWuzzy and DBSCAN


Clinical Notes are generated at different points of a patient's interaction with medical services, and generally consists of free-form text grouped into sections. Medical schools train doctors and nurses to take notes in a certain way, so the section titles from a family of notes look remarkably similar. However, over time, doctors and nurses adapt the template they were taught to one that works for their particular case, so there are variations in section titles. For example, one doctor may group Laboratory Tests requested under the section heading "LABORATORY TESTS", while someone else may call it "LAB TESTS" or even "LABS".

For pipelines that process this sort of document, section titles can help provide hints for entity disambiguation - for example, words such as Allergy or Headache can be recognized as drug brand names in the MEDICATIONS section, but as symptoms in HISTORY AND PHYSICAL EXAM sections. The document can also be split into different sections and each section sent through a different pipeline.

The basic problem is thus to identify the type of section from the name. Because of the variation in naming sections, it is necessary to group the variations into clusters. The SecTag project from Vanderbilt Biomedical Language Processing Lab has developed a very detailed terminology of section headers. The project is described in this paper.

I wondered if instead I could use Machine Learning techniques to group these titles, and decided to use the DBSCAN clustering algorithm from the scikit-learn project. For the similarity implementation I decided to use the fuzzy string matching algorithms from the FuzzyWuzzy project. I also tried breaking up the titles into character 5-gram shingles and using the Jaccard similarity between the shingles as a similarity metric but did not get very good results.

I have a small corpus of 2k+ clinical notes that I use for my experiments. Here is the code to extract the section titles from it. The clinical notes are in plaintext format, with section headers usually in all-caps, either starting a line and on its own line. If it starts a line, it is separated from the rest of the text by either a colon or hyphen. Obviously this is very tied to my data and not very interesting, but I include it here for completeness (and no, unfortunately I cannot share the data).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Source: extract_stl.py
import os
import nltk
import string
import re
from operator import itemgetter

INPUT_DIR = "/path/to/dir/containing/input/texts"
OUTPUT_DIR = "/path/to/intermediate/files"
PUNCTUATIONS = set([c for c in string.punctuation])
DIGITS = set([c for c in string.digits])
BULLETS = re.compile("[0-9IVXA-Za-z]{0,3}\.")
PUNCTS = re.compile(r"[" + string.punctuation + "]")

def find_first(line, cs):
    idxs = []
    for c in cs:
        c_index = line.find(c)
        if c_index > -1:
            # if this occurs after an existing punctuation, then discard
            prev_chars = set([pc for pc in line[0:c_index - 1]])
            if len(PUNCTUATIONS.intersection(prev_chars)) > 0:
                return -1
            # make sure this position is either EOL or followed by space
            if c_index + 1 == len(line) or line[c_index + 1] == ' ':
                idxs.append(c_index)
    if len(idxs) == 0:
        return -1
    else:
        return min(idxs)
        
stfd = nltk.FreqDist()
for filename in os.listdir(INPUT_DIR):
    f = open(os.path.join(INPUT_DIR, filename), 'rb')
    for line in f:
        line = line.strip()
        if len(line) == 0:
            continue
        # Isolate section titles from text. Titles are leading phrases 
        # terminated by colon or hyphen. Usually all-caps but can be in
        # mixed-case also
        sec_title = None
        corh = find_first(line, [":", "-"])
        if corh > -1:
            sec_title = line[0:corh]
        # Alternatively, if the line is all caps, then it is also a
        # section title
        if sec_title is None and line.upper() == line:
            sec_title = line
        if sec_title is not None: 
            # Remove retrieved titles with leading arabic number, roman number
            # and alpha bullets (allow max 3) bullets
            if re.match(BULLETS, sec_title) is not None:
                continue
            # Remove sections that look like dates (all numbers once puncts)
            # are removed
            if re.sub(PUNCTS, "", sec_title).isdigit():
                continue
            # if retrieved title is mixed case remove any that have > 4 words
            if sec_title != sec_title.upper() and len(sec_title.split()) > 4:
                continue
            # if retrieved title contains special chars, remove
            if "," in sec_title:
                continue
            # replace "diagnoses" with "diagnosis"
            sec_title = re.sub("DIAGNOSES", "DIAGNOSIS", sec_title)
            stfd[sec_title] += 1
    f.close()
    
# output the frequency distribution
fout = open(os.path.join(OUTPUT_DIR, "stitles.txt"), 'wb')
for k, v in sorted(stfd.items(), key=itemgetter(1), reverse=True):
    fout.write("%s\t%d\n" % (k, v))
fout.close()

The code above produces an output file of 2,093 section titles and their counts, one per line, the title and count separated by a tab. This file is consumed by the following code in order to create a square matrix of pairwise distances between section titles. In order to keep the distance matrix size reasonable I decided to only consider titles which appear at least twice - this gives me a distance matrix of size 841x841 (instead of 2093x2093). The distance is calculated as 1 minus the highest similarity from the ratio, token_sort and token_set ratio methods in FuzzyWuzzy. The code below will write this into Matrix Market format.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Source: fuzz_similarity.py
from __future__ import division
import os
import numpy as np
from scipy.io import mmwrite
from fuzzywuzzy import fuzz

OUTPUT_DIR = "/path/to/intermediate/files"

def compute_similarity(s1, s2):
    return 1.0 - (0.01 * max(
        fuzz.ratio(s1, s2),
        fuzz.token_sort_ratio(s1, s2),
        fuzz.token_set_ratio(s1, s2)))
        

cutoff = 2
stitles = []
fin = open(os.path.join(OUTPUT_DIR, "stitles.txt"), 'rb')
for line in fin:
    stitle, count = line.strip().split("\t")
    if int(count) < cutoff:
        continue
    stitles.append(stitle)
fin.close()

X = np.zeros((len(stitles), len(stitles)))
for i in range(len(stitles)):
    if i > 0 and i % 10 == 0:
        print "Processed %d/%d rows of data" % (i, X.shape[0])
    for j in range(len(stitles)):
        if X[i, j] == 0.0:        
            X[i, j] = compute_similarity(stitles[i].lower(), stitles[j].lower())
            X[j, i] = X[i, j]

# write to Matrix Market format for passing to DBSCAN
mmwrite(os.path.join(OUTPUT_DIR, "stitles.mtx"), X)

Finally, I use Scikit-Learn's DBSCAN implementation to cluster the titles. This is done by passing in the distance matrix generated above and setting the metric to precomputed.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Source: cluster_titles.py
import os
import numpy as np
from sklearn.cluster import DBSCAN
from scipy.io import mmread

OUTPUT_DIR = "/path/to/intermediate/files"

X = mmread(os.path.join(OUTPUT_DIR, "stitles.mtx"))
clust = DBSCAN(eps=0.1, min_samples=5, metric="precomputed")
clust.fit(X)

# print cluster report
stitles = []
ftitles = open(os.path.join(OUTPUT_DIR, "stitles.txt"), 'rb')
for line in ftitles:
    stitles.append(line.strip().split("\t")[0])
ftitles.close()

preds = clust.labels_
clabels = np.unique(preds)
for i in range(clabels.shape[0]):
    if clabels[i] < 0:
        continue
    cmem_ids = np.where(preds == clabels[i])[0]
    cmembers = []
    for cmem_id in cmem_ids:
        cmembers.append(stitles[cmem_id])
    print "Cluster#%d: %s" % (i, ", ".join(cmembers))

The output of the clustering is shown below. As you can see, the section titles have been clustered into 26 clusters. In most cases, it is easy to see why a title is in a group - ie, they share a common word, such as MEDICATIONS, CURRENT MEDICATIONS and DISCHARGE MEDICATIONS. Other times, it is not so obvious, such as HX being in the same cluster as PATIENT HISTORY. Other times, there is scope for improvement, such as LAB FINDINGS and LAB DATA appearing in different clusters. Overall, though, it seems to have created fairly reasonable clusters.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
Cluster#1: ALLERGIES, MEDICATIONS, CURRENT MEDICATIONS, DISCHARGE MEDICATIONS, 
        MEDICATION, Medications, Allergies, HOME MEDICATIONS, 
        MEDICATIONS AT HOME, MEDICATIONS PRESCRIBED, MEDICATION HISTORY, 
        ALLERGIES TO MEDICINES, ALLERGIES TO MEDICATIONS, 
        MEDICATIONS AT THE TIME OF STUDY, DRUG ALLERGIES, PRESENT MEDICATIONS, 
        MEDICATIONS PRIOR TO ADMISSION, CURRENT MEDICATIONS AT HOME, 
        PREOPERATIVE MEDICATIONS, MEDICATIONS ON TRANSFER, 
        MEDICATIONS ON DISCHARGE, CURRENT MEDICATION, MEDICATION CHANGES
Cluster#2: PREOPERATIVE DIAGNOSIS, POSTOPERATIVE DIAGNOSIS, IMPRESSION, 
        PLAN, DIAGNOSIS, DISCHARGE DIAGNOSIS, ASSESSMENT AND PLAN, 
        FINAL DIAGNOSIS, ADMITTING DIAGNOSIS, ADMISSION DIAGNOSIS, 
        POSTOP DIAGNOSIS, PREOP DIAGNOSIS, SUMMARY, SECONDARY DIAGNOSIS, 
        DIAGNOSTIC IMPRESSION, ASSESSMENT/PLAN:, PREPROCEDURE DIAGNOSIS, 
        POSTPROCEDURE DIAGNOSIS, IMPRESSION AND PLAN, POSTOPERATIVE PLAN, 
        TREATMENT PLAN, IMPRESSION/PLAN:, PROBLEMS/DIAGNOSIS:, 
        PRINCIPAL DIAGNOSIS, DIFFERENTIAL DIAGNOSIS, FINAL IMPRESSION, 
        DIAGNOSIS ON DISCHARGE, ENDOSCOPIC IMPRESSION, INITIAL DIAGNOSIS, 
        CLINICAL IMPRESSION, DIAGNOSIS ON ADMISSION, CURRENT DIAGNOSIS, 
        POSTOPERATIVE DX, DIAGNOSIS AT ADMISSION, 
        DIAGNOSTIC SUMMARY AND IMPRESSION, PROBLEMS DIAGNOSIS, 
        RECOMMENDATIONS AND PLAN, DISCHARGE PLAN, PLAN OF CARE, 
        DISCHARGE SUMMARY, IN SUMMARY, SUMMARY OF TREATMENT PLANNING, 
        IMPRESSION DIAGNOSIS, PREOPERATIVE DIAGNOSIS (ES):, 
        POSTOPERATIVE DIAGNOSIS (ES):, VIDEO EEG DIAGNOSIS, Impression, 
        PRIMARY DIAGNOSIS, DSM IV DIAGNOSIS, PLAN AND RECOMMENDATIONS, 
        DIAGNOSTICS, POST PROCEDURE DIAGNOSIS, SUMMARY OF PROCEDURE, 
        DIAGNOSIS AT DISCHARGE, INITIAL IMPRESSION
Cluster#3: PAST MEDICAL HISTORY, SOCIAL HISTORY, FAMILY HISTORY, 
        PAST SURGICAL HISTORY, HISTORY OF PRESENT ILLNESS, HX, 
        PSYCHIATRIC, HISTORY, PAST PSYCHIATRIC HISTORY, 
        FAMILY MEDICAL HISTORY, SURGICAL HISTORY, PERSONAL HISTORY, 
        MEDICAL HISTORY, Psychiatric, PSYCHIATRIC HISTORY, Social History, 
        LEGAL HISTORY, FAMILY SOCIAL HISTORY, FAMILY PSYCHIATRIC HISTORY, 
        Past Medical History, Family History, PERSONAL AND SOCIAL HISTORY, 
        SOCIAL AND DEVELOPMENTAL HISTORY, INTERVAL HISTORY, PAST HISTORY, 
        BIRTH HISTORY, Past Surgical History, 
        PAST MEDICAL AND SURGICAL HISTORY,
        SUBSTANCE ABUSE HISTORY, CLINICAL HISTORY, 
        SUBSTANCE AND ALCOHOL HISTORY, SOCIAL HX, NUTRITIONAL HISTORY, 
        PAST MEDICAL HX, PRIMARY MEDICAL HISTORY, DEVELOPMENTAL HISTORY, 
        IMMUNIZATION HISTORY, GENETIC PSYCHIATRIC HISTORY, PAST SURGICAL HX, 
        DEVELOPMENTAL MILESTONES, FAMILY AND SOCIAL HISTORY, TRAVEL HISTORY, 
        OBSTETRICAL HISTORY, DEVELOPMENTAL ASSESSMENT, BRIEF HISTORY, 
        BIRTH AND DEVELOPMENTAL HISTORY, GYNECOLOGICAL HISTORY, 
        PSYCHOLOGICAL HISTORY, WORK HISTORY, DEVELOPMENTAL, 
        FAMILY HISTORY AND SOCIAL HISTORY, GYN HISTORY, PAST MEDICAL HISTORY, 
        BIRTH HX, PREVIOUS MEDICAL HISTORY
Cluster#4: AXIS III, AXIS II, AXIS V, AXIS I, AXIS IV, Axis I, Axis II, 
        Axis III, Axis V, Axis IV, AXIS  IV, V, AXIS   II, AXIS  III, 
        AXIS   V, II, AXIS    I
Cluster#5: HOSPITAL COURSE, COURSE, TREATMENT, EMERGENCY DEPARTMENT COURSE, 
        Hospital Course, EMERGENCY ROOM COURSE, HOSPITAL COURSE AND TREATMENT, 
        COURSE IN HOSPITAL, COURSE IN THE ED, TREATMENT RECOMMENDATIONS, 
        COURSE IN THE HOSPITAL, ED COURSE
Cluster#6: PROCEDURE, PROCEDURE PERFORMED, PROCEDURES, 
        DESCRIPTION OF PROCEDURE, PROCEDURE IN DETAIL, PROCEDURES PERFORMED, 
        OPERATIVE PROCEDURE, INDICATIONS, INDICATION, PROCEDURE DETAILS, 
        INDICATIONS FOR PROCEDURE, TITLE OF PROCEDURE, 
        OPERATIVE PROCEDURE IN DETAIL, INDICATION FOR SURGERY, 
        TITLE OF PROCEDURES, OPERATIVE PROCEDURES, PROCEDURE DETAIL, 
        PROCEDURE NOTE, PROCEDURE DONE, INDICATIONS FOR SURGERY, 
        DESCRIPTION OF THE PROCEDURE, DESCRIPTION OF OPERATION, 
        POST PROCEDURE INSTRUCTIONS, DESCRIPTION OF THE OPERATION, 
        INDICATIONS FOR OPERATION, REPORTED PROCEDURE, NAME OF PROCEDURE, 
        PROCEDURES DURING THIS HOSPITALIZATION, 
        PROCEDURES PLANNED AND PERFORMED, INDICATIONS FOR THE PROCEDURE, 
        DESCRIPTION, DETAILS OF THE PROCEDURE, PROCEDURES AND IMMUNIZATIONS, 
        INDICATION FOR STUDY, GROSS DESCRIPTION, PRINCIPAL PROCEDURE, 
        DETAILS OF PROCEDURE, REASON FOR PROCEDURE, 
        CONDITION OF THE PATIENT AT THE END OF THE PROCEDURE, 
        SURGICAL PROCEDURE, INDICATION FOR PROCEDURE, 
        LOCATION OF PROCEDURE, TECHNICAL PROCEDURE, CLINICAL INDICATIONS, 
        PROCEDURES DONE, PROCEDURES UNDERTAKEN, DETAILS OF THE OR, 
        PROCEDURES DURING HOSPITALIZATION, PROCEDURE DESCRIPTION, 
        DESCRIPTION OF FINDINGS, MICROSCOPIC DESCRIPTION, 
        DESCRIPTION OF PROCEDURE IN DETAIL, OPERATIVE PROCEDURE PERFORMED, 
        OPERATION AND PROCEDURE, CLINICAL INDICATION, PROCEDURE REPORT
Cluster#7: NECK, Neck, EYES, HEAD, ROS, Eyes, HEAD AND NECK, Head, 
        ROS Gastrointestinal, ROS Respiratory, ROS Cardiovascular, 
        ROS Head and Eyes, ROS General, Head and Eyes, ROS Musculoskeletal
Cluster#8: PHYSICAL EXAMINATION, REVIEW OF SYSTEMS, GENERAL, NEUROLOGIC, 
        General, Neurological, NEUROLOGICAL, Neurologic, GENERAL APPEARANCE, 
        NEUROLOGICAL EXAMINATION, UROLOGICAL, EXAMINATION, General Appearance, 
        Appearance, MENTAL STATUS EXAMINATION, GENERAL EVALUATION, 
        REASON FOR EXAMINATION, Physical Examination, 
        GENERAL COGNITIVE ABILITY, Review of systems, NEUROLOGIC EXAMINATION, 
        CLINICAL/PHYSICAL EXAMINATION:, FUNCTIONAL EXAMINATION, 
        DISCHARGE PHYSICAL EXAMINATION, SYSTEMS REVIEW, NEUROLOGICAL EXAM, 
        Neurologic examination, MALE PHYSICAL EXAMINATION, General Exam, 
        MICROSCOPIC EXAMINATION, GENERAL PHYSICAL EXAMINATION, 
        FEMALE PHYSICAL EXAMINATION, REASON FOR NEUROLOGICAL CONSULTATION, 
        NEUROLOGICALLY, GENERAL REVIEW OF SYSTEMS
Cluster#9: EXAM, PHYSICAL EXAM, REASON FOR EXAM, GEN, Gen Exam, GEN EXAM, 
        MENTAL STATUS EXAM, Gen exam, Physical Exam, EXAM:MRI LEFT SHOULDER, 
        CARDIOVASCULAR EXAM, Exam Neck, EXAM:MRI LEFT KNEE WITHOUT CONTRAST, 
        Cranial Nerve Exam, RECTAL EXAM, Sensory Exam, MULTISYSTEM EXAM, 
        Motor Exam, COMPARISON EXAM, Gen
Cluster#10: FINDINGS, OPERATIVE FINDINGS, INTRAOPERATIVE FINDINGS, 
        GROSS FINDINGS, ANGIOGRAPHIC FINDINGS, ABNORMAL FINDINGS, 
        MAJOR FINDINGS, GROSS OPERATIVE FINDINGS, 
        GROSS INTRAOPERATIVE FINDINGS, LABORATORY FINDINGS, 
        FINDINGS AT THE TIME OF SURGERY, FINDING, EVALUATION FINDINGS, 
        ENDOSCOPIC FINDINGS, FINDINGS AT OPERATION, DIAGNOSTIC FINDINGS, 
        PHYSICAL FINDINGS
Cluster#11: LABORATORY DATA, DIAGNOSTIC DATA, DEVICE DATA, 
        MEASURED INTRAOPERATIVE DATA, HEMODYNAMIC DATA, LAB DATA, DATA
Cluster#12: SURGERY, DURATION OF SURGERY, 
        CONDITION OF THE PATIENT AFTER SURGERY
Cluster#13: FLUIDS, FLUIDS RECEIVED, IV FLUIDS, INTRAVENOUS FLUIDS, 
        FLUIDS GIVEN, FLUID
Cluster#14: SKIN, Skin, LYMPHATIC, LYMPHATICS, Lymphatic, Lymphatics, 
        Skin and Lymphatics
Cluster#15: LEFT MAIN CORONARY ARTERY, LEFT ANTERIOR DESCENDING ARTERY, 
        LEFT, LEFT LOWER EXTREMITY, LEFT CIRCUMFLEX ARTERY, LEFT VENTRICULOGRAM
Cluster#16: S, S , ICD9 CODE(S):, CANDIDATE'S MOTIVATION TO DONATE:, 
        CPT CODE(S):, CPT4 CODE(S):
Cluster#17: ESTIMATED BLOOD LOSS, BLOOD LOSS, BLOOD, Blood pressure, 
        BLOOD TRANSFUSIONS
Cluster#18: TECHNIQUE, OPERATIVE TECHNIQUE, TECHNIQUE IN DETAIL, 
        IMAGE TECHNIQUE, STRESS TECHNIQUE
Cluster#19: TEST RESULTS, RESULTS, LABORATORY RESULTS, STRESS ECG RESULTS, 
        NUCLEAR RESULTS
Cluster#20: NOSE, Nose, THROAT, Throat, NOSE AND THROAT
Cluster#21: ASSESSMENT, Cognitive Assessment, ASSESSMENTS, 
        ASSESSMENT AND EVALUATION, ASSESSMENT AND RECOMMENDATIONS
Cluster#22: RECOMMENDATIONS, RECOMMENDATION
Cluster#23: CONDITION, DISCHARGE CONDITION, CONDITION ON DISCHARGE, 
        CONDITION UPON DISPOSITION, CONTESTED CONDITION, 
        CONDITION UPON DISCHARGE
Cluster#24: OPERATION PERFORMED, OPERATION, TITLE OF OPERATION, OPERATIONS, 
        NAME OF OPERATION, TITLE OF THE OPERATION, DETAILS OF THE OPERATION, 
        OPERATION IN DETAIL
Cluster#25: CARDIOVASCULAR, Cardiovascular, CARDIOVASCULAR SYSTEM
Cluster#26: LABORATORY STUDIES, LABORATORY, LABORATORY TESTS, 
        LABORATORY VALUES, LABORATORY EVALUATION, LABORATORY WORK

The clustering above is strictly lexical. Even the ones that appear to be somewhat magical are rooted in lexical similarity with one or more elements in the cluster. This may not always be what we want - for example, one would probably want to separate out FAMILY HISTORY and PAST MEDICAL HISTORY from current patient history - however, these appear in the same cluster at the moment. However, perhaps we can use the output of this clustering as a first step to fine tuning the cluster memberships.

4 comments (moderated to prevent spam):

Anonymous said...

Hi Sujit:

I am a young data analyst; very eye openning and good learning stuff for me; regarding to this blog, I have a confusion, forgive me if I made any stupid mistakes since I am very new to python and text clustering coding;

stitles.txt stores medical terms you extracted from notes;

stitles.mtx: you filtered stitles terms if count >2; then calculated similarity ratios and saved into this matrix from X;

in your last code; you are trying to extract and print out medical terms which with the same clust.labels_ number FROM stitles.txt again? while stitles.mtx is filtered result from original stitles.txt; why you try to pull result from stitles.txt instead of write filtered results into a new text file and read, and extract and print from there?

Regards
Hao

HAODING88@GMAIL.COM

Sujit Pal said...

Thank you for the kind words Hao, I am glad you found the post helpful. Your approach is also correct, I could have written out the filtered list and looked up against that. However, since my list is sorted by frequency to begin with, chopping off the low frequency tail still retains the (id, title) mapping that I am using to relate the matrix rows in the mtx file to the actual titles.

Zhang Mengdi said...

thx for the blogging. Very helpful.

Sujit Pal said...

Thank you Zhang, glad you found it helpful.