Saturday, May 31, 2014

Outlier Detection on Medical Claims


I have been analysing some (anonymized) Medicare/Mediclaim data. In this post, I describe how I used this data to build a model that could be used to detect fraudulent/abnormal claims.

A claim consists of a set of medical codes. Each code can refer to a procedure performed, drug administered, diagnosis made, or item provided by the provider (your doctor or hospital) to the patient (you) during an encounter (a doctor's visit). Each code has a price that is pre-negotiated between the insurance company and the service provider. The claim represents the invoice the provider sends to your insurance company after deducting your co-pay. One way a service provider could defraud the insurance company is to "accidentally" drop in high-priced procedure codes that don't "belong".

The Medicare/Mediclaim data provides a large number of outpatient claim records, which can be used to build a model of a "normal" claim. The intuition is as follows - some codes naturally go together, such as that for a blood test and a disposable blood-drawing kit, or a diagnosis of a fracture and splints. While the codes used in each encounter can vary depending on age, gender or pre-existing condition even for the same service, computing this "similarity" between different codes can help us find codes that don't belong.

In our model, the similarity simij between two codes i and j is calculated by counting how many times they co-occur across claims in the dataset. We then compute the average distance dij as 1/simij. We then consider each claim as a cluster of codes, and compute the average diameter of the cluster as the square root of the sum of squared distances between all pairs of codes in the claim, ie √(Σi,jdij2).

This is more of an exploration to see if such a model makes sense. I used Python and its data science libraries (numpy, scipy, matplotlib, scikits-learn, pandas, etc) to work on a subset (approximately 5%) of the data. I first read the data into a pandas DataFrame, then extract only the codes from each claim. Then I use scikit-learn's CountVectorizer to create a binarized term-document matrix (or code-claim matrix in this case).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from __future__ import division
import itertools
import numpy as np
import os.path
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

def join_codes(row):
    return " ".join([str(v) for i, v in row.iteritems() if pd.notnull(v)])
    
DATA_DIR = "/path/to/claims/data"
EPSILON = 0.1

# extract codes as bag of codes from input
opdf = pd.read_csv(
    os.path.join(DATA_DIR, "DE1_0_2008_to_2010_Outpatient_Claims_Sample_1.csv"),
    low_memory=False)
opdf.head()

colnames = [colname for colname in opdf.columns if "_CD_" in colname]
bcdf = opdf.ix[:, colnames].apply(join_codes, axis=1)

# build a code-document matrix out of the codes
vec = CountVectorizer(min_df=1, binary=True)
X = vec.fit_transform(bcdf)

This creates a 790790x17023 sparse matrix on which 0.04% elements are non-zero. Taking a cue from text mining, we can compute the similarity between codes as XT*X. Further, since our matrix is binarized, the matrix product will contain the co-occurrence counts. While this approach seems a bit unintuitive (unless you have done this before), and even a bit wasteful since we will use only half of what we compute (the top or bottom triangle), this takes advantages of numpy's vector optimizations and is orders of magnitude faster than looping through the data manually to calculate the similarity.

1
sim = X.T * X

The similarity matrix is a 17023x17023 sparse square matrix (slightly denser than X, with 2.2% elements being non-zero). We use the similarities to compute the average diameter of the code cluster in each claim, thereby reducing each claim to a single number. This is done within a for-loop and does an all-pairs join on each record with a O(n2) complexity, and takes a while (6 hours in my case) to generate.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
fout = open(os.path.join(DATA_DIR, "clusters.txt"), 'wb')
for row in range(0, X.shape[0]):
    codes = [code for code in X[row, :].nonzero()][1]
    dists = []
    for i, j in itertools.product(codes, codes):
        if i < j:
            sim_ij = sim.getrow(i).todense()[:, j][0]
            if sim_ij == 0:
                sim_ij = EPSILON
            dists.append(1 / (sim_ij ** 2)) 
    fout.write("%f\n" % (np.sqrt(sum(dists)) / len(dists)))
fout.close()

The file contains a series of numbers, each number representing the code cluster diameter for a single claim. This is now a univariate distribution, and we can use techniques described in this paper on univariate outliers (PDF) to detect outliers. Essentially, for normal distributions, elements farther than 2 standard deviations from the mean (at the 95 percentile level) can be considered as mild outliers, and those that are farther than 3 standard deviations from the mean (at the 99 percentile level) can be considered as strong outliers. Mean based measures are sensitive to the presence of outliers, however, so one can also use the median based measure. In this case, mild outliers are those outside the range (Q1 - 1.5*IQR, Q3 + 1.5*IQR) and strong outliers are those outside the range (Q1 - 3*IQR, Q3 + 3*IQR), where the interquartile range IQR = Q3 - Q1.

Since the data (even at 5% of the total volume) is still quite large, I initially thought that I would have to compute mean and median using streaming algorithms - Welford's method for streaming standard deviation (and mean) and the P-square algorithm for streaming quantiles - the Python library livestats implements these algorithms, so I just used that:

The code below generates some visualizations using the cluster diameter data we just generated, from which we can extract cutoffs for prediction.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
from livestats import livestats
import math
import os.path

EPSILON = 0.0001

def summary(stats):
    summary = {}
    summary["mean"] = stats.mean()
    summary["stdev"] = math.sqrt(stats.variance())
    q1, q2, q3 = [q[1] for q in stats.quantiles()]
    summary["q1"] = q1
    summary["q2"] = q2
    summary["q3"] = q3
    return summary    
    
def norm(x, mu, sigma):
    return (1 / (sigma * np.sqrt(2 * np.pi)) *
        np.exp(-(x - mu)**2 / (2 * sigma**2)))

lns = 0
fin = open(os.path.join(DATA_DIR, "clusters.txt"), 'rb')
stats = livestats.LiveStats([0.25, 0.5, 0.75])
xs = []
for line in fin:
    line = line.strip()
    lns += 1
    x = EPSILON if line == "nan" else float(line)
    x = math.log(x, math.e)
    xs.append(x)
    # add a mirror image to make it approximately normal
#    xs.append(-x)
    stats.add(x)
#    stats.add(-x)
fin.close()

# plot data for visualization
mu = stats.mean()
sigma = math.sqrt(stats.variance())
count, bins, ignored = plt.hist(xs, bins=100, normed=True)
plt.plot(bins, [norm(x, mu, sigma) for x in bins], linewidth=2, color='r')
max_y = 0.5
#max_y = 10

# mean +/- (2*sigma or 3*sigma)
lb2 = mu - (2 * sigma)
ub2 = mu + (2 * sigma)
lb3 = mu - (3 * sigma)
ub3 = mu + (3 * sigma)
plt.plot([lb2, lb2], [0, max_y], 'r--')
plt.plot([ub2, ub2], [0, max_y], 'r--')
plt.plot([lb3, lb3], [0, max_y], 'r-')
plt.plot([ub3, ub3], [0, max_y], 'r-')

# median based (interquartile range based outlier measure)
q1, q2, q3 = [q[1] for q in stats.quantiles()]
iqr = q3 - q1
lb2 = q1 - (1.5 * iqr)
ub2 = q3 + (1.5 * iqr)
lb3 = q1 - (3.0 * iqr)
ub3 = q3 + (3.0 * iqr)
plt.plot([lb2, lb2], [0, max_y], 'g--')
plt.plot([ub2, ub2], [0, max_y], 'g--')
plt.plot([lb3, lb3], [0, max_y], 'g-')
plt.plot([ub3, ub3], [0, max_y], 'g-')

print summary(stats)

plt.show()

The maximum number of clusters peak close to 0 diameter, the distribution resembles a Zipf distribution. I tried to fake a normal distribution by adding a mirror image (commented out lines in the xs.append(-x) and stats.add(-x) in the code above). I also tried to fit a normal curve of area 1 to the distribution, but as you can see, the normal distribution for the standard deviation is shorter and fatter than the actual distribution.


The vertical red lines (full and dotted) indicate the strong and weak outliers based on the mean, and the vertical green lines (full and dotted) indicate the strong and weak outliers based on the median. However, because the distribution is clearly not normal, we cannot draw any conclusions from this.

I then tried to do a log transform, which fits much better to a normal distribution, except that it is skewed to the left. Here is what the log of the diameters looks like:


As you can see, there is absolutely no outlier data on the right hand side. We are not really interested in outliers on the left, since they represent super-tight clusters of codes. So we abandon this approach and try to find the 95% and 99% percentile points in the Zipf distribution.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def compute_cutoff(level, cys):
    for i in range(len(cys), 0, -1):
        if cys[i-1] < level:
            return i
    return -1    
    
counts, bins, ignored = plt.hist(xs, bins=100)
cumsums = np.cumsum(counts)
plt.plot(bins[:-1], cumsums, color='red')

max_cy = len(xs)
strong_xcut = compute_cutoff(0.99 * max_cy, cumsums) / len(bins)
mild_xcut = compute_cutoff(0.95 * max_cy, cumsums) / len(bins)

print (strong_xcut, mild_xcut)

plt.plot([strong_xcut, strong_xcut], [0, max_cy], 'g-')
plt.plot([mild_xcut, mild_xcut], [0, max_cy], 'g--')

plt.show()

This produces the output as shown below. The red curve plots the cumulative frequency distribution, and the green solid and dotted lines represent the 99 and 95 percentile cutoffs respectively.


Although there are some items beyond the 99 percentile mark, inspection reveals that these are all single rarely occurring codes (occurs once in the dataset), which is not interesting. So we look at the mild outliers (95 percentile mark) ordered by descending cluster diameter, ignoring diameters of 1 (which point to single rare codes which are not interesting as noted before).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
strong_outlier_cutoff = 0.9801980198019802
mild_outlier_cutoff = 0.36633663366336633

fin = open(os.path.join(DATA_DIR, "clusters.txt"), 'rb')
outliers = []
idx = 0
for line in fin:
    line = line.strip()
    x = EPSILON if line == "nan" else float(line)
    if x > mild_outlier_cutoff and x < 1.0:
       outliers.append((idx, x))
    idx += 1
fin.close()

# find corresponding claim ids and claims for verification
outliers_sorted = sorted(outliers, key=operator.itemgetter(0), 
                         reverse=True)[0:10]
colnames = ["CLM_ID"]
colnames.extend([colname for colname in opdf.columns if "_CD_" in colname])
for idx, score in outliers_sorted:
    claim_id = opdf.ix[idx, colnames[0]]
    codes = opdf.ix[idx, colnames[1:]]
    names = ["_".join(x.split("_")[0:2]) for x in colnames[1:]]
    code_names = [":".join([x[0], x[1]]) for x in zip(names, codes.values) 
                                         if pd.notnull(x[1])]
    print("%s %6.4f %s" % (claim_id, score, str(code_names)))

This returns the following result. I am not conversant enough with medical codes to tell if these are true anomalies, but I would be willing to bet that these are relatively rare.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
542382281346896 0.5000 ['ICD9_DGNS:99673', 'HCPCS_CD:11042']
542962281495106 0.4787 ['ICD9_DGNS:99657', 'HCPCS_CD:11042', 'HCPCS_CD:94375']
542332281612480 0.4715 ['ICD9_DGNS:7921', 'ICD9_DGNS:5533', 'HCPCS_CD:A5061']
542482280853055 0.5774 ['ICD9_DGNS:V622', 'ICD9_DGNS:V653', 'HCPCS_CD:90717']
542142281445697 0.3727 ['ICD9_DGNS:6119', 'ICD9_DGNS:71949', 'HCPCS_CD:76645']
542352281341634 0.4787 ['ICD9_DGNS:7948', 'ICD9_DGNS:5739', 'ICD9_DGNS:V7649']
542142281439193 0.3745 ['ICD9_DGNS:7260', 'ICD9_DGNS:7231', 'HCPCS_CD:20550']
542492281574929 0.3727 ['ICD9_DGNS:6279', 'ICD9_DGNS:7822', 'HCPCS_CD:77080']
542362281032984 0.4718 ['ICD9_DGNS:07799', 'ICD9_DGNS:2724', 'HCPCS_CD:87252']
542072280995842 0.3727 ['ICD9_DGNS:496', 'ICD9_DGNS:V601', 'HCPCS_CD:97110']

Another good introduction to outliers in general is this teaser chapter in the book Outlier Analysis by Charu Aggarwal. I got to read this towards the tail end of this work, and found that proximity based outlier detection is a valid approach. Felt great to have figured this out independently :-).

I also started using the Spyder IDE (part of the Anaconda distribution that I use) for this work, and I loved it! Its a bit like R-Studio, allows you to run scripts or code blocks, then inspect the values as you run. Its a huge timesaver (and more convenient than importing scripts into the python shell) when you are dealing with large amounts of data. There is scope for improvement - key mappings are a bit non-intuitive for me and key remapping does not work, and large jobs tend to hang the IDE, but its still a huge convenience to be able to look at your data and try various things with it without having to rerun the entire script from scratch.

I would also like to thank my colleagues Bijal Shah, who confirmed my hypothesis that the practice of switching codes in insurance claims does exist, and Roger Yeung who pointed me to the Medicare/Mediclaim dataset.

All the code for this post is in my GitHub repository. Would appreciate hearing from you if you know of better ways of finding outliers.

12 comments (moderated to prevent spam):

Al Krinker said...

Excellent post. I am new to Python (mainly use Java at work) and it was great to learn about all these libraries that I can use vs implementing my own logic.

Question: Can you share your data or at least small sample of it? I would like to run the examples locally to get a better understanding of it.

Thank you
Al

Sujit Pal said...

Thanks Al, glad it helped. You can download the data yourself using this link (this is the same as the very first link in the post).

Weisse said...

This is a great contribution, thanks for sharing. I work with claims data regularly and have previously been in charge of identifying fraud within a state Medicaid program using techniques far less sophisticated than what you've shown here. I'm excited to see your resolution to the ETG algorithm, I'd love to try something similar using python and state 'all-payer' datasets that are more complete within a geography.

Sujit Pal said...

Thanks for the kind words Weisse. Regarding the ETG algorithm described here, the implementation of DBSCAN is incorrect, since you are working with Python, maybe use the one available in Scikit-Learn.

sreejith said...

Hi Sujit,

Excellent post. Can you please give some clues on "clusters.txt" file which you have used in the code.

Thanks,
Srijith

Sujit Pal said...

Thanks Sreejith. The clusters.txt is basically a list of similarities between all pairs of claims, where each claim is modeled as a set of procedure/diagnostic codes. I treat this list as a univariate distribution for outlier analysis.

Anonymous said...

Hi Sujit,
Excellent work and thanks for sharing. Can you please tell me how do you get the files clusters.txt and vocab.csv ?
Thanks
Sammy

Sujit Pal said...

Hi Sammy, thanks, glad you liked it. I don't see vocab.csv anywhere in this post (perhaps this is from another post?), but clusters.txt is just the "cluster size" of the procedure codes present in each claim record. I have already calculated the similarity between individual procedures across all claims, so now I just calculate the cluster size of the procedure as the square root of the sum of all code pair similarities squared within the claim.

Anonymous said...

Thanks for the prompt response. the file vocab.csv was in the code in GitHub linked in this post (https://github.com/sujitpal/mlia-examples/blob/master/src/proc_outliers/predict.py). I have used the code in the current post and worked fine on CMS data. I am trying to understand it better and see if it can be useful in my case and with the data that I have. I see that the CMS data from which this code reads from is already organized in Episodes of care. The data that I have is not in Episode format. It is just a sequence of claims of each procedure codes with date of service, NPI,.....
Do you have any algorithm/code to make these episodes ? I would appreciate it if you share if you have any or point me to any other blog.
Thanks a lot
Sammy

Sujit Pal said...

I looked at the code, looks like I may have created a {code:integer} mapping hoping to use it later, but ended up never using it (if you look for "vocab" you will see it gets created but never used). May I know how you saw that the CMS data is already organized into Episodes of Care? I wasn't able to figure this out, so I built something to cluster the claims data, treating each claim as a bag of procedure codes, but I think there was a bug and I haven't had a chance to look at the code since. But if they are already grouped, perhaps this is not even necessary in my case?

One other idea that I had suggested to someone working with data similar to yours for figuring out Episode boundaries was to construct a timeline of expense by beneficiary. Dips in the expense would signal episode boundaries. This is based on the intuition that people are mostly not sick, so their medical expenses are flat most of the time except when they are sick and are in the process of being treated. Never got around to testing this idea with this data though.

Anonymous said...

Hi Sujit,
Thank you for response and feedback. I assumed that the CMS data is already organized into Episode of care due to the fact that each claim has many HCPCS_CD_X (X= 1,....,45). The data that I have has each claim with only on HCPCS_CD.
Do you think that your current code will work on a data that is organized one HCPCS_CD per claim ?
Best regards,
Sammy

Sujit Pal said...

Thanks for the hint and sorry, I mixed up Episode and Episode Groups. Of course you are right, in the CMS data each claim record represents a single episode, and the suggestions I had before about clustering and computing expense time series were about finding Episode Groups rather than Episodes. And no, my code won't work with your format, its built around the idea that each claim is a bag of codes and I am trying to find some bags of codes very unlike most of the other bags, so it would need to be grouped by episodes. One possibility could be to first group by patient ID, then within each patient, sort by timestamp and cluster records so the current record is within 60 minutes (or some other suitable interval) of previous record.