I have been analysing some (anonymized) Medicare/Mediclaim data. In this post, I describe how I used this data to build a model that could be used to detect fraudulent/abnormal claims.

A claim consists of a set of medical codes. Each code can refer to a procedure performed, drug administered, diagnosis made, or item provided by the provider (your doctor or hospital) to the patient (you) during an encounter (a doctor's visit). Each code has a price that is pre-negotiated between the insurance company and the service provider. The claim represents the invoice the provider sends to your insurance company after deducting your co-pay. One way a service provider could defraud the insurance company is to "accidentally" drop in high-priced procedure codes that don't "belong".

The Medicare/Mediclaim data provides a large number of outpatient claim records, which can be used to build a model of a "normal" claim. The intuition is as follows - some codes naturally go together, such as that for a blood test and a disposable blood-drawing kit, or a diagnosis of a fracture and splints. While the codes used in each encounter can vary depending on age, gender or pre-existing condition even for the same service, computing this "similarity" between different codes can help us find codes that don't belong.

In our model, the similarity sim

_{ij}between two codes i and j is calculated by counting how many times they co-occur across claims in the dataset. We then compute the average distance d

_{ij}as 1/sim

_{ij}. We then consider each claim as a cluster of codes, and compute the average diameter of the cluster as the square root of the sum of squared distances between all pairs of codes in the claim, ie √(Σ

_{i,j}d

_{ij}

^{2}).

This is more of an exploration to see if such a model makes sense. I used Python and its data science libraries (numpy, scipy, matplotlib, scikits-learn, pandas, etc) to work on a subset (approximately 5%) of the data. I first read the data into a pandas DataFrame, then extract only the codes from each claim. Then I use scikit-learn's CountVectorizer to create a binarized term-document matrix (or code-claim matrix in this case).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | ```
from __future__ import division
import itertools
import numpy as np
import os.path
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
def join_codes(row):
return " ".join([str(v) for i, v in row.iteritems() if pd.notnull(v)])
DATA_DIR = "/path/to/claims/data"
EPSILON = 0.1
# extract codes as bag of codes from input
opdf = pd.read_csv(
os.path.join(DATA_DIR, "DE1_0_2008_to_2010_Outpatient_Claims_Sample_1.csv"),
low_memory=False)
opdf.head()
colnames = [colname for colname in opdf.columns if "_CD_" in colname]
bcdf = opdf.ix[:, colnames].apply(join_codes, axis=1)
# build a code-document matrix out of the codes
vec = CountVectorizer(min_df=1, binary=True)
X = vec.fit_transform(bcdf)
``` |

This creates a 790790x17023 sparse matrix on which 0.04% elements are non-zero. Taking a cue from text mining, we can compute the similarity between codes as X

^{T}*X. Further, since our matrix is binarized, the matrix product will contain the co-occurrence counts. While this approach seems a bit unintuitive (unless you have done this before), and even a bit wasteful since we will use only half of what we compute (the top or bottom triangle), this takes advantages of numpy's vector optimizations and is orders of magnitude faster than looping through the data manually to calculate the similarity.

1 | ```
sim = X.T * X
``` |

The similarity matrix is a 17023x17023 sparse square matrix (slightly denser than X, with 2.2% elements being non-zero). We use the similarities to compute the average diameter of the code cluster in each claim, thereby reducing each claim to a single number. This is done within a for-loop and does an all-pairs join on each record with a O(n

^{2}) complexity, and takes a while (6 hours in my case) to generate.

1 2 3 4 5 6 7 8 9 10 11 12 | ```
fout = open(os.path.join(DATA_DIR, "clusters.txt"), 'wb')
for row in range(0, X.shape[0]):
codes = [code for code in X[row, :].nonzero()][1]
dists = []
for i, j in itertools.product(codes, codes):
if i < j:
sim_ij = sim.getrow(i).todense()[:, j][0]
if sim_ij == 0:
sim_ij = EPSILON
dists.append(1 / (sim_ij ** 2))
fout.write("%f\n" % (np.sqrt(sum(dists)) / len(dists)))
fout.close()
``` |

The file contains a series of numbers, each number representing the code cluster diameter for a single claim. This is now a univariate distribution, and we can use techniques described in this paper on univariate outliers (PDF) to detect outliers. Essentially, for normal distributions, elements farther than 2 standard deviations from the mean (at the 95 percentile level) can be considered as mild outliers, and those that are farther than 3 standard deviations from the mean (at the 99 percentile level) can be considered as strong outliers. Mean based measures are sensitive to the presence of outliers, however, so one can also use the median based measure. In this case, mild outliers are those outside the range (Q1 - 1.5*IQR, Q3 + 1.5*IQR) and strong outliers are those outside the range (Q1 - 3*IQR, Q3 + 3*IQR), where the interquartile range IQR = Q3 - Q1.

Since the data (even at 5% of the total volume) is still quite large, I initially thought that I would have to compute mean and median using streaming algorithms - Welford's method for streaming standard deviation (and mean) and the P-square algorithm for streaming quantiles - the Python library livestats implements these algorithms, so I just used that:

The code below generates some visualizations using the cluster diameter data we just generated, from which we can extract cutoffs for prediction.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | ```
from livestats import livestats
import math
import os.path
EPSILON = 0.0001
def summary(stats):
summary = {}
summary["mean"] = stats.mean()
summary["stdev"] = math.sqrt(stats.variance())
q1, q2, q3 = [q[1] for q in stats.quantiles()]
summary["q1"] = q1
summary["q2"] = q2
summary["q3"] = q3
return summary
def norm(x, mu, sigma):
return (1 / (sigma * np.sqrt(2 * np.pi)) *
np.exp(-(x - mu)**2 / (2 * sigma**2)))
lns = 0
fin = open(os.path.join(DATA_DIR, "clusters.txt"), 'rb')
stats = livestats.LiveStats([0.25, 0.5, 0.75])
xs = []
for line in fin:
line = line.strip()
lns += 1
x = EPSILON if line == "nan" else float(line)
x = math.log(x, math.e)
xs.append(x)
# add a mirror image to make it approximately normal
# xs.append(-x)
stats.add(x)
# stats.add(-x)
fin.close()
# plot data for visualization
mu = stats.mean()
sigma = math.sqrt(stats.variance())
count, bins, ignored = plt.hist(xs, bins=100, normed=True)
plt.plot(bins, [norm(x, mu, sigma) for x in bins], linewidth=2, color='r')
max_y = 0.5
#max_y = 10
# mean +/- (2*sigma or 3*sigma)
lb2 = mu - (2 * sigma)
ub2 = mu + (2 * sigma)
lb3 = mu - (3 * sigma)
ub3 = mu + (3 * sigma)
plt.plot([lb2, lb2], [0, max_y], 'r--')
plt.plot([ub2, ub2], [0, max_y], 'r--')
plt.plot([lb3, lb3], [0, max_y], 'r-')
plt.plot([ub3, ub3], [0, max_y], 'r-')
# median based (interquartile range based outlier measure)
q1, q2, q3 = [q[1] for q in stats.quantiles()]
iqr = q3 - q1
lb2 = q1 - (1.5 * iqr)
ub2 = q3 + (1.5 * iqr)
lb3 = q1 - (3.0 * iqr)
ub3 = q3 + (3.0 * iqr)
plt.plot([lb2, lb2], [0, max_y], 'g--')
plt.plot([ub2, ub2], [0, max_y], 'g--')
plt.plot([lb3, lb3], [0, max_y], 'g-')
plt.plot([ub3, ub3], [0, max_y], 'g-')
print summary(stats)
plt.show()
``` |

The maximum number of clusters peak close to 0 diameter, the distribution resembles a Zipf distribution. I tried to fake a normal distribution by adding a mirror image (commented out lines in the xs.append(-x) and stats.add(-x) in the code above). I also tried to fit a normal curve of area 1 to the distribution, but as you can see, the normal distribution for the standard deviation is shorter and fatter than the actual distribution.

The vertical red lines (full and dotted) indicate the strong and weak outliers based on the mean, and the vertical green lines (full and dotted) indicate the strong and weak outliers based on the median. However, because the distribution is clearly not normal, we cannot draw any conclusions from this.

I then tried to do a log transform, which fits much better to a normal distribution, except that it is skewed to the left. Here is what the log of the diameters looks like:

As you can see, there is absolutely no outlier data on the right hand side. We are not really interested in outliers on the left, since they represent super-tight clusters of codes. So we abandon this approach and try to find the 95% and 99% percentile points in the Zipf distribution.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | ```
def compute_cutoff(level, cys):
for i in range(len(cys), 0, -1):
if cys[i-1] < level:
return i
return -1
counts, bins, ignored = plt.hist(xs, bins=100)
cumsums = np.cumsum(counts)
plt.plot(bins[:-1], cumsums, color='red')
max_cy = len(xs)
strong_xcut = compute_cutoff(0.99 * max_cy, cumsums) / len(bins)
mild_xcut = compute_cutoff(0.95 * max_cy, cumsums) / len(bins)
print (strong_xcut, mild_xcut)
plt.plot([strong_xcut, strong_xcut], [0, max_cy], 'g-')
plt.plot([mild_xcut, mild_xcut], [0, max_cy], 'g--')
plt.show()
``` |

This produces the output as shown below. The red curve plots the cumulative frequency distribution, and the green solid and dotted lines represent the 99 and 95 percentile cutoffs respectively.

Although there are some items beyond the 99 percentile mark, inspection reveals that these are all single rarely occurring codes (occurs once in the dataset), which is not interesting. So we look at the mild outliers (95 percentile mark) ordered by descending cluster diameter, ignoring diameters of 1 (which point to single rare codes which are not interesting as noted before).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ```
strong_outlier_cutoff = 0.9801980198019802
mild_outlier_cutoff = 0.36633663366336633
fin = open(os.path.join(DATA_DIR, "clusters.txt"), 'rb')
outliers = []
idx = 0
for line in fin:
line = line.strip()
x = EPSILON if line == "nan" else float(line)
if x > mild_outlier_cutoff and x < 1.0:
outliers.append((idx, x))
idx += 1
fin.close()
# find corresponding claim ids and claims for verification
outliers_sorted = sorted(outliers, key=operator.itemgetter(0),
reverse=True)[0:10]
colnames = ["CLM_ID"]
colnames.extend([colname for colname in opdf.columns if "_CD_" in colname])
for idx, score in outliers_sorted:
claim_id = opdf.ix[idx, colnames[0]]
codes = opdf.ix[idx, colnames[1:]]
names = ["_".join(x.split("_")[0:2]) for x in colnames[1:]]
code_names = [":".join([x[0], x[1]]) for x in zip(names, codes.values)
if pd.notnull(x[1])]
print("%s %6.4f %s" % (claim_id, score, str(code_names)))
``` |

This returns the following result. I am not conversant enough with medical codes to tell if these are true anomalies, but I would be willing to bet that these are relatively rare.

1 2 3 4 5 6 7 8 9 10 | ```
542382281346896 0.5000 ['ICD9_DGNS:99673', 'HCPCS_CD:11042']
542962281495106 0.4787 ['ICD9_DGNS:99657', 'HCPCS_CD:11042', 'HCPCS_CD:94375']
542332281612480 0.4715 ['ICD9_DGNS:7921', 'ICD9_DGNS:5533', 'HCPCS_CD:A5061']
542482280853055 0.5774 ['ICD9_DGNS:V622', 'ICD9_DGNS:V653', 'HCPCS_CD:90717']
542142281445697 0.3727 ['ICD9_DGNS:6119', 'ICD9_DGNS:71949', 'HCPCS_CD:76645']
542352281341634 0.4787 ['ICD9_DGNS:7948', 'ICD9_DGNS:5739', 'ICD9_DGNS:V7649']
542142281439193 0.3745 ['ICD9_DGNS:7260', 'ICD9_DGNS:7231', 'HCPCS_CD:20550']
542492281574929 0.3727 ['ICD9_DGNS:6279', 'ICD9_DGNS:7822', 'HCPCS_CD:77080']
542362281032984 0.4718 ['ICD9_DGNS:07799', 'ICD9_DGNS:2724', 'HCPCS_CD:87252']
542072280995842 0.3727 ['ICD9_DGNS:496', 'ICD9_DGNS:V601', 'HCPCS_CD:97110']
``` |

Another good introduction to outliers in general is this teaser chapter in the book Outlier Analysis by Charu Aggarwal. I got to read this towards the tail end of this work, and found that proximity based outlier detection is a valid approach. Felt great to have figured this out independently :-).

I also started using the Spyder IDE (part of the Anaconda distribution that I use) for this work, and I loved it! Its a bit like R-Studio, allows you to run scripts or code blocks, then inspect the values as you run. Its a huge timesaver (and more convenient than importing scripts into the python shell) when you are dealing with large amounts of data. There is scope for improvement - key mappings are a bit non-intuitive for me and key remapping does not work, and large jobs tend to hang the IDE, but its still a huge convenience to be able to look at your data and try various things with it without having to rerun the entire script from scratch.

I would also like to thank my colleagues Bijal Shah, who confirmed my hypothesis that the practice of switching codes in insurance claims does exist, and Roger Yeung who pointed me to the Medicare/Mediclaim dataset.

All the code for this post is in my GitHub repository. Would appreciate hearing from you if you know of better ways of finding outliers.