## Friday, July 18, 2014

### Clustering Medical Procedure Codes with Scalding

A colleague pointed out that there exists an inverse use-case to finding outliers in medical claims. This one is to group procedure codes into clusters, or what Health Insurance companies call Episode Treatment Groups (ETG). Essentially an ETG is a way to cluster a group of services (procedures) into a medically relevant unit.

The CMS.gov dataset provides slightly under 16 million anonymized outpatient claims for Medicare/Medicaid patients. Each outpatient record can have upto 6 ICD-9 procedure codes, upto 10 ICD-9 diagnosis codes and upto 45 HCPCS codes. So just like the outlier case, we can derive a measure of similarity between a pair of codes as the average co-occurrence within claims across the dataset.

I decided to use a variant of the DBSCAN clustering algorithm. This post provides some tips on how to implement DBSCAN in a distributed manner - I used the ideas in this post to develop my implementation. The intuition behind my clustering algorithm goes something like this.

We calculate the similarity sAB between a pair of codes A and B as the number of times they co-occur in the corpus. Clustering algorithms need a distance measure, so we treat the distance dAB as the reciprocal of their similarity, ie 1/sAB. The DBSCAN clustering algorithm works by selecting other points around each point that are within a specified distance ε from each other. Candidate cluster centroids are those that have at least MinPoints codes within this distance ε. My algorithm deviates from DBSCAN at this point - instead of finding density-reachable codes I just find the Top-N densest clusters. Density is calculated as the number of codes within a circular area of the mean radius, i.e. N2 / πΣi=0..Nd2. We then calculate the top N densest code clusters - these are our derived ETGs.

The Scalding code below does just this. We simplify a bit by not calculating using some constants such as π but otherwise the code is quite faithful to the algorithm described above.

I ran this locally with 1 million claims (out of the 16 million claims in my dataset) and got results like this:

 1 2 3 4 5 6 7 8 9 10 HCPCS:00142 HCPCS:85025, HCPCS:36415, HCPCS:93005, HCPCS:80053, ... HCPCS:00300 HCPCS:85025, HCPCS:80053, HCPCS:36415, HCPCS:99284, ... HCPCS:00400 HCPCS:85025, HCPCS:93005, HCPCS:80053, HCPCS:36415, ... HCPCS:00532 HCPCS:85025, HCPCS:36415, HCPCS:80048, HCPCS:G0378, ... HCPCS:0073T HCPCS:77417, HCPCS:77336, HCPCS:77427, HCPCS:77280, ... HCPCS:00740 HCPCS:85025, HCPCS:36415, HCPCS:93005, HCPCS:85610, ... HCPCS:00750 HCPCS:36415, HCPCS:85025, HCPCS:85610, HCPCS:J3010, ... HCPCS:00790 HCPCS:36415, HCPCS:85025, HCPCS:80048, HCPCS:80053, ... HCPCS:00810 HCPCS:36415, HCPCS:85025, HCPCS:93005, HCPCS:80053, ... HCPCS:00830 HCPCS:36415, HCPCS:85025, HCPCS:80048, HCPCS:93005, ...

[Edit: 07/22/2014: This approach does not produce clusters. Notice in the data the same HCPCS:85025 is part of the first 3 clusters, which is obviously not wanted. I will implement the last part of the DBSCAN algorithm and update this page when I am done.]

And thats all I have for today. I'd like to point out a new book on Scalding, Programming MapReduce with Scalding by Antonios Chalkiopoulos. I was quite impressed by this book, you can read my review on Amazon if you are interested.

#### 2 comments (moderated to prevent spam):

Anonymous said...

Great post, why are you using the reciprocal of the similarity as a distance measure? Can you provide any more information on this approach? Why could you not just use the similarity / co occurence count directly?

Sujit Pal said...

Thank you. I am using the reciprocal because DBScan needs a distance measure to cluster, so it would try to cluster points with smaller distances between them. Similarity or co-occurrence count would be higher the closer the points are, so it wouldn't work.