Data Science and ICD-10 Team Up to Benefit Healthcare

Posted by Daniel D. Gutierrez on Mon, Mar 31, 2014

ICD Graphic V1.0 06

Switching to a new medical coding system won’t be easy, but when combined with data science and machine learning, ICD-10 presents enormous potential benefits for both the financial and the clinical sides of healthcare.

Part of why the healthcare industry is such a notorious laggard in jumping on the Big Data bandwagon is that every attempted change faces a huge domino effect, rendering many good ideas useless until everyone — and everything — is ready. One big step in the right direction, however, is an important upgrade to the computerized codes used for electronic medical records (EMR), which will take hold in the next year or two. These codes, known as ICD or International Classification of Diseases, determine what ailments patients have and how much they and their insurers should pay for a treatment. The set of codes, currently called ICD-9, is scheduled for its 10th revision this fall (but there may be a year-long delay). The updated codes, called ICD-10 codes, improve on the previous standard by adding more descriptive capabilities that will help healthcare professionals better categorize and keep track of patient disorders and treatments. Through the use of machine learning and other data science techniques, this increased granularity is expected to open up patient treatment analytics along with the ability to better monitor public health threats.


The Predictive Power of ICD-10

In the past, the cataloging of patient details would say “injury to right arm,” whereas now with ICD-10 it's going to say “crushing injury to left upper arm” or “fracture to right lower arm.” The specificity will help with documentation, measurement, and databases, as well as discovery using machine learning.

Expanded ICD codes, starting with the new ICD-10 codes and beyond, will pave the way for many new benefits for the healthcare industry, from fraud detection to treatment diagnostics. By all indication, this evolution will yield an unprecedented increase in the predictive power of machine learning algorithms designed to make sense of Big Data generated by the healthcare industry. For example, machine learning can help predict certain health issues by finding patterns and correlations in patient data by comparing age, blood pressure, and specific lab results. The updated codes will yield previously unknown insights through the use of healthcare data analytics.

To get a feeling for the extent of the expansion of the ICD codes, the old ICD-9 standard had 14,000 diagnosis codes, whereas the new ICD-10 codes number more than 67,000. Similarly, procedure codes have increased from 13,000 to 85,000. As a result, the new codes can be used to document far more complex diagnoses and procedures than what was previously possible. The potentially increased size of a feature vector (variables used in a predictive machine learning algorithm) can lead to long-awaited answers to important healthcare questions. Further insights can be gleaned by combining ICD-10 codes with demographic information collected from patient EMRs. Combining data sets in this fashion can have a significant effect on the predictive power of a data science solution. Other types of machine learning work to achieve knowledge discovery instead of prediction. With these solutions, known as clustering techniques, you may unexpectedly find previously unknown insights.


Data Science Can Improve the Coding Process and Shorten the Learning Curve

Data science technology also can help with the ICD-10 coding process. When reviewing claims, the healthcare industry uses a standard code set to denote specific conditions and treatments. Codes are used to describe information included in physicians’ notes and lab test results. The codes have a more succinct representation for machine learning algorithms, and they also help streamline billing processes (insurance companies may not issue payment if services are not coded). The caveat is that encoding medical data can be expensive and labor-intensive since it requires human skill and expertise. ICD-10 makes the process even more challenging with its increased complexity. Fortunately, the data science field provides technology such as text analytics coupled with AI methods involving natural language processing (NLP) to accurately translate human-entered narratives into encoded equivalents. 

Today, understanding unstructured text of clinical notes that contain a huge supply of information and then mapping them to fine-grained ICD-10 coding schemes requires a combination of NLP, linguistic analysis, machine learning, and graph database technology. In this case, technology has kept pace with healthcare industry advancements.

ICD-10 Will Temporarily Hamper Productivity and Be Vulnerable to Fraud

Anticipated challenges are awaiting the adoption of the new codes. Detecting healthcare insurance fraud relies on complex algorithms designed to detect suspicious outliers in claims data. For example, a fraud classifier might look for discrepancies in certain feature variables such as average dollars paid per patient, average cost of medical procedures, and average number of visits per year. Machine learning algorithms known as “classifiers” work to identify healthcare events (e.g. performing unnecessary services, billing for services never provided, billing for services at inflated costs, and reporting false diagnoses) as binary conditions: “fraud” or “not fraud.” However, these predictive methods can be expanded to multi-class situations, so multiple types of fraud can be classified. These detection algorithms become more accurate by continually training them with newly available data. 

After the transition period (which, admittedly, could take a year), when the algorithms have caught up with the new requirements involved in ICD-10, the greater specificity in medical codes and increased amount and quality of data will give the algorithms better insight into the patterns of legitimate billing, refining their capabilities even further and reducing the opportunities for unscrupulous providers to seek extra funds.


ICD-10 represents a bold and necessary move for the healthcare industry. The hope is that this more precise information detailing patient maladies and treatments will fuel the effort to utilize the many benefits brought by Big Data, machine learning, and advanced analytics, including prediction, classification, and the discovery of hidden knowledge waiting to be found in healthcare’s immense data stores.

Opera Solutions is at the forefront of this movement for healthcare providers and payers alike. We offer Hospital Revenue Leakage, which is most directly impacted by ICD-10. It uses pattern-based machine-learning algorithms to find missing charges in patient invoices. We also work with The Centers for Medicare and Medicaid Services, detecting fraud and helping with the Exchanges. We offer a different fraud solution to state Medicare and Medicaid Services, as well as private insurers.

Download our new data sheet on FWA Radar, Opera Solutions' fraud-detection solution, here:  

Download Now




Daniel D. Gutierrez is a Los Angeles–based data scientist working for a broad range of clients through his consultancy AMULET Analytics.


Topics: Healthcare, Big Data, Data Science, Machine Learning