Data Science and ICD-10 Team Up to Benefit Healthcare

Posted by Daniel D. Gutierrez on Mon, Mar 31, 2014

ICD Graphic V1.0 06Switching to a new medical coding system won’t be easy, but when combined with data science and machine learning, ICD-10 presents enormous potential benefits for both the financial and the clinical sides of healthcare.

Part of why the healthcare industry is such a notorious laggard in jumping on the Big Data bandwagon is that every attempted change faces a huge domino effect, rendering many good ideas useless until everyone — and everything — is ready. One big step in the right direction, however, is an important upgrade to the computerized codes used for electronic medical records (EMR), which will take hold in the next year or two. These codes, known as ICD or International Classification of Diseases, determine what ailments patients have and how much they and their insurers should pay for a treatment. The set of codes, currently called ICD-9, is scheduled for its 10th revision this fall (but there may be a year-long delay). The new ICD-10 codes allow for much greater detail than the existing codes in describing illnesses, injuries, and treatment procedures. Through the use of data science techniques, this increased granularity is expected to allow for improved tracking of public health threats and trends as well as better analysis of treatments.

The Predictive Power of ICD-10

In the past, the cataloging of patient details would say “injury to right arm,” whereas now with ICD-10 it's going to say “crushing injury to left upper arm” or “fracture to right lower arm.”  The specificity will help with documentation, measurement, and databases, as well as discovery using methods of supervised and unsupervised machine learning. (Supervised machine learning assigns predefined labels to documents based on the likelihood suggested by a training set of labeled documents; unsupervised machine learning does not need any human intervention or labeled documents.)

The U.S. Department of Health and Human Services (HHS) hopes the move to a new era of ultra-specific codes will help the industry identify more billing fraud, allow more thorough quality reporting by healthcare providers, and enable refinements in reimbursement models through more detailed diagnostic and procedure data. HHS also expects the conversion will improve patient care and outcomes through new insights that may be uncovered in analysis of the more detailed clinical data. By all indication, this evolution will yield an unprecedented increase in the predictive power of machine-learning algorithms designed to make sense of Big Data generated by the healthcare industry. For example, machine learning can help predict certain health issues by finding patterns and correlations in patient data by comparing age, blood pressure, and specific lab results.

The transition from ICD-9 to ICD-10 involves expanding medical diagnosis codes from the current number of 14,000 to more than 67,000 and procedure codes from 13,000 to 85,000. As a result, the new codes can be used to document far more complex diagnoses and procedures than what was previously possible. The potentially increased size of a feature vector (variables used in a machine-learning algorithm) can lead to long-awaited answers to important healthcare questions. Further, more can be answered at greater speed using combined patient demographic data along with ICD-10 diagnosis codes, all of which are being collected within patients’ EMRs. Having a larger data set to search from using ICD-10 offers clinical analysts added depths to drill down into and finer details regarding answers they may be looking for. Sometimes it’s what you find that you weren’t looking for that brings about answers as well – a gain made possible with unsupervised machine-learning techniques, such as clustering.

Data Science Can Improve the Coding Process and Shorten the Learning Curve

Data science technology also can help with the ICD-10 coding process. In healthcare, coding refers to the process of converting the narrative description of the treatments a patient received, including doctors’ notes, lab tests, and medical images, into a series of alphanumeric codes that are used for billing purposes. Medical coding is both very important (if the treatments a patient receives aren’t coded, the insurance company won’t pay for them) and very expensive (medical coders need a lot of training and skill to do the job well). The new ICD-10 makes the process even more challenging due to increased complexity. Finding ways to use natural language processing (NLP) to help coders be more efficient is a great problem for a data scientist to tackle: It has a quantifiable impact on the bottom line, and there is a strong potential for data analysis and modeling to make a meaningful difference.

The first systems for performing computer-assisted coding (circa 1994) were similar to the first spam classifiers: They relied almost exclusively on a static set of rules to make coding decisions. They also primarily targeted medical coding applications for outpatient treatments instead of the more complex coding required for inpatient treatments. These early systems weren’t very robust, but they were useful enough that they could gather feedback from the medical coders on when they failed to identify a code or included one that was not relevant for the problem. In addition, as more data was gathered, the static rules could be augmented with statistical models that were capable of adjusting to new information and improving over time.

Today, understanding unstructured text of clinical notes that contain a huge supply of information and then mapping them to fine-grained ICD-10 coding schemes use a combination of NLP, linguistic analysis, machine learning, and semantic Web technologies. In this case, technology has kept pace with healthcare industry advancements.

ICD-10 Will Temporarily Hamper Productivity and Be Vulnerable to Fraud

Anticipated challenges are awaiting the adoption of the new codes. Detecting healthcare insurance fraud relies on complex algorithms designed to detect suspicious outliers in claims data. For example, a fraud classifier might look for discrepancies in certain feature variables such as average dollars paid per patient, average cost of medical procedures, and average number of visits per year. The imminent arrival of ICD-10 presents an unprecedented challenge to these algorithms, rendering them less efficient and blurring the line between genuine coding mistakes and legitimate healthcare fraud.

The most common types of fraud, according to the National Healthcare Anti-Fraud Association (NHCAA), involve billing for services that were never provided or were more expensive than actual services provided, performing unnecessary services, providing false diagnoses, overbilling a patient’s copay, and accepting kickbacks. The machine-learning algorithms designed to detect these types of fraud attempt to identify trends, becoming smarter and more accurate as increasingly more data is used to train the algorithms. But ICD-10 is a brand-new set of codes. The upheaval will stymy the algorithms at least temporarily as they struggle to make sense of the new input and potentially let undetected fraud slip through. 

After the transition period (which, admittedly, could take a year), when the algorithms have caught up with the new requirements involved in ICD-10, the greater specificity in medical codes and increased amount and quality of data will give the algorithms better insight into the patterns of legitimate billing, refining their capabilities even further and reducing the opportunities for unscrupulous providers to seek extra funds.


ICD-10 represents a bold and necessary move for the healthcare industry. The hope is that this more precise information detailing patient maladies and treatments will fuel the effort to utilize the many benefits brought by Big Data, machine learning, and advanced analytics, including prediction, classification, and the discovery of hidden knowledge waiting to be found in healthcare’s immense data stores.

Opera Solutions is at the forefront of this movement for healthcare providers and payers alike. We offer Hospital Revenue Leakage, which is most directly impacted by ICD-10. It uses pattern-based machine-learning algorithms to find missing charges in patient invoices. We also work with The Centers for Medicare and Medicaid Services, detecting fraud and helping with the Exchanges. We offer a different fraud solution to state Medicare and Medicaid Services, as well as private insurers. Download our new data sheet on FWA Radar, Opera Solutions’ new fraud-detection solution, here:


Download Now


Daniel D. Gutierrez is a Los Angeles–based data scientist working for a broad range of clients through his consultancy AMULET Analytics. He’s been involved with data science and Big Data since long before it came in vogue, so imagine his delight when the Harvard Business Review deemed “data scientist” as the sexiest profession for the 21st century. He is also a recognized Big Data journalist and is working on a new machine-learning book due out in later this year.

Topics: Healthcare, Big Data, Data Science, Machine Learning