Data Science As the Panacea for Healthcare Fraud, Waste, and Abuse

Posted by Daniel D. Gutierrez on Thu, Jun 12, 2014

medicare fraudEdited by Yan Zhang

Many pundits speaking about the state of the nation’s healthcare infrastructure routinely point to fraud, waste, and abuse (FWA) as major reasons for many of the problems the public witnesses every day – increased healthcare costs and the resulting increase in insurance premiums. The net effect is an annual loss of billions of dollars, and these losses affect the public in very real ways. 

The size of the healthcare sector, the enormous amount of money involved, and the lack of surveillance and monitoring mechanisms across the healthcare ecosystem make it an attractive target for FWA. According to the Office of Management and Budget, in 2010, about 9%, or $47.9 billion was lost to fraud in Medicare alone. It is therefore imperative to develop effective FWA technologies and solutions for reducing the costs associated with our healthcare system.

Data science and its primary enabling methodology of machine learning represent the country’s best chance for detecting FWA to avoid extraordinary sources of loss. Data science possesses the facilities to make a significant difference healthcare industry budgets and their impact to the public. Opera Solutions is the industry leader in applying Big Data technologies to the most challenging and significant business problems. We are the company charged with developing the analytics to identify fraud for the Centers for Medicare & Medicaid Services (CMS) on the health insurance exchanges. Here’s a look into just how big this challenge is — and some of the approaches we’re taking to overcome one of the costliest burdens in America.

Types of Healthcare Fraud

According to David Turcotte, Global Industry Director, Public Sector at Microsoft, the FWA situation can be described in this way:

Waste and abuse are characterized by careless practices that do not conform to good clinical practice and divert money away from treatment, but are not necessarily carried out with criminal intent. Fraud, on the other hand, is an intentional deception or misrepresentation made with the purpose of collecting unauthorized benefits.

Turcotte states that the six most common types of healthcare fraud are billing for services or items not furnished, upcoding (using more costly codes), unbundling (fragmented billing), providing unnecessary services, doctor shopping (when patients visit multiple doctors to obtain multiple doses of controlled substances), and medical identity theft. Without the proper detection and mitigation, many parties end up losing, including healthcare providers, insurers, and consumers.


Data Science Healthcare Fraud Detection Strategies

Data science is poised to become the saving grace in healthcare fraud detection. The technology possesses both prediction and knowledge discovery capabilities to stage an advance against this serious problem. In the two sections below, we’ll overview some ways that machine learning can be used to detect and prevent FWA. We’ll consider both supervised and unsupervised learning methods. Regardless of the method, one important consideration with using machine learning for healthcare FWA is the need to interface with disparate computer systems. There are many active data standards in the healthcare industry, so it’s critical that FWA applications be able to access and utilize a wide cross section of data sources.

Supervised Methods

Supervised machine learning is an important segment of data science because it allows for prediction. There are two main types of supervised learning models: regression and classification. Both types of models are useful for mitigating the risks of FWA. Let’s consider some examples.

Predictive models may be used to detect trends in large volumes of Medicaid and Medicare claims data (Big Data is actively gaining traction in healthcare since more data analyzed leads to more accurate predictions).

Classification models may be deployed to categorize certain medical event sequences as either “normal” or “abnormal.” This would lead to the detection of nonsensical sequences of medical procedures, often a precursor to fraud. Similar techniques could be used to make a probabilistic comparison of individual claims, a process that can lead to detecting fraud. Classification can also be used to analyze unstructured social media data in an effort to detect the work of fraudulent actors before any adverse activity has taken place, or take action after the fact. Classifiers, for instance, can use techniques in text analytics to classify groups of Tweets as either fraudulent or non-fraudulent activity. 

Unsupervised Methods

In contrast to supervised learning, unsupervised techniques provide for knowledge discovery. In this case, there is no prediction involved, but rather the goal is to detect previously unknown patterns in the data. The main vehicle for unsupervised learning is clustering.

In the healthcare FWA domain, clustering can be used to yield clusters of claimants as either “potentially fraudulent” or “not potentially fraudulent.” Clustering also can be used to find patterns in medical billing records generated by physicians and healthcare providers — resulting in FWA detection.


Opera Solutions' FWA Radar

Opera Solutions’ FWA Radar utilizes its industry-leading expertise in Big Data predictive analytics and extensive experience in developing healthcare-focused predictive modeling solutions to provide a comprehensive FWA offering. With the integration of multiple data sources and advanced machine-learning techniques — which produce results in real time — our solution identifies not only known but also newly emerging FWA patterns through a self-learning module, which incorporates user feedback.

Fraud detection methods based on data science can be roughly divided into two categories: supervised and unsupervised machine learning. Supervised learning requires all cases in the training data set to be labeled by domain experts. Unsupervised learning does not have this requirement, as the objective is to find outliers in the cases. Examples of the algorithms that have been applied to Medicare fraud detection include neural networks, decision trees, association rules, Bayesian networks, and genetic algorithms, among others. As a result of applying these methods, some fraud behaviors can now be detected, including home/hospital stay conflict, hospital stay with no associated physician inpatient visit, excessive lab/radiology services per client per day, x-ray duplicate billing, fragmented lab and x-ray procedures, lab/x-ray interpretation with no associated technical portion, and ambulance trips with no associated medical service.

One of the examples is neural networks, which have been used extensively in detecting fraud in general thanks to their ability to handle complex data structures and nonlinear variable relationships. However, one common concern with neural networks is overfitting (which produces a relatively small error on the training dataset but a much larger error when new data is presented to the network). Overfitting is especially prominent with skewed data such as healthcare claims, which have many more legitimate cases than fraudulent. Fortunately, a number of strategies have been devised to address the overfitting problem, such as adding a weight delay term to the error function, and another technique called “early stopping,” which uses two different training data sets (one to update the weights and biases, and the other to stop training when the network begins to overfit the data) — both of which improve the generalization of the neural network from the training data set to the test data set.

Applying supervised learning to Medicare fraud detection often involves combining several supervised algorithms (as an ensemble) to improve classification performance. Here are some examples of fraud-detection learning ensembles:

  • A Bayesian network whose weights were refined by a rule generator
  • The use of a k-nearest neighbor algorithm whose distance metric is optimized by a genetic algorithm in detecting two types of fraud: inappropriate practice of service providers and “doctor-shoppers” (soliciting multiple physicians using a variety of false pretenses to receive prescriptions for controlled substances)
  • A model that combines fuzzy-set theory and a Bayesian classifier was designed to detect suspicious claims
  • Association rules and a neural segmentation algorithm for fraud detection

Most unsupervised methods are combined with supervised methods in Medicare fraud detection. For example, a clustering algorithm can be applied to divide all insurance subscriber profiles into groups. Then a decision tree can be built for each group and converted into a set of rules. When an unsupervised method is followed by a supervised method, the objective is usually to discover knowledge in a hierarchical way.

Opera Solutions’ healthcare FWA capacity also includes a proprietary Ensemble scoring methodology that combines multiple models at various granularity levels. Each has a unique ability to address a particular aspect of the problem. This allows the solution to capture the complicated structure of procedure and diagnosis codes at the visit level and maximize performance in predicting outliers.



The development of a safe, high-quality, and cost-effective healthcare system like Medicare requires effective ways to detect and mitigate FWA. The application of machine learning to healthcare FWA is necessary but highly challenging. The overall objective is to develop FWA detection methods and algorithms that are scalable, accurate, and fast. Ultimately, they need to be able to handle the immense volume of healthcare data in real time while keeping costs and error rates low.

If you’d like to learn more about how we detect and prevent healthcare fraud, download our FWA Radar sheet.


Download Now


Daniel D. Gutierrez is a Los Angeles–based data scientist. He is also a recognized Big Data journalist and is working on a new machine-learning book due out in later this year.

Yan Zhang is an Opera Solutions data scientist based in San Diego. He leads Opera Solutions’ healthcare analytics team.


Topics: Healthcare, Big Data, Data Science, Machine Learning