Ensuring Predictive Analytics Success with Data Preparation & Quality

Posted by Daniel D. Gutierrez on Fri, Mar 24, 2017

Blog_DataPrepQuality.pngIf you’re in the business of pretty much anything, you’ve got a lot of important data coming in from a lot of different places — both internal and external. What you might be lacking are some best practices that could help you access or see all of that data and be in a position to extract important insights that could nudge your business into new competitive directions.

But what data is relevant to your business and where is it? Can you access it when you want to? Do you know that it’s accurate, current, clean, and complete? Can you easily pull all the data together, no matter what format it’s in or how often it changes?

Basically, is your data ready to support analytics?

Data Preparation

Many companies have done analytics on data that wasn’t prepared for analytics. Their data might have been incomplete, or maybe they were working from duplicate, corrupt, or outdated data.

Until companies find a better way to manage their data, the results of their analytics are going to be far less than optimal. So how difficult is it to manage unfiltered data and get it ready for analytics? Ask a data scientist. Many of them spend 50 to 80 percent of their time on data transformation.

Here are four steps data scientists take to prep data for analytics:

  1. 1. Simplify access to traditional and emerging data. Enterprise thought leaders need to steer their companies toward the adoption of tools that offer native data access capabilities, which make it easy to work with a variety of data from an ever-increasing number of sources, formats, and structures. 

  2. 2. Strengthen your data science team’s arsenal with analytic techniques. It’s useful to have sophisticated statistical analysis capabilities inside of the extract, transform, load (ETL) flow. For example, frequency analysis helps identify outliers and missing values that can skew other statistical measures. Summary statistics helps analysts understand the distribution and variance of the data. Correlation shows which variables or combination of variables will be most useful based on predictive capability strength.

  3. 3. Cleanse data to build quality into existing processes. Up to 40 percent of all strategic processes fail because of poor data. With a data quality platform designed around data best practices, your team can incorporate data cleansing right into your company’s data pipeline.

    4. Shape data using flexible manipulation techniques. Preparing data for analytics requires merging, transforming, denormalizing, and sometimes aggregating your source data from multiple tables into one big, flat file. In addition, your team may need to use other reshaping transformations such as frequency analysis, appending data, partitioning data, and combining data, as well as multiple summarization techniques.

Data Quality

Data preparation is only half the process. Once the data is cleaned and normalized, ensuring data quality is next. Data quality management (DQM) is often cited as a critical determining factor in delivering business value from data. A 2016 survey of 200 enterprise decision makers and influencers helps determine the real and perceived risks of compromised data quality and integrity in enterprises. It also identifies future initiatives that will impact the growth of data quality management use.

In the figure below, we see that the survey revealed that enterprises employ a variety of ways to manage data quality. It shows that 44.5% of respondents said they found errors using reports and then act on those errors, which isn’t a particularly proactive method. So it makes sense that 37.5% said they use a manual data cleansing process. Surprisingly, 8.5% of respondents avoided any kind of data quality management completely, indicating they used a “hope for the best” approach.

Figure 2: Means for Managing Data Quality


Source: 451 Research, January 2016 

The next figure shows the causes for poor data quality. Human error, expectedly, ranked as the number-one offender. Furthermore, IT-related practices such as migration efforts, systems changes, and system errors were also frequently cited. However, 38% of respondents cited their customers as the cause of data quality issues. Customer data entry usually involves an interaction with a Web-based application, Web service, or mobile app, all of which typically have integrated data validation to address error-prone data at the point of entry. Further, errors from external data sources are likely to increase as more organizations accelerate the sharing of data and services via API integration with their customers and supplier partners.

Figure 3: Causes for Poor Data Quality


Source: 451 Research, January 2016 

The survey reports that most respondents had made investments in data quality technology. But even though they have something in place, people still have data quality on their minds, as 24% of those surveyed are currently evaluating or planning to evaluate tools within the next 12 months. Quite a diverse mix of tools have been employed or are under evaluation. Big Data, master data management (MDM), and data cleansing tools were the most common. One issue is that the breadth of tools can create complexity in DQM execution. Additionally, personnel responsible for managing data quality may not understand the conditions under which certain tools should be used. Moreover, they may not have a solid understanding of how data quality tools fit in the overall architecture, which combines machine learning and predictive analytics technologies. The right orchestration of DQM with ETL and analytics platforms can positively impact the work of data scientists and analytics teams.

In terms of ROI, over 25% of organizations reported a high return on investment from DQM. Also of note is the fact that 80% of surveyed organizations believe data quality is of high importance and warrants investment. This is a good indication that enterprises are taking data quality seriously.

Data Quality’s Contribution to Analytics

The value of business execution and resulting outcomes can be directly linked to solid analytics models and the quality of the data used for making decisions and directing operations across the enterprise. Here are three strategic benefits of high-quality data:

  1. 1. It generates precise analytics that can be used to create business value through increased revenue and profit growth
  2. 2. It Improves productivity by reducing time spent reconciling data
  3. 3. It Improves quality through the reduction of errors.

Poor data quality can considerably diminish productivity and the quality of analytics results. Further, a range of problems arise from poor data quality, such as misuse of data scientists’ time, burdening them with data reconciliation, and verifying their models and lost productivity. Of course, if analytics are not based on accurate data, the results will be misguided and lead to wrong decisions and economic losses.

Data is at the heart of every organization, and the quality of that data is critical to sustained business success. Having clean, consistent, and error-free data helps drive growth and enhance critical business processes. But it’s not easy to achieve a high level of data quality. The growing volume and variety of information has created dirty data that lives deep within every kind of system. Even a small amount of dirty data can create countless problems, wreaking havoc as it flows during the course of business processes, permeating many different information sources.

While senior management is generally shielded from the intricate details of the layers of technology and data supporting their decisions, they do have an appreciation for getting objective and reliable data and insights.  It becomes important to present the collective business case of Big Data and analytics projcets to encompass data quality programs, sound ETL infrastructure and predictive analytics platform, in relation to larger business objectives especially to C-level decision makers. To leverage data as a strategic asset, businesses must build confidence in the data they already have.


Achieving high-quality data can seem an elusive feat because it requires a focus on business benefits, a partnership between business people, data scientists, and IT, and a continual rediscovery of the ways in which people, processes, and technology help a company organize, govern, and share data. And finding the right combinations of tools to reduce the amount of time spent on data preparation can be more overwhelming than the data itself. To create a single source of truth that encompasses all relevant data and democratizes its key insights across the enterprise, bringing decision makers in every area of the company data-driven guidance, organizations need to take a disciplined approach and build an architecture that addresses the entire flow, from data capture to the extraction of valuable insights.

Want to learn more about how to get more from your data? Download our Signal Hub Technical Brief, which explains how your company can more efficiently extract analytic insights to deliver faster, deeper business impact across the enterprise. 


Download Now





Daniel D. Gutierrez is a Los Angeles–based data scientist working for a broad range of clients through his consultancy AMULET Analytics.


Topics: Big Data, Data Science, Data Equity