5 Obstacles to Achieving Scalable Data Science, and How to Overcome Them

The struggle is real — and it’s becoming increasingly apparent to companies that have dipped their toes into popular data science tools. As enterprises test the limits of their new tools, old technology, and data scientists’ time, their infrastructure is starting to show its cracks. Read on to see how these issues are revealing themselves — and more importantly — gather some ideas on what to do about it.

Over the past year, I have been averaging 2–3 customer meetings per week, resulting in over 100 customer and partner conversations around Big Data, analytics, and data science for the enterprise. From these conversations, I have found one key recurring theme: scale. Large enterprises no longer want to build one model quickly or implement just one use case in production. They all struggle with a large backlog of ideas. They need a way to rapidly turn these many ideas into real use cases that deliver tangible business value.

However, many companies simply can’t find a pathway to make this happen. Across my numerous conversations, I noticed very similar patterns and identified 5 common obstacles that can prevent companies from achieving scale for data science.

While the symptoms of revenue growth challenges are clear, the causal factors can be less obvious. Forensic analysis can generally identify the causal factors, but success depends on the skill of the analysts, and actionable analysis may require substantial time to complete. In many cases, Big Data analytics can play a role in accelerating the analysis, uncovering previously unsuspected causes and improving accuracy of the findings. Additionally, Big Data analytics can help identify and validate a course of action most likely to effectively address revenue growth challenges.

Ultimately, the answer to revenue growth challenges lies in identifying specific actions to take and then precisely executing these actions. Businesses that have already embraced Big Data analytics as a tool to support their overall corporate strategy can drive additional value by making Big Data analytics the cornerstone of their strategy for driving profitable growth.

1. Too many tools, too many technologies. Not a week seems to go by without some new Apache open-source project announcement or a commercial release in the Big Data space. Moreover, this space is evolving very quickly. First, MapReduce was hot, then Mahout, then Storm, and now Spark, with even more projects on the horizon (Flink or Apex, anyone?). Innovation is taking place across the whole stack, including in-memory distributed databases, management layers, and new hardware. Enterprises are afraid to commit to a particular technology as they are wary of its longevity. For example, one company I worked with made an early bet on Hadoop a few years ago and wrote a lot of legacy MapReduce code (using Java, HIVE, and PIG). The company now faces a significant migration effort to more modern technologies such as Spark.

2. Data, data everywhere — but not an easy way to derive value from it. I often hear companies proudly talking about their nicely organized data warehouses that are chock-full of millions of variables. In my opinion, this should be a cause for concern, because with that much data, only deeply skilled data specialists are going to be able to make sense out of all of it. Furthermore, any meaningful analysis will require time-consuming joining, cleaning, and processing of this data. Simply organizing raw data in a warehouse will not give companies the highly processed, useful variables they need to drive business value.
3. Artisan approach — each use case developed from scratch. Today, enterprise data scientists are like the artisans of the olden days — highly skilled but hampered by a production process that kills their efficiency. For example, here’s what happens when a request comes into the data science team: They first have to become “data plumbers,” figuring out data sources, how to bring data into the application, transformation steps, and variables to be created. Then, and only then, can they finally begin to build and test models and eventually package their code. Another month goes by, another request comes in, and the process repeats itself, with only limited reuse of the work that they had completed in the previous project. Additionally, the people who do this work must be experts in many areas. They are simultaneously part data scientists, part software engineers, and part database administrators. This is a combination of skills that is only rarely found in the marketplace.
4. Operationalizing data science: an often-overlooked challenge. After moving to Hadoop, one enterprise said that “life was great — for the first six months.” Costs were low, and data scientists and analysts were having a blast playing around with the many open-source and commercial tools. However, over time, they created hundreds of models, launched more and more jobs on the cluster, and massively duplicated data. They soon found themselves buying increasing numbers of nodes every week, and suddenly Hadoop did not seem so low-cost anymore. The painful lesson: Creating a new model is easy; managing hundreds of them in production, all with different parameters, workflows, and invocation times, is hard.
5. Lots of experimentation but limited adoption. The most frequent word I hear from our customers is “experimentation.” For all the talk about machine learning, artificial intelligence, and cool new software and hardware innovations, most enterprises still run the bulk of their production workloads on their trusted legacy systems and tools such as Teradata and SAS. Data science teams tend to use new technologies such as Hadoop and new machine learning libraries only for experimentation. The unwillingness to jump on a new horse and ride it means that companies aren’t getting the real benefits from Big Data and better analytics.

The Big Data and advanced analytics industry needs to change its mindset from “building tools to enable artisan data scientists” to “industrializing data science for the enterprise.” Vendors that are able to crack this problem will be the ones that are ultimately successful in the marketplace.

Opera Solutions’ Signal Hub addresses the challenges described here. To learn more, download the “Signal Hub Technical Brief.”

Anatoli Olkhovets is Vice President of Product Management at Opera Solutions.

5 Obstacles to Achieving Scalable Data Science, and How to Overcome Them

Subscribe to Email Updates

Recent Posts

Posts by Topic