The Relentless Progress of Big Data and Machine Learning Technology

Posted by Anatoli Olkhovets on Tue, Oct 04, 2016
Find me on:

Relentless_Progres_Blog_post.jpg“Didn’t you just go to a similar Big Data conference recently?” my wife asked me. “How much could have changed in a few months?” I was hesitating about attending another conference in a short time span. My wife is right about most things, but in this case, I am glad I didn’t listen and went anyway. I learned about many new advances in both commercial and open source tools and across the whole technology stack: new hardware, in-memory databases, and new-and-improved tools.

Just a few years ago, the world was simple for enterprise buyers: You set up your trusty data warehouse, ran analyses with SQL queries, and used SAS for advanced statistics and models. But then along came an open-source, enterprise-grade distributed computing framework – Hadoop – and the whole data infrastructure stack was basically reinvented. Many leading-edge companies became early adopters. Initially, MapReduce was the only way to process data on Hadoop. One company taught all of their data scientists how to write MapReduce code in Java, HIVE, and PIG (modules in Hadoop, which allow users to extract data and execute analytics jobs). But many other frameworks quickly appeared — Mahout for Machine Learning, Apache Storm for streaming, and now, Spark is all the rage because of its ability to process both batch and streaming data as well as its ever-improving machine learning library. Industry heavyweights are announcing their Spark support (e.g. IBM and Microsoft). So the early adopter clients who jumped on MapReduce are now facing a major (and costly) migration effort to the latest frameworks.

Beyond Spark, even more technologies are on the horizon. Apache Flink and Apache Apex are emerging, with some advantages over Spark (e.g., a better streaming engine). Also, vendors are innovating across the whole stack, building specialized hardware (e.g., Google introduced Tensor Processing Unit, a specialized piece of hardware to accelerate training of deep learning neural networks), developing promising in-memory processing engines such as Alluxio. And the introduction of new technologies seems to be accelerating.

As an enterprise buyer, how do you navigate the volatile, fast-evolving, and constantly shifting technology landscape? How can you stay on the leading edge while also creating stable production applications? Broadly, there seem to be two ways to do it:

Option 1: Embrace the complexity. Keep a close eye on the evolving landscape, adopt new technologies, and build production-level applications. The upside to this strategy is that you stay on the leading edge of machine learning performance and accuracy and reap the competitive advantages that come from being there. You also may have more flexibility by doing mix-and-match from a variety of technologies. However, to succeed, you must essentially commit to becoming a software company and make the necessary investments. You’ll have to put a tremendous amount of effort into testing new technologies, stitching them together into overall application, troubleshooting problems and then maintaining the solution, especially if you are relying heavily on open source tools. Plus, you have to worry about cluster tuning and pipeline parallelization and be ready to regularly recode the solution to avoid obsolescence. Last but not least, you need to hire additional personnel that can bridge coding skills and data science. Some companies have found that any projected cost savings from open source options quickly get eaten up by higher people costs.

Option 2: Move up a level of abstraction. Let data scientists worry about model features and parameters, as opposed to pipeline parallelization. Let data engineers worry about which tables to join, as opposed to how best to partition them for efficient cluster performance. Let somebody else worry about swapping in a new data execution engine if the current one is falling behind in features or performance. This means adopting a platform that hides the underlying execution layer and allows users to focus on the logic and use cases. To do this, scientists need quick and reliable access to the company’s most relevant insights, or Signals. Signals are mathematical transformations of data that have proven to be valuable for solving a specific business problem.

Both options are valid, but few companies will have the desire, talent, or deep pockets to embrace the complexity (Option 1). At Opera Solutions, we believe that most enterprises will be much better served with the alternative option, which is to implement an abstraction layer (Option 2). There’s no need for an enterprise to solve a problem that somebody else already solved; use your precious analytics and data resources wisely. In other words, enterprises should let their data scientists do data science and leave software engineering to companies like Opera Solutions. This is what we set out to do by implementing a Signal-oriented software solution (Signal Hub) and enabling enterprises to rapidly build analytics at an abstraction level we call the Signal Layer.

For more information on Opera Solutions’ analytics platform, download our white paper “Delivering Big Data Success with the Signal Hub Platform.”


New Call-to-action  


Anatoli_cropped.jpgAnatoli Olkhovets is Vice President of Product Management at Opera Solutions.

Topics: Big Data, Data Science, Machine Learning, Hadoop, Signal Hubs, Spark