The Hadoop Summit conference, hosted by Hortonworks and Yahoo, has become a must-see Big Data event. The Hadoop distributed computing architecture is now an integral part of what it means to be a data scientist, and a few days of concentrated effort each year is enough to get a vision for where the industry is headed. The Hadoop Summit serves this purpose well by providing thought-provoking technical sessions, keynote addresses, and a vendor exhibition that brings many of the major players in the Hadoop ecosystem together under one roof.
But this conference is not just about the Hadoop platform. There’s plenty of focus on tools and methodologies to help the data scientist. In this way, attending this show is a great way to get the pulse of what other data scientists are doing. Here are some of the most helpful highlights from the show.
Apache Pig for Data Scientists
Apache Pig is a high-level scripting language for operating on large datasets inside Hadoop. Pig compiles the scripting language into MapReduce operations and provides optimizations to minimize the number of MapReduce jobs required. Pig is extensible via user-defined functions and loaders for customized data processing and formats.
Casey Stella, principal architect at Hortonworks, gave a talk describing some popular libraries for Apache Pig geared for use by data scientists. Stella discussed how you can integrate these libraries with Pig to provide a robust environment to do data science. One example is Apache DataFu for Pig, which computes quantiles of data and performs various data sampling methods.
Pig can be seen as a data aggregation tool as well as a platform to evaluate machine-learning models at scale. Machine learning at scale in Hadoop generally follows this strategy: Build one large model on all (or almost all) of the data and then sample the large dataset and build the model based on that sample. Pig can assist in intelligently sampling down the large data into a training set that you can use with machine-learning algorithms.
Gaining Support for Hadoop in a Large Enterprise Environment
Sprint, a major telecommunications company, had its technology architect Jennifer Lim present how the company went from issuing a research challenge to enabling the entire enterprise in the area of analytics. Faced with competitive requirements for applying data science to business problems, Sprint repurposed an existing team and started with an initial proof-of-concept using Hadoop. Lim explained how Sprint is now in the midst of setting up a multi-petabyte enterprise system, supported by Hadoop, for multiple funded projects, all of which are augmenting Sprint’s research facilities. The company has a long list of use case trials in the works. Gaining management support began with the “what” and quickly advanced to the “how” in terms of clearly articulating how analytics success could be realized through a contemporary Big Data architecture based on Hadoop. Lim’s talk had the benefit of hindsight to demonstrate how Big Data initiatives can progress steadily in a big company atmosphere.
Practical Anomaly-Detection Anomaly
Ted Dunning, chief application architect for MapR, provided a very well crafted presentation on anomaly detection, or as he put it, “the art of automating surprise.” The first step is to be able to define what is meant by “normal” and recognize what it means to be different from that. The basic ideas behind anomaly detection are straightforward using data science: You construct a model and look for data points that don’t match that model. Although the mathematical foundations of this process can be formidable, contemporary techniques in data science provide ways to solve the problem in many common situations. Dunning described these techniques with particular emphasis on several real-world use cases:
- Rate shifts — determine trends in Web traffic, purchases, or the rate of progress for a given process
- Time series data— generated by machines (such as RFID scanning tags) or biomedical measurements, such as those taken from ultrasounds; facilitates time series analysis and forecasting
- Topic spotting — determines when new topics appear in a content stream such as Twitter
- Network flow anomalies — determine when systems with defined inputs and outputs act strangely.
Dunning also pointed out that in building a practical anomaly-detection system, you have to deal with concrete details, including algorithm selection, data flow architecture, anomaly alerting, user interfaces, and visualizations.
Bayesian Networks with R and Hadoop
Data scientists know that when modeling intelligent systems for real-world applications, one inevitably has to deal with uncertainty. Ofer Mendelevitch, director of data science at Hortonworks, delivered a probing session on how Bayesian networks are well established as a modeling tool for expert systems in domains with uncertainty, mainly because of their powerful yet simple representation of probabilistic models as a network or graph. Bayesian networks are widely used in fields such as healthcare for medical diagnoses, genetic modeling, crime pattern analysis, document classification, predicting credit defaults, and gaming. But working with large-scale Bayesian networks is a computationally intensive endeavor.
Mendelevitch described his experience working with R and Hadoop to implement large-scale Bayesian networks (hundreds and thousands of nodes), where manual and automated methods are used. He started with defining nodes, seeding with some known edges based on expert knowledge, and then augmented with automated learning (e.g. hc, tabu, rsmax2, etc.). He also gave a quick overview of Bayesian network theory and offered example applications such as healthcare, security, education, finance, and tech support. He explained how to build a Bayesian network using R (the bnlearn package) and how to infer data from a Bayesian network.
How to Determine Which Algorithms Really Matter
Ted Dunning from MapR gave another talk that was of interest to data scientists. The session focused on how it can be difficult to figure out what really matters in data science. The set of algorithms that matters theoretically is very different from the one that matters commercially. Commercial importance often hinges on ease of deployment, robustness against perverse data, and conceptual simplicity. Often, even accuracy can be sacrificed against these goals. Commercial systems also often live in a highly interacting environment, so off-line evaluations may have limited applicability. Dunning discussed how to tell which algorithms really matter and went on to describe several commercially important algorithms such as Thompson sampling (aka Bayesian Bandits), result dithering, online clustering (e.g. k-means algorithm), and how they’d be important in an industrial setting.
The session provided several examples of how to determine which algorithm is optimal. First, Dunning discussed how to make recommendations better by evaluating the most important algorithmic advances from the past ten years, namely, result dithering (re-order recommendation results) and anti-flood. Next, he discussed Bayesian Bandits, possibly the best known solution for exploration/exploitation. Lastly, he discussed on-line clustering and how K-means clustering is useful for feature extraction or compression.
Conclusion
This year’s edition of the Hadoop Summit in Silicon Valley did not disappoint. It provided cutting-edge technology trends and much more. The Hadoop ecosystem is alive and well with a growing number of data science and statistical learning solutions designed to provide actionable business intelligence for the enterprise.
Daniel D. Gutierrez is a Los Angeles–based data scientist working for a broad range of clients through his consultancy AMULET Analytics. He’s been involved with data science and Big Data since long before it came in vogue, so imagine his delight when the Harvard Business Review deemed “data scientist” as the sexiest profession for the 21st century. He is also a recognized Big Data journalist and is working on a new machine-learning book due out in later this year.