Big data moves fast, but you can prime your company to always have the latest technology and ultimately grow with its data.
When you bring up the subject of Big Data, the conversation can go one of two ways: into the past or into the future. If the first question to come up is, “What kind of data warehouse do you use?,” you know you’re talking to someone living in the past. The new reality is that the demand for data storage is increasing much faster than the cost of data storage is decreasing, meaning enterprises will run out of affordable storage soon — if they haven’t already. A more forward-looking question might be, “What’s your plan to deal with the growing volume of data?”
Chances are, they’ll respond with some form of in-database or in-memory approach, courtesy of the mainstream database vendors. These vendors are jumping onto the Big Data, Fast Data trend, with a particular emphasis on conducting analytics within their databases. The premise of in-database analytics is simple: the closer the analytics processing is to the data itself, the faster it can run. The mainstream vendors have suggested two specific approaches along this theme:
- Pure in-memory. Early adopters are drawn to in-memory databases that claim to provide remarkable performance and scalability, albeit at a steep cost. The basis for in-memory databases is the placement of the underlying data stores into DRAM, which is substantially faster and closer to the processor than solid state drives or traditional disks. This model has largely been adopted by SAP with its HANA in-memory database. A pure in-memory route requires a critical assumption, however, namely that the cost of storage will decrease faster than the demand for additional data increases. This is a dangerous assumption in the current environment, as the variety and volume of data are growing at a substantially higher rate than costs are expected to decline.
- Hybrid/tiered. To optimize costs against performance, several vendors, particularly Oracle and Teradata, have proposed smart algorithms to push critical data from a data warehouse/lake environment to the in-memory database prior to processing. Hence, there’s an ongoing replication of the warehouse within DRAM, based on upcoming processing jobs, with the data being transferred back to drives following the computation cycles. If done well, performance can be similar to in-memory at a fraction of the cost, but the underlying optimization required will become increasingly more complex as volume and variety of data expands.
A third alternative, not dependent on a specific legacy architecture, also exists. We call this approach a Signal Hub. Signal Hubs extract, store, and maintain the Signals, or predictive patterns, from the underlying data, drastically reducing the storage and ongoing processing requirements. A Signal Hub can sift through a multi-petabyte data warehouse, along with vast external data sources flowing in, and extract millions of entities with thousands of Signals, requiring a small fraction of the storage and processing requirements of the other two alternatives.
The magic behind Signal Hub is the series of proprietary Signal schemas that enable the rapid extraction of Signals from large data sets in a fairly short time frame. This process is quite intensive but is conducted only once on the historical data.
These extracted Signals are tested for predictive value and then stored, updated, and maintained as new data flows into the Signal Hub. The ongoing maintenance and updating of Signals is done strictly on net new data to the system, cutting down processing requirements considerably. Resulting analytics applications are built off of these stored Signals, again considerably cutting down the processing requirements for an individual model or query.
This approach has several benefits over the pure in-memory or tiered approaches applied by the in-database movement:
- Infrastructure agnostic. The database vendors are clearly trying to retain market share with their in-database approaches, as smaller, open source players attack their RDBMS businesses with lower-cost NoSQL and NewSQL alternatives. Because Signal Hubs can work alone or with existing systems, enterprises gain the flexibility to change and adapt their warehouse and database strategies over time without having to completely start over.
- Truly scalable. As enterprises begin to add advanced analytics to their repertoire, they will inevitably seek to add new internal, external, structured, and unstructured data to both existing analytics applications and new complex applications. With a Signal Hub, new data sources can be incorporated into pre-existing Signals and models and, depending on the historical record of those data sources, be added abruptly or gradually. In either case, new hardware is not necessarily required, nor does the new data need to be stored prior to use, which is a big time and money saver.
- Optimization not required. Moving the analytics engine from a data warehouse to the Signal Hub represents a dramatic shift in resources required to maintain and update models and Signals, allowing for shorter and more focused processing and fewer trade-offs. Whereas a tiered database approach would try to identify the data that is likely to require more frequent or complex processing from within the entire warehouse, the Signal Hub approach is already doing that prioritization in its inherent design. Enterprises can thus choose to go the real-time or batch route based purely on economic value, knowing that the batch route will not leave them days or weeks behind the curve.
- Signals: shortcut to results. One thing we didn’t touch on in this post as much as in others is the sheer value of Signals vs. cleaned or linked data. Nearly all other vendors in the space are suggesting building analytics on the raw data, with the underlying assumption that enterprises don’t necessarily know what to look for in their data. This is an interesting and comforting argument for many enterprises, as they have watched their industries change and adapt to keep up with global competition, as well as innovation and advances in technology, and do not believe they can foresee what may come next.
The Signal extraction process is designed purely to identify anomalies and patterns, however subtle, as they emerge in any data environment, in any context. This is the beauty of the machine: it seeks the correlation and discrepancies in data, not whether those correlations or discrepancies fit a logical design built by the human. This is how unnatural combinations such as beer and diapers find their way into grocery promotions; they’re based on correlations in purchase behavior, not human instinct.
The Signal Hub approach assumes that the fundamental business model of a particular enterprise will not change frequently, and therefore the Signals that power that business will not radically change all that frequently (this is not to say the direction of Signals, or Signal values, will not change day to day, only that the Signals themselves do not). The underlying business questions asked of these Signals will revolve around common business themes, namely increasing revenue or lowering costs, using levers that are fairly well known to management already. Signals and the models built on them will enable management to extract value out of their business entities in a more efficient and rapid manner than open-ended sifting through linked data.
The Signal Hub approach discussed here is Opera Solutions’ proprietary offering, helping enterprises transition from traditional methods of data management to a flexible method that allows for both data and technological growth. Click here to learn more about Signal Hubs.
To learn how Signal Hubs can impact marketing efforts, download our white paper “Retailing: Increasing Revenue and Profit with Advanced Analytics.”
Ben Weiss is a director in the Office of the Chairman at Opera Solutions.