Why Warehousing Data and Mining It Later Is No Longer Sufficient

Posted by William Dunne on Thu, Mar 27, 2014

data warehouseImagine trying to fill a thimble with water — from a fire hose. Unfortunately, if you’re still storing data with the once-good intention of mining it later, you might as well give it up. In this analogy, the wasted water represents not only the gushing, wasted data but also the enormous waste of time, money, and resources spent maintaining an outdated concept. Sure, your data warehouse if full, but full of what? If you’re only capturing a fraction of what is available, what are the chances that what you have is really valuable? And what valuable information are you letting go to waste?

It was only two years ago that Target Corporation promised to send shoppers coupons for store items before the shoppers even knew they wanted those items. Target could see the rate at which data was flowing into its organization and knew that mining it later meant a lot of missed opportunities. And whether your company is bigger or smaller than Target — or just plain different, the point is the same: If you’re still warehousing and mining data, you need to completely change your thinking. Here are five areas that need to be seriously reconsidered:

Area #1: Data Warehouses — The problem with data warehouses isn’t only that they’re too small. It’s also that they’re too structured. They accept very specific types of data and sort the data as it comes in into preset formats. A place for everything, and everything in its place. 

This is where size comes in: The growth of digital data in the world is accelerating faster than the capabilities of conventional IT methodologies. Back in 2005, when Target was in the middle of its torrid advance, the total of digitized data in the world, according to International Data Corporation (IDC), was 130 exabytes. An exabyte is one trillion gigabytes. By 2020 those 130 exabytes are projected to have grown to 40,000 exabytes, or 40 zettabytes. That would be an increase of 30,000% in a decade and a half.

Area #2: Unstructured Data — The vast majority of the growth mentioned above is in the form of unstructured and semi-structured data flowing from billions of connected smart devices. Unstructured and semi-structured data is any data that wasn’t received in a neat, orderly fashion, such as when we fill in a field on a Website or scan a customer loyalty card at the supermarket. So news stories, blog articles, cell phone pings, images, video, audio, text documents, and much more fall into this category. A vast amount of value exists in this data, but if your company has no way to interpret it, it’s useless.

Area #3: Lead Time — The acceptable amount of time between attaining information and acting upon insights gained from that information has shrunk dramatically. New insights have to be developed and acted upon in near-real time. Not only do your customers expect this, but your competitors are likely already doing it. (Case in point: Even small Web sellers on can react in seconds to a competitor’s price drop.) 

Area #4: Silo vs. Lake —In the traditional data warehouse architecture, data is pre-categorized at the point of entry. This siloing of data is a natural consequence of the conventional corporate structure, with different departments and divisions having different IT requirements. Searches and queries, accordingly, are limited in scope and can be complicated to set up.

A "data lake," however, is much simpler, architecturally-speaking. It accepts data coming in from any and all Internet-connected devices and pours all of it into the same reservoir. Data lakes are acknowledgement that Big Data searches are omnivorous — looking for connections and patterns among any and all data types. Pre-categorization and special handling are largely unnecessary. This simplified storage technology is the antithesis of siloing — and is ultimately much more affordable.

Area #5: Data Science and Advanced Analytics — Another departure from traditional thinking comes with recognition of the limitations of the more rudimentary ideas about data analytics: the idea, for example, that visualization is the be-all and end-all of the Big Data revolution. Or that various forms of textual analyses are all that can be applied to document management. These concepts are not in the same realm as the analytics designed for plumbing the depths of complex, multivariate data sets and calculating the risks involved in decision making. This is where data scientists begin to take on the most ambitious challenges. 

Mathematicians, statisticians, and analysts first do the usual, which is to ferret out and assess all of the potentially meaningful (in a business sense) information, such as average dollars spent on a particular type of transaction. Then they do the unusual. They apply special testing and processing to distill the data down to a smaller, richer, and more manageable set. It’s not unusual for this operation to reduce, say, a 50-terabyte data volume down to a few gigabytes, which vastly simplifies the subsequent operations. 

Scientists then work on the distilled data set to further refine the material to include data mechanisms called Signals. Signals are decision-support indicators that have proven to be of value for solving a particular problem. 

The result, overall, is a far more sophisticated level of extraction, combined with domain expertise and machine-learning techniques, to deliver genuine guidance on decisions and actions to the end user. Such guidance can come in the form of a prioritized to-do list, a score that indicates risk, opportunity, or urgency, or other actions such as pricing, partnerships, and more.

Whether you’re an old hand at IT or a new CEO of your own start up, understanding both the old ways of data management and how far the world has come in the ways we think about data will help drive your business forward.

Learn more about how data science and advanced analytics can prevent you from making costly mistakes and help you make the right business decisions going forward in our whitepaper, “Intelligence Technology: Harnessing the Value in Big Data.”


Download Now


Topics: Big Data, Signals