As a relatively new term, “data science” can mean different things to different people due in part to all the hype surrounding the field. Often used in the same breath, we also hear a lot about “big data” and how it is changing the way that companies interact with their customers. This begs the question — how are these two technologies related? Unfortunately, the hype can cloud our understanding for how these technologies are working to shape our increasingly data-driven society. Rest assured, there truly is something deep and profound representing a paradigm shift in our society surrounding data, but the hype isn’t helping to clarify data science’s exact role in Big Data. In this article, we strive to put to rest many of the misunderstandings surrounding data science.
Data Science Defined
Let’s start with a short and concise definition found on Wikipedia:
Data science is the study of the generalizable extraction of knowledge from data.
This high-level definition suggests that data science can be applied to the solution of business problems for organizations large and small, across all industries, for-profit as well as not-for-profit. The common goal is to increase the value of enterprise data assets. Data science is part of a tiered collection of related technologies — Big Data is facilitated by data science, which in turn is facilitated by machine learning. Bear in mind that data science is not just for massive data stores; small companies can take advantage of data science as well.
From the excitement and enthusiasm we’re seeing in the industry today, you might be led to believe that data science and machine learning were brand-new disciplines and Big Data was a new concept. On the contrary, data science is just a new name for disciplines that have been around for decades — computer science, software engineering, statistics, and applied mathematics. What’s newly happening today is the amount of data being collected coupled with a quickly evolving set of compute resources to process it all.
In addition, there’s a lot of rebranding going on. Machine learning, for example, used to be called “data mining,” or “knowledge discovery in databases,” and “business intelligence” has morphed into “Big Data analytics.” But rapid changes in terminology are not all that’s at play here. There are many profound changes happening, and it’s important to understand the context in which data science is working today to solve business problems in ways never considered possible.
Characterizing Data Science
Data science has two main manifestations — activities based on theoretical foundations and activities with a basis in engineering. As an example of the former, a Ph.D. in applied statistics studies who works on the “science of data” can be thought of as a data scientist. But there’s more to data science than just statistics; computer science and pure mathematics play an important role, too. The combination of these disciplines — data science — is truly a discipline of its own. Similarly, a software engineer with knowledge of machine learning principles can be correctly thought of as a data scientist. And yet, certainly, not everyone who specializes in computer science is a data scientist.
So what is unique about data science? Data science is a confluence of disciplines including but not limited to computer science, mathematical statistics, probability theory, machine learning, software engineering, distributed computer architectures, and data visualization. All of these areas have been around for decades, but coupled with Big Data volumes of information, these fields have evolved quickly to embrace the demands of unique combinations of data volume (quantity of data), velocity (speed at which data is collected), and variety (disparate types of data being collected). These are the commonly acknowledged three V’s of Big Data.
Machine learning is an integral part of data science. Supervised machine learning is typically associated with prediction, where for each observation of the predictor measurements (also known as feature variables), there is an associated response measurement (also known as the class label). Supervised learning occurs when a model is trained on labeled data (response variables are known) and then run against new data to predict the response variables that are unknown. Many classical learning algorithms, such a linear regression and logistic regression, operate in the supervised domain.
Unsupervised machine learning is a more open-ended style of statistical learning. Instead of using labeled data sets, unsupervised learning is a set of statistical tools intended for applications where there are no labels (response variables). Instead we just find patterns in the set of observations. In this case, prediction is not the goal because the data set is unlabeled, i.e. there is no associated response variable that can supervise the analysis. Rather, the goal is to discover interesting things about the measurements based on the feature variables. For example, you might find an informative way to visualize the data or discover subgroups among the variables or the observations.
There are a number of regularly used unsupervised learning methods. Probably the most commonly used is k-means clustering. By associating data points through use of different distance measures (e.g. Euclidean distance), clusters can form that highlight patterns in the data. Another unsupervised method is known as principal component analysis (PCA). This method is used to reduce the dimension of a data set, in that it eliminates predictors that don’t add anything to the prediction accuracy while preserving the variance of data in the data set. This is a form of feature engineering since the so-called principal components are then used as predictor variables.
What is a Data Scientist?
Data scientists are made up for four specialties: mathematics/statistics, computer science, software engineering, and domain expertise. Each of these areas requires dramatically different skills, education, and training with very little overlap. For example, suppose a company wants to identify outliers in millions of insurance claims to identify fraud. It would need someone with experience in the insurance industry, a computer scientist who can manipulate large data, a statistician who specializes in sampling and multivariate statistical analysis, and a software engineer to write the code and create the application that delivers results to end users. They are all considered data scientists, and yet most of them won’t have “data scientist” on their business cards.
The diagram below is not meant to be a precise description of the overlap, but a graphic guideline. Notice that the unicorn — a mythical creature not actually found in nature — represents a very rare intersection of all the disciplines.
The Data Science Process
The data science workflow process is encapsulated in the figure below. Here are the critical first steps:
- Ask the right question. A recipe for disaster is to think that value can be derived from data without knowing what you’re looking for.
- Locate and secure the appropriate data sets for the problem being solved.
- Perform “data munging” or cleaning/transforming data, exploratory data analysis, and feature engineering. This task can often account for 80% of the time required for a data science project, and there is some back and forth within this step.
- Model the data: choose an appropriate algorithm, fit the model, and validate the model.
- Prepare visualizations that clearly illustrate key discoveries.
- Communicate the results effectively enough to provide actionable business intelligence. This is often referred to as “data storytelling.”
- Repeat the process with new insights.
Data science is a field that is rapidly increasing in importance for how business enterprises are able to increase the value of their data assets. As a facilitator of so-called Big Data, data science possesses the technologies, in particular machine learning, required to embrace the demands of growing data sets. Data science is not a passing fad but rather an important evolution of a tried and proven lineage of complementary technologies.
Download our Signalology White Paper to learn more.
Daniel D. Gutierrez is a Los Angeles–based data scientist working for a broad range of clients through his consultancy AMULET Analytics.