Everywhere you turn, both business and IT talk about data science. But there’s also trepidation about how to get started, especially in the context of attaining an organization’s business goals and objectives beyond the realm of lab or departmental experimentation.
Also hampering widespread adoption is the lack of one clear definition of data science. This is due in part because it touches so many fields: computer science, applied statistics, probability theory, mathematics, operations research, machine learning (algorithms that can learn from data), software engineering, and data visualization. Therefore, executives in varying fields have different understandings on what using data science means.
Our recommendation: stop overanalyzing it, and just get going! You are likely to find previously hidden knowledge and generate business impact even if your approach is not textbook-perfect. The following blueprint outlines the steps to get started applying data science in your organization and avoid common mistakes
Step 1: Determine whether you even need data science.
The first question to ask yourself is straightforward: “Do we need data science to solve our business problems, and do we have enough problems to justify the expense?” The idea is to evaluate whether it is worth the company’s time and resources to engage in data science. Often, simple rules-based or statistical approaches get you part of the way there, but if your needs still aren’t being met, it might be time to upgrade to a data science solution. Remember, it’s about achieving business results with minimal effort, not about using the latest technology.
Step 2: Locate all the data.
Naturally, data is the raw material of data science. With data science, generally the more data you have and the more varied data sources are, the better. So go beyond traditional transactional systems, and be on the lookout for interesting and potentially relevant new data sources, such as system logs, social media data, and sensor data.
Note, though, preparing this breadth of data, assessing its quality, looking for and correcting errors, and performing exploratory data analysis to determine its relevance can take up a large portion of data scientists’ time. Therefore, we recommend creating an intermediate “processed data” layer, so data scientists don’t have to return to raw data and repeat various join/clean/derive/preparation steps for each project.
Step 3: Make collaboration a priority.
Collaboration is a vital step in getting up to speed with data science. A typical project may involve data scientists, business analysts, data analysts, statisticians, and software engineers. Look for software solutions and platforms that allow collaboration among these disparate groups. Also, encourage collaboration through organizational constructs, such as explicitly forming a team with representatives from each of the groups and outlining roles and responsibilities.
Step 4: Expand the impact of data science through visualization.
The key goal with data science is to replace uninformed, emotional decision making based on subjective theories with decision processes that are supported by evidence and analytics. To do this, analysts require data visualization, which provides interpretation of data science results and shows why they are significant. Data visualization can be an effective way to communicate accurately what results have been found rather than just presenting numbers that could be misinterpreted. So in addition to developing the core data science capabilities, companies must also place a good deal of emphasis on developing strong data visualization expertise.
Step 5: Establish confidence through data governance.
Enterprises need a consistent level of data governance to protect sensitive customer and other information. But data scientists need access to as much data as possible to do their job. How then can the two seemingly opposing needs coexist?
The answer is about giving appropriate access. Data governance is not about restricting the data access. It’s about providing access consistent with enterprise and regulatory policies. Enterprises need to have the following abilities:
- Control access to data
- Protect sensitive data via infrastructure and application-level access controls
- Generate data lineage (e.g., document variables going into models to ensure compliance with appropriate regulations)
- Generate usage reports and audit trails
- Ability to view metadata, such as data source, last refresh time stamp, and key stats, to give data scientists confidence in the data source
- Automatically disguise sensitive fields (e.g., name, address)
With these capabilities in place, data governance does not have to be an impediment to data science.
Step 6: Remember data security.
Data science platforms must provide the following modern capabilities to ensure data security:
- Data masking — the ability to disguise customer-identifiable information such as name, address, and phone number
- Data encryption in motion (when transferring between systems) and at rest (on the file system or database)
- Multiple options to ensure appropriate user access: application-level access permissions, file-level permissions, logical separation (e.g., different instances of the application for different projects), and in extreme cases, physical environment separation (instances of the platform on different physical environments)
With the appropriate data security strategy and enough planning, enterprises can develop an appropriate architecture. But this must be done before starting the project, as changing architecture mid-stream is more difficult.
Step 7: Emphasize data storytelling skills.
It is not enough to build an accurate model and create a compelling visualization of its performance. Enterprises that get value from data science typically point to effective data storytelling as a key ingredient to their success. Insights provided by data science are of little value unless the results can be articulated in terms of what the findings say and why they are significant to business goals. This is not always easy, especially if the story behind the findings calls into question C-level gut-feeling assumptions about business strategy or suggests that established business processes are outdated or ineffective.
However, it’s rare to find people who are great at both data science and making a compelling executive-level presentation. Pairing data scientists with business analysts, especially those with consulting experience, is one way to address this need.
Many enterprises are quick to jump on hot new technology without fully preparing for the process. These companies, however, will find themselves disappointed with the results. By implementing the seven steps discussed here, enterprises can ensure their efforts are generating business outcomes, which is ultimately the key metric on which they will be judged, and building a strong organizational foundation in data science for the long term.
Anatoli Olkhovets is Vice President of Product Management at Opera Solutions.