Data Science Revolves Around Measurable Results, not Data Lakes
Companies still find it difficult to achieve concrete results with data science and big data. One common mistake – due to a lack of knowledge and experience in the field – is when businesses focus on the infrastructure instead of concrete applications. Read here to find out why you should set aside your Data Envy, start small, and fail fast, in order to become successful with data science.
By Rolf Lange, director data science technology at ORTEC
Over the past 15 years, not much has changed in how most organizations deal with their data. They use the data to calculate the optimal sales prices, plan their inventories or delivery routes, and schedule their staff. But advanced data analysis has become so affordable that it is now possible to come up with increasingly creative analysis and complex algorithms. Next to that, there is ever-more data available to analyze. These developments offer the potential for new insights and significant added value, but many companies simply don’t get that far. Following the hype and an irresistible sense of Big Data Envy, they can’t wait to build the ultimate big data infrastructure, which will eventually produce all sorts of business wisdom through the magic of machine learning. But that’s unrealistic. Moreover, you should never start with the infrastructure. You have to start with the problem that you want to solve, or the value that you want to create for the organization or for your clients. But then there are still some hurdles that need to be crossed to get there.
1. Big Data Envy
According to Forbes, companies invested in 2018 over 42 billion of dollars in big data. That amount is expected to rise to 103 billion by 2027. In some cases, the investments were made simply out of Big Data Envy. This form of jealous herd behavior leads companies to set out on the data science path without a clear road map. It often begins with the comment ‘we should do something with data science’, which quickly leads to expensive projects to combine all of the data in a data lake or data warehouse. This is a common pitfall that trips up many companies. In general, you can assume that a company holding less than a petabyte (1,000 terabytes) of data doesn’t need a big data infrastructure. So don’t get sucked in to the collective frenzy; instead, start with a clear plan.
2. Start small, but relevant
I’m a fan of ‘fail fast’: discover which parts of your plan or idea work, and which ones don’t, as soon as possible. That means you shouldn’t make your data science projects too complicated. Start with a small project based on a practical use case. Think of something that affects the company in a significant way with regard to operations and decision-making, not a major project that needs something in the way of data lakes. The key is to achieve results that are useful for the organization, and that produce measurable improvements.
Also consider that much of the data within the company is simply structured information. Many analyses don’t even need a lot of data, and big data infrastructures, machine learning, and enormous volumes of data are then simply overkill. The most important goal you should pursue is to help your company make better decisions. So start with a concrete use case, and start small.
3. Linking instead of integrating
Four years ago, the classic approach to big data was to save all available data, then analyze what value they might have later. Unfortunately, that’s mainly useful as a way to spend years wasting time and money setting up a costly infrastructure full of ‘dark data’, which nobody will ever look at.
It’s much better to consider which problems you want to solve first, and which insights you want to gain. Then you can look for the sources of the information you need, and an efficient way to analyze the data. In most cases, this won’t be by structurally bringing all of the data together in a single system. That’s an illusion.
It’s more realistic to understand that more and more data comes from outside the organization. In such a situation, it’s practically impossible to copy, synchronize, and manage all of the data. It’s much better to use a system in which you can link external sources, such as via APIs, and that you only use for the concrete information that you want to retrieve. And it’s just fine if that isn’t done via a central hub. Don’t try to centralize everything; instead, consider a decentralized approach. Link different ecosystems together instead of combining them all into a single database, because that results in a static data infrastructure that makes it increasingly difficult to scale up. Moreover, there are no static requirements in data science, because the entire landscape is in constant flux.
4. Focus on the last mile
Data science is relatively new in organizations, so its successful integration into business processes can be extremely difficult. It starts by understanding the true complexity of these kinds of projects. After a hackathon or a proof of concept, many people will get excited and think: ‘This is fun!’ Can it go live in production next week?’ But that’s completely unrealistic, because it takes 10 to 50 times more effort to get something to work in production than in a simple proof of concept. Since you have to deal with actual company data, technical and legal requirements, and the actual business processes. So try to see the solution within its full context, and adjust your expectations accordingly. After all, the last mile is always the longest. The last 20 percent of a project often requires 80 percent of the effort.
It should be clear by now that a successful data science project starts with a clear plan, and not with building a monolithic data infrastructure. Don’t be guided by Big Data Envy, but start with a small project that will significantly benefit the organization. Then look for all of the relevant sources of data, and try not to integrate them if possible, but rather link them together in a flexible manner. And if the project eventually proves to have tangible potential, then you should realize that you’re only getting started. Getting the project from the test phase into production will require a lot more effort. But you should still share your preliminary results, because it helps employees and management to understand the benefits of data science, which will in turn create the necessary support to eventually create a data-driven organization.