Without the Scientific Method, Data Science is a Lottery
There are lots of ambiguous terms describing technology these days, one of the more ambiguous ones from my perspective is data lake. There are some good definitions and practices out there. Even differentiating it from a data warehouse.
What is a Data Lake?
At a high level it’s a repository for all types of data, structured and unstructured. It might be partially curated or not. It might have a schema or not. It’s targeted more for use by data scientists than by business analysts.
I had a discussion recently with a divisional CIO from a large global manufacturer. He described to me how they were going to begin using their workforce management data. When we got into the details it became apparent to me that they were simply going to make a copy of the transactional database and move it into their data lake. The data scientists would take it from there.
I may have oversimplified their approach, but I have seen that strategy with many businesses whether it’s a data lake or simply looking at the data in a warehouse.
Here’s my challenge with “the data scientists will figure it out” approach
It takes a lot of time to plow through all different types of data, wrangling it into a useable form and applying different statistical and visual techniques. The upside is that you become familiar with the different types of data available, but I’m not sure that benefit is worth it, you can gain that knowledge in more efficient ways. If you do elect that strategy, your data scientists had better be well versed in many of the nuances of the company’s business or they might not understand the impact of a relationship they do find or alternatively spend their time chasing down a bunch of false leads or issues that they identify that are of secondary priority.
So, what’s a better way?
Understand the business problem you are trying to solve first and apply the scientific method. I’m not going to define that all here, that’s easy enough to look up. Let’s simply look at the benefits of applying the scientific method to data analysis.
The data scientist can collaborate with a finite number of people to get a solid understanding of the business opportunity and begin to find the data available that has a relationship to that opportunity. Will they get it right the first time? Probably not, it’s an iterative process, but each iteration will be more productive and it’s a lot easier when you have a focused set of data to experiment with. The cross functional team required is also smaller and will likely collaborate more effectively.
Secondly, you can fail faster. Not every business opportunity can be solved with the data currently available to the business today. For example, Workforce data is often noisy and incomplete. There are many behaviors that can’t be accurately predicted today because so much of what influences an individual takes place outside of work. The scientific method allows us to find the edges much more quickly. From there you can either put a plan together to obtain that data or acknowledge the limits and move onto your next opportunity.
Overall your data scientists will be more productive and happier, (they love to solve problems and see those insights put into action). Don’t worry, you will still get those unexpected “aha” moments that you might think you’d miss if you don’t have all the data. Looking at the existing data in a different way or connecting existing sets of data that have already been wrangled for other purposes will still yield fresh valuable insights.
Did it work for us?
Yes. Recently we were asked to look at fatigue in the workforce within a hospital. By speaking with our clinicians, we began to understand that fatigue sets in in different ways. It’s not simply hundreds of consecutive days. It might be working double shifts, it might be working 6 days a week for months on end. It might be working highly variable shifts or the type of work performed. By spending more of our time understanding the business problem and looking at the data that represented what our clinicians hypothesized we would find, we were able to identify a number of these situations and coined a term “fatigue intensity” that reflected a more subtle and complex set of situations compared to the more blunt assessment of consecutive days or hours per workday worked to define fatigue.
The scientific method was established around the 17th century. There’s a reason we still use it today….it works.
See more best practices for using Data Science to improve insight on your workforce.