Top 3 Data Science Challenges to Operationalize Machine Learning: Insider Stories

Data Science & Machine Learning Operations

Sep 18

Want to add data science and machine learning capabilities to your arsenal (or just make more data-driven decisions)? If so, I commend you! While the rewards can be plentiful, quite often so are the roadblocks to achieving this goal. To help with your journey, I wrote this article to give you a preview of the top challenges you will likely face, so that you can immediately start to make adjustments to your business in order to facilitate this effort.

Source

To be clear, these data science and machine learning challenges have not been gleaned from conversations with a list of cherry-picked technology vendors or opt-in surveys of their customers. What follows are insights mostly gained from deep, first-hand, extended involvement with data science and machine learning projects with numerous Fortune 500 enterprises across a wide range of industries (industrial, transportation, media, software, financial services, healthcare) over the past several years.

In 2017 I was brought on to help Dataiku establish a customer presence in the Western half of the US. For a brief description, Dataiku provides an “end-to-end” data science and machine learning platform that also in 2017 made its debut on the Gartner Data Science & Machine Learning Magic Quadrant as the “Most Visionary” vendor (see here for a unique perspective on the MQ). This event led to increased interest by some of the most recognizable brands in the world for us to participate in the development of their data science and machine learning practices. And because of the wide geographical area that I was focused on for Dataiku during this time, I was grateful to become directly involved in a very large and diverse number of these projects.

These projects were multifaceted, and usually included the involvement of multiple customer stakeholders spanning multiple lines of business, and at times included third party consultants and systems integrators. In order to best serve our customers, I had the pleasure of aligning our exceptional team of data scientists, architects, project managers, and engineers to help. They added immeasurable depth, subject matter expertise, and perspective to our engagements, for which I’m immensely grateful. One of whom in particular, Kenneth Sanford, I’ve gone on to partner with to create our consulting practice, Datagrom.

This experience has provided me with a unique perspective, not found by many industry analysts and consulting firms out there, that I hope you’ll find useful in this article and in future posts.

Business Context

Most of these enterprises had already attempted to make headway building out their practice, and had hit some speed bumps along the way, causing them to reach out to us to collaborate and determine whether or not we could help, typically over the course of a six to nine month engagement. This is a brief summary of the top challenges I discovered that seemed to apply to at least most of these companies, and could quite possibly apply to yours as well.

And for clarification, in the title “Operationalize Machine Learning”, what is meant is that once business stakeholders have a clear vision for how they want to implement machine learning models to achieve specific business outcomes, they have the capability to convert that vision into reality on a repeatable basis. The presence of such a vision should by no means be taken for granted, and will likely be a topic for a future post. For the purposes of this article, the domain of discussion is focused on the tactical challenges of implementing some predefined business objective.

For each of these challenges, I’ll include a brief real-life account of what this actually looked like at one of the large enterprises I worked with (without naming any names of course). I’ll also include some brief recommendations to mitigate these challenges. If you’d like to continue the conversation in more depth, please contact us!

Top 3 Data Science Challenges to Operationalize Machine Learning

Data Access: Getting this part right is foundational to any attempt at becoming truly data-driven across any organization. I found this challenge to be incredibly common from one company to the next, to the point of being quite predictable. Perhaps this comes as little surprise in a world exploding with, and being consumed by data.
Data Prep: Much has been written about this portion of the data science lifecycle which is often the most time-consuming stage of a machine learning project. I’ll add my two cents.
Production Deployment: Challenges to operationalize learning models in production were also common, though somewhat less frequent than challenges related to data access and data prep. This is likely because there just aren’t that many advanced analytics teams out there (outside of Bay Area companies) that have matured their data science practice to the point where they actually have models they’ve built that are production ready.

1. Data Access

Organization: Large pharmaceutical company

Challenge: “It can take up to 10 weeks to get access to the data I need in order to perform my analysis”

Why: Relevant data is decentralized and scattered across multiple lines of business. Each of which “owns” their own siloed data sets. It takes many weeks from the time data access requests are submitted as IT tickets, to the time when access is approved and provided. IT is understandably overloaded, and data owners can be guarded about sharing information which could be possibly used to reflect negatively on the specific line of business receiving the request, or there could be regulatory implications involved, and a lengthy task priority queue in which the request is placed. All factors which contributed to lengthy data access approval times.

Recommendation: In today’s business climate, data should quite often be considered a strategic business asset (ignore this at peril of being disrupted). This hasn’t always been the case. Today, and for the foreseeable future, it is. What this means for organizations established before this change, is some re-imagineering. It requires top-down, executive leadership and intentional strategy on data storage and access processes across the enterprise.

The larger and older your organization is, the tougher this challenge will be. You may be tempted to start sucking copies of all your data from across the enterprise, do a bunch of formatting to it in anticipation of what you think you might want to use it for in the future, and then dump it into a centralized “data hub” or EDW. Yes, you could do that! Instead of letting the data just sit there, it’s more fun to hire some people to start moving it all over the place. And spend piles of cash on technology and storage to make that scheme work. You’ll then have new issues to deal with, like “I wonder if I can trust whether the data in this central hub is the most up-to-date and accurate data possible? After all, it’s been moved around so much, it’s hard to know what’s happened to it in the process!” Or perhaps, “I really hope our EDW admin never leaves. It’s so complicated, no one else knows how it works.” And let’s not forget, moving data around takes time. Implement something like an EDW if you must (and depending on the use case it could make sense). Otherwise, let that data chill where it is. Where it’s happy and untouched. Reach out and grab it when you need it!

For a simpler approach, I might recommend deploying a centralized governed access portal where any of your team members can log-in, and get self-service access to any data set they should be allowed access to, in its original format, no matter where the data lives in your organization.

And to support an initiative to become data-driven at an even larger scale across your organization, it would certainly help if access to all of your data could be accomplished by anyone in your company, regardless of their skill-set. They could simply log-in, see a visual interface, and start working with data from HDFS, or Mongo, or Oracle, etc… as if they were working with something that looked similar to Excel, without ever writing a single line of code or SQL.

In case building something out like that sounds like a pain in the ass to you (it is), here are some software platforms to take a look at that can greatly simplify the data access and governance process — Alation, Dremio, and Immuta, are among the technologies focused on this area. And of course, contact us, we’d be happy to help evaluate and recommend the best solution to meet your specific needs!

2. Data Prep

Organization: Large media company

Challenge: “Our analysts can’t do anything with this data because they’re used to working with visual tools like Excel, and have no idea how to write code”

Why: Database technologies have changed and evolved much faster than most analysts have

Recommendation: In this case, you’re often better off changing your technology approach to accommodate the humans, instead of the other way around.

Take a look at tools like Alteryx, Trifacta, Knime, and Dataiku to help or contact us.

3. Production Deployment

Organization: Large transportation company

Challenge: “Once we’ve built a model, it takes six months from the time we pass it to DevOps to the time it gets deployed into production”

Why: Deploying machine learning models into production is more difficult than deploying application code for several reasons which include:

The model might need to be refactored into production code, e.g. from Python to Java
Models may need to be re-trained on a regular basis as model prediction feedback (or ground truth) comes in. This feedback may come-in instantly, or it could take months or more. And there has to be a plan to actually capture this data. This all needs to be contemplated and configured in advance of deployment.
Models require ongoing monitoring to make sure, for instance, that they aren’t making biased predictions because they were trained on biased historical data, or that the quality of their predictions don’t rapidly deteriorate because the world they learned to operate within (training data), has suddenly and completely changed (COVID-19 for example). Again, this must be carefully considered prior to deployment.

Recommendation: Create a machine learning operations (MLOps) team and strategy. DevOps works for software. You need a different approach to operationalize machine learning models.

This is an area where data science platforms are also trying to keep up. Dataiku has been adding capabilities in this area. And a new crop of startups like Truera are emerging to address model explainability and bias reduction, while startups like Superwise.ai are focusing on model monitoring and management in production.

If you have any questions or comments, please leave them below!

Featured

Estevan McCalley

Estevan is an AI strategist with a passion for helping clients amplify their human potential with AI technologies and make better data driven decisions. Estevan is a military veteran with an academic background in aerospace engineering, and has contributed in big data and AI roles at Oracle, Pivotal, Dataiku, Databricks, and Thermo Fisher Scientific. He has also been a featured speaker at several industry conferences. He has worked with many of the Fortune 100 companies to help convert their vision for AI into operational reality.

Subscribe to our weekly Data Science & Machine Learning Technology Newsletter

Posts by Category