Tuesday, December 31, 2024

Packaging ML Pipelines from Experiment to Deployment

As an ML Engineer, we are generally tasked with solving some business problem with technology. Typically it involves leveraging data assets that your organization already owns or can acquire. Generally, unless it is a very simple problem, there would be more than one ML model involved, maybe different types of models depending on the sub-task, maybe other supporting tools such as a Search Index or Bloom Filter or third-party API. In such cases, these different models and tools would be organized into an ML Pipeline, where they would cooperate to produce the desired solution.

My general (very high level, very hand-wavy) process is to first convince myself that my proposed solution will work, then convince my project owners / peers, and finally to deploy the pipeline as an API to convince the application team that the solution solves the business problem. Of course, generating the initial proposed solution is a task in itself, and may need to be composed of multiple sub-solutions, each of which needs to be tested individually as well. So very likely the initial "proposed solution" is a partial bare-bones pipeline to begin with, and improves through successive iterations of feedback from the project and application teams.

In the past, I have treated these phases as largely disjoint, and each phase is built (mostly) from scratch with lot of copy-pasting of code from the previous phase. That is, I would start with notebooks (on Visual Studio Code of course) for the "convice myself" phase, copy-paste a lot of the functionality into a Streamlit application for the "convince project owners / peers" phase, and finally do another round of copy-pasting to build the backend for a FastAPI application for the "convnice application team" phase. While this works in general, folding in iterative improvements into each phase gets to be messy, time-consuming, and potentially error-prone.

Inspired by some of my fellow ML Engineers who are more steeped in Software Engineering best practices than I am, I decided to optimize the process by making it DRY (Don't Repeat Yourself). My modified process is as follows:

Convince Yourself -- continue using a combination of Notebooks and Short code snippets to test out sub-task functionality and compose sub-tasks into candidate pipelines. Focus is on exploration of different options, in terms of pre-trained third party models and supporting tools, fine-tuning candidate models, understanding the behavior of the individual components and the pipeline on small subsets of data, etc. There is no change here, the process can be as organized or chaotic as you like, if it works for you it works for you.

Convince Project Owners -- in this phase, your audience is a set of people that understand the domain very well, and are generally interested in how you are solving it, and how your solution will behave in wierd edge cases (that they have seen in the past and that you may not have imagined). They could run your notebooks in a pinch but they would prefer an application like interface with lots of debug information to show them how your pipeline is doing what it is doing.

Here the first step is to extract and parameterize functionality from my notebook(s) into functions. Functions would represent individual steps in multi-step pipeline, and should be able to return additional debug information when given a debug parameter. There should also be a function representing the entire pipeline, composed of calls to the individual steps. This is also the function that would deal with optional / new functionality across multiple iterations through feature flags. These functions should live in a central model.py file that would be called from all subsequent clients. Functions should have associated unit tests (unittest or pytest).

The Streamlit application should call the function representing the entire pipeline with the debug information. This ensures that as the pipeline evolves, no changes need to be made to the Streamlit client. Streamlit provides its own unit testing functionality in the form of the AppTest class, which can be used to run a few inputs through it. The focus is more to ensure that the app does not fail in a non-interactive manner so it can be run on a schedule (perhaps by a Github action).

Convince Project Team -- while this is similar to the previous step, I think of it as having the pipeline evaluated by domain experts in the project team against a larger dataset than what was achievable on the Streamlit application. We don't need as much intermediate / debugging information to illustrate how the process works. The focus here is on establishing that the solution generalizes for a sufficiently large and diverse set of data. This should be able to leverage the functions in the model we built in the previous phase. The output expected for this stage is a batch report, where you call the function representing the pipeline (with debug set to False this time), and format the returned value(s) into a file.

Convince Application Team -- this would expose a self-describing API that the application team can call to integrate your work into the application solving the business problem. This is again just a wrapper for your function call to the pipeline with debug set to False. Having this up as early as possible allows the application team to start working, as well as provide you valuable feedback around inputs and outputs, and point out edge cases where your pipeline might produce incorrect or inconsistent results.

I also used the requests library to build unit tests for the API, the objective is to just be able to test that it doesn't fail from the command line.

There is likely to be a feedback loop back to the Convince Yourself phase from each of these phase as inconsistencies are spotted and edge cases are uncovered. These may result in additional components being added to or removed from the pipeline, or their functionality changed. These changes should ideally only affect the model.py file, unless we need to add additional inputs, in that case these changes would affect the Streamlit app.py and the FastAPI api.py.

Finally, I orchestrated all these using SnakeMake, which I learned about in the recent PyData Global conference I attended. This allows me to not have to remember all the commands associated with running the Streamlit and FastAPI clients, running the different kinds of unit tests, etc, if I have to come back to the application after a while.

I implemented this approach over a small project recently, and the process is not as clear cut as I described, there was a fair amount of refactoring as I moved from the "Convince Project Owner" to "Convince Application Team". However, it feels less like a chore than it did when I have to fold in iterative improvements using the copy-paste approach. I think it is a step in the right direction, at least for me. What do you think?