Software engineering for Machine Learning – Part II: An overeager implementation

5 minutes read

This is part of a series of posts in which we discuss the challenges and strategies for the productionization of machine learning code. In our previous post we introduced the problem and arrived at the four pillars that the production ML code should have:

  1. Fast and easy exploration
  2. Declarative and Intention revealing
  3. Useful checkpoints
  4. Seamless tracking and monitoring

We will do an attempt to productionize the playground code we wrote earlier and show step by step the rationale behind each of our decisions. We will see how some attempts to implement a production-is-exploration approach eventually fall back to deploy as a service when we disregard the typical exploration workflow. To follow the evolution of the code, you can check the repository. Remember that we began with the original playground file

The first thing we realize with this code is that there is no way in which we can put it into production: the model only leaves within this python file scope. We need to persist it somehow, and for that we choose pickle. Instead of calculating the accuracy score, once we train the model we save it (check the code at that commit in the repo):

We can then use this saved model to predict on a completely new dataset. In this scenario we will simplify this by reading the unseen dataset from disk, but keep in mind that we can use this to serve a model over any API with, for example, fastAPI.

From the exploration-is-production perspective (see part I we could say our work is done: we have a model and it’s going into production. However, there are some things that can be improved in this code. Take for example the calls to pickle. If eventually we wanted to save it in any other format, we’d have to manually change all these files. And it’s something that Data Scientists are usually not interested in. Essentially we are saying that there are implementation details being exposed in the code. We know how to solve it – we abstract the implementation details away into functions! So we create the module that will be in charge of knowing how to persist and load the model (check the diff from the previous step in the repo):

See that, when creating this new file we have also split the code in two spaces: the file is part of a “library” code, while and are “application” code. See that now if, for some reason, we decided to change the tool we use to save the model (for example using joblib instead of pickle), we’d have to change only the library code and the user of the application code would not be impacted by this at all. This difference is important in order to understand how and when we are going to abstract implementation details, and we’ll use this distinction throughout this post.

Now we turn our attention to the model fitting itself, this section of the file:

In a similar fashion to what we’ve done before, we will abstract the fitting of the model into a function. And, spoiler alert: this will turn out badly right away. When we abstract the fit_model into a function, we end up with file and the file looking like this (diff):

See that we have the _model as a repeating suffix in all of the functions. As Sandi Metz put it in her wonderful talk All the Little Things, when we see functions that have a repeating prefix or a repeating suffix, there’s a tortured object there that’s screaming trying to get out. So we do a very quick code refactor and “upgrade” the collection of functions with _model to a wellformed Model object, which you can also check in the repo.

Emboldened by the success and how clean the Model object looks like, we now turn our attention to the dataset loading in the file

Similarly, we can build the Dataset object that abstracts the information about the features and the target:

We can also leverage this new object to modify slightly the signature of the fit method in Model and use this new structure:

With these small changes, the file now is much more concise (check the full code in the repo):

It is now only a handful of lines, and it is intention-revealing: each of the lines in the application code is clear about what it’s doing. But, as we said earlier, this story doesn’t end well: immersed in our search for abstracting implementation details, we simply went too far.

Remember that one of our goals is to have fast and easy exploration. Suppose that a data scientist in our team wants to try out (even completely offline!) how the model would perform if we select 3 instead of 2 features in the SelectKBest. It’s OK, we know how to solve it: we get the hardcoded 2 in the __init__ method and parametrize it away! (diff)

Done! We live to see another day. But that “another day”, the data scientists that are exploring the model realize that sometimes they don’t want to pick anything, just want to go directly to the logistic regression step. Rushed by the needs of the data scientist to be able to explore this quickly, then, we come up with a maybe questionable (but very widespread in Python) design decision: we will use None as a special value in number_of_selected so that, in this case, we will skip the selection step (diff):

Phew! Bullet dodged, we are not super happy with the solution but it works. But right away (remember, the Data Science job here is to explore different solutions) people want to try out what would happen if we don’t use a logistic regression as the classifier, and they want to try out a fancy multilayer perceptron. Reluctantly, we do whatever-gets-this-thing-solved, and we come up with this idea (diff):

This was super fast to implement, but it’s clearly not good code: we are comparing with a string, raising an error that can be hard to understand, and the user needs to remember what are possible values for the classifier_type parameter, among other problems. We can see how this is a slippery slope: we did a lot of modifications to our production code for things that are only going to be part of the exploration step, they might never get into production. The price that we have to pay is both degradation of the quality of production code and also an exploration workflow with a lot of friction. At this point, we are extremely tempted to tell the data scientist to try whatever modification they want on their own and, if it turns out to be production-worthy, we will code it in production, falling back to the deploy as a service approach.

We finish this post on a sad note, but the solution is right around the corner. As some tips about how to solve it: see that we are extremely coupled with the original solution and that we actually don’t know what model we want. In the following posts, we will provide an alternative path to solve this problem.


Leave a Comment