ML Ops Specialization

6 minute read

I’ve been having a really good time working on projects that involve more of an ML Ops component lately. So, I’ve been spending some of my learning time working through the Machine Learning Engineering for Production (MLOps) Specialization on Coursera. This blog covers my notes from the first course, Introduction to Machine Learning in Production, as well as a bit on how I’ve been applying the concepts in my work.

Overview of the ML Lifecycle

  1. Scoping: This includes defining metrics (throughput, latency, etc.), resources needed (compute, budget, time) and timeline.
  2. Data Definition: Expected inputs, ground truth, and consistent labeling are all factors here.
  3. Modeling: Key considerations are the choice of algorithm, hyperparameters, and data.
  4. Deployment: Details of the actual deployment, as well as monitoring.
  5. Iteration: It is often only after the initial deployment in production and working on real traffic that you find out things need to change.

Project Scoping

The course proposes the following procedure for scoping projects:

  • Brainstorm business problems
  • Brainstorm ML solutions to those problems
  • Check project feasibility
  • Determine milestones
  • Budget for resources

Data Definition and Baseline

Once you’ve scoped your project, spending time at the beginning to specifically and clearly define the data can make a big difference in both the success of the final model, as well as how challenging the project is (I have certainly learned this the hard way in my own work).

As a first step, think about the model input. This includes both the features that you’ll use, as well as details about those features. For example, if you’re working on a computer vision project, what kind of images will you be using? Consider things like lighting, contrast, and resolution. If the input data you have isn’t great, it often makes sense to first spend time improving the data, before moving further along.

Next, define the target. An important consideration here is label consistency, particularly for smaller data sets. I’ve had first-hand experience with how important this is, and how deceptively tricky it can be. For example, in a recent project we were working on gathering labels for which skills were taught in a particular course. But, there is a lot of potential ambiguity here; what if a skill is only briefly mentioned, should it be tagged? What if it’s a class on deep learning, would you also tag machine learning? Etc. We didn’t notice these issues until later in the project than would have been ideal, and it required going back and re-doing some work.

The course recommends a few ways to improve label consistency:

  • Multiple lablers label the same example, or have labelers re-do the same examples after some time has passed.
  • Standardize labels (for example, if you’re doing an annotation project, make sure that the spelling of filler words like “um”, “ummm”, and “um…” all are categorized the same).
  • Merge classes in cases where having too many classes is confusing and not helpful (in the skill-course tagging, that might mean only having one label that says, if the skill is taught at all, the skill is tagged, instead of multiple labels for “taught a lot,” “taught a little,” etc.)
  • Have a class for uncertainty. This one is critical so that you can trust your negative labels in particular.

We often use human-level performance (or HLP) as a benchmark for ML model performance. When the ground truth is externally defined (e.g. a biopsy shows cancer or not), HLP represents irreducible/Bayes error. However, when it is defined by a human, then HLP is just measuring how well two humans agree. This can be a big issue when using human-labeled data.

Improving ground truth and label definitions can improve HLP when it’s about making sure humans agree. Therefore, be cautious of a model that naively beats HLP before HLP has been maximized. Often there is room for improvement in HLP, which therefore raises the benchmark for the model. Raising HLP means the bar is higher, but the application will be more useful since it will perform better due to clean and consistent data.

Generally, it makes sense to spend about the same amount of time on data collection, model training, and error analysis. Giving yourself a deadline for gathering data can help speed up the development process/iteration loop (data collection -> model training -> error analysis), rather than spending a super long time gathering data initially before beginning to prototype anything. An exception is if you’re experienced in a problem and you know it takes a certain amount of data to work well – then, just get that first.

Model Selection and Training

The course recommends first conducting a literature review, and then finding an open-source implementation of the model you’re interested in trying out. This can be a good option for a baseline even if it wouldn’t be possible to deploy (perhaps because it’s not efficient enough for your use case). Often it is better to spend the majority of your time on data prep and feature engineering, and to just use an established model vs. creating a custom one.

When getting started, a good approach to sanity check the code and algorithm can be to try to overfit to a small dataset or a single example to make sure the algorithm works before working on regularization or training on the whole training set.

The key issue for model selection and training is to try to do well on business metrics and meet the project goals. In particular, you need to make sure that the model doesn’t just perform well on average, but also on particularly important examples, subsets of the data (especially when working with data that contains protected classes like race or gender), and rare cases.

Error analysis can be very useful for making sure the above criteria are met, as well as identifying ways to improve the model. The course describes an approach where you manually analyze correct and incorrect predictions, and classify why the incorrect ones were incorrect. I’ve personally used this on most of my projects, and found some really interesting things that helped improve the model.

Next, you can prioritize which category of bad predictions to work on improving in a few ways:

  • Those with the biggest gap to HLP.
  • Those that comprise the biggest chunk of all the bad predictions.
  • Those that are easiest to improve.
  • Those that are part of the most important category for the business.

Deployment

Once you have a model you’re happy with, we’re finally ready to deploy. This can be done in a few different ways:

  • ​​Start with a small amount of traffic and gradually ramp it up. This often makes the most sense for new products.
  • Shadow mode deployment, where the ML algo initially runs in parallel to a human inspector who has the final say. This provides an opportunity to gather data about the model performance and verify it is working as expected before deciding whether to let the model make decisions independently in the future. This works well for automating manual tasks.
  • Canary deployment, where you start by rolling out to a small group (maybe 5% or less of total traffic), monitor the response, and then rollout a bit more over time if everything looks ok. This is a good choice to use when replacing an existing ML system.
  • Blue-green deployment, where the old version of the software is “blue” and the new version is “green.” You have the old version running, spin up the new one, and have the router switch over to the new version, either all at once or little by little. The advantage here is that rollbacks are easy.

After deployment, monitoring is essential. In order to catch issues, brainstorm all the things that could possibly go wrong, and a few metrics that would catch them. Often it makes sense to start with a lot of options and narrow down to the best ones. Then, decide on some thresholds to trigger alerts.

Beyond more functional issues (like returning NULLs, latency, etc.), it’s also important to consider drift:

  • Data drift: Input data distribution changes.
  • Concept drift: desired mapping from x->y changes (like user online shopping behavior changing at the beginning of covid 19, or inflation changing the mapping from house size to price).