Ever since teaching TensorFlow for Deep Learning Research, I’ve known that I love teaching and want to do it again.

In early 2019, I started talking with Stanford’s CS department about the possibility of coming back to teach. After almost two years in development, the course has finally taken shape. I’m excited to let you know that I’ll be teaching CS 329S: Machine Learning Systems Design at Stanford in January 2021.

The course wouldn’t have been possible with the help of many people including Christopher Ré, Jerry Cain, Mehran Sahami, Michele Catasta, Mykel J. Kochenderfer.

Here’s a short description of the course. You can find the (tentative) syllabus below.

This project-based course covers the iterative process for designing, developing, and deploying machine learning systems. It focuses on systems that require massive datasets and compute resources, such as large neural networks. Students will learn about the different layers of the data pipeline, approaches to model selection, training, scaling, as well as how to deploy, monitor, and maintain ML systems. In the process, students will learn about important issues including privacy, fairness, and security.

Pre-requisites: At least one of the following; CS229, CS230, CS231N, CS224N, or equivalent. Students should have a good understanding of machine learning algorithms and should be familiar with at least one framework such as TensorFlow, PyTorch, JAX.

For Stanford students interested in taking the course, you can fill in the application here. The course will be evaluated based on one final project (at least 50%), three short assignments, and class participation.

For those outside Stanford, I’ll try to make as much of the course materials available as possible. I’ll post updates about the course on Twitter or you can check back here from time to time.

Since these are all new materials, I’m hoping to get early feedback. If you’re interested in becoming a reviewer for the course materials, please shoot me an email. Thank you!

Tentative syllabus

Week 1: Overview of machine learning systems design

  • When to use ML
  • ML in research vs. ML in production
  • ML systems vs. traditional software
  • ML production myths
  • ML applications
  • Case studies

Week 2: Iterative process

  • Principles of a good ML system
  • Iterative process
  • Scoping the project

Week 3: Data management

  • Challenges of real- world data
  • How to collect, store, and handle massive data
  • Different layers of the data pipeline
  • Data processor & monitor
  • Data controller
  • Data storage
  • Data ingestion: database- engines

Week 4: Creating training datasets

  • Feature engineering
  • Data labeling
  • Data leakage
  • Data partitioning, slicing, and sampling

Week 5: Building and training machine learning models

  • Baselines
  • Model selection
  • Training, debugging, and experiment tracking
  • Distributed training
  • Evaluation and benchmarking
  • AutoML

Week 6: Deployment

  • Inference constraints
  • Model compression and optimization
  • Training vs. serving skew
  • Concept drift
  • Server- side ML vs. client- side ML
  • Releasing strategies
  • Deployment evaluation

Week 7: Project milestone and discussion

  • Ethical concerns

Week 8: Monitoring and maintenance

  • What to monitor
  • Metrics, logging, tags, alerts
  • Updates and rollbacks
  • Iterative improvement

Week 9: Hardware & infrastructure

  • Architectural choices
  • Hardware design
  • Edge devices
  • Clouds vs. private data centers
  • Future of high- performance computing

Week 10: Integrating ML into business

  • Model performance vs. business goals vs. user experience
  • Team structure
  • Why ML projects fail
  • Best practices
  • State of ML production

This blog post was edited by the wonderful Andrey Kurenkov.