7.2 Sampling and creating training data
- [E] If you have 6 shirts and 4 pairs of pants, how many ways are there to choose 2 shirts and 1 pair of pants?
- [M] What is the difference between sampling with vs. without replacement? Name an example of when you would use one rather than the other?
- [M] Explain Markov chain Monte Carlo sampling.
- [M] If you need to sample from high-dimensional data, which sampling method would you choose?
[H] Suppose we have a classification task with many classes. An example is when you have to predict the next word in a sentence -- the next word can be one of many, many possible words. If we have to calculate the probabilities for all classes, it’ll be prohibitively expensive. Instead, we can calculate the probabilities for a small set of candidate classes. This method is called candidate sampling. Name and explain some of the candidate sampling algorithms.
Hint: check out this great article on candidate sampling by the TensorFlow team.
Suppose you want to build a model to classify whether a Reddit comment violates the website’s rule. You have 10 million unlabeled comments from 10K users over the last 24 months and you want to label 100K of them.
- [M] How would you sample 100K comments to label?
[M] Suppose you get back 100K labeled comments from 20 annotators and you want to look at some labels to estimate the quality of the labels. How many labels would you look at? How would you sample them?
Hint: This article on different sampling methods and their use cases might help.
[M] Suppose you work for a news site that historically has translated only 1% of all its articles. Your coworker argues that we should translate more articles into Chinese because translations help with the readership. On average, your translated articles have twice as many views as your non-translated articles. What might be wrong with this argument?
Hint: think about selection bias.
[M] How to determine whether two sets of samples (e.g. train and test splits) come from the same distribution?
- [H] How do you know you’ve collected enough samples to train your ML model?
- [M] How to determine outliers in your data samples? What to do with them?
- Sample duplication
- [M] When should you remove duplicate training samples? When shouldn’t you?
- [M] What happens if we accidentally duplicate every data point in your train set or in your test set?
- Missing data
- [H] In your dataset, two out of 20 variables have more than 30% missing values. What would you do?
- [M] How might techniques that handle missing data make selection bias worse? How do you handle this bias?
- [M] Why is randomization important when designing experiments (experimental design)?
- Class imbalance.
- [E] How would class imbalance affect your model?
- [E] Why is it hard for ML models to perform well on data with class imbalance?
- [M] Imagine you want to build a model to detect skin legions from images. In your training dataset, only 1% of your images shows signs of legions. After training, your model seems to make a lot more false negatives than false positives. What are some of the techniques you'd use to improve your model?
Training data leakage.
- [M] Imagine you're working with a binary task where the positive class accounts for only 1% of your data. You decide to oversample the rare class then split your data into train and test splits. Your model performs well on the test split but poorly in production. What might have happened?
- [M] You want to build a model to classify whether a comment is spam or not spam. You have a dataset of a million comments over the period of 7 days. You decide to randomly split all your data into the train and test splits. Your co-worker points out that this can lead to data leakage. How?
Hint: You might want to clarify what oversampling here means. Oversampling can be as simple as dupplicating samples from the rare class.
[M] How does data sparsity affect your models?
Hint: Sparse data is different from missing data.
- [E] What are some causes of feature leakage?
- [E] Why does normalization help prevent feature leakage?
- [M] How do you detect feature leakage?
- [M] Suppose you want to build a model to classify whether a tweet spreads misinformation. You have 100K labeled tweets over the last 24 months. You decide to randomly shuffle on your data and pick 80% to be the train split, 10% to be the valid split, and 10% to be the test split. What might be the problem with this way of partitioning?
- [M] You’re building a neural network and you want to use both numerical and textual features. How would you process those different features?
[H] Your model has been performing fairly well using just a subset of features available in your data. Your boss decided that you should use all the features available instead. What might happen to the training error? What might happen to the test error?
Hint: Think about the curse of dimensionality: as we use more dimensions to describe our data, the more sparse space becomes, and the further are data points from each other.