Glossary term
Glossary term
Governance and Compliance
Errors in conclusions drawn from sampled data due to a selection process that generates systematic differences between samples observed in the data and those not observed. The following forms of selection bias exist:
coverage bias: The population represented in the dataset doesn't match the population that the machine learning model is making predictions about.
sampling bias: Data is not collected randomly from the target group.
non-response bias (also called participation bias): Users from certain groups opt-out of surveys at different rates than users from other groups.
For example, suppose you are creating a machine learning model that predicts people's enjoyment of a movie. To collect training data, you hand out a survey to everyone in the front row of a theater showing the movie. Offhand, this may sound like a reasonable way to gather a dataset; however, this form of data collection may introduce the following forms of selection bias:
coverage bias: By sampling from a population who chose to see the movie, your model's predictions may not generalize to people who did not already express that level of interest in the movie.
sampling bias: Rather than randomly sampling from the intended population (all the people at the movie), you sampled only the people in the front row. It is possible that the people sitting in the front row were more interested in the movie than those in other rows.
non-response bias: In general, people with strong opinions tend to respond to optional surveys more frequently than people with mild opinions. Since the movie survey is optional, the responses are more likely to form a bimodal distribution than a normal (bell-shaped) distribution.
For example, suppose you are creating a machine learning model that predicts people's enjoyment of a movie. To collect training data, you hand out a survey to everyone in the front row of a theater showing the movie. Offhand, this may sound like a reasonable way to gather a dataset; however, this form of data collection may introduce the following forms of selection bias:
coverage bias: By sampling from a population who chose to see the movie, your model's predictions may not generalize to people who did not already express that level of interest in the movie.
Created for this library
A bank's risk team flags selection bias because its training data only covers approved applicants, missing declined ones.
A health-tech startup documents selection bias when its training data comes from a single hospital network rather than a representative population.
A research lab labels its dataset card with selection bias notes so downstream users understand the limits of generalization.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License