Glossary term
Glossary term
Foundations
Representing categorical data as a vector in which:
One element is set to 1.
All other elements are set to 0.
One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a certain categorical feature named Scandinavia has five possible values:
"Denmark"
"Sweden"
"Norway"
"Finland"
"Iceland"
One-hot encoding could represent each of the five values as follows:
Thanks to one-hot encoding, a model can learn different connections based on each of the five countries.
Representing a feature as numerical data is an alternative to one-hot encoding. Unfortunately, representing the Scandinavian countries numerically is not a good choice. For example, consider the following numeric representation:
"Denmark" is 0
"Sweden" is 1
"Norway" is 2
"Finland" is 3
"Iceland" is 4
With numeric encoding, a model would interpret the raw numbers mathematically and would try to train on those numbers. However, Iceland isn't actually twice as much (or half as much) of something as Norway, so the model would come to some strange conclusions.
See Categorical data: Vocabulary and one-hot encoding in Machine Learning Crash Course for more information.
Created for this library
A retail analytics team uses one-hot encoding on payment method as a feature for its cart-abandonment model.
A churn team uses one-hot encoding on plan tier so the production model treats each tier as an independent feature.
A search-quality team uses one-hot encoding on country code as a feature in its ranker alongside continuous engagement signals.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License