Encoding high-cardinality string categorical variables

Cerda, Patricio; Varoquaux, Gaël

doi:10.1109/TKDE.2020.2992529

Computer Science > Machine Learning

arXiv:1907.01860 (cs)

[Submitted on 3 Jul 2019 (v1), last revised 18 May 2020 (this version, v5)]

Title:Encoding high-cardinality string categorical variables

Authors:Patricio Cerda (PARIETAL), Gaël Varoquaux (NEUROSPIN)

View PDF

Abstract:Statistical models usually require vector representations of categorical variables, using for instance one-hot encoding. This strategy breaks down when the number of categories grows, as it creates high-dimensional feature vectors. Additionally, for string entries, one-hot encoding does not capture information in their representation.Here, we seek low-dimensional encoding of high-cardinality string categorical variables. Ideally, these should be: scalable to many categories; interpretable to end users; and facilitate statistical analysis. We introduce two encoding approaches for string categories: a Gamma-Poisson matrix factorization on substring counts, and the min-hash encoder, for fast approximation of string similarities. We show that min-hash turns set inclusions into inequality relations that are easier to learn. Both approaches are scalable and streamable. Experiments on real and simulated data show that these methods improve supervised learning with high-cardinality categorical variables. We recommend the following: if scalability is central, the min-hash encoder is the best option as it does not require any data fit; if interpretability is important, the Gamma-Poisson factorization is the best alternative, as it can be interpreted as one-hot encoding on inferred categories with informative feature names. Both models enable autoML on the original string entries as they remove the need for feature engineering or data cleaning.

Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1907.01860 [cs.LG]
	(or arXiv:1907.01860v5 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1907.01860
Journal reference:	IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical and Electronics Engineers, pp.1-1
Related DOI:	https://doi.org/10.1109/TKDE.2020.2992529

Submission history

From: Patricio Cerda [view email] [via CCSD proxy]
[v1] Wed, 3 Jul 2019 11:36:07 UTC (4,266 KB)
[v2] Thu, 11 Jul 2019 12:32:56 UTC (6,005 KB)
[v3] Wed, 17 Jul 2019 10:17:59 UTC (3,233 KB)
[v4] Thu, 5 Sep 2019 08:54:06 UTC (3,233 KB)
[v5] Mon, 18 May 2020 15:18:45 UTC (4,884 KB)

Computer Science > Machine Learning

Title:Encoding high-cardinality string categorical variables

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Encoding high-cardinality string categorical variables

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators