Computer Science > Machine Learning
[Submitted on 3 Jul 2019 (this version), latest version 18 May 2020 (v5)]
Title:Encoding high-cardinality string categorical variables
View PDFAbstract:Statistical analysis usually requires a vector representation of categorical variables, using for instance one-hot encoding. This encoding strategy is not practical when the number of different categories grows, as it creates high-dimensional feature vectors. Additionally, the corresponding entries in the raw data are often represented as strings, that have additional information not captured by one-hot encoding. Here, we seek low-dimensional vectorial encoding of string categorical variables with high-cardinality. Ideally, these should i) be scalable to a very large number of categories, ii) be interpretable to the end user, and iii) facilitate statistical analysis. We introduce two new encoding approaches for string categories: a Gamma-Poisson matrix factorization on character-level substring counts, and a min-hash encoder, based on min-wise random permutations for fast approximation of the Jaccard similarity between strings. Both approaches are scalable and are suitable for streaming settings. Extensive experiments on real and simulated data show that these encoding methods improve prediction performance for real-life supervised-learning problems with high-cardinality string categorical variables and works as well as standard approaches with clean, low-cardinality ones. We recommend the following: i) if scalability is the main concern, the min-hash encoder is the best option as it does not require any fitting to the data; ii) if interpretability is important, the Gamma-Poisson factorization is a good alternative, as it can be interpreted as one-hot encoding, giving each encoding dimension a feature name that summarizes the substrings captured. Both models remove the need for hand-crafting features and data cleaning of string columns in databases and can be used for feature engineering in online autoML settings.
Submission history
From: Patricio Cerda [view email] [via CCSD proxy][v1] Wed, 3 Jul 2019 11:36:07 UTC (4,266 KB)
[v2] Thu, 11 Jul 2019 12:32:56 UTC (6,005 KB)
[v3] Wed, 17 Jul 2019 10:17:59 UTC (3,233 KB)
[v4] Thu, 5 Sep 2019 08:54:06 UTC (3,233 KB)
[v5] Mon, 18 May 2020 15:18:45 UTC (4,884 KB)
Current browse context:
cs.LG
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender
(What is IArxiv?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.