Streamlining Data Science Workflows with a Feature Catalog

Streamlining
Data Science
Workflows
with a Feature Catalog
Roel Bertens

Roel Bertens
Principal Data Scientist

The Challenges
of Custom
Model Pipelines
The risk of different defintions

Feature Catalog
The Solution
to Organized and Efficient
Feature Computation
A way to structure and centralize your feature logic code.
Preferably with these goals in mind:
• User-friendly (easy to extend and to use)
• Group / reuse logic
• Balance flexibility and speed
• Autogenerate docs and diagrams
My definition

Feature Catalog
The Solution
to Organized and Efficient
Feature Computation

Benefits of a Feature Catalog
Single source of truth
Iteration speed
Efficient computation
Quality
Collaboration
Re-usable documentation
Consistency PoC and PROD

What is the difference with a Feature Store?

Feature Catalog
vs
Feature Store
vs
Feature Platform
Source: https://huyenchip.com/2023/01/08/self-serve-feature-platforms.html

Without … … and with Feature Store

A Feature Store is the possible next step
Easy to integrate on any platform
Features computed on demand (slow)
Only compute what is required (cheap)
Single use (no caching by the catalog itself)
Feature Catalog
Requires a more complex architecture
Features precomputed (quick)
Compute everything (expensive)
Multiple use (cheap)
Feature Store

How does can a Feature Catalog look?

Kickstart your Feature Catalog with this template
Simple to use.
Define features once and
use them on multiple
aggregation levels.
Feature groups can builld
on top of each other
without redefining or
recomputing.
Don’t worry about loading
all necessary tables, that
is done for you.
Only specify the feature
names of interest.
https://xebia.ai/catalog-code

Feature Catalog template: https://xebia.ai/catalog-code
An example of how to structure your feature catalog using spark.
flexible
only a starting point (you still need to do the work)
Featuretools: https://github.com/alteryx/featuretools
A python library for automated feature engineering.
lot of functionality out of the box
no complex features (will only fit limited set of use cases)
dbt Semantic Layer: https://www.getdbt.com/product/semantic-layer
Designed for core business metrics where consistency and precision are of key importance.
lot of functionality out of the box
focus on metrics not features
There are different tools out there, what to use?

Blog: https://xebia.ai/catalog
Summary
Avoid
confusion and
duplication
Create your
own
Feature
Catalog
Increase
collaboration
and quality
Launch
experiments
and models
faster
Github: https://xebia.ai/catalog-code

Disclaimer
Whilst every care has been taken by Xebia to ensure that the information contained in this document is correct
and complete, it is possible that this is not the case. Xebia provides the information "as is", without any warranty
for its soundness, suitability for a different purpose or otherwise. Xebia is not liable for any damage which has
occurred or may occur as a result of or in any respect related to the use of this information. Xebia may change
or terminate this document at any time without further notice and shall not be responsible for any consequence(s)
arising there from. Subject to this disclaimer, Xebia is not responsible for any contributions by
third parties to this information.
Copyright Notice
Copyright © Xebia Nederland B.V., Laapersveld 27, 1213 VB, Hilversum, The Netherlands. All rights reserved.
Xebia® is a registered trademark of Xebia Holding B.V. internationally. All other company references
may be trademarks and/or service marks of their respective owners.

Streamlining Data Science Workflows with a Feature Catalog

More Related Content

Streamlining Data Science Workflows with a Feature Catalog