research-article

Selecting Top-k Data Science Models by Example Dataset

Authors:

Yinghui WuAuthors Info & Claims

CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Pages 2686 - 2695

https://doi.org/10.1145/3583780.3615051

Published: 21 October 2023 Publication History

Get Access

Abstract

Data analytical pipelines routinely involve various domain-specific data science models. Such models require expensive manual or training effort and often incur expensive validation costs (e.g., via scientific simulation analysis). Meanwhile, high-value models remain to be ad-hocly created, isolated, and underutilized for a broad community. Searching and accessing proper models for data analysis pipelines is desirable yet challenging for users without domain knowledge. This paper introduces ModsNet, a novel MODel SelectioN framework that only requires an Example daTaset. (1) We investigate the following problem: Given a library of pre-trained models, a limited amount of historical observations of their performance, and an "example" dataset as a query, return k models that are expected to perform the best over the query dataset. (2) We formulate a regression problem and introduce a knowledge-enhanced framework using a model-data interaction graph. Unlike traditional methods, (1) ModsNet uses a dynamic, cost-bounded "probe-and-select" strategy to incrementally identify promising pre-trained models in a strict cold-start scenario (when a new dataset without any interaction with existing models is given). (2) To reduce the learning cost, we develop a clustering-based sparsification strategy to prune unpromising models and their interactions. (3) We showcase of ModsNet built on top of a crowdsourced materials knowledge base platform. Our experiments verified its effectiveness, efficiency, and applications over real-world analytical pipelines.

Supplementary Material

MP4 File (full1931-video.mp4)

Presentation video for Paper: Selecting Top-k Data Science Models by Example Dataset. This paper proposed ModsNet, a novel MODel SelectioN framework that only requires an Example daTaset. (1) We investigate the following problem: Given a library of pre-trained models, a limited amount of historical observations of their performance, and an "example" dataset as a query, return k models that are expected to perform the best over the query dataset. (2) We formulate a regression problem and introduce a knowledge-enhanced framework using a model-data interaction graph. Unlike traditional methods, (1) ModsNet uses a dynamic, cost-bounded "probe-and-select" strategy to incrementally identify promising pre-trained models in a strict cold-start scenario. (2) To reduce the learning cost, we develop a clustering-based sparsification strategy to prune unpromising models and their interactions. (3) We showcase ModsNet built on top of a crowdsourced materials knowledge base platform.

Download
35.03 MB

References

[1]

[n.d.]. Github. https://github.com/

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

VQuAnDa: Verbalization QUestion ANswering DAtaset

ParaQA: A Question Answering Dataset with Paraphrase Responses for Single-Turn Conversation

Fast progressive training of mixture models for model selection

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations