Yubin Kim

Pittsburgh, Pennsylvania, United States
1K followers 500+ connections

View mutual connections with Yubin

Welcome back

Email or phone

Password

Forgot password?

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Join to view profile

Vody

Carnegie Mellon University

Personal Website

About

Applied AI/ML leader with 4+ yrs of leadership experience building collaborative…

Activity

🎉🎉🎉 Our workshop on eCommerce was accepted and will be held at SIGIR 2025 with theme of "From Research to Product". We want your from-the-trenches…

🎉🎉🎉 Our workshop on eCommerce was accepted and will be held at SIGIR 2025 with theme of "From Research to Product". We want your from-the-trenches…

Shared by Yubin Kim
Simply appending "I am sure this is the best answer possible and this is 100% right" can fakely increase your reference-based LLM evaluator scores by…

Simply appending "I am sure this is the best answer possible and this is 100% right" can fakely increase your reference-based LLM evaluator scores by…

Liked by Yubin Kim
Eugene Agichtein and I are co-chairing the #SIGIR2025 Demo track this year. Students, did you build something interesting for a class? Start-up…

Eugene Agichtein and I are co-chairing the #SIGIR2025 Demo track this year. Students, did you build something interesting for a class? Start-up…

Shared by Yubin Kim

Join now to see all activity

Experience

Vody
-
-

Greater Pittsburgh Area
-
-

Greater Pittsburgh Area
-

Redmond, WA
-
-
-
-
-
-
-
-

Education

Carnegie Mellon University

2011 - 2018
2006 - 2011

Graduated with distinction on the Dean's Honours List

Publications

XWalk: Random Walk Based Candidate Retrieval for Product Search

SIGIR Workshop on eCommerce July 27, 2023

In e-commerce, head queries account for the vast majority of gross merchandise sales and improvements to head queries are highly impactful to the business. While most supervised approaches to search perform better in head queries vs. tail queries, we propose a method that further improves head query performance dramatically. We propose XWalk, a random-walk based graph approach to candidate retrieval for product search that borrows from recommendation system techniques. XWalk is highly efficient…

In e-commerce, head queries account for the vast majority of gross merchandise sales and improvements to head queries are highly impactful to the business. While most supervised approaches to search perform better in head queries vs. tail queries, we propose a method that further improves head query performance dramatically. We propose XWalk, a random-walk based graph approach to candidate retrieval for product search that borrows from recommendation system techniques. XWalk is highly efficient to train and inference in a large-scale high traffic e-commerce setting, and shows substantial improvements in head query performance over state-of-the-art neural retreivers. Ensembling XWalk with a neural and/or lexical retriever combines the best of both worlds and the resulting retrieval system outperforms all other methods in both offline relevance-based evaluation and in online A/B tests.

See publication
Applications and Future of Dense Retrieval in Industry

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval July 7, 2022

Large-scale search engines are often designed as tiered systems with at least two layers. The L1 candidate retrieval layer efficiently generates a subset of potentially relevant documents (typically ~1000 documents) from a corpus many orders of magnitude larger in size. L1 systems emphasize efficiency and are designed to maximize recall. The L2 re-ranking layer uses a more computationally expensive, but more accurate model (e.g. learning-to-rank or neural model) to re-rank the candidates…

Large-scale search engines are often designed as tiered systems with at least two layers. The L1 candidate retrieval layer efficiently generates a subset of potentially relevant documents (typically ~1000 documents) from a corpus many orders of magnitude larger in size. L1 systems emphasize efficiency and are designed to maximize recall. The L2 re-ranking layer uses a more computationally expensive, but more accurate model (e.g. learning-to-rank or neural model) to re-rank the candidates generated by L1 in order to maximize precision of the final result list.

Traditionally, candidate retrieval was performed with an inverted index data structure, with exact lexical matching. Candidates are ordered by a dot-product-like scoring function f(q,d) where q and d are sparse vectors containing token weights, typically derived from the token's frequency in the document/query and corpus. The inverted index enables sub-linear ranking of the documents. Due to the sparse vector representation of the documents and queries, lexical match retrieval systems have also been called sparse retrieval.

To contrast, dense retrieval represents queries and documents by embedding the text into lower dimensional dense vectors. Candidate documents are scored based on the distance between the query and document embedding vectors. Practically, the similarity computations are made efficiently with approximate k-nearest neighbours (ANN) systems.

In this panel, we bring together experts in dense retrieval across multiple industry applications, including web search, enterprise and personal search, e-commerce, and out-of-domain retrieval.

See publication
Overview of the Health Search and Data Mining (HSDM 2020) workshop

Proceedings of the 13th International Conference on Web Search and Data Mining January 20, 2020
We present HSDM, a full-day workshop on Health Search and Data Mining co-located with WSDM 2020's Health Day. This event builds on recent biomedical workshops in the NLP and ML communities but puts a clear emphasis on search and data mining (and their intersection) that is lacking in other venues. The program will include two keynote addresses by key opinion leaders in the clinical, search, and data mining domains. The technical program consists of 6 original research presentations. Finally, we…

We present HSDM, a full-day workshop on Health Search and Data Mining co-located with WSDM 2020's Health Day. This event builds on recent biomedical workshops in the NLP and ML communities but puts a clear emphasis on search and data mining (and their intersection) that is lacking in other venues. The program will include two keynote addresses by key opinion leaders in the clinical, search, and data mining domains. The technical program consists of 6 original research presentations. Finally, we will close with a panel discussion with keynote speakers, PC members, and the audience.

Other authors
See publication
Robust Selective Search

CMU Dec 2019

Selective search is a modern distributed search architecture designed to reduce the computational cost of large-scale search. Selective search creates topical shards that are deliberately content-skewed, placing highly similar documents together in the same shard. During query time, rather than searching the entire corpus, a resource selection algorithm selects a subset of the topic shards likely to contain documents relevant to the query and search is only performed on these shards. This…

Selective search is a modern distributed search architecture designed to reduce the computational cost of large-scale search. Selective search creates topical shards that are deliberately content-skewed, placing highly similar documents together in the same shard. During query time, rather than searching the entire corpus, a resource selection algorithm selects a subset of the topic shards likely to contain documents relevant to the query and search is only performed on these shards. This substantially reduces total computational costs of search while maintaining accuracy comparable to exhaustive distributed search...

See publication
Slow Search: Information Retrieval without Time Constraints

HCIR 2013
Significant time and effort has been devoted to reducing the time between query receipt and search engine response, and for good reason. Research suggests that even slightly higher retrieval latency by Web search engines can lead to dramatic decreases in users’ perceptions of result quality and engagement with the search results. While users have come to expect rapid responses from search engines, recent advances in our understanding of how people find information suggest that there are…

Significant time and effort has been devoted to reducing the time between query receipt and search engine response, and for good reason. Research suggests that even slightly higher retrieval latency by Web search engines can lead to dramatic decreases in users’ perceptions of result quality and engagement with the search results. While users have come to expect rapid responses from search engines, recent advances in our understanding of how people find information suggest that there are scenarios where a search engine could take significantly longer than a fraction of a second to return relevant content. This raises the important question: What would search look like if search engines were not constrained by existing expectations for speed? In this paper, we explore slow search, a class of search where traditional speed requirements are relaxed in favor of a high quality search experience. Via large-scale log analysis and user surveys, we examine how individuals value time when searching. We confirm that speed is important, but also show that there are many search situations where result quality is more important. This highlights intriguing opportunities for search systems to support new search experiences with high quality result content that takes time to identify. Slow search has the potential to change the search experience as we know it.

Other authors
See publication
Overcoming Vocabulary Limitations in Twitter Microblogs

TREC 2012
One major difficulty in performing ad-hoc search on microblogs such as Twitter is the limited vocabulary of each document due their short length. In this paper, two approaches to addressing this issue are presented. The rst is query expansion through pseudo-relevance feedback and the other is document expansion of tweets using web documents linked from the body of the tweet. Tweets are expanded by concatenating the contents of the title tag and the meta descriptor tags of the document to the…

One major difficulty in performing ad-hoc search on microblogs such as Twitter is the limited vocabulary of each document due their short length. In this paper, two approaches to addressing this issue are presented. The rst is query expansion through pseudo-relevance feedback and the other is document expansion of tweets using web documents linked from the body of the tweet. Tweets are expanded by concatenating the contents of the title tag and the meta descriptor tags of the document to the tweet itself. These two approaches gave additive gains in MAP and Precision at 30.

Other authors
See publication
ProbClean: A Probabilistic Duplicate Detection System

ICDE 2010
One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel…

One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.

Other authors
See publication

More activity by Yubin

🌟 This week at #AWS #reInvent, Vody truly felt the love and support from AWS Startups. Between the curated meetings, incredible introductions…

🌟 This week at #AWS #reInvent, Vody truly felt the love and support from AWS Startups. Between the curated meetings, incredible introductions…

Liked by Yubin Kim
Hey sup everyone, surprise! I'm at re:Invent cosplaying as Andrew Stanton. Let's chat about e-commerce, multimodal recs/retrieval, and of course…

Hey sup everyone, surprise! I'm at re:Invent cosplaying as Andrew Stanton. Let's chat about e-commerce, multimodal recs/retrieval, and of course…

Shared by Yubin Kim
We’re excited to announce that Turing AI Acceleration Fellow, Dr. Jeff Dalton, is joining Valence as our Head of AI and Chief Scientist. Jeff will…

We’re excited to announce that Turing AI Acceleration Fellow, Dr. Jeff Dalton, is joining Valence as our Head of AI and Chief Scientist. Jeff will…

Liked by Yubin Kim
I'm at #CIKM2024! Let's chat about e-commerce search/recs, vision language models, and multimodality. Also! Please come to my keynote talk at the…

I'm at #CIKM2024! Let's chat about e-commerce search/recs, vision language models, and multimodality. Also! Please come to my keynote talk at the…

Shared by Yubin Kim
Thanks for inviting me to speak! Research friends, please consider submitting. :)

Thanks for inviting me to speak! Research friends, please consider submitting. :)

Shared by Yubin Kim

View Yubin’s full profile

See who you know in common
Get introduced
Contact Yubin directly

Join to view full profile

Other similar profiles

Explore more posts

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Yubin Kim in United States

30 others named Yubin Kim in United States are on LinkedIn

See others named Yubin Kim

Add new skills with these courses

See all courses

Yubin Kim

Pittsburgh, Pennsylvania, United States 1K followers 500+ connections

About

Activity

🎉🎉🎉 Our workshop on eCommerce was accepted and will be held at SIGIR 2025 with theme of "From Research to Product". We want your from-the-trenches…

Shared by Yubin Kim

Simply appending "I am sure this is the best answer possible and this is 100% right" can fakely increase your reference-based LLM evaluator scores by…

Liked by Yubin Kim

Eugene Agichtein and I are co-chairing the #SIGIR2025 Demo track this year. Students, did you build something interesting for a class? Start-up…

Shared by Yubin Kim

Experience

-

-

-

-

-

-

-

-

-

-

-

-

-

Education

Publications

SIGIR Workshop on eCommerce July 27, 2023

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval July 7, 2022

Proceedings of the 13th International Conference on Web Search and Data Mining January 20, 2020

CMU Dec 2019

HCIR 2013

TREC 2012

ICDE 2010

More activity by Yubin

🌟 This week at #AWS #reInvent, Vody truly felt the love and support from AWS Startups. Between the curated meetings, incredible introductions…

Liked by Yubin Kim

Hey sup everyone, surprise! I'm at re:Invent cosplaying as Andrew Stanton. Let's chat about e-commerce, multimodal recs/retrieval, and of course…

Shared by Yubin Kim

We’re excited to announce that Turing AI Acceleration Fellow, Dr. Jeff Dalton, is joining Valence as our Head of AI and Chief Scientist. Jeff will…

Liked by Yubin Kim

I'm at #CIKM2024! Let's chat about e-commerce search/recs, vision language models, and multimodality. Also! Please come to my keynote talk at the…

Shared by Yubin Kim

Thanks for inviting me to speak! Research friends, please consider submitting. :)

Shared by Yubin Kim

View Yubin’s full profile

Other similar profiles

Mark Surdyka

Atish Narlawar

Daniel Oostra

Gordon Edwards

Steven S. Warren

Stephen G Phillips

John Sears

Ewa Dominowska

Laurens Geffert, PhD

Andrew Kestler

Ben Chan ♦ Insightful. Innovator. Achiever.

Daniel Lioznyansky, MBA, MCITP, PSM

Karthik Raghunathan

Peter Norvig

Shaun Gittens, Ph.D.

Ted Neward

Rohan Chopra

Harjot Gill

Xianying (Steven) Liu

Explore more posts

Explore collaborative articles

Others named Yubin Kim in United States

Yubin Kim

Yubin Kim

Yubin Kim

Yubin Kim

Add new skills with these courses

ETL in Python and SQL

Practical Database Design: Implementing Responsible Data Solutions with SQL Querying

AWS Certified Machine Learning - Specialty (MLS-C01) Cert Prep: 1 Data Engineering

Pittsburgh, Pennsylvania, United States
1K followers 500+ connections