Yubin Kim
Pittsburgh, Pennsylvania, United States
1K followers
500+ connections
About
Applied AI/ML leader with 4+ yrs of leadership experience building collaborative…
Activity
-
Eugene Agichtein and I are co-chairing the #SIGIR2025 Demo track this year. Students, did you build something interesting for a class? Start-up…
Eugene Agichtein and I are co-chairing the #SIGIR2025 Demo track this year. Students, did you build something interesting for a class? Start-up…
Shared by Yubin Kim
-
🌟 This week at #AWS #reInvent, Vody truly felt the love and support from AWS Startups. Between the curated meetings, incredible introductions…
🌟 This week at #AWS #reInvent, Vody truly felt the love and support from AWS Startups. Between the curated meetings, incredible introductions…
Liked by Yubin Kim
-
Hey sup everyone, surprise! I'm at re:Invent cosplaying as Andrew Stanton. Let's chat about e-commerce, multimodal recs/retrieval, and of course…
Hey sup everyone, surprise! I'm at re:Invent cosplaying as Andrew Stanton. Let's chat about e-commerce, multimodal recs/retrieval, and of course…
Shared by Yubin Kim
Experience
Education
Publications
-
XWalk: Random Walk Based Candidate Retrieval for Product Search
SIGIR Workshop on eCommerce
In e-commerce, head queries account for the vast majority of gross merchandise sales and improvements to head queries are highly impactful to the business. While most supervised approaches to search perform better in head queries vs. tail queries, we propose a method that further improves head query performance dramatically. We propose XWalk, a random-walk based graph approach to candidate retrieval for product search that borrows from recommendation system techniques. XWalk is highly efficient…
In e-commerce, head queries account for the vast majority of gross merchandise sales and improvements to head queries are highly impactful to the business. While most supervised approaches to search perform better in head queries vs. tail queries, we propose a method that further improves head query performance dramatically. We propose XWalk, a random-walk based graph approach to candidate retrieval for product search that borrows from recommendation system techniques. XWalk is highly efficient to train and inference in a large-scale high traffic e-commerce setting, and shows substantial improvements in head query performance over state-of-the-art neural retreivers. Ensembling XWalk with a neural and/or lexical retriever combines the best of both worlds and the resulting retrieval system outperforms all other methods in both offline relevance-based evaluation and in online A/B tests.
-
Applications and Future of Dense Retrieval in Industry
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Large-scale search engines are often designed as tiered systems with at least two layers. The L1 candidate retrieval layer efficiently generates a subset of potentially relevant documents (typically ~1000 documents) from a corpus many orders of magnitude larger in size. L1 systems emphasize efficiency and are designed to maximize recall. The L2 re-ranking layer uses a more computationally expensive, but more accurate model (e.g. learning-to-rank or neural model) to re-rank the candidates…
Large-scale search engines are often designed as tiered systems with at least two layers. The L1 candidate retrieval layer efficiently generates a subset of potentially relevant documents (typically ~1000 documents) from a corpus many orders of magnitude larger in size. L1 systems emphasize efficiency and are designed to maximize recall. The L2 re-ranking layer uses a more computationally expensive, but more accurate model (e.g. learning-to-rank or neural model) to re-rank the candidates generated by L1 in order to maximize precision of the final result list.
Traditionally, candidate retrieval was performed with an inverted index data structure, with exact lexical matching. Candidates are ordered by a dot-product-like scoring function f(q,d) where q and d are sparse vectors containing token weights, typically derived from the token's frequency in the document/query and corpus. The inverted index enables sub-linear ranking of the documents. Due to the sparse vector representation of the documents and queries, lexical match retrieval systems have also been called sparse retrieval.
To contrast, dense retrieval represents queries and documents by embedding the text into lower dimensional dense vectors. Candidate documents are scored based on the distance between the query and document embedding vectors. Practically, the similarity computations are made efficiently with approximate k-nearest neighbours (ANN) systems.
In this panel, we bring together experts in dense retrieval across multiple industry applications, including web search, enterprise and personal search, e-commerce, and out-of-domain retrieval. -
Overview of the Health Search and Data Mining (HSDM 2020) workshop
Proceedings of the 13th International Conference on Web Search and Data Mining
We present HSDM, a full-day workshop on Health Search and Data Mining co-located with WSDM 2020's Health Day. This event builds on recent biomedical workshops in the NLP and ML communities but puts a clear emphasis on search and data mining (and their intersection) that is lacking in other venues. The program will include two keynote addresses by key opinion leaders in the clinical, search, and data mining domains. The technical program consists of 6 original research presentations. Finally, we…
We present HSDM, a full-day workshop on Health Search and Data Mining co-located with WSDM 2020's Health Day. This event builds on recent biomedical workshops in the NLP and ML communities but puts a clear emphasis on search and data mining (and their intersection) that is lacking in other venues. The program will include two keynote addresses by key opinion leaders in the clinical, search, and data mining domains. The technical program consists of 6 original research presentations. Finally, we will close with a panel discussion with keynote speakers, PC members, and the audience.
Other authorsSee publication -
Robust Selective Search
CMU
Selective search is a modern distributed search architecture designed to reduce the computational cost of large-scale search. Selective search creates topical shards that are deliberately content-skewed, placing highly similar documents together in the same shard. During query time, rather than searching the entire corpus, a resource selection algorithm selects a subset of the topic shards likely to contain documents relevant to the query and search is only performed on these shards. This…
Selective search is a modern distributed search architecture designed to reduce the computational cost of large-scale search. Selective search creates topical shards that are deliberately content-skewed, placing highly similar documents together in the same shard. During query time, rather than searching the entire corpus, a resource selection algorithm selects a subset of the topic shards likely to contain documents relevant to the query and search is only performed on these shards. This substantially reduces total computational costs of search while maintaining accuracy comparable to exhaustive distributed search...
-
Slow Search: Information Retrieval without Time Constraints
HCIR
Significant time and effort has been devoted to reducing the time between query receipt and search engine response, and for good reason. Research suggests that even slightly higher retrieval latency by Web search engines can lead to dramatic decreases in users’ perceptions of result quality and engagement with the search results. While users have come to expect rapid responses from search engines, recent advances in our understanding of how people find information suggest that there are…
Significant time and effort has been devoted to reducing the time between query receipt and search engine response, and for good reason. Research suggests that even slightly higher retrieval latency by Web search engines can lead to dramatic decreases in users’ perceptions of result quality and engagement with the search results. While users have come to expect rapid responses from search engines, recent advances in our understanding of how people find information suggest that there are scenarios where a search engine could take significantly longer than a fraction of a second to return relevant content. This raises the important question: What would search look like if search engines were not constrained by existing expectations for speed? In this paper, we explore slow search, a class of search where traditional speed requirements are relaxed in favor of a high quality search experience. Via large-scale log analysis and user surveys, we examine how individuals value time when searching. We confirm that speed is important, but also show that there are many search situations where result quality is more important. This highlights intriguing opportunities for search systems to support new search experiences with high quality result content that takes time to identify. Slow search has the potential to change the search experience as we know it.
Other authorsSee publication -
Overcoming Vocabulary Limitations in Twitter Microblogs
TREC
One major difficulty in performing ad-hoc search on microblogs such as Twitter is the limited vocabulary of each document due their short length. In this paper, two approaches to addressing this issue are presented. The rst is query expansion through pseudo-relevance feedback and the other is document expansion of tweets using web documents linked from the body of the tweet. Tweets are expanded by concatenating the contents of the title tag and the meta descriptor tags of the document to the…
One major difficulty in performing ad-hoc search on microblogs such as Twitter is the limited vocabulary of each document due their short length. In this paper, two approaches to addressing this issue are presented. The rst is query expansion through pseudo-relevance feedback and the other is document expansion of tweets using web documents linked from the body of the tweet. Tweets are expanded by concatenating the contents of the title tag and the meta descriptor tags of the document to the tweet itself. These two approaches gave additive gains in MAP and Precision at 30.
Other authorsSee publication -
ProbClean: A Probabilistic Duplicate Detection System
ICDE
One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel…
One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.
Other authorsSee publication
More activity by Yubin
-
We’re excited to announce that Turing AI Acceleration Fellow, Dr. Jeff Dalton, is joining Valence as our Head of AI and Chief Scientist. Jeff will…
We’re excited to announce that Turing AI Acceleration Fellow, Dr. Jeff Dalton, is joining Valence as our Head of AI and Chief Scientist. Jeff will…
Liked by Yubin Kim
-
I'm at #CIKM2024! Let's chat about e-commerce search/recs, vision language models, and multimodality. Also! Please come to my keynote talk at the…
I'm at #CIKM2024! Let's chat about e-commerce search/recs, vision language models, and multimodality. Also! Please come to my keynote talk at the…
Shared by Yubin Kim
-
Thanks for inviting me to speak! Research friends, please consider submitting. :)
Thanks for inviting me to speak! Research friends, please consider submitting. :)
Shared by Yubin Kim
-
So I just realized that I spent the whole of #SIGIR2024 without up-to-date employment on LinkedIn! Now is a good time as any to announce that I've…
So I just realized that I spent the whole of #SIGIR2024 without up-to-date employment on LinkedIn! Now is a good time as any to announce that I've…
Shared by Yubin Kim
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore MoreOthers named Yubin Kim in United States
-
Yubin Kim
B.S Advertising Student at The University of Texas at Austin
-
Yubin Kim
Industrial and Systems Engineering student at the Georgia Institute of Technology.
-
Yubin Kim
J.D. Candidate at Cornell Law School
-
Yubin Kim
Political Science and English Student at UNC
31 others named Yubin Kim in United States are on LinkedIn
See others named Yubin Kim