Yubin Kim

Yubin Kim

Pittsburgh, Pennsylvania, United States
1K followers 500+ connections

About

Applied AI/ML leader with 4+ yrs of leadership experience building collaborative…

Activity

Join now to see all activity

Experience

  • Vody Graphic
  • -

  • -

    Greater Pittsburgh Area

  • -

  • -

    Greater Pittsburgh Area

  • -

    Redmond, WA

  • -

  • -

  • -

  • -

  • -

  • -

  • -

  • -

Education

Publications

  • XWalk: Random Walk Based Candidate Retrieval for Product Search

    SIGIR Workshop on eCommerce

    In e-commerce, head queries account for the vast majority of gross merchandise sales and improvements to head queries are highly impactful to the business. While most supervised approaches to search perform better in head queries vs. tail queries, we propose a method that further improves head query performance dramatically. We propose XWalk, a random-walk based graph approach to candidate retrieval for product search that borrows from recommendation system techniques. XWalk is highly efficient…

    In e-commerce, head queries account for the vast majority of gross merchandise sales and improvements to head queries are highly impactful to the business. While most supervised approaches to search perform better in head queries vs. tail queries, we propose a method that further improves head query performance dramatically. We propose XWalk, a random-walk based graph approach to candidate retrieval for product search that borrows from recommendation system techniques. XWalk is highly efficient to train and inference in a large-scale high traffic e-commerce setting, and shows substantial improvements in head query performance over state-of-the-art neural retreivers. Ensembling XWalk with a neural and/or lexical retriever combines the best of both worlds and the resulting retrieval system outperforms all other methods in both offline relevance-based evaluation and in online A/B tests.

    See publication
  • Applications and Future of Dense Retrieval in Industry

    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

    Large-scale search engines are often designed as tiered systems with at least two layers. The L1 candidate retrieval layer efficiently generates a subset of potentially relevant documents (typically ~1000 documents) from a corpus many orders of magnitude larger in size. L1 systems emphasize efficiency and are designed to maximize recall. The L2 re-ranking layer uses a more computationally expensive, but more accurate model (e.g. learning-to-rank or neural model) to re-rank the candidates…

    Large-scale search engines are often designed as tiered systems with at least two layers. The L1 candidate retrieval layer efficiently generates a subset of potentially relevant documents (typically ~1000 documents) from a corpus many orders of magnitude larger in size. L1 systems emphasize efficiency and are designed to maximize recall. The L2 re-ranking layer uses a more computationally expensive, but more accurate model (e.g. learning-to-rank or neural model) to re-rank the candidates generated by L1 in order to maximize precision of the final result list.

    Traditionally, candidate retrieval was performed with an inverted index data structure, with exact lexical matching. Candidates are ordered by a dot-product-like scoring function f(q,d) where q and d are sparse vectors containing token weights, typically derived from the token's frequency in the document/query and corpus. The inverted index enables sub-linear ranking of the documents. Due to the sparse vector representation of the documents and queries, lexical match retrieval systems have also been called sparse retrieval.

    To contrast, dense retrieval represents queries and documents by embedding the text into lower dimensional dense vectors. Candidate documents are scored based on the distance between the query and document embedding vectors. Practically, the similarity computations are made efficiently with approximate k-nearest neighbours (ANN) systems.

    In this panel, we bring together experts in dense retrieval across multiple industry applications, including web search, enterprise and personal search, e-commerce, and out-of-domain retrieval.

    See publication
  • Overview of the Health Search and Data Mining (HSDM 2020) workshop

    Proceedings of the 13th International Conference on Web Search and Data Mining

    We present HSDM, a full-day workshop on Health Search and Data Mining co-located with WSDM 2020's Health Day. This event builds on recent biomedical workshops in the NLP and ML communities but puts a clear emphasis on search and data mining (and their intersection) that is lacking in other venues. The program will include two keynote addresses by key opinion leaders in the clinical, search, and data mining domains. The technical program consists of 6 original research presentations. Finally, we…

    We present HSDM, a full-day workshop on Health Search and Data Mining co-located with WSDM 2020's Health Day. This event builds on recent biomedical workshops in the NLP and ML communities but puts a clear emphasis on search and data mining (and their intersection) that is lacking in other venues. The program will include two keynote addresses by key opinion leaders in the clinical, search, and data mining domains. The technical program consists of 6 original research presentations. Finally, we will close with a panel discussion with keynote speakers, PC members, and the audience.

    Other authors
    See publication
  • Robust Selective Search

    CMU

    Selective search is a modern distributed search architecture designed to reduce the computational cost of large-scale search. Selective search creates topical shards that are deliberately content-skewed, placing highly similar documents together in the same shard. During query time, rather than searching the entire corpus, a resource selection algorithm selects a subset of the topic shards likely to contain documents relevant to the query and search is only performed on these shards. This…

    Selective search is a modern distributed search architecture designed to reduce the computational cost of large-scale search. Selective search creates topical shards that are deliberately content-skewed, placing highly similar documents together in the same shard. During query time, rather than searching the entire corpus, a resource selection algorithm selects a subset of the topic shards likely to contain documents relevant to the query and search is only performed on these shards. This substantially reduces total computational costs of search while maintaining accuracy comparable to exhaustive distributed search...

    See publication
  • Slow Search: Information Retrieval without Time Constraints

    HCIR

    Significant time and effort has been devoted to reducing the time between query receipt and search engine response, and for good reason. Research suggests that even slightly higher retrieval latency by Web search engines can lead to dramatic decreases in users’ perceptions of result quality and engagement with the search results. While users have come to expect rapid responses from search engines, recent advances in our understanding of how people find information suggest that there are…

    Significant time and effort has been devoted to reducing the time between query receipt and search engine response, and for good reason. Research suggests that even slightly higher retrieval latency by Web search engines can lead to dramatic decreases in users’ perceptions of result quality and engagement with the search results. While users have come to expect rapid responses from search engines, recent advances in our understanding of how people find information suggest that there are scenarios where a search engine could take significantly longer than a fraction of a second to return relevant content. This raises the important question: What would search look like if search engines were not constrained by existing expectations for speed? In this paper, we explore slow search, a class of search where traditional speed requirements are relaxed in favor of a high quality search experience. Via large-scale log analysis and user surveys, we examine how individuals value time when searching. We confirm that speed is important, but also show that there are many search situations where result quality is more important. This highlights intriguing opportunities for search systems to support new search experiences with high quality result content that takes time to identify. Slow search has the potential to change the search experience as we know it.

    Other authors
    See publication
  • Overcoming Vocabulary Limitations in Twitter Microblogs

    TREC

    One major difficulty in performing ad-hoc search on microblogs such as Twitter is the limited vocabulary of each document due their short length. In this paper, two approaches to addressing this issue are presented. The rst is query expansion through pseudo-relevance feedback and the other is document expansion of tweets using web documents linked from the body of the tweet. Tweets are expanded by concatenating the contents of the title tag and the meta descriptor tags of the document to the…

    One major difficulty in performing ad-hoc search on microblogs such as Twitter is the limited vocabulary of each document due their short length. In this paper, two approaches to addressing this issue are presented. The rst is query expansion through pseudo-relevance feedback and the other is document expansion of tweets using web documents linked from the body of the tweet. Tweets are expanded by concatenating the contents of the title tag and the meta descriptor tags of the document to the tweet itself. These two approaches gave additive gains in MAP and Precision at 30.

    Other authors
    See publication
  • ProbClean: A Probabilistic Duplicate Detection System

    ICDE

    One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel…

    One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.

    Other authors
    See publication

More activity by Yubin

View Yubin’s full profile

  • See who you know in common
  • Get introduced
  • Contact Yubin directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Yubin Kim in United States

Add new skills with these courses