research-article

Top-k and Clustering with Noisy Comparisons

Authors:

Susan Davidson,

Sanjeev Khanna,

Sudeepa RoyAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 39, Issue 4

Article No.: 35, Pages 1 - 39

https://doi.org/10.1145/2684066

Published: 30 December 2014 Publication History

Abstract

We study the problems of max/top-k and clustering when the comparison operations may be performed by oracles whose answer may be erroneous. Comparisons may either be of type or of value: given two data elements, the answer to a type comparison is “yes” if the elements have the same type and therefore belong to the same group (cluster); the answer to a value comparison orders the two data elements. We give efficient algorithms that are guaranteed to achieve correct results with high probability, analyze the cost of these algorithms in terms of the total number of comparisons (i.e., using a fixed-cost model), and show that they are essentially the best possible. We also show that fewer comparisons are needed when values and types are correlated, or when the error model is one in which the error decreases as the distance between the two elements in the sorted order increases. Finally, we examine another important class of cost functions, concave functions, which balances the number of rounds of interaction with the oracle with the number of questions asked of the oracle. Results of this article form an important first step in providing a formal basis for max/top-k and clustering queries in crowdsourcing applications, that is, when the oracle is implemented using the crowd. We explain what simplifying assumptions are made in the analysis, what results carry to a generalized crowdsourcing setting, and what extensions are required to support a full-fledged model.

References

[1]

Miklós Ajtai, Vitaly Feldman, Avinatan Hassidim, and Jelani Nelson. 2009. Sorting and selection with imprecise comparisons. In Proceedings of the 36^th International Colloquium on Automata, Languages, and Programming (ICALP'09). 37--48.

Digital Library

[2]

Paul André, Michael Bernstein, and Kurt Luther. 2012. Who gives a tweet&quest; Evaluating microblog content value. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW'12). ACM Press, New York, 471--474.

Digital Library

[3]

Paul André, Aniket Kittur, and Steven P. Dow. 2014. Crowd synthesis: Extracting categories and clusters from complex data. In Proceedings of the 17^th ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW'14). 989--998.

Digital Library

[4]

Paul Andre, Haoqi Zhang, Juho Kim, Lydia B. Chilton, Steven P. Dow, and Robert C. Miller. 2013. Community clustering: Leveraging an academic crowd to form coherent conference sessions. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP'13).

[5]

Eyal Baharad, Jacob Goldberger, Moshe Koppel, and Shmuel Nitzan. 2011. Distilling the wisdom of crowds: Weighted aggregation of decisions on multiple issues. Auton. Agents Multi-Agent Syst. 22, 1, 31--42.

Digital Library

[6]

Michael Ben-Or. 1983. Lower bounds for algebraic computation trees. In Proceedings of the 15^th Annual ACM Symposium on Theory of Computing (STOC'83). 80--86.

Digital Library

[7]

Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. 1973. Time bounds for selection. J. Comput. Syst. Sci. 7, 4, 448--461.

Digital Library

[8]

Rubi Boim, Ohad Greenshpan, Tova Milo, Slava Novgorodov, Neoklis Polyzotis, and Wang-Chiew Tan. 2012. Asking the right questions in crowd data sourcing. In Proceedings of the 28^th IEEE International Conference on Data Engineering (ICDE'12). 1261--1264.

Digital Library

[9]

Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22^nd International Conference on Machine Learning (ICML'05). 89--96.

Digital Library

[10]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms 3^rd Ed. The MIT Press.

Digital Library

[11]

Susan B. Davidson, Sanjeev Khanna, Tova Milo, and Sudeepa Roy. 2013. Using the crowd for top-k and group-by queries. In Proceedings of the 16^th International Conference on Database Theory (ICDT'13). 225--236.

Digital Library

[12]

Alexander Philip Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the em algorithm. J. Royal Statist. Soc. Appl. Statist. 28, 1, 20--28.

[13]

Uriel Feige, Prabhakar Raghavan, David Peleg, and Eli Upfal. 1994. Computing with noisy information. SIAM J. Comput. 23, 5, 1001--1018.

Digital Library

[14]

Alberto Fernández and Sergio Gómez. 2008. Solving non-uniqueness in agglomerative hierarchical clustering using multidendrograms. J. Classificat. 25, 1, 43--65.

Digital Library

[15]

Dimitris Fotakis and Christos Tzamos. 2013. Strategy proof facility location for concave cost functions. In Proceedings of the 14^th ACM Conference on Electronic Commerce (EC'13). 435--452.

Digital Library

[16]

Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 61--72.

Digital Library

[17]

Ryan Gomes, Peter Welinder, Andreas Krause, and Pietro Perona. 2011. Crowdclustering. In Proceedings of the Neural Information Processing Systems Conference (NIPS'11). 558--566.

[18]

Geoffrey M. Guisewite and Panagote M. Pardalos. 1991. Algorithms for the single-source uncapacitated minimum concave-cost network flow problem. J. Global Optim. 1, 3, 245--265.

[19]

Stephen Guo, Aditya Parameswaran, and Hector Garcia-Molina. 2012. So who won&quest; Dynamic max discovery with the crowd. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). 385--396.

Digital Library

[20]

Hannes Heikinheimo and Antti Ukkonen. 2013. The crowd-median algorithm. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP'13).

[21]

Panos Ipeirotis. 2011. Crowdsourcing using mechanical turk: Quality management and scalability. In Proceedings of the 8^th International Workshop on Information Integration on the Web in Conjunction with the Conference on World Wide Web (IIWeb/WWW'11).

Digital Library

[22]

David R. Karger, Sewoong Oh, and Devavrat Shah. 2011. Iterative learning for reliable crowdsourcing systems. In Proceedings of the Neural Information Processing Systems Conference (NIPS'11).1953--1961.

[23]

Matthew Lease. 2011. On quality control and machine learning in crowdsourcing. In Proceedings of the 3^rd Human Computation Workshop (HCOMP'11). AAAI, 97--102.

[24]

Fei-Fei Li and Pietro Perona. 2005. A bayesian hierarchical model for learning natural scene categories. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 2. 524--531.

Digital Library

[25]

Hongwei Li, Bin Yu, and Dengyong Zhou. 2013. Error rate bounds in crowdsourcing models. http://arxiv.org/pdf/1307.2674.pdf.

[26]

Tie-Yan Liu. 2009. Learning to rank for information retrieval. Found. Trends Inf. Retr. 3, 3, 225--331.

Digital Library

[27]

Xuan Liu, Meiyu Lu, Beng Chin Ooi, Yanyan Shen, Sai Wu, and Meihui Zhang. 2012. CDAS: A crowdsourcing data analytics system. Proc. VLDB Endow. 5, 10, 1040--1051.

Digital Library

[28]

Adam Marcus, Michael S. Bernstein, Osama Badar, David R. Karger, Samuel Madden, and Robert C. Miller. 2011a. Twitinfo: Aggregating and visualizing microblogs for event exploration. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI'11). 227--236.

Digital Library

[29]

Adam Marcus, Eugene Wu, David R. Karger, Samuel Madden, and Robert C. Miller. 2011b. Human-powered sorts and joins. Proc. VLDB Endow. 5, 1, 13--24.

Digital Library

[30]

Rajeev Motwani and Prabhakar Raghavan. 1995. Randomized Algorithms. Cambridge University Press, New York.

Digital Library

[31]

Aditya Parameswaran, Hyunjung Park, Hector Garcia-Molina, Neoklis Polyzotis, and Jennifer Widom. 2011. Deco: Declarative crowdsourcing. http://ilpubs.stanford.edu:8090/1015/.

[32]

Aditya G. Parameswaran, Hector Garcia-Molina, Hyunjung Park, Neoklis Polyzotis, Aditya Ramesh, and Jennifer Widom. 2012. CrowdScreen: Algorithms for filtering data with humans. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). 361--372.

Digital Library

[33]

Nicholas Pippenger. 1987. Sorting and selecting in rounds. SIAM J. Comput. 16, 6, 1032--1038.

Digital Library

[34]

Vassilis Polychronopoulos, Luca De Alfaro, James Davis, Hector Garcia-Molina, and Neoklis Polyzotis. 2013. Human-powered top-k lists. In Proceedings of the International Workshop on the Web and Databases (WebDB'13). 25--30.

[35]

Filip Radlinski and Thorsten Joachims. 2007. Active exploration for learning rankings from clickthrough data. In Proceedings of the 13^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'07). ACM Press, New York, 570--579.

Digital Library

[36]

Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon's mechanical turk. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (CSLDAMT'10). Association for Computational Linguistics, 139--147.

Digital Library

[37]

Mohammad Rastegari, Chen Fang, and Lorenzo Torresani. 2011. Scalable object-class retrieval with approximate and top-k ranking. In Proceedings of the International Conference on Computer Vision (ICCV'11). 2659--2666.

Digital Library

[38]

Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. J. Mach. Learn. Res. 11, 1297--1322.

Digital Library

[39]

Joachim Selke, Christoph Lofi, and Wolf-Tilo Balke. 2012. Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. Proc. VLDB Endow. 5, 6, 538--549.

Digital Library

[40]

Leslie G. Valiant. 1975. Parallelism in comparison problems. SIAM J. Comput. 4, 3, 348--355.

Digital Library

[41]

Petros Venetis, Hector Garcia-Molina, Kerui Huang, and Neoklis Polyzotis. 2012. Max algorithms in crowdsourcing environments. In Proceedings of the 21^st International Conference on World Wide Web (WWW'12). 989--998.

Digital Library

[42]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. Proc. VLDB Endow. 5, 11, 1483--1494.

Digital Library

[43]

Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013. Question selection for crowd entity resolution. Proc. VLDB Endow. 6, 6, 349--360.

Digital Library

[44]

Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain, and Tianbao Yang. 2012. Semi-crowdsourced clustering: Generalizing crowd labeling by robust distance metric learning. In Proceedings of the Neural Information Processing Systems Conference (NIPS'12). 1781--1789.

Cited By

Li ZManurangsi PScarlett JSuksompong W(2024)Complexity of Round-Robin Allocation with Potentially Noisy QueriesAlgorithmic Game Theory10.1007/978-3-031-71033-9_29(520-537)Online publication date: 31-Aug-2024
https://doi.org/10.1007/978-3-031-71033-9_29
Naeem HDong SFalana OUllah F(2023)Development of a deep stacked ensemble with process based volatile memory forensics for platform independent malware detection and classificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119952223:COnline publication date: 10-May-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119952
Le TPhan HDuong HLe M(2023)Optimal design of circular concrete-filled steel tubular columns based on a combination of artificial neural network, balancing composite motion algorithm and a large experimental databaseExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119940223:COnline publication date: 10-May-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119940
Show More Cited By

Index Terms

Top-k and Clustering with Noisy Comparisons

Recommendations

Using the crowd for top-k and group-by queries
ICDT '13: Proceedings of the 16th International Conference on Database Theory

Group-by and top-k are fundamental constructs in database queries. However, the criteria used for grouping and ordering certain types of data -- such as unlabeled photos clustered by the same person ordered by age -- are difficult to evaluate by ...
Crowdsourced Top-k Queries by Confidence-Aware Pairwise Judgments
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Crowdsourced query processing is an emerging processing technique that tackles computationally challenging problems by human intelligence. The basic idea is to decompose a computationally challenging problem into a set of human friendly microtasks (e.g.,...
Supporting ranking and clustering as generalized order-by and group-by
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

The Boolean semantics of SQL queries cannot adequately capture the "fuzzy" preferences and "soft" criteria required in non-traditional data retrieval applications. One way to solve this problem is to add a flavor of "information retrieval" into database ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems

ACM Transactions on Database Systems Volume 39, Issue 4

Invited Articles Issue, SIGMOD 2013, PODS 2013 and ICDT 2013

December 2014

341 pages

ISSN:0362-5915

EISSN:1557-4644

DOI:10.1145/2691190

Editor:
Christian S. Jensen
Aalborg University, Denmark

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 December 2014

Accepted: 01 October 2014

Revised: 01 June 2014

Received: 01 October 2013

Published in TODS Volume 39, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
395
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li ZManurangsi PScarlett JSuksompong W(2024)Complexity of Round-Robin Allocation with Potentially Noisy QueriesAlgorithmic Game Theory10.1007/978-3-031-71033-9_29(520-537)Online publication date: 31-Aug-2024
https://doi.org/10.1007/978-3-031-71033-9_29
Naeem HDong SFalana OUllah F(2023)Development of a deep stacked ensemble with process based volatile memory forensics for platform independent malware detection and classificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119952223:COnline publication date: 10-May-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119952
Le TPhan HDuong HLe M(2023)Optimal design of circular concrete-filled steel tubular columns based on a combination of artificial neural network, balancing composite motion algorithm and a large experimental databaseExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119940223:COnline publication date: 10-May-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119940
Yin BZeng WWei X(2023)Efficient crowdsourced best objects finding via superiority probability based ordering for decision support systemsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119893223:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119893
Zhang JLi WYuan LQin LZhang YChang L(2022)Shortest-path queries on complex networksProceedings of the VLDB Endowment10.14778/3551793.355182015:11(2640-2652)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551820
Park YMin SLee J(2022)GinexProceedings of the VLDB Endowment10.14778/3551793.355181915:11(2626-2639)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551819
Galhotra SFirmani DSaha BSrivastava DIves ZBonifati AEl Abbadi A(2022)Hierarchical Entity Resolution using an OracleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526147(414-428)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526147
Lahouti FKostina VHassibi B(2022)How to Query an Oracle? Efficient Strategies to Label DataIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2021.311864444:11(7597-7609)Online publication date: 1-Nov-2022
https://doi.org/10.1109/TPAMI.2021.3118644
Cao XLiu JLu HRen K(2021)Cryptanalysis of an encrypted database in SIGMOD '14Proceedings of the VLDB Endowment10.14778/3467861.346786514:10(1743-1755)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.14778/3467861.3467865
Xiao YDing ZWang YZhang DKifer D(2021)Optimizing fitness-for-use of differentially private linear queriesProceedings of the VLDB Endowment10.14778/3467861.346786414:10(1730-1742)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.14778/3467861.3467864
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents