Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Top-k and Clustering with Noisy Comparisons

Published: 30 December 2014 Publication History

Abstract

We study the problems of max/top-k and clustering when the comparison operations may be performed by oracles whose answer may be erroneous. Comparisons may either be of type or of value: given two data elements, the answer to a type comparison is “yes” if the elements have the same type and therefore belong to the same group (cluster); the answer to a value comparison orders the two data elements. We give efficient algorithms that are guaranteed to achieve correct results with high probability, analyze the cost of these algorithms in terms of the total number of comparisons (i.e., using a fixed-cost model), and show that they are essentially the best possible. We also show that fewer comparisons are needed when values and types are correlated, or when the error model is one in which the error decreases as the distance between the two elements in the sorted order increases. Finally, we examine another important class of cost functions, concave functions, which balances the number of rounds of interaction with the oracle with the number of questions asked of the oracle. Results of this article form an important first step in providing a formal basis for max/top-k and clustering queries in crowdsourcing applications, that is, when the oracle is implemented using the crowd. We explain what simplifying assumptions are made in the analysis, what results carry to a generalized crowdsourcing setting, and what extensions are required to support a full-fledged model.

References

[1]
Miklós Ajtai, Vitaly Feldman, Avinatan Hassidim, and Jelani Nelson. 2009. Sorting and selection with imprecise comparisons. In Proceedings of the 36th International Colloquium on Automata, Languages, and Programming (ICALP'09). 37--48.
[2]
Paul André, Michael Bernstein, and Kurt Luther. 2012. Who gives a tweet? Evaluating microblog content value. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW'12). ACM Press, New York, 471--474.
[3]
Paul André, Aniket Kittur, and Steven P. Dow. 2014. Crowd synthesis: Extracting categories and clusters from complex data. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW'14). 989--998.
[4]
Paul Andre, Haoqi Zhang, Juho Kim, Lydia B. Chilton, Steven P. Dow, and Robert C. Miller. 2013. Community clustering: Leveraging an academic crowd to form coherent conference sessions. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP'13).
[5]
Eyal Baharad, Jacob Goldberger, Moshe Koppel, and Shmuel Nitzan. 2011. Distilling the wisdom of crowds: Weighted aggregation of decisions on multiple issues. Auton. Agents Multi-Agent Syst. 22, 1, 31--42.
[6]
Michael Ben-Or. 1983. Lower bounds for algebraic computation trees. In Proceedings of the 15th Annual ACM Symposium on Theory of Computing (STOC'83). 80--86.
[7]
Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. 1973. Time bounds for selection. J. Comput. Syst. Sci. 7, 4, 448--461.
[8]
Rubi Boim, Ohad Greenshpan, Tova Milo, Slava Novgorodov, Neoklis Polyzotis, and Wang-Chiew Tan. 2012. Asking the right questions in crowd data sourcing. In Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE'12). 1261--1264.
[9]
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML'05). 89--96.
[10]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms 3rd Ed. The MIT Press.
[11]
Susan B. Davidson, Sanjeev Khanna, Tova Milo, and Sudeepa Roy. 2013. Using the crowd for top-k and group-by queries. In Proceedings of the 16th International Conference on Database Theory (ICDT'13). 225--236.
[12]
Alexander Philip Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the em algorithm. J. Royal Statist. Soc. Appl. Statist. 28, 1, 20--28.
[13]
Uriel Feige, Prabhakar Raghavan, David Peleg, and Eli Upfal. 1994. Computing with noisy information. SIAM J. Comput. 23, 5, 1001--1018.
[14]
Alberto Fernández and Sergio Gómez. 2008. Solving non-uniqueness in agglomerative hierarchical clustering using multidendrograms. J. Classificat. 25, 1, 43--65.
[15]
Dimitris Fotakis and Christos Tzamos. 2013. Strategy proof facility location for concave cost functions. In Proceedings of the 14th ACM Conference on Electronic Commerce (EC'13). 435--452.
[16]
Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 61--72.
[17]
Ryan Gomes, Peter Welinder, Andreas Krause, and Pietro Perona. 2011. Crowdclustering. In Proceedings of the Neural Information Processing Systems Conference (NIPS'11). 558--566.
[18]
Geoffrey M. Guisewite and Panagote M. Pardalos. 1991. Algorithms for the single-source uncapacitated minimum concave-cost network flow problem. J. Global Optim. 1, 3, 245--265.
[19]
Stephen Guo, Aditya Parameswaran, and Hector Garcia-Molina. 2012. So who won? Dynamic max discovery with the crowd. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). 385--396.
[20]
Hannes Heikinheimo and Antti Ukkonen. 2013. The crowd-median algorithm. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP'13).
[21]
Panos Ipeirotis. 2011. Crowdsourcing using mechanical turk: Quality management and scalability. In Proceedings of the 8th International Workshop on Information Integration on the Web in Conjunction with the Conference on World Wide Web (IIWeb/WWW'11).
[22]
David R. Karger, Sewoong Oh, and Devavrat Shah. 2011. Iterative learning for reliable crowdsourcing systems. In Proceedings of the Neural Information Processing Systems Conference (NIPS'11).1953--1961.
[23]
Matthew Lease. 2011. On quality control and machine learning in crowdsourcing. In Proceedings of the 3rd Human Computation Workshop (HCOMP'11). AAAI, 97--102.
[24]
Fei-Fei Li and Pietro Perona. 2005. A bayesian hierarchical model for learning natural scene categories. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 2. 524--531.
[25]
Hongwei Li, Bin Yu, and Dengyong Zhou. 2013. Error rate bounds in crowdsourcing models. http://arxiv.org/pdf/1307.2674.pdf.
[26]
Tie-Yan Liu. 2009. Learning to rank for information retrieval. Found. Trends Inf. Retr. 3, 3, 225--331.
[27]
Xuan Liu, Meiyu Lu, Beng Chin Ooi, Yanyan Shen, Sai Wu, and Meihui Zhang. 2012. CDAS: A crowdsourcing data analytics system. Proc. VLDB Endow. 5, 10, 1040--1051.
[28]
Adam Marcus, Michael S. Bernstein, Osama Badar, David R. Karger, Samuel Madden, and Robert C. Miller. 2011a. Twitinfo: Aggregating and visualizing microblogs for event exploration. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI'11). 227--236.
[29]
Adam Marcus, Eugene Wu, David R. Karger, Samuel Madden, and Robert C. Miller. 2011b. Human-powered sorts and joins. Proc. VLDB Endow. 5, 1, 13--24.
[30]
Rajeev Motwani and Prabhakar Raghavan. 1995. Randomized Algorithms. Cambridge University Press, New York.
[31]
Aditya Parameswaran, Hyunjung Park, Hector Garcia-Molina, Neoklis Polyzotis, and Jennifer Widom. 2011. Deco: Declarative crowdsourcing. http://ilpubs.stanford.edu:8090/1015/.
[32]
Aditya G. Parameswaran, Hector Garcia-Molina, Hyunjung Park, Neoklis Polyzotis, Aditya Ramesh, and Jennifer Widom. 2012. CrowdScreen: Algorithms for filtering data with humans. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). 361--372.
[33]
Nicholas Pippenger. 1987. Sorting and selecting in rounds. SIAM J. Comput. 16, 6, 1032--1038.
[34]
Vassilis Polychronopoulos, Luca De Alfaro, James Davis, Hector Garcia-Molina, and Neoklis Polyzotis. 2013. Human-powered top-k lists. In Proceedings of the International Workshop on the Web and Databases (WebDB'13). 25--30.
[35]
Filip Radlinski and Thorsten Joachims. 2007. Active exploration for learning rankings from clickthrough data. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'07). ACM Press, New York, 570--579.
[36]
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon's mechanical turk. In Proceedings of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (CSLDAMT'10). Association for Computational Linguistics, 139--147.
[37]
Mohammad Rastegari, Chen Fang, and Lorenzo Torresani. 2011. Scalable object-class retrieval with approximate and top-k ranking. In Proceedings of the International Conference on Computer Vision (ICCV'11). 2659--2666.
[38]
Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. J. Mach. Learn. Res. 11, 1297--1322.
[39]
Joachim Selke, Christoph Lofi, and Wolf-Tilo Balke. 2012. Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. Proc. VLDB Endow. 5, 6, 538--549.
[40]
Leslie G. Valiant. 1975. Parallelism in comparison problems. SIAM J. Comput. 4, 3, 348--355.
[41]
Petros Venetis, Hector Garcia-Molina, Kerui Huang, and Neoklis Polyzotis. 2012. Max algorithms in crowdsourcing environments. In Proceedings of the 21st International Conference on World Wide Web (WWW'12). 989--998.
[42]
Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. Proc. VLDB Endow. 5, 11, 1483--1494.
[43]
Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013. Question selection for crowd entity resolution. Proc. VLDB Endow. 6, 6, 349--360.
[44]
Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain, and Tianbao Yang. 2012. Semi-crowdsourced clustering: Generalizing crowd labeling by robust distance metric learning. In Proceedings of the Neural Information Processing Systems Conference (NIPS'12). 1781--1789.

Cited By

View all
  • (2024)Complexity of Round-Robin Allocation with Potentially Noisy QueriesAlgorithmic Game Theory10.1007/978-3-031-71033-9_29(520-537)Online publication date: 31-Aug-2024
  • (2023)Development of a deep stacked ensemble with process based volatile memory forensics for platform independent malware detection and classificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119952223:COnline publication date: 10-May-2023
  • (2023)Optimal design of circular concrete-filled steel tubular columns based on a combination of artificial neural network, balancing composite motion algorithm and a large experimental databaseExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119940223:COnline publication date: 10-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 39, Issue 4
Invited Articles Issue, SIGMOD 2013, PODS 2013 and ICDT 2013
December 2014
341 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2691190
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 December 2014
Accepted: 01 October 2014
Revised: 01 June 2014
Received: 01 October 2013
Published in TODS Volume 39, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Top-k
  2. algorithm
  3. approximation
  4. clustering
  5. crowdsourcing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)1
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Complexity of Round-Robin Allocation with Potentially Noisy QueriesAlgorithmic Game Theory10.1007/978-3-031-71033-9_29(520-537)Online publication date: 31-Aug-2024
  • (2023)Development of a deep stacked ensemble with process based volatile memory forensics for platform independent malware detection and classificationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119952223:COnline publication date: 10-May-2023
  • (2023)Optimal design of circular concrete-filled steel tubular columns based on a combination of artificial neural network, balancing composite motion algorithm and a large experimental databaseExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119940223:COnline publication date: 10-May-2023
  • (2023)Efficient crowdsourced best objects finding via superiority probability based ordering for decision support systemsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119893223:COnline publication date: 1-Aug-2023
  • (2022)Shortest-path queries on complex networksProceedings of the VLDB Endowment10.14778/3551793.355182015:11(2640-2652)Online publication date: 29-Sep-2022
  • (2022)GinexProceedings of the VLDB Endowment10.14778/3551793.355181915:11(2626-2639)Online publication date: 29-Sep-2022
  • (2022)Hierarchical Entity Resolution using an OracleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526147(414-428)Online publication date: 10-Jun-2022
  • (2022)How to Query an Oracle? Efficient Strategies to Label DataIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2021.311864444:11(7597-7609)Online publication date: 1-Nov-2022
  • (2021)Cryptanalysis of an encrypted database in SIGMOD '14Proceedings of the VLDB Endowment10.14778/3467861.346786514:10(1743-1755)Online publication date: 26-Oct-2021
  • (2021)Optimizing fitness-for-use of differentially private linear queriesProceedings of the VLDB Endowment10.14778/3467861.346786414:10(1730-1742)Online publication date: 26-Oct-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media