research-article

Open access

Query Refinement for Diverse Top-k Selection

Authors:

Felix S. Campbell,

Alon Silberstein,

Julia Stoyanovich,

Yuval MoskovitchAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 3

Article No.: 166, Pages 1 - 27

https://doi.org/10.1145/3654969

Published: 30 May 2024 Publication History

Abstract

Database queries are often used to select and rank items as decision support for many applications. As automated decision-making tools become more prevalent, there is a growing recognition of the need to diversify their outcomes. In this paper, we define and study the problem of modifying the selection conditions of an ORDER BY query so that the result of the modified query closely fits some user-defined notion of diversity while simultaneously maintaining the intent of the original query. We show the hardness of this problem and propose a mixed-integer linear programming (MILP) based solution. We further present optimizations designed to enhance the scalability and applicability of the solution in real-life scenarios. We investigate the performance characteristics of our algorithm and show its efficiency and the usefulness of our optimizations.

Supplemental Material

MP4 File

Presentation video (with captions)

Download
23.26 MB

PDF File

Slides

Download
1.16 MB

References

[1]

Abolfazl Asudeh, H. V. Jagadish, Julia Stoyanovich, and Gautam Das. 2019. Designing Fair Ranking Schemes. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1259--1276. https://doi.org/10.1145/3299869.3300079

Digital Library

[2]

Ricardo Baeza-Yates. 2018. Bias on the web. Commun. ACM 61, 6 (2018), 54--61. https://doi.org/10.1145/3209581

Digital Library

[3]

Pierre Bourhis, Daniel Deutch, and Yuval Moskovitch. 2016. Analyzing data-centric applications: Why, what-if, and how-to. In 32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16--20, 2016. IEEE Computer Society, 779--790. https://doi.org/10.1109/ICDE.2016.7498289

[4]

Matteo Brucato, Azza Abouzied, and Alexandra Meliou. 2014. Improving package recommendations through query relaxation. In Proceedings of the First International Workshop on Bringing the Value of "Big Data" to Users, Data4U@VLDB 2014, Hangzhou, China, September 1, 2014, Rada Chirkova and Jun Yang (Eds.). ACM, 13. https://doi.org/10.1145/ 2658840.2658843

Digital Library

[5]

Matteo Brucato, Azza Abouzied, and Alexandra Meliou. 2018. Package queries: efficient and scalable computation of high-order constraints. VLDB J. 27, 5 (2018), 693--718. https://doi.org/10.1007/s00778-017-0483--4

Digital Library

[6]

Matteo Brucato, Rahul Ramakrishna, Azza Abouzied, and Alexandra Meliou. 2015. PackageBuilder: From Tuples to Packages. CoRR abs/1507.00942 (2015). arXiv:1507.00942 http://arxiv.org/abs/1507.00942

[7]

Felix S. Campbell, Alon Silberstein, Julia Stoyanovich, and Yuval Moskovitch. 2024. Query Refinement for Diverse Top-?? Selection (Implementation). https://github.com/fsalc/diverse-top-k

[8]

Felix S. Campbell, Alon Silberstein, Julia Stoyanovich, and Yuval Moskovitch. 2024. Query Refinement for Diverse Top-?? Selection (Tech Report). arXiv:2403.17786 [cs.DB]

[9]

L. Elisa Celis, Anay Mehrotra, and Nisheeth K. Vishnoi. 2020. Interventions for ranking in the presence of implicit bias. In FAT* '20: Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, January 27--30, 2020, Mireille Hildebrandt, Carlos Castillo, L. Elisa Celis, Salvatore Ruggieri, Linnet Taylor, and Gabriela Zanfir-Fortuna (Eds.). ACM, 369--380. https://doi.org/10.1145/3351095.3372858

Digital Library

[10]

L. Elisa Celis, Damian Straszak, and Nisheeth K. Vishnoi. 2018. Ranking with Fairness Constraints. In 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, July 9--13, 2018, Prague, Czech Republic (LIPIcs, Vol. 107), Ioannis Chatzigiannakis, Christos Kaklamanis, Dániel Marx, and Donald Sannella (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 28:1--28:15. https://doi.org/10.4230/LIPIcs.ICALP.2018.28

[11]

Abraham Charnes and William W Cooper. 1962. Programming with linear fractional functionals. Naval Research Logistics Quarterly 9, 3--4 (1962), 181--186.

[12]

Zixuan Chen, Panagiotis Manolios, and Mirek Riedewald. 2023. Why Not Yet: Fixing a Top-k Ranking that Is Not Fair to Individuals. Proc. VLDB Endow. 16, 9 (2023), 2377--2390. https://www.vldb.org/pvldb/vol16/p2377-chen.pdf

Digital Library

[13]

Wesley W. Chu and Qiming Chen. 1994. A structured approach for cooperative query answering. IEEE Transactions on Knowledge and Data Engineering 6, 5 (1994), 738--749.

Digital Library

[14]

Ting Deng and Wenfei Fan. 2014. On the Complexity of Query Result Diversification. ACM Trans. Database Syst. 39, 2 (2014), 15:1--15:46. https://doi.org/10.1145/2602136

Digital Library

[15]

Daniel Deutch, Zachary G. Ives, Tova Milo, and Val Tannen. 2013. Caravan: Provisioning for What-If Analysis. In Sixth Biennial Conference on Innovative Data Systems Research, CIDR 2013, Asilomar, CA, USA, January 6--9, 2013, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2013/Papers/CIDR13_Paper100.pdf

[16]

Daniel Deutch, Yuval Moskovitch, and Val Tannen. 2014. A Provenance Framework for Data-Dependent Process Analysis. Proc. VLDB Endow. 7, 6 (2014), 457--468. https://doi.org/10.14778/2732279.2732283

Digital Library

[17]

Ronald Fagin, Ravi Kumar, and D. Sivakumar. 2003. Comparing Top k Lists. SIAM J. Discret. Math. 17, 1 (2003), 134--160. https://doi.org/10.1137/S0895480102412856

Digital Library

[18]

Sahin Cem Geyik, Stuart Ambler, and Krishnaram Kenthapadi. 2019. Fairness-Aware Ranking in Search & Recommendation Systems with Application to LinkedIn Talent Search. In SIGKDD. ACM.

[19]

Sreenivas Gollapudi and Aneesh Sharma. 2009. An axiomatic approach for result diversification. In Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20--24, 2009, Juan Quemada, Gonzalo León, Yoëlle S. Maarek, and Wolfgang Nejdl (Eds.). ACM, 381--390. https://doi.org/10.1145/1526709.1526761

Digital Library

[20]

Md Mouinul Islam, Dong Wei, Baruch Schieber, and Senjuti Basu Roy. 2022. Satisfying Complex Top-k Fairness Constraints by Preference Substitutions. Proc. VLDB Endow. 16, 2 (2022), 317--329. https://www.vldb.org/pvldb/vol16/ p317-roy.pdf

Digital Library

[21]

Richard M. Karp. 1972. Reducibility Among Combinatorial Problems. In Proceedings of a symposium on the Complexity of Computer Computations, held March 20--22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA (The IBM Research Symposia Series), Raymond E. Miller and James W. Thatcher (Eds.). Plenum Press, New York, 85--103. https://doi.org/10.1007/978--1--4684--2001--2_9

[22]

M. G. Kendall. 1938. A New Measure of Rank Correlation. Biometrika 30, 1--2 (06 1938), 81--93. https://doi.org/10.1093/ biomet/30.1--2.81 arXiv:https://academic.oup.com/biomet/article-pdf/30/1--2/81/423380/30--1--2--81.pdf

[23]

Jon M. Kleinberg and Manish Raghavan. 2018. Selection Problems in the Presence of Implicit Bias. In 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11--14, 2018, Cambridge, MA, USA (LIPIcs, Vol. 94), Anna R. Karlin (Ed.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 33:1--33:17. https://doi.org/10.4230/LIPIcs.ITCS.2018.33

[24]

Nick Koudas, Chen Li, Anthony K. H. Tung, and Rares Vernica. 2006. Relaxing Join and Selection Queries. In VLDB.

[25]

Matt J. Kusner, Joshua R. Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual Fairness. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 4066--4076. https://proceedings.neurips.cc/paper/2017/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html

[26]

Jinyang Li, Yuval Moskovitch, Julia Stoyanovich, and HV Jagadish. 2023. Query Refinement for Diversity Constraint Satisfaction. Proceedings of the VLDB Endowment 17, 2 (2023), 106--118.

Digital Library

[27]

Jinyang Li, Alon Silberstein, Yuval Moskovitch, Julia Stoyanovich, and H. V. Jagadish. 2023. Erica: Query Refinement for Diversity Constraint Satisfaction. Proc. VLDB Endow. 16, 12 (2023), 4070--4073. https://doi.org/10.14778/3611540.3611623

Digital Library

[28]

Anh L. Mai, PengyuWang, Azza Abouzied, Matteo Brucato, Peter J. Haas, and Alexandra Meliou. 2023. Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization. CoRR abs/2307.02860 (2023). https://doi.org/10.48550/arXiv.2307.02860 arXiv:2307.02860

[29]

Alexandra Meliou and Dan Suciu. 2012. Tiresias: the database oracle for how-to queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012, K. Selçuk Candan, Yi Chen, Richard T. Snodgrass, Luis Gravano, and Ariel Fuxman (Eds.). ACM, 337--348. https://doi.org/10.1145/2213836.2213875

Digital Library

[30]

Chaitanya Mishra and Nick Koudas. 2009. Interactive query refinement. In EDBT 2009, 12th International Conference on Extending Database Technology, Saint Petersburg, Russia, March 24--26, 2009, Proceedings (ACM International Conference Proceeding Series, Vol. 360), Martin L. Kersten, Boris Novikov, Jens Teubner, Vladimir Polutin, and Stefan Manegold (Eds.). ACM, 862--873. https://doi.org/10.1145/1516360.1516459

Digital Library

[31]

Yuval Moskovitch, Jinyang Li, and H. V. Jagadish. 2022. Bias analysis and mitigation in data-driven tools using provenance. In Proceedings of the 14th International Workshop on the Theory and Practice of Provenance, TaPP 2022, Philadelphia, Pennsylvania, 17 June 2022. ACM, 1:1--1:4. https://doi.org/10.1145/3530800.3534528

Digital Library

[32]

Yuval Moskovitch, Jinyang Li, and H. V. Jagadish. 2022. Bias analysis and mitigation in data-driven tools using provenance. In Proceedings of the 14th International Workshop on the Theory and Practice of Provenance, TaPP 2022, Philadelphia, Pennsylvania, 17 June 2022, Adriane Chapman, Daniel Deutch, and Tanu Malik (Eds.). ACM, 1:1--1:4. https://doi.org/10.1145/3530800.3534528

Digital Library

[33]

Yuval Moskovitch, Jinyang Li, and H. V. Jagadish. 2023. Detection of Groups with Biased Representation in Ranking. CoRR abs/2301.00719 (2023). https://doi.org/10.48550/arXiv.2301.00719 arXiv:2301.00719

[34]

Ion Muslea and Thomas J Lee. 2005. Online query relaxation via bayesian causal structures discovery. In AAAI. 831--836.

[35]

Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. 2016. The Synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA). 399--410. https://doi.org/10.1109/DSAA.2016.49

[36]

Christopher Peskun, Allan Detsky, and Maureen Shandling. 2007. Effectiveness of medical school admissions criteria in predicting residency ranking four years later. Medical education 41, 1 (2007).

[37]

Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1981--1984. https://doi.org/10.1145/3299869.3320212

Digital Library

[38]

Suraj Shetiya, Ian P. Swift, Abolfazl Asudeh, and Gautam Das. 2022. Fairness-Aware Range Queries for Selecting Unbiased Data. In 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9--12, 2022. IEEE, 1423--1436. https://doi.org/10.1109/ICDE53745.2022.00111

[39]

Julia Stoyanovich, Ke Yang, and H. V. Jagadish. 2018. Online Set Selection with Fairness and Diversity Constraints. In Proceedings of the 21st International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, March 26--29, 2018, Michael H. Böhlen, Reinhard Pichler, Norman May, Erhard Rahm, Shan-Hung Wu, and Katja Hose (Eds.). OpenProceedings.org, 241--252. https://doi.org/10.5441/002/edbt.2018.22

[40]

Quoc Trung Tran and Chee-Yong Chan. 2010. How to conquer why-not questions. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 15--26.

Digital Library

[41]

Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. 2009. Query by output. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 535--548.

Digital Library

[42]

Marcos R. Vieira, Humberto Luiz Razente, Maria Camila Nardini Barioni, Marios Hadjieleftheriou, Divesh Srivastava, Caetano Traina Jr., and Vassilis J. Tsotras. 2011. On query result diversification. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11--16, 2011, Hannover, Germany, Serge Abiteboul, Klemens Böhm, Christoph Koch, and Kian-Lee Tan (Eds.). IEEE Computer Society, 1163--1174. https://doi.org/10.1109/ICDE.2011. 5767846

[43]

Xiaolan Wang, Alexandra Meliou, and Eugene Wu. 2017. QFix: Diagnosing Errors through Query Histories. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017, Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu (Eds.). ACM, 1369--1384. https://doi.org/10.1145/3035918.3035925

Digital Library

[44]

Linda F Wightman. 1998. LSAC National Longitudinal Bar Passage Study. LSAC Research Report Series. (1998).

[45]

Ke Yang, Vasilis Gkatzelis, and Julia Stoyanovich. 2019. Balanced Ranking with Diversity Constraints. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10--16, 2019, Sarit Kraus (Ed.). ijcai.org, 6035--6042. https://doi.org/10.24963/ijcai.2019/836

[46]

Ke Yang and Julia Stoyanovich. 2017. Measuring Fairness in Ranked Outputs. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, June 27--29, 2017. ACM, 22:1--22:6. https://doi.org/10.1145/3085504.3085526

Digital Library

[47]

Meike Zehlike, Philipp Hacker, and Emil Wiedemann. 2020. Matching code and law: achieving algorithmic fairness with optimal transport. Data Min. Knowl. Discov. 34, 1 (2020), 163--200. https://doi.org/10.1007/s10618-019-00658--8

Digital Library

[48]

Meike Zehlike, Ke Yang, and Julia Stoyanovich. 2023. Fairness in Ranking, Part I: Score-Based Ranking. ACM Comput. Surv. 55, 6 (2023), 118:1--118:36. https://doi.org/10.1145/3533379

Digital Library

[49]

Meike Zehlike, Ke Yang, and Julia Stoyanovich. 2023. Fairness in Ranking, Part II: Learning-to-Rank and Recommender Systems. ACM Comput. Surv. 55, 6 (2023), 117:1--117:41. https://doi.org/10.1145/3533380

Digital Library

Index Terms

Query Refinement for Diverse Top-k Selection
1. Information systems
  1. Data management systems
2. Social and professional topics
  1. Professional topics
    1. Computing and business
      1. Socio-technical systems

Recommendations

Adaptive rank-aware query optimization in relational databases

Rank-aware query processing has emerged as a key requirement in modern applications. In these applications, efficient and adaptive evaluation of top-k queries is an integral part of the application semantics. In this article, we introduce a rank-aware ...
Probabilistic top-k and ranking-aggregate queries

Ranking and aggregation queries are widely used in data exploration, data analysis, and decision-making scenarios. While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications involve ...
Disjunctive Sets of Phrase Queries for Diverse Query Suggestion
WI '19: IEEE/WIC/ACM International Conference on Web Intelligence

This paper proposes a method of suggesting expanded queries that disambiguate the original Web query which has multiple interpretations. In order to produce a diverse set of queries including those corresponding to infrequent query intents, our method ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 3

SIGMOD

June 2024

1953 pages

EISSN:2836-6573

DOI:10.1145/3670010

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2024

Published in PACMMOD Volume 2, Issue 3

Author Tags

Qualifiers

Research-article

Funding Sources

Frankel Center for Computer Science, BGU
National Science Foundation
Israel Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
116
Total Downloads

Downloads (Last 12 months)116
Downloads (Last 6 weeks)31

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents