research-article

Homogeneity in Web Search Results: Diagnosis and Mitigation

Authors:

Rakesh Agrawal,

Behzad Golshan,

Evangelos E. PapalexakisAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 8, Issue 5

Article No.: 66, Pages 1 - 35

https://doi.org/10.1145/3057731

Published: 12 July 2017 Publication History

Abstract

Access to diverse perspectives nurtures an informed citizenry. Google and Bing have emerged as the duopoly that largely arbitrates which English-language documents are seen by web searchers. We present our empirical study over the search results produced by Google and Bing that shows a large overlap. Thus, citizens may not gain different perspectives by simultaneously probing them for the same query. Fortunately, our study also shows that by mining Twitter data, one can obtain search results that are quite distinct from those produced by Google, Bing, and Bing News. Additionally, the users found those results to be quite informative.

We also present two novel tools we designed for this study. One uses tensor analysis to derive low-dimensional compact representation of search results and study their behavior over time. The other uses machine learning and quantifies the similarity of results between two search engines by framing it as a prediction problem. Although these tools have different underpinnings, the analytical results obtained using them corroborate each other, which reinforces the confidence one can place in them for finding meaningful insights from big data.

References

[1]

Amazon. 2011. Amazon Mechanical Turk, Requester Best Practices Guide. Amazon Web Services.

[2]

Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In 2nd ACM International Conference on Web Search and Data Mining. ACM, 5--14.

Digital Library

[3]

Rakesh Agrawal, Behzad Golshan, and Evangelos Papalexakis. 2015a. A study of distinctiveness in web results of two search engines. In 24th International Conference on World Wide Web, Web Science Track. ACM.

Digital Library

[4]

Rakesh Agrawal, Behzad Golshan, and Evangelos Papalexakis. 2015b. Whither social networks for web search? In 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Sydney, Australia.

Digital Library

[5]

Omar Alonso, Chad Carson, David Gerster, Xiang Ji, and Shubha U. Nabar. 2010. Detecting uninteresting content in text streams. In SIGIR Crowdsourcing for Search Evaluation Workshop.

[6]

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. 2014. Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 1 (2014), 2773--2832.

Digital Library

[7]

Demetris Antoniades, Iasonas Polakis, Georgios Kontaxis, Elias Athanasopoulos, Sotiris Ioannidis, Evangelos P. Markatos, and Thomas Karagiannis. 2011. we.b: The web of short URLs. In 20th International Conference on World Wide Web. ACM, 715--724.

Digital Library

[8]

Javed A. Aslam and Mark Montague. 2001. Models for metasearch. In 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 276--284.

Digital Library

[9]

Brett W. Bader and Tamara G. Kolda. 2007. Matlab tensor toolbox version 2.2. Sandia National Laboratories. Albuquerque, NM (2007).

[10]

Judit Bar-Ilan. 2004. Search engine ability to cope with the changing Web. In Web Dynamics. Springer, 195--215.

[11]

Ziv Bar-Yossef, Idit Keidar, and Uri Schonfeld. 2009. Do not crawl in the DUST: Different urls with similar text. ACM Trans. Web 3, 1 (2009), 3.

Digital Library

[12]

Rabia Batool, Asad Masood Khattak, Jahanzeb Maqbool, and Sungyoung Lee. 2013. Precise tweet classification and sentiment analysis. In IEEE/ACIS 12th International Conference on Computer and Information Science. IEEE, 461--466.

[13]

Krishna Bharat and Andrei Broder. 1998. A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst. 30, 1 (1998), 379--388.

Digital Library

[14]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022.

[15]

Andrei Broder. 2002. A taxonomy of web search. ACM Sigir Forum 36, 2 (2002), 3--10.

Digital Library

[16]

Christopher D. Brown and Herbert T. Davis. 2006. Receiver operating characteristics curves and related decision measures: A tutorial. Chemometr. Intell. Lab. Syst. 80, 1 (2006), 24--38.

[17]

Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. 2012. Earlybird: Real-time search at Twitter. In IEEE 28th International Conference on Data Engineering. IEEE, 1360--1369.

Digital Library

[18]

Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. 2011. Information credibility on twitter. In 20th International Conference on World Wide Web. ACM, 675--684.

Digital Library

[19]

Eric C. Chi and Tamara G. Kolda. 2012. On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl. 33, 4 (2012), 1272--1299.

Digital Library

[20]

Heting Chu and Marilyn Rosenthal. 1996. Search engines for the world wide web: A comparative study and evaluation methodology. In American Society for Information Science, Vol. 33. 127--135.

[21]

Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. 2011. Overview of the TREC 2011 Web Track. Technical Report. NIST.

[22]

Wei Ding and Gary Marchionini. 1996. A comparative study of web search service performance. In ASIS Annual Meeting, Vol. 33. ERIC, 136--42.

[23]

Anlei Dong, Ruiqiang Zhang, Pranam Kolari, Jing Bai, Fernando Diaz, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. 2010. Time is of the essence: Improving recency ranking using twitter data. In 19th International Conference on World Wide Web. ACM, 331--340.

Digital Library

[24]

Yajuan Duan, Long Jiang, Tao Qin, Ming Zhou, and Heung-Yeung Shum. 2010. An empirical study on learning to rank of tweets. In 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 295--303.

Digital Library

[25]

W. H. DuBay. 2004. The Principles of Readability. Impact Information.

[26]

Eric Enge, Stephan Spencer, Jessie Stricchiola, and Rand Fishkin. 2012. The Art of SEO. O’Reilly.

Digital Library

[27]

Federal Communications Commission. 1949. Editorializing by Broadcast Licensees. Washington, DC: GPO. (1949).

[28]

Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 5 (1971), 378.

[29]

Susan Gauch and Guijun Wang. 1996. Information fusion with ProFusion. In 1st World Conference of the Web Society.

[30]

Zhiwei Guan and Edward Cutrell. 2007. An eye tracking study of the effect of target rank on web search. In SIGCHI Conference on Human Factors in Computing Systems. ACM, 417--420.

Digital Library

[31]

Antonio Gulli and Alessio Signorini. 2005. Building an open source meta-search engine. In 14th International Conference on World Wide Web. ACM, 1004--1005.

Digital Library

[32]

Aniko Hannak, Piotr Sapiezynski, Arash Molavi Kakhki, Balachander Krishnamurthy, David Lazer, Alan Mislove, and Christo Wilson. 2013. Measuring personalization of web search. In 22nd International Conference on World Wide Web. ACM, 527--538.

Digital Library

[33]

Richard A. Harshman. 1970. Foundations of the Parafac Procedure: Models and Conditions for an“ Explanatory” Multimodal Factor Analysis. Technical Report. UCLA.

[34]

U. Kang, Evangelos Papalexakis, Abhay Harpale, and Christos Faloutsos. 2012. Gigatensor: Scaling tensor analysis up by 100 times—algorithms and discoveries. In 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 316--324.

Digital Library

[35]

G. R. Klare and B. Buck. 1954. Know Your Reader: The Scientific Approach to Readability. Heritage House.

[36]

Tamara G. Kolda and Brett W. Bader. 2009. Tensor decompositions and applications. SIAM Rev. 51, 3 (2009), 455--500.

Digital Library

[37]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is twitter, a social network or a news media?. In 19th International Conference on World Wide Web. ACM, 591--600.

Digital Library

[38]

F. W. Lancaster and E. G. Fayen. 1973. Information Retrieval On-Line. Melville Publishing Co.

[39]

Hady W. Lauw, Alexandros Ntoulas, and Krishnaram Kenthapadi. 2010. Estimating the quality of postings in the real-time web. In Proc. of SSM Conference.

[40]

Steve Lawrence and C. Lee Giles. 1998. Searching the world wide web. Science 280, 5360 (1998), 98--100.

[41]

Sang Ho Lee, Sung Jin Kim, and Seok Hoo Hong. 2005. On URL normalization. In Computational Science and Its Applications--ICCSA 2005. Springer, 1076--1085.

Digital Library

[42]

Tao Lei, Rui Cai, Jiang-Ming Yang, Yan Ke, Xiaodong Fan, and Lei Zhang. 2010. A pattern tree-based approach to learning URL normalization rules. In 19th International Conference on World Wide Web. ACM, 611--620.

Digital Library

[43]

Dirk Lewandowski. 2012. Web Search Engine Research. Emerald Group Publishing.

[44]

Vincenzo Maltese, Fausto Giunchiglia, Kerstin Denecke, Paul Lewis, Cornelia Wallner, Anthony Baldry, and Devika Madalli. 2009. On the Interdisciplinary Foundations of Diversity. University of Trento.

[45]

Mari-Carmen Marcos and Cristina González-Caro. 2010. Comportamiento de los usuarios en la página de resultados de los buscadores. Un estudio basado en eye tracking. Profes. Inf. 19, 4 (2010), 348--358.

[46]

Juan Martinez-Romo and Lourdes Araujo. 2013. Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Syst. Appl. 40, 8 (2013), 2992--3000.

Digital Library

[47]

Weiyi Meng, Clement Yu, and King-Lup Liu. 2002. Building efficient and effective metasearch engines. Comput. Surv. 34, 1 (2002), 48--89.

Digital Library

[48]

Kyosuke Nishida, Takahide Hoshide, and Ko Fujimura. 2012. Improving tweet stream classification by detecting changes in word probability. In 35th International ACM SIGIR Conference on Research and Development in Information Retrieval). ACM, 971--980.

Digital Library

[49]

Evangelos E. Papalexakis. 2016. Automatic unsupervised tensor mining with quality assessment. In SIAM SDM.

[50]

Ari Pirkola. 2009. The effectiveness of web search engines to index new sites from different countries. Inf. Res. 14, 2 (2009).

[51]

Kristin Purcell, Joanna Brenner, and Lee Rainie. 2012. Search Engine Use 2012. Pew Internet 8 American Life Project.

[52]

Md Sazzadur Rahman, Ting-Kai Huang, Harsha V. Madhyastha, and Michalis Faloutsos. 2012. Efficient and scalable socware detection in online social networks. In USENIX Security Symposium. 663--678.

Digital Library

[53]

Daniel M. Romero, Wojciech Galuba, Sitaram Asur, and Bernardo A. Huberman. 2011. Influence and passivity in social media. In Machine Learning and Knowledge Discovery in Databases. Springer, 18--33.

Digital Library

[54]

Tom Rowlands, David Hawking, and Ramesh Sankaranarayana. 2010. New-web search with microblog annotations. In 19th International Conference on World Wide Web. ACM, 1293--1296.

Digital Library

[55]

Igor Santos, Igor Miñambres-Marcos, Carlos Laorden, Patxi Galán-García, Aitor Santamaría-Ibirika, and Pablo García Bringas. 2014. Twitter content-based spam filtering. In International Joint Conference SOCO’13-CISIS’13-ICEUTE’13. Springer, 449--458.

[56]

Erik Selberg and Oren Etzioni. 1995. Multi-service search and comparison using the MetaCrawler. In 4th International Conference on World Wide Web.

[57]

Amanda Spink, Bernard J. Jansen, Chris Blakely, and Sherry Koshman. 2006. A study of results overlap and uniqueness among major web search engines. Inf. Process. Manag. 42, 5 (2006), 1379--1391.

Digital Library

[58]

Amanda Spink, Bernard J. Jansen, and Changru Wang. 2008. Comparison of major web search engine overlap: 2005 and 2007. In 14th Australasian World Wide Web Conference.

[59]

Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat Demirbas. 2010. Short text classification in twitter to improve information filtering. In 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 841--842.

Digital Library

[60]

Natalie Jomini Stroud and Ashley Muddiman. 2012. Exposure to news and diverse views in the internet age. ISJLP 8 (2012), 605.

[61]

Hikaru Takemura and Keishi Tajima. 2012. Tweet classification based on their lifetime duration. In 21st ACM International Conference on Information and Knowledge Management. ACM, 2367--2370.

Digital Library

[62]

Ke Tao, Fabian Abel, Claudia Hauff, and Geert-Jan Houben. 2012. Twinder: A search engine for twitter streams. In Web Engineering. Springer, 153--168.

Digital Library

[63]

Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. 2011. # TwitterSearch: A comparison of microblog search and web search. In 4th ACM International Conference on Web Search and Data Mining. ACM, 35--44.

Digital Library

[64]

Ibrahim Uysal and W. Bruce Croft. 2011. User oriented tweet ranking: A filtering approach to microblogs. In 20th ACM International Conference on Information and Knowledge Management. ACM, 2261--2264.

Digital Library

[65]

William M. Webberley. 2014. Inferring Interestingness in Online Social Networks. Ph.D. Dissertation. Cardiff University.

[66]

Ryen W. White and Susan T. Dumais. 2009. Characterizing and predicting search engine switching behavior. In 18th ACM Conference on Information and Knowledge Management. ACM, 87--96.

Digital Library

[67]

David Wilkinson and Mike Thelwall. 2013. Search markets and search results: The case of Bing. Libr. Inf. Sci. Res. 35, 4 (2013), 318--325.

[68]

Min-Chul Yang, Jung-Tae Lee, Seung-Wook Lee, and Hae-Chang Rim. 2012. Finding interesting posts in twitter based on retweet graph analysis. In 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1073--1074.

Digital Library

[69]

Min-Chul Yang and Hae-Chang Rim. 2014. Identifying interesting twitter contents using topical analysis. Expert Syst. Appl. 41, 9 (2014), 4330--4336.

Digital Library

Cited By

Yao CMo YZhang Z(2022)A study on interplatform competition based on a Lotka–Volterra competition model focusing on network externalityElectronic Commerce Research and Applications10.1016/j.elerap.2022.10120156:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.elerap.2022.101201
Thompson D(2020)Recommended Reads for Visual LiteracyArt Documentation: Journal of the Art Libraries Society of North America10.1086/71115139:2(239-246)Online publication date: 1-Sep-2020
https://doi.org/10.1086/711151
Thompson DBeene S(2020)Uniting the field: using the ACRL Visual Literacy Competency Standards to move beyond the definition problem of visual literacyJournal of Visual Literacy10.1080/1051144X.2020.1750809(1-17)Online publication date: 4-May-2020
https://doi.org/10.1080/1051144X.2020.1750809

Index Terms

Homogeneity in Web Search Results: Diagnosis and Mitigation
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Topic modeling
    2. Machine learning approaches
      1. Factorization methods
2. Information systems
  1. Information systems applications
    1. Data mining
  2. World Wide Web
    1. Web searching and information discovery
      1. Web search engines

Recommendations

A Study of Distinctiveness in Web Results of Two Search Engines
WWW '15 Companion: Proceedings of the 24th International Conference on World Wide Web

Google and Bing have emerged as the diarchy that arbitrates what documents are seen by Web searchers, particularly those desiring English language documents. We seek to study how distinctive are the top results presented to the users by the two search ...
Overlap Between Google and Bing Web Search Results!: Twitter to the Rescue?
COSN '15: Proceedings of the 2015 ACM on Conference on Online Social Networks

Access to diverse perspectives nurtures an informed citizenry. Google and Bing have emerged as the duopoly that largely arbitrates which English language documents are seen by web searchers. We present our empirical study over the search results ...
Aggregating Web Search Results
Intelligent Information and Database Systems
Abstract
In this paper a method for aggregating Web search results is proposed. The aggregator results are compared with the results of most popular search engines: Google, Bing and Yandex. There are 3 stages of the comparison, one for each of the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 8, Issue 5

September 2017

261 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/3120923

Editor:
Yu Zheng
Microsoft Research, China

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2017

Accepted: 01 February 2017

Revised: 01 January 2017

Received: 01 August 2016

Published in TIST Volume 8, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Microsoft Research in Silicon Valley

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
435
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yao CMo YZhang Z(2022)A study on interplatform competition based on a Lotka–Volterra competition model focusing on network externalityElectronic Commerce Research and Applications10.1016/j.elerap.2022.10120156:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.elerap.2022.101201
Thompson D(2020)Recommended Reads for Visual LiteracyArt Documentation: Journal of the Art Libraries Society of North America10.1086/71115139:2(239-246)Online publication date: 1-Sep-2020
https://doi.org/10.1086/711151
Thompson DBeene S(2020)Uniting the field: using the ACRL Visual Literacy Competency Standards to move beyond the definition problem of visual literacyJournal of Visual Literacy10.1080/1051144X.2020.1750809(1-17)Online publication date: 4-May-2020
https://doi.org/10.1080/1051144X.2020.1750809

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents