Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Homogeneity in Web Search Results: Diagnosis and Mitigation

Published: 12 July 2017 Publication History

Abstract

Access to diverse perspectives nurtures an informed citizenry. Google and Bing have emerged as the duopoly that largely arbitrates which English-language documents are seen by web searchers. We present our empirical study over the search results produced by Google and Bing that shows a large overlap. Thus, citizens may not gain different perspectives by simultaneously probing them for the same query. Fortunately, our study also shows that by mining Twitter data, one can obtain search results that are quite distinct from those produced by Google, Bing, and Bing News. Additionally, the users found those results to be quite informative.
We also present two novel tools we designed for this study. One uses tensor analysis to derive low-dimensional compact representation of search results and study their behavior over time. The other uses machine learning and quantifies the similarity of results between two search engines by framing it as a prediction problem. Although these tools have different underpinnings, the analytical results obtained using them corroborate each other, which reinforces the confidence one can place in them for finding meaningful insights from big data.

References

[1]
Amazon. 2011. Amazon Mechanical Turk, Requester Best Practices Guide. Amazon Web Services.
[2]
Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In 2nd ACM International Conference on Web Search and Data Mining. ACM, 5--14.
[3]
Rakesh Agrawal, Behzad Golshan, and Evangelos Papalexakis. 2015a. A study of distinctiveness in web results of two search engines. In 24th International Conference on World Wide Web, Web Science Track. ACM.
[4]
Rakesh Agrawal, Behzad Golshan, and Evangelos Papalexakis. 2015b. Whither social networks for web search? In 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Sydney, Australia.
[5]
Omar Alonso, Chad Carson, David Gerster, Xiang Ji, and Shubha U. Nabar. 2010. Detecting uninteresting content in text streams. In SIGIR Crowdsourcing for Search Evaluation Workshop.
[6]
Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. 2014. Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 1 (2014), 2773--2832.
[7]
Demetris Antoniades, Iasonas Polakis, Georgios Kontaxis, Elias Athanasopoulos, Sotiris Ioannidis, Evangelos P. Markatos, and Thomas Karagiannis. 2011. we.b: The web of short URLs. In 20th International Conference on World Wide Web. ACM, 715--724.
[8]
Javed A. Aslam and Mark Montague. 2001. Models for metasearch. In 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 276--284.
[9]
Brett W. Bader and Tamara G. Kolda. 2007. Matlab tensor toolbox version 2.2. Sandia National Laboratories. Albuquerque, NM (2007).
[10]
Judit Bar-Ilan. 2004. Search engine ability to cope with the changing Web. In Web Dynamics. Springer, 195--215.
[11]
Ziv Bar-Yossef, Idit Keidar, and Uri Schonfeld. 2009. Do not crawl in the DUST: Different urls with similar text. ACM Trans. Web 3, 1 (2009), 3.
[12]
Rabia Batool, Asad Masood Khattak, Jahanzeb Maqbool, and Sungyoung Lee. 2013. Precise tweet classification and sentiment analysis. In IEEE/ACIS 12th International Conference on Computer and Information Science. IEEE, 461--466.
[13]
Krishna Bharat and Andrei Broder. 1998. A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst. 30, 1 (1998), 379--388.
[14]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022.
[15]
Andrei Broder. 2002. A taxonomy of web search. ACM Sigir Forum 36, 2 (2002), 3--10.
[16]
Christopher D. Brown and Herbert T. Davis. 2006. Receiver operating characteristics curves and related decision measures: A tutorial. Chemometr. Intell. Lab. Syst. 80, 1 (2006), 24--38.
[17]
Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. 2012. Earlybird: Real-time search at Twitter. In IEEE 28th International Conference on Data Engineering. IEEE, 1360--1369.
[18]
Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. 2011. Information credibility on twitter. In 20th International Conference on World Wide Web. ACM, 675--684.
[19]
Eric C. Chi and Tamara G. Kolda. 2012. On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl. 33, 4 (2012), 1272--1299.
[20]
Heting Chu and Marilyn Rosenthal. 1996. Search engines for the world wide web: A comparative study and evaluation methodology. In American Society for Information Science, Vol. 33. 127--135.
[21]
Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Ellen M. Voorhees. 2011. Overview of the TREC 2011 Web Track. Technical Report. NIST.
[22]
Wei Ding and Gary Marchionini. 1996. A comparative study of web search service performance. In ASIS Annual Meeting, Vol. 33. ERIC, 136--42.
[23]
Anlei Dong, Ruiqiang Zhang, Pranam Kolari, Jing Bai, Fernando Diaz, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. 2010. Time is of the essence: Improving recency ranking using twitter data. In 19th International Conference on World Wide Web. ACM, 331--340.
[24]
Yajuan Duan, Long Jiang, Tao Qin, Ming Zhou, and Heung-Yeung Shum. 2010. An empirical study on learning to rank of tweets. In 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 295--303.
[25]
W. H. DuBay. 2004. The Principles of Readability. Impact Information.
[26]
Eric Enge, Stephan Spencer, Jessie Stricchiola, and Rand Fishkin. 2012. The Art of SEO. O’Reilly.
[27]
Federal Communications Commission. 1949. Editorializing by Broadcast Licensees. Washington, DC: GPO. (1949).
[28]
Joseph L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 5 (1971), 378.
[29]
Susan Gauch and Guijun Wang. 1996. Information fusion with ProFusion. In 1st World Conference of the Web Society.
[30]
Zhiwei Guan and Edward Cutrell. 2007. An eye tracking study of the effect of target rank on web search. In SIGCHI Conference on Human Factors in Computing Systems. ACM, 417--420.
[31]
Antonio Gulli and Alessio Signorini. 2005. Building an open source meta-search engine. In 14th International Conference on World Wide Web. ACM, 1004--1005.
[32]
Aniko Hannak, Piotr Sapiezynski, Arash Molavi Kakhki, Balachander Krishnamurthy, David Lazer, Alan Mislove, and Christo Wilson. 2013. Measuring personalization of web search. In 22nd International Conference on World Wide Web. ACM, 527--538.
[33]
Richard A. Harshman. 1970. Foundations of the Parafac Procedure: Models and Conditions for an“ Explanatory” Multimodal Factor Analysis. Technical Report. UCLA.
[34]
U. Kang, Evangelos Papalexakis, Abhay Harpale, and Christos Faloutsos. 2012. Gigatensor: Scaling tensor analysis up by 100 times—algorithms and discoveries. In 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 316--324.
[35]
G. R. Klare and B. Buck. 1954. Know Your Reader: The Scientific Approach to Readability. Heritage House.
[36]
Tamara G. Kolda and Brett W. Bader. 2009. Tensor decompositions and applications. SIAM Rev. 51, 3 (2009), 455--500.
[37]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is twitter, a social network or a news media?. In 19th International Conference on World Wide Web. ACM, 591--600.
[38]
F. W. Lancaster and E. G. Fayen. 1973. Information Retrieval On-Line. Melville Publishing Co.
[39]
Hady W. Lauw, Alexandros Ntoulas, and Krishnaram Kenthapadi. 2010. Estimating the quality of postings in the real-time web. In Proc. of SSM Conference.
[40]
Steve Lawrence and C. Lee Giles. 1998. Searching the world wide web. Science 280, 5360 (1998), 98--100.
[41]
Sang Ho Lee, Sung Jin Kim, and Seok Hoo Hong. 2005. On URL normalization. In Computational Science and Its Applications--ICCSA 2005. Springer, 1076--1085.
[42]
Tao Lei, Rui Cai, Jiang-Ming Yang, Yan Ke, Xiaodong Fan, and Lei Zhang. 2010. A pattern tree-based approach to learning URL normalization rules. In 19th International Conference on World Wide Web. ACM, 611--620.
[43]
Dirk Lewandowski. 2012. Web Search Engine Research. Emerald Group Publishing.
[44]
Vincenzo Maltese, Fausto Giunchiglia, Kerstin Denecke, Paul Lewis, Cornelia Wallner, Anthony Baldry, and Devika Madalli. 2009. On the Interdisciplinary Foundations of Diversity. University of Trento.
[45]
Mari-Carmen Marcos and Cristina González-Caro. 2010. Comportamiento de los usuarios en la página de resultados de los buscadores. Un estudio basado en eye tracking. Profes. Inf. 19, 4 (2010), 348--358.
[46]
Juan Martinez-Romo and Lourdes Araujo. 2013. Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Syst. Appl. 40, 8 (2013), 2992--3000.
[47]
Weiyi Meng, Clement Yu, and King-Lup Liu. 2002. Building efficient and effective metasearch engines. Comput. Surv. 34, 1 (2002), 48--89.
[48]
Kyosuke Nishida, Takahide Hoshide, and Ko Fujimura. 2012. Improving tweet stream classification by detecting changes in word probability. In 35th International ACM SIGIR Conference on Research and Development in Information Retrieval). ACM, 971--980.
[49]
Evangelos E. Papalexakis. 2016. Automatic unsupervised tensor mining with quality assessment. In SIAM SDM.
[50]
Ari Pirkola. 2009. The effectiveness of web search engines to index new sites from different countries. Inf. Res. 14, 2 (2009).
[51]
Kristin Purcell, Joanna Brenner, and Lee Rainie. 2012. Search Engine Use 2012. Pew Internet 8 American Life Project.
[52]
Md Sazzadur Rahman, Ting-Kai Huang, Harsha V. Madhyastha, and Michalis Faloutsos. 2012. Efficient and scalable socware detection in online social networks. In USENIX Security Symposium. 663--678.
[53]
Daniel M. Romero, Wojciech Galuba, Sitaram Asur, and Bernardo A. Huberman. 2011. Influence and passivity in social media. In Machine Learning and Knowledge Discovery in Databases. Springer, 18--33.
[54]
Tom Rowlands, David Hawking, and Ramesh Sankaranarayana. 2010. New-web search with microblog annotations. In 19th International Conference on World Wide Web. ACM, 1293--1296.
[55]
Igor Santos, Igor Miñambres-Marcos, Carlos Laorden, Patxi Galán-García, Aitor Santamaría-Ibirika, and Pablo García Bringas. 2014. Twitter content-based spam filtering. In International Joint Conference SOCO’13-CISIS’13-ICEUTE’13. Springer, 449--458.
[56]
Erik Selberg and Oren Etzioni. 1995. Multi-service search and comparison using the MetaCrawler. In 4th International Conference on World Wide Web.
[57]
Amanda Spink, Bernard J. Jansen, Chris Blakely, and Sherry Koshman. 2006. A study of results overlap and uniqueness among major web search engines. Inf. Process. Manag. 42, 5 (2006), 1379--1391.
[58]
Amanda Spink, Bernard J. Jansen, and Changru Wang. 2008. Comparison of major web search engine overlap: 2005 and 2007. In 14th Australasian World Wide Web Conference.
[59]
Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat Demirbas. 2010. Short text classification in twitter to improve information filtering. In 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 841--842.
[60]
Natalie Jomini Stroud and Ashley Muddiman. 2012. Exposure to news and diverse views in the internet age. ISJLP 8 (2012), 605.
[61]
Hikaru Takemura and Keishi Tajima. 2012. Tweet classification based on their lifetime duration. In 21st ACM International Conference on Information and Knowledge Management. ACM, 2367--2370.
[62]
Ke Tao, Fabian Abel, Claudia Hauff, and Geert-Jan Houben. 2012. Twinder: A search engine for twitter streams. In Web Engineering. Springer, 153--168.
[63]
Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. 2011. # TwitterSearch: A comparison of microblog search and web search. In 4th ACM International Conference on Web Search and Data Mining. ACM, 35--44.
[64]
Ibrahim Uysal and W. Bruce Croft. 2011. User oriented tweet ranking: A filtering approach to microblogs. In 20th ACM International Conference on Information and Knowledge Management. ACM, 2261--2264.
[65]
William M. Webberley. 2014. Inferring Interestingness in Online Social Networks. Ph.D. Dissertation. Cardiff University.
[66]
Ryen W. White and Susan T. Dumais. 2009. Characterizing and predicting search engine switching behavior. In 18th ACM Conference on Information and Knowledge Management. ACM, 87--96.
[67]
David Wilkinson and Mike Thelwall. 2013. Search markets and search results: The case of Bing. Libr. Inf. Sci. Res. 35, 4 (2013), 318--325.
[68]
Min-Chul Yang, Jung-Tae Lee, Seung-Wook Lee, and Hae-Chang Rim. 2012. Finding interesting posts in twitter based on retweet graph analysis. In 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1073--1074.
[69]
Min-Chul Yang and Hae-Chang Rim. 2014. Identifying interesting twitter contents using topical analysis. Expert Syst. Appl. 41, 9 (2014), 4330--4336.

Cited By

View all
  • (2022)A study on interplatform competition based on a Lotka–Volterra competition model focusing on network externalityElectronic Commerce Research and Applications10.1016/j.elerap.2022.10120156:COnline publication date: 1-Nov-2022
  • (2020)Recommended Reads for Visual LiteracyArt Documentation: Journal of the Art Libraries Society of North America10.1086/71115139:2(239-246)Online publication date: 1-Sep-2020
  • (2020)Uniting the field: using the ACRL Visual Literacy Competency Standards to move beyond the definition problem of visual literacyJournal of Visual Literacy10.1080/1051144X.2020.1750809(1-17)Online publication date: 4-May-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 8, Issue 5
September 2017
261 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3120923
  • Editor:
  • Yu Zheng
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2017
Accepted: 01 February 2017
Revised: 01 January 2017
Received: 01 August 2016
Published in TIST Volume 8, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bing
  2. Google
  3. Web search
  4. big data
  5. prediction
  6. search engine
  7. search result comparison
  8. social media search
  9. tensor

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Microsoft Research in Silicon Valley

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A study on interplatform competition based on a Lotka–Volterra competition model focusing on network externalityElectronic Commerce Research and Applications10.1016/j.elerap.2022.10120156:COnline publication date: 1-Nov-2022
  • (2020)Recommended Reads for Visual LiteracyArt Documentation: Journal of the Art Libraries Society of North America10.1086/71115139:2(239-246)Online publication date: 1-Sep-2020
  • (2020)Uniting the field: using the ACRL Visual Literacy Competency Standards to move beyond the definition problem of visual literacyJournal of Visual Literacy10.1080/1051144X.2020.1750809(1-17)Online publication date: 4-May-2020

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media