Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Studying the clustering paradox and scalability of search in highly distributed environments

Published: 17 May 2013 Publication History

Abstract

With the ubiquitous production, distribution and consumption of information, today's digital environments such as the Web are increasingly large and decentralized. It is hardly possible to obtain central control over information collections and systems in these environments. Searching for information in these information spaces has brought about problems beyond traditional boundaries of information retrieval (IR) research. This article addresses one important aspect of scalability challenges facing information retrieval models and investigates a decentralized, organic view of information systems pertaining to search in large-scale networks. Drawing on observations from earlier studies, we conduct a series of experiments on decentralized searches in large-scale networked information spaces. Results show that how distributed systems interconnect is crucial to retrieval performance and scalability of searching. Particularly, in various experimental settings and retrieval tasks, we find a consistent phenomenon, namely, the Clustering Paradox, in which the level of network clustering (semantic overlay) imposes a scalability limit. Scalable searches are well supported by a specific, balanced level of network clustering emerging from local system interconnectivity. Departure from that level, either stronger or weaker clustering, leads to search performance degradation, which is dramatic in large-scale networks.

References

[1]
Adamic, L. and Adar, E. 2005. How to search a social network. Social Netw. 27, 3, 187--203.
[2]
Adamic, L. S., Lukose, R. M., Puniyani, A. R., and Huberman, B. A. 2001. Search in power-law networks. Phys. Rev. E 64, 4, 046135.
[3]
Albert, R. and Barabási, A.-L. 2002. Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 1, 47--97.
[4]
Albert, R., Jeong, H., and Barabási, A.-L. 1999. Internet: Diameter of the World-Wide Web. Nature 401, 6749, 130--131.
[5]
Aslam, J. A. and Montague, M. 2001. Models for metasearch. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01). ACM, New York, 276--284.
[6]
Baeza-Yates, R. and Ribeiro-Neto, B. 2004. Modern Information Retrieval. Addison Wesley Longman Publishing.
[7]
Baeza-Yates, R., Castillo, C., Junqueira, F., Plachouras, V., and Silvestri, F. 2007. Challenges on distributed web retrieval. In Proceedings of the IEEE 23rd International Conference on Data Engineering (ICDE'07). 6--20.
[8]
Barabási, A.-L. 2009. Scale-free networks: A decade and beyond. Science 325, 412--413.
[9]
Baumgarten, C. 2000. Retrieving information from a distributed heterogeneous document collection. Inf. Retrieval 3, 3, 253--271.
[10]
Bawa, B., Manku, G. S., and Raghavan, P. 2003. SETS: Search enhanced by topic segmentation. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'03). ACM, 306--313.
[11]
Bellifemine, F. L., Caire, G., and Greenwood, D. 2007. Developing Multi-Agent Systems with JADE. Wiley Series in Agent Technology, John Wiley & Sons.
[12]
Bender, M., Michel, S., Triantafillou, P., Weikum, G., and Zimmer, C. 2005. Improving collection selection with overlap awareness in P2P search engines. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). ACM, New York, 67--74.
[13]
Berry, M. W. 2004. Survey of Text Mining: Clustering, Classification, and Retrieval. Springer.
[14]
Boguñá, M., Krioukov, D., and Claffy, K. C. 2009. Navigability of complex networks. Nat. Phys. 5, 1, 74--80.
[15]
Callan, J. 2002. Distributed Information Retrieval. In Advances in Information Retrieval, W. Bruce Croft, Ed., The Information Retrieval Series, Vol. 7, Springer US, 127--150.
[16]
Callan, J. and Connell, M. 2001. Query-based sampling of text databases. ACM Trans. Inf. Syst. 19, 2, 97--130.
[17]
Callan, J. P., Lu, Z., and Croft, W. B. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95). ACM, New York, 21--28.
[18]
Clarke, Charles L. A., Craswell, N., and Soboroff, I. 2009. Overview of the TREC 2009 web track. In Proceedings of the 18th Text Retrieval Conference (TREC'09).
[19]
Cooper, B. F. and Garcia-Molina, H. 2005. Ad hoc, self-supervising peer-to-peer search networks. ACM Trans. Inf. Syst. 23, 2, 169--200.
[20]
Crespo, A. and Garcia-Molina, H. 2005. Semantic overlay networks for P2P systems. In Agents and Peer-to-Peer Computing, G. Moro, S. Bergamaschi, and K. Aberer, Eds., Lecture Notes in Computer Science, vol. 3601, Springer, 1--13.
[21]
Dean, J. and Henzinger, M. R. 1999. Finding related pages on the World Wide Web. Comput. Netw. 31, 11--16, 1467--1479.
[22]
Dodds, P. S., Muhamad, R., and Watts, D. J. 2003. An experimental study of search in global social networks. Science 301, 5634, 827--829.
[23]
Doulkeridis, C., Norvag, K., and Vazirgiannis, M. 2008. Peer-to-peer similarity search over widely distributed document collections. In Proceedings of the ACM Workshop on Large-Scale Distributed Systems for Information Retrieval. ACM, New York, 35--42.
[24]
Fischer, G. and Nurzenski, A. 2005. Towards scatter/gather browsing in a hierarchical peer-to-peer network. In Proceedings of the 2005 ACM Workshop on Information Retrieval in Peer-to-Peer Networks (P2PIR'05). ACM, New York, 25--32.
[25]
Flake, G. W., Lawrence, S., Giles, C. L., and Coetzee, F. M. 2002. Self-organization and identification of web communities. IEEE Computer 35, 3, 66--71.
[26]
French, J. C., Powell, A. L., Callan, J., Viles, C. L., Emmitt, T., Prey, K. J., and Mou, Y. 1999. Comparing the performance of database selection algorithms. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99). ACM, New York, 238--245.
[27]
French, J. C., Powell, A. L., Callan, J., Viles, C. L., Emmitt, T., and Prey, K. J. 1998. Evaluating database selection techniques: a testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98). ACM, New York, 121--129.
[28]
Gibson, D., Kleinberg, J., and Raghavan, P. 1998. Inferring Web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia (HYPERTEXT'98). ACM, New York, 225--234.
[29]
Granovetter, M. S. 1973. The strength of weak ties. Amer. J. Sociology 78, 6, 1360--1380.
[30]
Gravano, L., García-Molina, H., and Tomasic, A. 1994. The effectiveness of GIOSS for the text database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD'94). ACM, New York, 126--137.
[31]
Hatcher, E., Gospodnetić, O., and McCandless, M. 2010. Lucene in Action 2nd Ed. Manning Publications.
[32]
Hawking, D. and Thomas, P. 2005. Server selection methods in hybrid portal search. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). ACM, New York, 75--82.
[33]
Hearst, M A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'96). ACM Press, 76--84.
[34]
Huhns, M. N. 1998. Agent Foundations for Cooperative Information Systems. In Proceedings of the 3rd International Conference on the Practical Applications of Intelligent Agents and Multi-Agent Technology. H. S. Nwana and D. T. Ndumu, Eds.
[35]
Jarvelin, K. and Kekalainen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4, 422--446.
[36]
Jennings, N. R. 2001. An agent-based approach for building complex software systems. Comm. ACM 44, 4, 35--41.
[37]
Jennings, N. R. and Wooldridge, M. J. 1998. Applications of Intelligent Agents. In Agent Technology: Foundations, Applications, and Markets, Nicholas R. Jennings and Michael J. Wooldridge, Eds., Springer. 3--28.
[38]
Ke, W. 2012. Decentralized search and the clustering paradox in large scale information networks. In Next Generation Search Engines: Advanced Models for Information Retrieval, C. Jouis, I. Biskri, J. G. Ganascia, and M. Roux, Eds., IGI Global, 29--46.
[39]
Ke, W. and Mostafa, J. 2009. Strong ties vs. weak ties: Studying the clustering paradox for decentralized search. In Proceedings of the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval, Colocated with ACM SIGIR'09. 49--56.
[40]
Ke, W. and Mostafa, J. 2010. Scalability of findability: effective and efficient IR operations in large information networks. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'10). ACM, New York, 74--81.
[41]
Ke, W., Sugimoto, C. R., and Mostafa, J. 2009. Dynamicity vs. Effectiveness: Studying Online Clustering for Scatter/Gather. In Proceedings of the 32th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). ACM, New York, 19--26.
[42]
Kleinberg, J. 2006a. Complex networks and decentralized search algorithms. In Proceedings of the International Congress of Mathematicians.
[43]
Kleinberg, J. 2006b. Social networks, incentives, and search. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'06). ACM, New York, 210--211.
[44]
Kleinberg, J. M. 2000. Navigation in a small world. Nature 406, 6798.
[45]
Kleinberg, J. M., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. S. 1999. The Web as a graph: Measurements, models and methods. In Proceedings of the 5th Annual International Conference on Computing and Combinatorics, Lecture Notes in Computer Science, vol. 1627. 1--17.
[46]
Liben-Nowell, D., Novak, J., Kumar, R., Raghavan, P., and Tomkins, A. 2005. Geographic routing in social networks. Proc. Nat. Acad. Sci. 102, 33, 11623--11628.
[47]
Lillis, D., Toolan, F., Collier, R., and Dunnion. J. 2006. ProbFuse: a probabilistic approach to data fusion. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'06). ACM, New York, 139--146.
[48]
Liu, J. 2007a. Full-text federated search in peer-to-peer networks. Ph.D. Dissertation. Language Technologies Institute, Carnegie Mellon University.
[49]
Liu, J. 2007b. Full-text federated search in peer-to-peer networks. SIGIR Forum 41, 1, 121--121.
[50]
Liu, J. and Callan, J. 2006. User modeling for full-text federated search in peer-to-peer networks. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'06). ACM, New York, 332--339.
[51]
Liu, J., Feng, L., and He, C. 2006. Semantic link based top-K join queries in P2P networks. In Proceedings of the 15th International Conference on World Wide Web (WWW'06). ACM, New York, 1005--1006.
[52]
Lua, E. K., Crowcroft, J., Pias, M., Sharma, R., and Lim, S. 2005. A survey and comparison of peer-to-peer overlay network schemes. IEEE Comm. Surv. Tutorials 7, 72--93.
[53]
Luu, T., Klemm, F., Podnar, I., Rajman, M., and Aberer, K. 2006. ALVIS peers: A scalable full-text peer-to-peer retrieval engine. In Proceedings of the International Workshop on Information Retrieval in Peer-to-Peer Networks (P2PIR'06). ACM, New York, 41--48.
[54]
Manmatha, R., Rath, T., and Feng, F. 2001. Modeling score distributions for combining the outputs of search engines. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01). ACM, New York, 267--275.
[55]
Menczer, F. 2004. Lexical and semantic clustering by web links. J. Amer. Soc. Inf. Sci. Techn. 55, 14, 1261--1269.
[56]
Meng, W. and Yu, C. T. 2010. Advanced Metasearch Engine Technology. Morgan & Claypool Publishers.
[57]
Meng, W., Yu, C., and Liu, K.-L. 2002. Building efficient and effective metasearch engines. Comput. Surv. 34, 1, 48--89.
[58]
Milgram, S. 1967. Small-world Problem. Psych. Today 1, 1, 61--67.
[59]
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the Web. Tech. rep., Stanford Digital Library Technologies Project.
[60]
Powell, A. L. and French, J. C. 2003. Comparing the performance of collection selection algorithms. ACM Trans. Inf. Syst. 21, 4, 412--456.
[61]
Shokouhi, M. and Si, L. 2011. Federated search. Found. Trends Inf. Retrieval 5, 1, 1--102.
[62]
Shokouhi, S. and Zobel, J. 2007. Federated text retrieval from uncooperative overlapped collections. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, New York, 495--502.
[63]
Si, L. and Callan, J. 2003a. Relevant document distribution estimation method for resource selection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'03). ACM, New York, 298--305.
[64]
Si, L. and Callan, J. 2003b. A semisupervised learning method to merge search engine results. ACM Trans. Inf. Syst. 21, 4, 457--491.
[65]
Si, L. and Callan, J. 2005. Modeling search engine effectiveness for federated search. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). ACM, New York, 83--90.
[66]
Simsek, Ö. and Jensen, D. 2008. Navigating networks by using homophily and degree. Proc. Nat. Acad. Sci. 105, 35, 12758--12762.
[67]
Singh, M. P., Yu, B., and Venkatraman, M. 2001. Community-based service location. Comm. ACM 44, 4, 49--54.
[68]
Skobeltsyn, G., Luu, T., Zarko, I. P., Rajman, M., and Aberer, K. 2007. Web text retrieval with a P2P query-driven index. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, New York, 679--686.
[69]
Tang, C., Xu, Z., and Dwarkadas, S. 2003. Peer-to-peer information retrieval using self-organizing semantic overlay networks. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM'03). ACM, New York, 175--186.
[70]
Thomas, P. and Hawking, D. 2009. Server selection methods in personal metasearch: A comparative empirical study. Inf. Retrieval 12, 5, 581--604.
[71]
van Rijsbergen, C. J. and Sparck-Jones, K. 1973. A test for the separation of relevant and non-relevant documents in experimental retrieval collections. J. Documentation 29, 3, 251--257.
[72]
Watts, D. 2003. Six Degrees: The Science of a Connected Age. W.W. Norton, New York.
[73]
Watts, D. J. and Strogatz, S. H. 1998. Collective dynamics of ‘small-world’ networks. Nature 393, 6684.
[74]
Watts, D. J., Dodds, P. S., and Newman, M. E. J. 2002. Identity and search in social networks. Science 296, 5571, 1302--1305.
[75]
Xu, J. and Croft, W. B. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99). ACM, New York, 254--261.
[76]
Yu, B. and Singh, M. P. 2003. Searching social networks. In Proceedings of the 2nd International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS'03). ACM, New York, 65--72.

Cited By

View all
  • (2022)Collaboration, Self-Reflection, and Adaptation in Robot Communities: Using Multi-Agent Distributed Learning for Coordination Planning2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI56440.2022.00020(69-73)Online publication date: Dec-2022
  • (2018)Visual Analysis of Distributed Search Traffic in a Peer-to-peer Network2018 10th International Conference on Communication Software and Networks (ICCSN)10.1109/ICCSN.2018.8488222(189-194)Online publication date: Jul-2018
  • (2018)Visual Analysis of Distributed Search Traffic in a Peer-to-peer Network2018 13th APCA International Conference on Control and Soft Computing (CONTROLO)10.1109/CONTROLO.2018.8439791(189-194)Online publication date: Jun-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 31, Issue 2
May 2013
180 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2457465
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 May 2013
Accepted: 01 January 2013
Revised: 01 August 2012
Received: 01 November 2011
Published in TOIS Volume 31, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Information network
  2. decentralized search
  3. efficiency
  4. information retrieval
  5. large-scale distributed system
  6. loose coupling
  7. network clustering
  8. scalability
  9. self-organization

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Collaboration, Self-Reflection, and Adaptation in Robot Communities: Using Multi-Agent Distributed Learning for Coordination Planning2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI56440.2022.00020(69-73)Online publication date: Dec-2022
  • (2018)Visual Analysis of Distributed Search Traffic in a Peer-to-peer Network2018 10th International Conference on Communication Software and Networks (ICCSN)10.1109/ICCSN.2018.8488222(189-194)Online publication date: Jul-2018
  • (2018)Visual Analysis of Distributed Search Traffic in a Peer-to-peer Network2018 13th APCA International Conference on Control and Soft Computing (CONTROLO)10.1109/CONTROLO.2018.8439791(189-194)Online publication date: Jun-2018
  • (2017)Distributed Search Efficiency and Robustness in Service oriented Multi-agent NetworksProceedings of the 2017 International Conference on Management Engineering, Software Engineering and Service Sciences10.1145/3034950.3034975(9-18)Online publication date: 14-Jan-2017
  • (2016)Scalability analysis of distributed search in large peer-to-peer networks2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840686(909-914)Online publication date: Dec-2016

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media