Abstract
When I was younger and spent most of my time playing in the field of (more) theoretical computer science, I used to think of data mining as an uninteresting kind of game: I thought that area was a wild jungle of ad hoc techniques with no flesh to seek my teeth into. The truth is, I immediately become kind-of skeptical when I see a lot of money flying around: my communist nature pops out and I start seeing flaws everywhere.
I was an idealist, back then, which is good. But in that specific case, I was simply wrong. You may say that I am trying to convince myself just because my soul has been sold already (and they didn’t even give me the thirty pieces of silver they promised, btw). Nonetheless, I will try to offer you evidences that there are some gems, out there in the data miner’s cave, that you yourself may appreciate.
Who knows? Maybe you will decide to sell your soul to the devil too, after all.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Johnson, S.: The Ghost Map: the Story of London’s Most Terrifying Epidemic - And How It Changed Science, Cities, and the Modern World. Riverhead Books (2006)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical Report 66, Stanford University (1999)
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience 34(8), 711–726 (2004)
Boldi, P., Marino, A., Santini, M., Vigna, S.: Bubing: Massive crawling for the masses. Poster Proc. of 23rd International World Wide Web Conference, Seoul, Korea (2014)
Lee, H.T., Leonard, D., Wang, X., Loguinov, D.: Irlbot: Scaling to 6 billion pages and beyond. ACM Trans. Web 3(5), 8:1–8:34 (2009)
Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the 11th International Conference on World Wide Web, pp. 124–135. ACM (2002)
Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, pp. 654–663. ACM (1997)
Majewski, B.S., Wormald, N.C., Havas, G., Czech, Z.J.: A family of perfect hashing methods. Comput. J. 39(6), 547–554 (1996)
Jacobson, G.: Space-efficient static trees and graphs. In: 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, pp. 549–554. IEEE (1989)
Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Theory and practise of monotone minimal perfect hashing. In: Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 132–144. SIAM (2009)
Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses. In: Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Mathematics (SODA), pp. 785–794. ACM Press, New York (2009)
Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Fast prefix search in little space, with applications. In: de Berg, M., Meyer, U. (eds.) ESA 2010, Part I. LNCS, vol. 6346, pp. 427–438. Springer, Heidelberg (2010)
Belazzougui, D., Boldi, P., Vigna, S.: Dynamic z-fast tries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 159–172. Springer, Heidelberg (2010)
Randall, K.H., Stata, R., Wiener, J.L., Wickremesinghe, R.G.: The Link Database: Fast access to graphs of the web. In: Proceedings of the Data Compression Conference, pp. 122–131. IEEE Computer Society, Washington, DC (2002)
Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In: Proc. of the Thirteenth International World Wide Web Conference, pp. 595–601. ACM Press (2004)
Moffat, A.: Compressing integer sequences and sets. In: Kao, M.-Y. (ed.) Encyclopedia of Algorithms, pp. 1–99. Springer, US (2008)
Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., Raghavan, P.: On compressing social networks. In: KDD 2009: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 219–228. ACM, New York (2009)
Boldi, P., Santini, M., Vigna, S.: Permuting web and social graphs. Internet Math. 6(3), 257–283 (2010)
Boldi, P., Santini, M., Vigna, S.: Permuting web graphs. In: Avrachenkov, K., Donato, D., Litvak, N. (eds.) WAW 2009. LNCS, vol. 5427, pp. 116–126. Springer, Heidelberg (2009)
Boldi, P., Rosa, M., Santini, M., Vigna, S.: Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravindra, M.P., Bertino, E., Kumar, R. (eds.) Proceedings of the 20th International Conference on World Wide Web, pp. 587–596. ACM (2011)
Milgram, S.: The small world problem. Psychology Today 2(1), 60–67 (1967)
Travers, J., Milgram, S.: An experimental study of the small world problem. Sociometry 32(4), 425–443 (1969)
Lipton, R.J., Naughton, J.F.: Estimating the size of generalized transitive closures. In: VLDB 1989: Proceedings of the 15th International Conference on Very Large Data Bases, pp. 165–171. Morgan Kaufmann Publishers Inc. (1989)
Crescenzi, P., Grossi, R., Lanzi, L., Marino, A.: A comparison of three algorithms for approximating the distance distribution in real-world graphs. In: Marchetti-Spaccamela, A., Segal, M. (eds.) TAPAS 2011. LNCS, vol. 6595, pp. 92–103. Springer, Heidelberg (2011)
Palmer, C.R., Gibbons, P.B., Faloutsos, C.: Anf: a fast and scalable tool for data mining in massive graphs. In: KDD 2002: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 81–90. ACM, New York (2002)
Boldi, P., Rosa, M., Vigna, S.: HyperANF: Approximating the neighbourhood function of very large graphs on a budget. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravindra, M.P., Bertino, E., Kumar, R. (eds.) Proceedings of the 20th International Conference on World Wide Web, pp. 625–634. ACM (2011)
Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings of the 13th Conference on Analysis of Algorithm (AofA 2007), pp. 127–146 (2007)
Backstrom, L., Boldi, P., Rosa, M., Ugander, J., Vigna, S.: Four degrees of separation. In: ACM Web Science 2012: Conference Proceedings, pp. 45–54. ACM Press (2012), Best paper award
Backstrom, L., Dwork, C., Kleinberg, J.M.: Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In: WWW, pp. 181–190 (2007)
Narayanan, A., Shmatikov, V.: De-anonymizing social networks. In: IEEE Symposium on Security and Privacy (2009)
Boldi, P., Bonchi, F., Gionis, A., Tassa, T.: Injecting uncertainty in graphs for identity obfuscation. Proceedings of the VLDB Endowment 5(11), 1376–1387 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Boldi, P. (2014). Algorithmic Gems in the Data Miner’s Cave. In: Ferro, A., Luccio, F., Widmayer, P. (eds) Fun with Algorithms. FUN 2014. Lecture Notes in Computer Science, vol 8496. Springer, Cham. https://doi.org/10.1007/978-3-319-07890-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-07890-8_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07889-2
Online ISBN: 978-3-319-07890-8
eBook Packages: Computer ScienceComputer Science (R0)