Better size estimation for sparse matrix products

Amossen, Rasmus Resen; Campagna, Andrea; Pagh, Rasmus

Computer Science > Data Structures and Algorithms

arXiv:1006.4173 (cs)

[Submitted on 21 Jun 2010 (v1), last revised 22 Feb 2011 (this version, v2)]

Title:Better size estimation for sparse matrix products

Authors:Rasmus Resen Amossen, Andrea Campagna, Rasmus Pagh

View PDF

Abstract:We consider the problem of doing fast and reliable estimation of the number of non-zero entries in a sparse boolean matrix product. This problem has applications in databases and computer algebra. Let n denote the total number of non-zero entries in the input matrices. We show how to compute a 1 +- epsilon approximation (with small probability of error) in expected time O(n) for any epsilon > 4/\sqrt[4]{z}. The previously best estimation algorithm, due to Cohen (JCSS 1997), uses time O(n/epsilon^2). We also present a variant using O(sort(n)) I/Os in expectation in the cache-oblivious model. In contrast to these results, the currently best algorithms for computing a sparse boolean matrix product use time omega(n^{4/3}) (resp. omega(n^{4/3}/B) I/Os), even if the result matrix has only z=O(n) nonzero entries. Our algorithm combines the size estimation technique of Bar-Yossef et al. (RANDOM 2002) with a particular class of pairwise independent hash functions that allows the sketch of a set of the form A x C to be computed in expected time O(|A|+|C|) and O(sort(|A|+|C|)) I/Os. We then describe how sampling can be used to maintain (independent) sketches of matrices that allow estimation to be performed in time o(n) if z is sufficiently large. This gives a simpler alternative to the sketching technique of Ganguly et al. (PODS 2005), and matches a space lower bound shown in that paper. Finally, we present experiments on real-world data sets that show the accuracy of both our methods to be significantly better than the worst-case analysis predicts.

Comments:	Corrected a number of mistakes and typos in the first version (also present in the version published at RANDOM 2010). Most importantly, the lower bound on the error epsilon is now a function of z rather than n
Subjects:	Data Structures and Algorithms (cs.DS); Databases (cs.DB)
Cite as:	arXiv:1006.4173 [cs.DS]
	(or arXiv:1006.4173v2 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.1006.4173

Submission history

From: Rasmus Pagh [view email]
[v1] Mon, 21 Jun 2010 20:47:46 UTC (350 KB)
[v2] Tue, 22 Feb 2011 19:26:15 UTC (350 KB)

Computer Science > Data Structures and Algorithms

Title:Better size estimation for sparse matrix products

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:Better size estimation for sparse matrix products

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators