Improved ESP-index: a practical self-index for highly repetitive texts

Takabatake, Yoshimasa; Tabei, Yasuo; Sakamoto, Hiroshi

Computer Science > Data Structures and Algorithms

arXiv:1404.4972 (cs)

[Submitted on 19 Apr 2014 (v1), last revised 28 Apr 2014 (this version, v2)]

Title:Improved ESP-index: a practical self-index for highly repetitive texts

Authors:Yoshimasa Takabatake, Yasuo Tabei, Hiroshi Sakamoto

View PDF

Abstract:While several self-indexes for highly repetitive texts exist, developing a practical self-index applicable to real world repetitive texts remains a challenge. ESP-index is a grammar-based self-index on the notion of edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees upper bounds of parsing discrepancies between different appearances of the same subtexts in a text. Although ESP-index performs efficient top-down searches of query texts, it has a serious issue on binary searches for finding appearances of variables for a query text, which resulted in slowing down the query searches. We present an improved ESP-index (ESP-index-I) by leveraging the idea behind succinct data structures for large alphabets. While ESP-index-I keeps the same types of efficiencies as ESP-index about the top-down searches, it avoid the binary searches using fast rank/select operations. We experimentally test ESP-index-I on the ability to search query texts and extract subtexts from real world repetitive texts on a large-scale, and we show that ESP-index-I performs better that other possible approaches.

Comments:	This is the full version of a proceeding accepted to the 11th International Symposium on Experimental Algorithms (SEA2014)
Subjects:	Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:1404.4972 [cs.DS]
	(or arXiv:1404.4972v2 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.1404.4972

Submission history

From: Yasuo Tabei [view email]
[v1] Sat, 19 Apr 2014 17:08:29 UTC (1,170 KB)
[v2] Mon, 28 Apr 2014 03:08:52 UTC (1,398 KB)

Computer Science > Data Structures and Algorithms

Title:Improved ESP-index: a practical self-index for highly repetitive texts

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:Improved ESP-index: a practical self-index for highly repetitive texts

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators