Abstract
Trimmomatic is a de-facto standard trimmer for Illumina sequencing data. However, limited by its sub-optimal implementation, it cannot fully exploit the computational power of common multi-core platforms. Therefore, we propose RabbitTrim, a highly optimized implementation of Trimmomatic based on efficient I/O strategies, parallel (de)compression engines, block-based memory pools, bitwise operations and vectorization techniques. RabbitTrim achieves speedups between 1.5x and 3.3x (3.7x and 8.0x) when processing plain (gzip-compressed) FASTQ files on a 48-core Intel server. Overall, RabbitTrim is able to process 101 GB gzip-compressed sequencing data in only 5 min while Trimmomatic requires at least 21 min. The source code is available at https://github.com/RabbitBio/RabbitTrim.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adler, M.: pigz: A parallel implementation of Gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory (2015)
Bolger, A.M., Lohse, M., Usadel, B.: Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15), 2114–2120 (2014)
Chen, S., Zhou, Y., Chen, Y., Gu, J.: fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34(17), i884–i890 (2018)
Fang, L.T., et al.: Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat. Biotechnol. 39(9), 1151–1160 (2021)
Kerbiriou, M., Chikhi, R.: Parallel decompression of Gzip-compressed files and random access to DNA sequences. In: 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 209–217. IEEE (2019)
Knespel, M., Brunst, H.: Rapidgzip: parallel decompression and seeking in Gzip files using cache prefetching. In: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, pp. 295–307 (2023)
Lindgreen, S.: AdapterRemoval: easy cleaning of next-generation sequencing reads. BMC. Res. Notes 5, 1–7 (2012)
Martin, M.: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17(1), 10–12 (2011)
Schubert, M., Lindgreen, S., Orlando, L.: AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC. Res. Notes 9, 1–7 (2016)
Sun, K.: Ktrim: an extra-fast and accurate adapter-and quality-trimmer for sequencing data. Bioinformatics 36(11), 3561–3562 (2020)
Tucker, G., Oursler, R., Stern, J.: ISA-L Igzip: improvements to a fast deflate. In: 2017 Data Compression Conference (DCC), pp. 465–465. IEEE Computer Society (2017)
Yan, L., et al.: RabbitQCPlus: more efficient quality control for sequencing data. In: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 619–626. IEEE (2022)
Zhang, H., et al.: RabbitFX: efficient framework for FASTA/Q file parsing on modern multi-core platforms. IEEE/ACM Trans. Comput. Biol. Bioinform. (2022)
Zhou, X., Rokas, A.: Prevention, diagnosis and treatment of high-throughput sequencing data pathologies. Mol. Ecol. 23(7), 1679–1700 (2014)
Acknowledgement
This work is partially supported by NSFC Grants 62102231; Shandong Provincial Natural Science Foundation (ZR2021QF089); Engineering Research Center of Digital Media Technology, Ministry of Education, China.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, M. et al. (2024). RabbitTrim: Highly Optimized Trimming of Illumina Sequencing Data on Multi-core Platforms. In: Peng, W., Cai, Z., Skums, P. (eds) Bioinformatics Research and Applications. ISBRA 2024. Lecture Notes in Computer Science(), vol 14955. Springer, Singapore. https://doi.org/10.1007/978-981-97-5131-0_3
Download citation
DOI: https://doi.org/10.1007/978-981-97-5131-0_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5130-3
Online ISBN: 978-981-97-5131-0
eBook Packages: Computer ScienceComputer Science (R0)