XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples
Authors:
Shorya Consul,
John Robertson,
Haris Vikalo
Abstract:
It is estimated that approximately 15% of cancers worldwide can be linked to viral infections. The viruses that can cause or increase the risk of cancer include human papillomavirus, hepatitis B and C viruses, Epstein-Barr virus, and human immunodeficiency virus, to name a few. The computational analysis of the massive amounts of tumor DNA data, whose collection is enabled by the recent advancemen…
▽ More
It is estimated that approximately 15% of cancers worldwide can be linked to viral infections. The viruses that can cause or increase the risk of cancer include human papillomavirus, hepatitis B and C viruses, Epstein-Barr virus, and human immunodeficiency virus, to name a few. The computational analysis of the massive amounts of tumor DNA data, whose collection is enabled by the recent advancements in sequencing technologies, have allowed studies of the potential association between cancers and viral pathogens. However, the high diversity of oncoviral families makes reliable detection of viral DNA difficult and thus, renders such analysis challenging. In this paper, we introduce XVir, a data pipeline that relies on a transformer-based deep learning architecture to reliably identify viral DNA present in human tumors. In particular, XVir is trained on genomic sequencing reads from viral and human genomes and may be used with tumor sequence information to find evidence of viral DNA in human cancers. Results on semi-experimental data demonstrate that XVir is capable of achieving high detection accuracy, generally outperforming state-of-the-art competing methods while being more compact and less computationally demanding.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
A Compressed Sensing Approach to Pooled RT-PCR Testing for COVID-19 Detection
Authors:
Sabyasachi Ghosh,
Rishi Agarwal,
Mohammad Ali Rehan,
Shreya Pathak,
Pratyush Agrawal,
Yash Gupta,
Sarthak Consul,
Nimay Gupta,
Ritika,
Ritesh Goenka,
Ajit Rajwade,
Manoj Gopalkrishnan
Abstract:
We propose `Tapestry', a novel approach to pooled testing with application to COVID-19 testing with quantitative Reverse Transcription Polymerase Chain Reaction (RT-PCR) that can result in shorter testing time and conservation of reagents and testing kits. Tapestry combines ideas from compressed sensing and combinatorial group testing with a novel noise model for RT-PCR used for generation of synt…
▽ More
We propose `Tapestry', a novel approach to pooled testing with application to COVID-19 testing with quantitative Reverse Transcription Polymerase Chain Reaction (RT-PCR) that can result in shorter testing time and conservation of reagents and testing kits. Tapestry combines ideas from compressed sensing and combinatorial group testing with a novel noise model for RT-PCR used for generation of synthetic data. Unlike Boolean group testing algorithms, the input is a quantitative readout from each test and the output is a list of viral loads for each sample relative to the pool with the highest viral load. While other pooling techniques require a second confirmatory assay, Tapestry obtains individual sample-level results in a single round of testing, at clinically acceptable false positive or false negative rates. We also propose designs for pooling matrices that facilitate good prediction of the infected samples while remaining practically viable. When testing $n$ samples out of which $k \ll n$ are infected, our method needs only $O(k \log n)$ tests when using random binary pooling matrices, with high probability. However, we also use deterministic binary pooling matrices based on combinatorial design ideas of Kirkman Triple Systems to balance between good reconstruction properties and matrix sparsity for ease of pooling. In practice, we have observed the need for fewer tests with such matrices than with random pooling matrices. This makes Tapestry capable of very large savings at low prevalence rates, while simultaneously remaining viable even at prevalence rates as high as 9.5\%. Empirically we find that single-round Tapestry pooling improves over two-round Dorfman pooling by almost a factor of 2 in the number of tests required. We validate Tapestry in simulations and wet lab experiments with oligomers in quantitative RT-PCR assays. Lastly, we describe use-case scenarios for deployment.
△ Less
Submitted 29 April, 2021; v1 submitted 16 May, 2020;
originally announced May 2020.