Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3535508.3545556acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article
Open access

Haplotype-aware variant selection for genome graphs

Published: 07 August 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Graph-based genome representations have proven to be a powerful tool in genomic analysis due to their ability to encode variations found in multiple haplotypes and capture population genetic diversity. Such graphs also unavoidably contain paths which switch between haplotypes (i.e., recombinant paths) and thus do not fully match any of the constituent haplotypes. The number of such recombinant paths increases combinatorially with path length and cause inefficiencies and false positives when mapping reads. In this paper, we study the problem of finding reduced haplotype-aware genome graphs that incorporate only a selected subset of variants, yet contain paths corresponding to all α-long substrings of the input haplotypes (i.e., non-recombinant paths) with at most δ mismatches. Solving this problem optimally, i.e., minimizing the number of variants selected, is previously known to be NP-hard [14]. Here, we first establish several inapproximability results regarding finding haplotype-aware reduced variation graphs of optimal size. We then present an integer linear programming (ILP) formulation for solving the problem, and experimentally demonstrate this is a computationally feasible approach for real-world problems and provides far superior reduction compared to prior approaches.

    References

    [1]
    Xian Chang, Jordan Eizenga, Adam M Novak, Jouni Sirén, and Benedict Paten. 2020. Distance indexing and seed clustering in sequence graphs. Bioinformatics 36, Supplement_1 (2020), i146--i153.
    [2]
    1000 Genomes Project Consortium et al. 2015. A global reference for human genetic variation. Nature 526, 7571 (2015), 68--74.
    [3]
    Computational Pan-Genomics Consortium. 2018. Computational pan-genomics: status, promises and challenges. Briefings in bioinformatics 19, 1 (2018), 118--135.
    [4]
    Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A Albers, Eric Banks, Mark A DePristo, Robert E Handsaker, Gerton Lunter, Gabor T Marth, Stephen T Sherry, et al. 2011. The variant call format and VCFtools. Bioinformatics 27, 15 (2011), 2156--2158.
    [5]
    Charlotte A Darby, Ravi Gaddipati, Michael C Schatz, and Ben Langmead. 2020. Vargas: heuristic-free alignment for assessing linear and graph read aligners. Bioinformatics 36, 12 (2020), 3712--3718.
    [6]
    Irit Dinur, Venkatesan Guruswami, Subhash Khot, and Oded Regev. 2005. A New Multilayered PCP and the Hardness of Hypergraph Vertex Cover. SIAM J. Comput. 34, 5 (2005), 1129--1146.
    [7]
    Hannes P Eggertsson, Hakon Jonsson, Snaedis Kristmundsdottir, Eirikur Hjartarson, Birte Kehr, Gisli Masson, Florian Zink, Kristjan E Hjorleifsson, Aslaug Jonasdottir, Adalbjorg Jonasdottir, et al. 2017. Graphtyper enables population-scale genotyping using pangenome graphs. Nature genetics 49, 11 (2017), 1654.
    [8]
    Jordan M Eizenga, Adam M Novak, Jonas A Sibbesen, Simon Heumos, Ali Ghaffaari, Glenn Hickey, Xian Chang, Josiah D Seaman, Robin Rounthwaite, Jana Ebler, et al. 2020. Pangenome Graphs. Annual Review of Genomics and Human Genetics 21 (2020).
    [9]
    Uriel Feige. 1998. A Threshold of ln n for Approximating Set Cover. J. ACM 45, 4 (1998), 634--652.
    [10]
    Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. 2018. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology 36, 9 (2018), 875--879.
    [11]
    Ali Ghaffaari and Tobias Marschall. 2019. Fully-sensitive seed finding in sequence graphs using a hybrid index. Bioinformatics 35, 14 (2019), i81--i89.
    [12]
    Gurobi Optimization, LLC. 2022. Gurobi Optimizer Reference Manual. https://www.gurobi.com
    [13]
    Guillaume Holley, Roland Wittler, and Jens Stoye. 2016. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms for Molecular Biology 11, 1 (2016), 1--9.
    [14]
    Chirag Jain, Neda Tavakoli, and Srinivas Aluru. 2021. A variant selection framework for genome graphs. Bioinformatics 37, Supplement_1 (2021), i460--i467.
    [15]
    Chirag Jain, Haowen Zhang, Yu Gao, and Srinivas Aluru. 2020. On the Complexity of Sequence-to-Graph Alignment. Journal of Computational Biology 27, 4 (2020), 640--654.
    [16]
    Daehwan Kim, Joseph Paggi, and Steven L Salzberg. 2018. Hisat-genotype: Next generation genomic analysis platform on a personal computer. BioRxiv (2018), 266197.
    [17]
    Daehwan Kim, Joseph M Paggi, Chanhee Park, Christopher Bennett, and Steven L Salzberg. 2019. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology 37, 8 (2019), 907--915.
    [18]
    Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. 2020. Efficient construction of a complete index for pangenomics read alignment. Journal of Computational Biology 27, 4 (2020), 500--513.
    [19]
    Anna Kuosmanen, Topi Paavilainen, Travis Gagie, Rayan Chikhi, Alexandru Tomescu, and Veli Mäkinen. 2018. Using minimum path cover to boost dynamic programming on DAGs: co-linear chaining extended. In International Conference on Research in Computational Molecular Biology. Springer, 105--121.
    [20]
    Christopher Lee, Catherine Grasso, and Mark F Sharlow. 2002. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 3 (2002), 452--464.
    [21]
    Heng Li. 2011. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 21 (2011), 2987--2993.
    [22]
    Heng Li, Xiaowen Feng, and Chong Chu. 2020. The design and construction of reference pangenome graphs with minigraph. Genome Biology 21, 1 (2020), 1--19.
    [23]
    Yucheng Liu, Huilong Du, Pengcheng Li, Yanting Shen, Hua Peng, Shulin Liu, Guo-An Zhou, Haikuan Zhang, Zhi Liu, Miao Shi, et al. 2020. Pan-genome of wild and cultivated soybeans. Cell 182, 1 (2020), 162--176.
    [24]
    Bastien Llamas, Giuseppe Narzisi, Valerie Schneider, Peter A Audano, Evan Biederstedt, Lon Blauvelt, Peter Bradbury, Xian Chang, Chen-Shan Chin, Arkarachai Fungtammasan, et al. 2019. A strategy for building and using a human reference pangenome. F1000Research 8, 1751 (2019), 1751.
    [25]
    Shoshana Marcus, Hayan Lee, and Michael C Schatz. 2014. SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics 30, 24 (2014), 3476--3483.
    [26]
    Tom Mokveld, Jasper Linthorst, Zaid Al-Ars, Henne Holstege, and Marcel Reinders. 2020. CHOP: haplotype-aware path indexing in population graphs. Genome biology 21, 1 (2020), 1--16.
    [27]
    Benedict Paten, Adam M Novak, Jordan M Eizenga, and Erik Garrison. 2017. Genome graphs and the evolution of genome inference. Genome research 27, 5 (2017), 665--676.
    [28]
    David Peleg, Gideon Schechtman, and Avishai Wool. 1993. Approximating Bounded 0-1 Integer Linear Programs. In Second Israel Symposium on Theory of Computing Systems, ISTCS 1993, Natanya, Israel, June 7--9, 1993, Proceedings. IEEE Computer Society, 69--77.
    [29]
    Jacob Pritt, Nae-Chyun Chen, and Ben Langmead. 2018. FORGe: prioritizing variants for graph genomes. Genome biology 19, 1 (2018), 1--16.
    [30]
    Gunnar Rätsch and Martin Vechev. 2020. AStarix: Fast and Optimal Sequence-to-Graph Alignment. In Research in Computational Molecular Biology: 24th Annual International Conference, RECOMB 2020, Padua, Italy, May 10--13, 2020, Proceedings, Vol. 12074. Springer, 104.
    [31]
    Mikko Rautiainen and Tobias Marschall. 2020. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biology 21, 1 (2020), 1--28.
    [32]
    Rachel M Sherman and Steven L Salzberg. 2020. Pan-genomics in the human genome era. Nature Reviews Genetics 21, 4 (2020), 243--254.
    [33]
    Jouni Sirén. 2017. Indexing variation graphs. In 2017 Proceedings of the ninteenth workshop on algorithm engineering and experiments (ALENEX). SIAM, 13--27.
    [34]
    Jouni Sirén, Erik Garrison, Adam M Novak, Benedict Paten, and Richard Durbin. 2020. Haplotype-aware graph indexes. Bioinformatics 36, 2 (2020), 400--407.
    [35]
    Jouni Sirén, Niko Välimäki, and Veli Mäkinen. 2014. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11, 2 (2014), 375--388.
    [36]
    Ravi Vijaya Satya, Nela Zavaljevski, and Jaques Reifman. 2012. A new strategy to reduce allelic bias in RNA-Seq readmapping. Nucleic acids research 40, 16 (2012), e127--e127.
    [37]
    Laurence A. Wolsey. 1982. An analysis of the greedy algorithm for the submodular set covering problem. Comb. 2, 4 (1982), 385--393.
    [38]
    David Zuckerman. 2007. Linear Degree Extractors and the Inapproximability of Max Clique and Chromatic Number. Theory Comput. 3, 1 (2007), 103--128.

    Cited By

    View all
    • (2024)GraphSlimmer: Preserving Read Mappability with the Minimum Number of VariantsJournal of Computational Biology10.1089/cmb.2024.0601Online publication date: 11-Jul-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    BCB '22: Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
    August 2022
    549 pages
    ISBN:9781450393867
    DOI:10.1145/3535508
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 August 2022

    Check for updates

    Author Tags

    1. ILP-based optimization
    2. SNPs
    3. haplotype-aware
    4. variant selection
    5. variation graphs

    Qualifiers

    • Research-article

    Funding Sources

    • National Science Foundation

    Conference

    BCB '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 885 submissions, 29%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)125
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)GraphSlimmer: Preserving Read Mappability with the Minimum Number of VariantsJournal of Computational Biology10.1089/cmb.2024.0601Online publication date: 11-Jul-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media