Next-generation data filtering in the genomics era

Hemstrom, William; Grummer, Jared A.; Luikart, Gordon; Christie, Mark R.

doi:10.1038/s41576-024-00738-6

Review Article
Published: 14 June 2024

Next-generation data filtering in the genomics era

William HemstromÂ ORCID: orcid.org/0000-0002-2408-9535¹^Â na1,
Jared A. Grummer²^Â na1,
Gordon Luikart² &
â¦
Mark R. ChristieÂ ORCID: orcid.org/0000-0001-7285-5364^1,3Â

Nature Reviews Genetics (2024)Cite this article

8102 Accesses
1 Citations
95 Altmetric
Metrics details

Subjects

Abstract

Genomic data are ubiquitous across disciplines, from agriculture to biodiversity, ecology, evolution and human health. However, these datasets often contain noise or errors and are missing information that can affect the accuracy and reliability of subsequent computational analyses and conclusions. A key step in genomic data analysis is filtering âÂ removing sequencing bases, reads, genetic variants and/or individuals from a dataset â to improve data quality for downstream analyses. Researchers are confronted with a multitude of choices when filtering genomic data; they must choose which filters to apply and select appropriate thresholds. To help usher in the next generation of genomicÂ data filtering, we review and suggest best practices to improve the implementation, reproducibility and reporting standards for filter types and thresholds commonly applied to genomic datasets. We focus mainly on filters for minor allele frequency, missing data per individual or per locus, linkage disequilibrium and HardyâWeinberg deviations. Using simulated and empirical datasets, we illustrate the large effects of different filtering thresholds on common population genetics statistics, such as Tajimaâs D value, population differentiation (F_ST), nucleotide diversity (Ï) and effective population size (N_e).

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Pre-variant filtering â challenges and potential solutions related to filtering before variant calling.**

**Fig. 2: Post-variant filtering â challenges associated with four common filters after variant discovery.**

**Fig. 3: Flow chart to facilitate thoughtful, systematic and reproducible filtering for representative studies and questions using genomic DNA.**

Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes

Article Open access 29 June 2023

Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank

Article Open access 29 June 2023

The sequences of 150,119 genomes in the UK Biobank

Article Open access 20 July 2022

Data availability

Information on the empirical and simulated data used for the analyses shown in this review is available in the Supplementary Information.

Code availability

The simulation code is available on GitHub at: https://github.com/ChristieLab/filtering_simulation_paper.

References

Allendorf, F. W., Hohenlohe, P. A. & Luikart, G. Genomics and the future of conservation genetics. Nat. Rev. Genet. 11, 697â709 (2010).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Athanasopoulou, K., Boti, M. A., Adamopoulos, P. G., Skourou, P. C. & Scorilas, A. Third-generation sequencing: the spearhead towards the radical transformation of modern genomics. Life 12, 30 (2022).
ArticleÂ CASÂ Google ScholarÂ
Fiedler, P. L. et al. Seizing the moment: the opportunity and relevance of the California Conservation Genomics Project to state and federal conservation policy. J. Hered. 113, 589â596 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Hu, T., Chitnis, N., Monos, D. & Dinh, A. Next-generation sequencing technologies: an overview. Hum. Immunol. 82, 801â811 (2021).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Pompanon, F., Bonin, A., Bellemain, E. & Taberlet, P. Genotyping errors: causes, consequences and solutions. Nat. Rev. Genet. 6, 847â859 (2005). This review summarizes the sources of many common types of sequencing errors and provides some laboratory and bioinformatic ways to mitigate them.
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Stoler, N. & Nekrutenko, A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom. Bioinform. 3, lqab019 (2021).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Fountain, E. D., Pauli, J. N., Reid, B. N., PalsbÃ¸ll, P. J. & Peery, M. Z. Finding the right coverage: the impact of coverage and sequence quality on single nucleotide polymorphism genotyping error rates. Mol. Ecol. Resour. 16, 966â978 (2016).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
OâLeary, S. J., Puritz, J. B., Willis, S. C., Hollenbeck, C. M. & Portnoy, D. S. These arenât the loci youâre looking for: principles of effective SNP filtering for molecular ecologists. Mol. Ecol. 27, 3193â3206 (2018). This helpful review discusses the effects of missing data, MAC and other filters on genotyping error rates for RADseq data.
ArticleÂ Google ScholarÂ
Rochette, N. C., Rivera-ColÃ³n, A. G. & Catchen, J. M. Stacks 2: analytical methods for paired-end sequencing improve RADseq-based population genomics. Mol. Ecol. 28, 4737â4754 (2019).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Ahrens, C. W. et al. Regarding the F-word: the effects of data filtering on inferred genotypeâenvironment associations. Mol. Ecol. Resour. 21, 1460â1474 (2021).
ArticleÂ PubMedÂ Google ScholarÂ
Andrews, K. R. & Luikart, G. Recent novel approaches for population genomics data analysis. Mol. Ecol. 23, 1661â1667 (2014).
ArticleÂ PubMedÂ Google ScholarÂ
Shafer, A. B. A. et al. Bioinformatic processing of RAD-seq data dramatically impacts downstream population genetic inference. Methods Ecol. Evol. 8, 907â917 (2017). This study demonstrates the effects of different filtering and alignment choices on several downstream statistics and demographic reconstruction in RADseq data.
ArticleÂ Google ScholarÂ
Larson, W. A., Isermann, D. A. & Feiner, Z. S. Incomplete bioinformatic filtering and inadequate age and growth analysis lead to an incorrect inference of harvested-induced changes. Evol. Appl. 14, 278â289 (2021).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Nazareno, A. G. & Knowles, L. L. There is no ârule of thumbâ: genomic filter settings for a small plant population to obtain unbiased gene flow estimates. Front. Plant Sci. 12, 677009 (2021). This comprehensive analysis of empirical data demonstrates how missing data and MAF thresholds affect estimates of gene flow.
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Sethuraman, A. et al. Continued misuse of multiple testing correction methods in population genetics â a wake-up call? Mol. Ecol. Resour. 19, 23â26 (2019).
ArticleÂ PubMedÂ Google ScholarÂ
Allendorf, F. W. et al. Conservation and the Genomics of Populations (Oxford Univ. Press, 2022).
Gervais, L. et al. RAD-sequencing for estimating genomic relatedness matrix-based heritability in the wild: a case study in roe deer. Mol. Ecol. Resour. 19, 1205â1217 (2019).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Crow, J. F. & Kimura, M. An Introduction to Population Genetics Theory (Scientific Publishers, 2017).
Van Etten, J., Stephens, T. G. & Bhattacharya, D. A k-mer-based approach for phylogenetic classification of taxa in environmental genomic data. Syst. Biol. 72, 1101â1118 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Todd, E. V., Black, M. A. & Gemmell, N. J. The power and promise of RNA-seq in ecology and evolution. Mol. Ecol. 25, 1224â1241 (2016).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Olofsson, D., PreuÃner, M., Kowar, A., Heyd, F. & Neumann, A. One pipeline to predict them all? On the prediction of alternative splicing from RNA-seq data. Biochem. Biophys. Res. Commun. 653, 31â37 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Upton, R. N. et al. Design, execution, and interpretation of plant RNA-seq analyses. Front. Plant Sci. 14, 1135455 (2023).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Rehn, J. et al. RaScALL: rapid (Ra) screening (Sc) of RNA-seq data for prognostically significant genomic alterations in acute lymphoblastic leukaemia (ALL). PLOS Genet. 18, e1010300 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Boshuizen, H. C. & te Beest, D. E. Pitfalls in the statistical analysis of microbiome amplicon sequencing data. Mol. Ecol. Resour. 23, 539â548 (2023).
ArticleÂ PubMedÂ Google ScholarÂ
Combrink, L. et al. Best practice for wildlife gut microbiome research: a comprehensive review of methodology for 16S rRNA gene investigations. Front. Microbiol. 14, 1092216 (2023).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Cheng, Z. et al. Transcriptomic analysis of circulating leukocytes obtained during the recovery from clinical mastitis caused by Escherichia coli in Holstein dairy cows. Animals 12, 2146 (2022).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Yang, L. & Chen, J. Benchmarking differential abundance analysis methods for correlated microbiome sequencing data. Brief. BioinformaticsÂ 24, bbac607 (2023).
ArticleÂ PubMedÂ Google ScholarÂ
Patin, N. V. & Goodwin, K. D. Capturing marine microbiomes and environmental DNA: a field sampling guide. Front. Microbiol. 13, 1026596 (2023).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Ruppert, K. M., Kline, R. J. & Rahman, M. S. Past, present, and future perspectives of environmental DNA (eDNA) metabarcoding: a systematic review in methods, monitoring, and applications of global eDNA. Glob. Ecol. Conserv. 17, e00547 (2019).
Google ScholarÂ
Deyneko, I. V. et al. Modeling and cleaning RNA-seq data significantly improve detection of differentially expressed genes. BMC BioinformaticsÂ 23, 488 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Giusti, A., Malloggi, C., Magagna, G., Filipello, V. & Armani, A. Is the metabarcoding ripe enough to be applied to the authentication of foodstuff of animal origin? A systematic review. Compr. Rev. Food Sci. Food Saf. 23, 1â21 (2024).
ArticleÂ Google ScholarÂ
da Fonseca, R. R. et al. Next-generation biology: sequencing and data analysis approaches for non-model organisms. Mar. Genomics 30, 3â13 (2016).
ArticleÂ PubMedÂ Google ScholarÂ
Zhao, M. et al. Exploring conflicts in whole genome phylogenetics: a case study within manakins (Aves: Pipridae). Syst. Biol. 72, 161â178 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Koboldt, D. C. Best practices for variant calling in clinical sequencing. Genome Med 12, 91 (2020).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Giani, A. M., Gallo, G. R., Gianfranceschi, L. & Formenti, G. Long walk to genomics: history and current approaches to genome sequencing and assembly. Comput. Struct. Biotechnol. J. 18, 9â19 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Kumar, K. R., Cowley, M. J. & Davis, R. L. Next-generation sequencing and emerging technologies. Semin. Thromb. Hemost. 45, 661â673 (2019).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Shendure, J. et al. DNA sequencing at 40: past, present and future. Nature 550, 345â353 (2017).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Lou, R. N., Jacobs, A., Wilder, A. P. & Therkildsen, N. O. A beginnerâs guide to low-coverage whole genome sequencing for population genomics. Mol. Ecol. 30, 5966â5993 (2021). This reviews discusses the production and analysis of low-coverage WGS data.
ArticleÂ PubMedÂ Google ScholarÂ
Olson, N. D. et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat. Rev. Genet. 24, 464â483 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Rochette, N. C. & Catchen, J. M. Deriving genotypes from RAD-seq short-read data using Stacks. Nat. Protoc. 12, 2640â2659 (2017).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Paris, J. R., Stevens, J. R. & Catchen, J. M. Lost in parameter space: a road map for stacks. Methods Ecol. Evol. 8, 1360â1373 (2017).
ArticleÂ Google ScholarÂ
Ceballos, F. C., Joshi, P. K., Clark, D. W., Ramsay, M. & Wilson, J. F. Runs of homozygosity: windows into population history and trait architecture. Nat. Rev. Genet. 19, 220â234 (2018).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Heller, R. et al. A reference-free approach to analyse RADseq data using standard next generation sequencing toolkits. Mol. Ecol. Resour. 21, 1085â1097 (2021).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Bohling, J. Evaluating the effect of reference genome divergence on the analysis of empirical RADseq datasets. Ecol. Evol. 10, 7585â7601 (2020).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Valiente-Mullor, C. et al. One is not enough: on the effects of reference genome for the mapping and subsequent analyses of short-reads. PLOS Comput. Biol. 17, e1008678 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Hendricks, S. et al. Recent advances in conservation and population genomics data analysis. Evol. Appl. 11, 1197â1211 (2018).
ArticleÂ PubMed CentralÂ Google ScholarÂ
Vaux, F., Dutoit, L., Fraser, C. I. & Waters, J. M. Genotyping-by-sequencing for biogeography. J. Biogeogr. 50, 262â281 (2023).
ArticleÂ Google ScholarÂ
Jackson, B. C., Campos, J. L. & Zeng, K. The effects of purifying selection on patterns of genetic differentiation between Drosophila melanogaster populations. Heredity 114, 163â174 (2015).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Luikart, G., England, P. R., Tallmon, D., Jordan, S. & Taberlet, P. The power and promise of population genomics: from genotyping to genome typing. Nat. Rev. Genet. 4, 981â994 (2003).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Benestan, L. et al. Sex matters in massive parallel sequencing: evidence for biases in genetic parameter estimation and investigation of sex determination systems. Mol. Ecol. 26, 6767â6783 (2017).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Yang, Z. et al. Multi-omics provides new insights into the domestication and improvement of dark jute (Corchorus olitorius). Plant J. 112, 812â829 (2022).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Zeng, L. et al. Whole genomes and transcriptomes reveal adaptation and domestication of pistachio. Genome Biol. 20, 79 (2019).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zhernakova, D. V. et al. Genome-wide sequence analyses of ethnic populations across Russia. Genomics 112, 442â458 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357â359 (2012).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Li, H. & Durbin, R. Fast and accurate short read alignment with BurrowsâWheeler transform. Bioinformatics 25, 1754â1760 (2009).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Pfeifer, S. P. From next-generation resequencing reads to a high-quality variant data set. Heredity 118, 111â124 (2017).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Lefouili, M. & Nam, K. The evaluation of BCFtools mpileup and GATK HaplotypeCaller for variant calling in non-human species. Sci. Rep. 12, 11331 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Chen, N.-C., Solomon, B., Mun, T., Iyer, S. & Langmead, B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 22, 8 (2021).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
GÃ¼nther, T. & Nettelblad, C. The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLOS Genet. 15, e1008302 (2019).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737â746 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Ho, S. S., Urban, A. E. & Mills, R. E. Structural variation in the sequencing era. Nat. Rev. Genet. 21, 171â189 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Singh, A. K. et al. Detecting copy number variation in next generation sequencing data from diagnostic gene panels. BMC Med. Genomics 14, 214 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Willis, S. C., Hollenbeck, C. M., Puritz, J. B., Gold, J. R. & Portnoy, D. S. Haplotyping RAD loci: an efficient method to filter paralogs and account for physical linkage. Mol. Ecol. Resour. 17, 955â965 (2017).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Rochette, N. C. et al. On the causes, consequences, and avoidance of PCR duplicates: towards a theory of library complexity. Mol. Ecol. Resour. 23, 1299â1318 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Van der Auwera, G. A. & OâConnor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (OâReilly Media, 2020).
Korneliussen, T. S., Albrechtsen, A. & Nielsen, R. ANGSD: analysis of next generation sequencing data. BMC Bioinformatics 15, 356 (2014).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Eaton, D. A. R. & Overcast, I. ipyrad: interactive assembly and analysis of RADseq datasets. Bioinformatics 36, 2592â2594 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Mona, S., Benazzo, A., Delrieu-Trottin, E. & Lesturgie, P. Population genetics using low coverage RADseq data in non-model organisms: biases and solutions. Preprint at Authorea https://doi.org/10.22541/au.168252801.19878064/v1 (2023).
Nielsen, R., Korneliussen, T., Albrechtsen, A., Li, Y. & Wang, J. SNP calling, genotype calling, and sample allele frequency estimation from new-generation sequencing data. PLoS ONE 7, e37558 (2012).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Warmuth, V. M. & Ellegren, H. Genotype-free estimation of allele frequencies reduces bias and improves demographic inference from RADseq data. Mol. Ecol. Resour. 19, 586â596 (2019).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Wright, B. et al. From reference genomes to population genomics: comparing three reference-aligned reduced-representation sequencing pipelines in two wildlife species. BMC Genomics 20, 453 (2019).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Huang, H. & Knowles, L. L. Unforeseen consequences of excluding missing data from next-generation sequences: simulation study of RAD sequences. Syst. Biol. 65, 357â365 (2016).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Duntsch, L., Whibley, A., Brekke, P., Ewen, J. G. & Santure, A. W. Genomic data of different resolutions reveal consistent inbreeding estimates but contrasting homozygosity landscapes for the threatened Aotearoa New Zealand hihi. Mol. Ecol. 30, 6006â6020 (2021).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Kardos, M. & Waples, R. S. Low-coverage sequencing and Wahlund effect severely bias estimates of inbreeding, heterozygosity, and effective population size in North American wolves. Mol. Ecol. https://doi.org/10.1111/mec.17415 (2024). This study reports biases that could affect management decisions caused by next-generation sequencing filtering choices, low-coverage data and the sampling strategy.
Schmidt, T. L., Jasper, M.-E., Weeks, A. R. & Hoffmann, A. A. Unbiased population heterozygosity estimates from genome-wide sequence data. Methods Ecol. Evol. 12, 1888â1898 (2021).
ArticleÂ Google ScholarÂ
Sopniewski, J. & Catullo, R. A. Estimates of heterozygosity from single nucleotide polymorphism markers are context-dependent and often wrong. Mol. Ecol. Resour. 24, e13947 (2024).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945â959 (2000).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Waples, R. S. Testing for HardyâWeinberg proportions: have we lost the plot? J. Hered. 106, 1â19 (2015).
ArticleÂ PubMedÂ Google ScholarÂ
Gautier, M. et al. The effect of RAD allele dropout on the estimation of genetic variation within and between populations. Mol. Ecol. 22, 3165â3178 (2013).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
McKinney, G. J., Waples, R. K., Seeb, L. W. & Seeb, J. E. Paralogs are revealed by proportion of heterozygotes and deviations in read ratios in genotyping-by-sequencing data from natural populations. Mol. Ecol. Resour. 17, 656â669 (2017).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Bitarello, B. D., Brandt, D. Y. C., Meyer, D. & AndrÃ©s, A. M. Inferring balancing selection from genome-scale data. Genome Biol. Evol. 15, evad032 (2023).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Pearman, W. S., Urban, L. & Alexander, A. Commonly used HardyâWeinberg equilibrium filtering schemes impact population structure inferences using RADseq data. Mol. Ecol. Resour. 22, 2599â2613 (2022). This study demonstrates the impact of pooling or splitting sample-groups when applying HWP filters to F_ST and other population structure inferences.
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Linderoth, T. P. Identifying population histories, adaptive genes, and genetic duplication from population-scale next generation sequencing. Genome Res. 20, 291â300 (2018).
Google ScholarÂ
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289â300 (1995).
ArticleÂ Google ScholarÂ
Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65â70 (1979).
Google ScholarÂ
Graffelman, J., Jain, D. & Weir, B. A genome-wide study of HardyâWeinberg equilibrium with next generation sequence data. Hum. Genet. 136, 727â741 (2017).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Larson, W. A. et al. Genotyping by sequencing resolves shallow population structure to inform conservation of Chinook salmon (Oncorhynchus tshawytscha). Evol. Appl. 7, 355â369 (2014).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Waples, R. K., Larson, W. A. & Waples, R. S. Estimating contemporary effective population size in non-model species using linkage disequilibrium across thousands of loci. Heredity 117, 233â240 (2016).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Gattepaille, L. M., Jakobsson, M. & Blum, M. G. Inferring population size changes with sequence and SNP data: lessons from human bottlenecks. Heredity 110, 409â419 (2013).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Tajima, F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585 LPâ585595 (1989).
ArticleÂ Google ScholarÂ
Arantes, L. S. et al. Scaling-up RADseq methods for large datasets of non-invasive samples: lessons for library construction and data preprocessing. Mol. Ecol. Resour. https://doi.org/10.1111/1755-0998.13859 (2023).
Cubry, P., Vigouroux, Y. & FranÃ§ois, O. The empirical distribution of singletons for geographic samples of DNA sequences. Front. Genet. 8, 139 (2017).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Linck, E. & Battey, C. J. Minor allele frequency thresholds strongly affect population structure inference with genomic data sets. Mol. Ecol. Resour. 19, 639â647 (2019). This study demonstrates how MAF thresholds affect population structure inferences using both simulated and empirical data.
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Andersson, B. A., Zhao, W., Haller, B. C., BrÃ¤nnstrÃ¶m, Ã. & Wang, X.-R. Inference of the distribution of fitness effects of mutations is affected by single nucleotide polymorphism filtering methods, sample size and population structure. Mol. Ecol. Resour. 23, 1589â1603 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
DÃaz-Arce, N. & RodrÃguez-Ezpeleta, N. Selecting RAD-seq data analysis parameters for population genetics: the more the better? Front. Genet. 10, 533 (2019).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Holsinger, K. E. & Weir, B. S. Genetics in geographically structured populations: defining, estimating and interpreting FST. Nat. Rev. Genet. 10, 639â650 (2009).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Roesti, M., Salzburger, W. & Berner, D. Uninformative polymorphisms bias genome scans for signatures of selection. BMC Evol. Biol. 12, 94 (2012).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Yin, X. et al. Rapid, simultaneous increases in the effective sizes of adaptively divergent yellow perch (Perca flavescens) populations. Preprint at bioRxiv https://doi.org/10.1101/2024.04.21.590447 (2024).
Visscher, P. M. et al. 10âyears of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5â22 (2017).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64â69 (2012).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Dementieva, N. V. et al. Assessing the effects of rare alleles and linkage disequilibrium on estimates of genetic diversity in the chicken populations. Animal 15, 100171 (2021).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
De MeeÃ»s, T. Revisiting F_IS, F_ST, Wahlund effects, and null alleles. J. Hered. 109, 446â456 (2018).
ArticleÂ PubMedÂ Google ScholarÂ
Levy-Sakin, M. et al. Genome maps across 26 human populations reveal population-specific patterns of structural variation. Nat. Commun. 10, 1025 (2019).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zhang, H., Yin, L., Wang, M., Yuan, X. & Liu, X. Factors affecting the accuracy of genomic selection for agricultural economic traits in maize, cattle, and pig populations. Front. Genet. 10, 189 (2019).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Anderson, E. C. & Garza, J. C. The power of single-nucleotide polymorphisms for large-scale parentage inference. Genetics 172, 2567â2582 (2006).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Dussault, F. M. & Boulding, E. G. Effect of minor allele frequency on the number of single nucleotide polymorphisms needed for accurate parentage assignment: a methodology illustrated using Atlantic salmon. Aquac. Res. 49, 1368â1372 (2018).
ArticleÂ Google ScholarÂ
Thompson, E. The estimation of pairwise relationships. Ann. Hum. Genet. 39, 173â188 (1975).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Goubert, C. et al. A beginnerâs guide to manual curation of transposable elements. Mob. DNA 13, 7 (2022).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Storer, J. M., Hubley, R., Rosen, J. & Smit, A. F. A. Curation guidelines for de novo generated transposable element families. Curr. Protoc. 1, e154 (2021).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Hemstrom, W. B., Freedman, M. G., Zalucki, M. P., RamÃrez, S. R. & Miller, M. R. Population genetics of a recent range expansion and subsequent loss of migration in monarch butterflies. Mol. Ecol. 31, 4544â4557 (2022).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Escoda, L., GonzÃ¡lez-Esteban, J., GÃ³mez, A. & Castresana, J. Using relatedness networks to infer contemporary dispersal: application to the endangered mammal Galemys pyrenaicus. Mol. Ecol. 26, 3343â3357 (2017).
ArticleÂ PubMedÂ Google ScholarÂ
Brown, A. V. et al. Ten quick tips for sharing open genomic data. PLOS Comput. Biol. 14, e1006472 (2018).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zhang, D. et al. PhyloSuite: an integrated and scalable desktop platform for streamlined molecular sequence data management and evolutionary phylogenetics studies. Mol. Ecol. Resour. 20, 348â355 (2020).
ArticleÂ PubMedÂ Google ScholarÂ
Tanjo, T., Kawai, Y., Tokunaga, K., Ogasawara, O. & Nagasaki, M. Practical guide for managing large-scale human genome data in research. J. Hum. Genet. 66, 39â52 (2021).
ArticleÂ PubMedÂ Google ScholarÂ
Del Fabbro, C., Scalabrin, S., Morgante, M. & Giorgi, F. M. An extensive evaluation of read trimming effects on illumina NGS data analysis. PLoS ONE 8, e85024 (2013).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Yang, S.-F., Lu, C.-W., Yao, C.-T. & Hung, C.-M. To trim or not to trim: effects of read trimming on the de novo genome assembly of a widespread East Asian passerine, the rufous-capped babbler (Cyanoderma ruficeps Blyth). Genes 10, 737 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Hotaling, S. et al. Demographic modelling reveals a history of divergence with gene flow for a glacially tied stonefly in a changing post-Pleistocene landscape. J. Biogeogr. 45, 304â317 (2018).
ArticleÂ Google ScholarÂ
Cumer, T. et al. Double-digest RAD-sequencing: do pre- and post-sequencing protocol parameters impact biological results? Mol. Genet. Genomics 296, 457â471 (2021).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Mastretta-Yanes, A. et al. Restriction site-associated DNA sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference. Mol. Ecol. Resour. 15, 28â41 (2015).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Ebbert, M. T. W. et al. Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics 17, 239 (2016).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Euclide, P. T. et al. Attack of the PCR clones: rates of clonality have little effect on RAD-seq genotype calls. Mol. Ecol. Resour. 20, 66â78 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Flanagan, S. P. & Jones, A. G. Substantial differences in bias between single-digest and double-digest RAD-seq libraries: a case study. Mol. Ecol. Resour. 18, 264â280 (2018).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Martins, F. B. et al. A semi-automated SNP-based approach for contaminant identification in biparental polyploid populations of tropical forage grasses. Front. Plant Sci. 12, 737919 (2021).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Deo, T. G. et al. High-resolution linkage map with allele dosage allows the identification of regions governing complex traits and apospory in guinea grass (Megathyrsus maximus). Front. Plant Sci. 11, 15 (2020).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zhang, F. et al. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res. 30, 185â194 (2020).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Christie, M. R., Marine, M. L., Fox, S. E., French, R. A. & Blouin, M. S. A single generation of domestication heritably alters the expression of hundreds of genes. Nat. Commun. 7, 10676 (2016).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Lou, R. N. & Therkildsen, N. O. Batch effects in population genomic studies with low-coverage whole genome sequencing data: causes, detection and mitigation. Mol. Ecol. Resour. 22, 1678â1692 (2022).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156â2158 (2011).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Mirchandani, C. D. et al. A fast, reproducible, high-throughput variant calling workflow for population genomics. Mol. Biol. Evol. 41, msad270 (2024).
ArticleÂ PubMedÂ Google ScholarÂ
PeÃ±alba, J. V., Peters, J. L. & Joseph, L. Sustained plumage divergence despite weak genomic differentiation and broad sympatry in sister species of Australian woodswallows (Artamus spp.). Mol. Ecol. 31, 5060â5073 (2022).
ArticleÂ PubMedÂ Google ScholarÂ
Thompson, N. F. et al. A complex phenotype in salmon controlled by a simple change in migratory timing. Science 370, 609â613 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Howe, K. et al. Significantly improving the quality of genome assemblies through curation. Gigascience 10, giaa153 (2021).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44â53 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Michael, T. P. & VanBuren, R. Building near-complete plant genomes. Genome Stud. Mol. Genet. 54, 26â33 (2020).
CASÂ Google ScholarÂ
Tettelin, H. & Medini, D. The Pangenome: Diversity, Dynamics and Evolution of Genomes (Springer, 2020).
Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437â446 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Hemstrom, W. Thirty-Four Kilometers and Fifteen Years: Rapid Adaptation at a Novel Chromosomal Inversion in Recently Introduced Deschutes River Three-Spined Stickleback. Thesis, Oregon State Univ. (2016).
Halvorsen, S., Korslund, L., Mattingsdal, M. & Slettan, A. Estimating number of European eel (Anguilla anguilla) individuals using environmental DNA and haplotype count in small rivers. Ecol. Evol. 13, e9785 (2023).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Whitlock, M. C. & Lotterhos, K. E. Reliable detection of loci responsible for local adaptation: inference of a null model through trimming the distribution of FST. Am. Nat. 186, S24âS36 (2015).
ArticleÂ PubMedÂ Google ScholarÂ
vonHoldt, B. M. et al. Demographic history shapes North American gray wolf genomic diversity and informs speciesâ conservation. Mol. Ecol. 33, e17231 (2024).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Alonso-Blanco, C. et al. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166, 481â491 (2016).
ArticleÂ Google ScholarÂ
Maruki, T., Ye, Z. & Lynch, M. Evolutionary genomics of a subdivided species. Mol. Biol. Evol. 39, msac152 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Kessler, C., Wootton, E. & Shafer, A. B. A. Speciation without gene-flow in hybridizing deer. Mol. Ecol. 32, 1117â1132 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Martchenko, D. & Shafer, A. B. A. Contrasting whole-genome and reduced representation sequencing for population demographic and adaptive inference: an alpine mammal case study. Heredity 131, 273â281 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Lowy-Gallego, E. et al. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project. Wellcome Open Res. 4, 50 (2019).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Schweizer, R. M. et al. Broad concordance in the spatial distribution of adaptive and neutral genetic variation across an elevational gradient in deer mice. Mol. Biol. Evol. 38, 4286â4300 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Kardos, M. et al. Inbreeding depression explains killer whale population dynamics. Nat. Ecol. Evol. 7, 675â686 (2023).
ArticleÂ PubMedÂ Google ScholarÂ
Malison, R. L. et al. Landscape connectivity and genetic structure in a mainstem and a tributary stonefly (Plecoptera) species using a novel reference genome. J. Hered. 113, 453â471 (2022).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Robinson, J. M. et al. Traditional ecological knowledge in restoration ecology: a call to listen deeply, to engage with, and respect Indigenous voices. Restor. Ecol. 29, e13381 (2021).
ArticleÂ Google ScholarÂ
Lynch, M. The Origins of Genome Architecture (Sinauer Associates, 2007).
Lynch, M. & OâHely, M. Captive breeding and the genetic fitness of natural populations. Conserv. Genet. 2, 363â378 (2001).
ArticleÂ Google ScholarÂ

Download references

Acknowledgements

The authors thank E. Anderson, A. LeachÃ©, M. Kardos and the reviewers for their helpful comments that greatly improved this manuscript. The authors also thank M. Exposito-Alonso and the 1001 Genomes Consortium, the 1000 Genomes Project, B. Hand, M. Freedman, M. Kardos, C. Kessler, M. Lynch, R. Malison, D. Martchenko, M. Miller, R. Schweizer, A.B.A. Shafer and X. Yin for allowing their datasets to be reviewed and re-filtered. M.R.C. was funded, in part, by NSF DEB-1856710 and OCE-1924505. G.L. was funded, in part, by NSF-DOB-M66230.

Author information

These authors contributed equally: William Hemstrom, Jared A. Grummer.

Authors and Affiliations

Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
William HemstromÂ &Â Mark R. Christie
Flathead Lake Biological Station, Wildlife Biology Program and Division of Biological Sciences, University of Montana, Missoula, MT, USA
Jared A. GrummerÂ &Â Gordon Luikart
Department of Forestry and Natural Resources, Purdue University, West Lafayette, IN, USA
Mark R. Christie

Authors

William Hemstrom
View author publications
You can also search for this author in PubMedÂ Google Scholar
Jared A. Grummer
View author publications
You can also search for this author in PubMedÂ Google Scholar
Gordon Luikart
View author publications
You can also search for this author in PubMedÂ Google Scholar
Mark R. Christie
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

All authors conceptualized, wrote and edited the manuscript. W.H. and J.A.G. conducted the simulations and analyses in BoxÂ 2.

Corresponding authors

Correspondence to William Hemstrom or Mark R. Christie.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Genetics thanks Mark Ravinet and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notebook 1

Supplementary Notebook 2

Glossary

Alignment: The mapping of sequencing reads and/or contigs to either each other (pairwise/multiple alignment) or to a reference. Alignments can vary in the strength of the evidence that supports them. Most alignment tools will return map quality (mapQ) scores, the derivation and meaning of which varies by program. Filtering thresholds based on this score must consider the specific aligner used.
Base quality score: The value in a logarithmic, Phred scale given to each base on a sequencing read that indicates a quantitative degree of confidence in the nucleotide called from the sequencing instrument.
Contigs: Contiguous sequences of DNA assembled from many overlapping sequence reads, representing a fragment of a chromosome.
De novo assembly: The reference-free alignment of sequencing reads into overlapping stacks or contigs for subsequent use in variant discovery and genotyping.
F _IS: A measure of inbreeding; the degree of subpopulation divergence from HardyâWeinberg proportions â the correlation between alleles at specific loci within individuals relative to the subpopulation.
F _ST: A measure of population differentiation; the proportion of the total genetic variance due to differences in allele frequencies between subpopulations.
Genetic variants: Differences in DNA sequence compared with a reference sequence or other individuals within a population. The term includes short variants (single-nucleotide polymorphisms (SNPs) or insertions and deletions) and structural variants (chromosomal inversions and copy number variations (CNVs)). In the context of this Review, used interchangeably with âlocusâ.
Genome-wide association studies: (GWAS). Tests for statistical relationships between a phenotype (including disease) and the allelic/genotypic state of an (ideally) large cohort of individuals across the entire set of sequenced loci.
Genotyping: Also referred to as genotype or variant calling. Calling allelic states at a locus (for example, A/A, A/C or C/C at a biallelic single-nucleotide polymorphism (SNP) in a diploid organism) or loci from sequence data. Genotyping algorithms often consist of multiple steps during which filtering can occur.
Haplotype phase: The complete sequence of variants that occur in a region along a single chromatid.
HardyâWeinberg proportions: (HWP). The expected frequencies of the genotypes at a given locus under HardyâWeinberg equilibrium. Filtering on HWP is often executed via an exact test, with loci that deviate significantly from HWP removed from subsequent analyses.
Imputation: The filling in of missing data for specific genotypes and/or loci by leveraging linkage disequilibrium (LD) between missing genotypes and genotypes called at other loci or samples. Imputation can use reference panels of well-described haplotypes to improve performance when available, usually in well-studied model organisms.
Linkage disequilibrium: (LD). The non-random association of alleles at different loci within a population or sample-group. This association can either be caused by physical linkage, when alleles are co-inherited due to non-independent assortment caused by close physical proximity, or occur across chromosomes when inbreeding, paralogy, genetic drift or other factors make certain alleles at different loci more likely to co-occur.
Low-coverage whole-genome sequencing: Whole-genome sequencing (WGS) with small numbers of reads covering most genomic loci (low coverage); the number of reads constituting low coverage varies widely depending on the discipline, methodology and research question. Low-coverage WGS often requires genotype likelihood-based methods.
Mapping quality: The score given to a read or other DNA sequence indicating the uniqueness of the alignment to a reference sequence; mapping quality score interpretations vary across alignment programs.
Minor allele count: (MAC). The number of gene copies or individuals carrying the minor (that is, least frequent) allele at a locus.
Minor allele frequency: (MAF). The proportion (frequency) of the least common allele at a locus across a study or sample-group; in this Review, we refer to filtering out loci with MAFs below a given threshold as MAF filtering.
Missing data: Missing genotype calls at a specific locus or individual. Missing data can be caused by many factors, such as the absence of a sufficient number of reads covering a locus to call a genotype in an individual with any degree of confidence.
N50 or L50 scores: In a genome assembly after sorting contigs or scaffolds by length, either the length of the contig/scaffold that reaches 50% of the cumulative genome length (N50) or the number of contigs needed to reach 50% of the cumulative genome length (L50); used to evaluate the assembly quality.
Paralogues: Duplicated genomic regions that have arisen via either the duplication of that specific region or the duplication of the entire genome. A type of homologue (loci identical by descent) distinct from orthologues, which arise due to speciation events.
PCR duplicates: Technical duplicates resulting in spurious, usually identical read copies caused by repeatedly sequencing the same piece of template DNA multiple times.
Population structure: Also known as population subdivision. Non-independence among individuals in a study area/region caused by spatial, temporal, behavioural or other forms of reproductive isolation. Population structure is characterized by divergent allele frequencies across loci.
Read depth: The number of reads that cover a given or fixed genomic position. Also referred to as âcoverageâ.
Reference bias: The propensity for reads containing the non-reference allele (the allele not in the reference genome) to have lower mapping quality scores or map to the wrong location compared with those containing the allele present in the reference genome.
Runs of homozygosity: Contiguous homozygous regions of the genome caused by the inheritance of identical haplotypes from both parents (for example, identical by descent). Useful for estimating inbreeding and population demographics.
Sample-group: A group of samples that are not independent due to natural causes (such as geographic or temporal separation) and/or experimental treatments.
Single-nucleotide polymorphisms: (SNPs). Genetic variants where the allelic state of the population varies at a single base pair.
Singletons: Alleles that appear only once in a sample of individuals. Sometimes alternatively defined as an allele sequenced in only one individual (which may be homozygous for that allele).
Site-frequency spectra: (SFS). The distributions of allele frequencies across loci within a study or sample-group. Can be either an âunfoldedâ or âpolarizedâ derived allele frequency spectrum which describes the frequency distribution of derived alleles or a âfoldedâ or âunpolarizedâ minor allele frequency (MAF) spectrum which describes the frequency distribution of the minor alleles. Also known as the allele frequency distribution.
Structural variation: Genetic variation in the order, number and/or arrangement of loci.
Study-wide filtering: Applying a filtering threshold âgloballyâ (simultaneously across all samples in the entire dataset) rather than separately within each sample-group.
VCF file: A file in the variant call format, which contains genotype calls (or likelihoods, posteriors) alongside a flexible suite of metadata such as filtering and processing history and quality information.
Wahlund effect: A reduction in observed heterozygosityÂ (H_O) relative to the expected heterozygosity (H_e) under HardyâWeinberg proportions (HWP) (that is,Â H_Oâ<âH_e) at many/most loci caused by the underlying population structure. When multiple (sub)populations are included in a sample, any differences in allele frequency between (sub)populations will cause there to be considerably more homozygous individuals at those loci than would be expected under HWP (causing an elevated F_IS, the fixation index in individuals relative toÂ a subpopulation).
Within-group filtering: Applying a filtering threshold within each sample-group separately rather than across all individuals simultaneously (for example, study wide or globally).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hemstrom, W., Grummer, J.A., Luikart, G. et al. Next-generation data filtering in the genomics era. Nat Rev Genet (2024). https://doi.org/10.1038/s41576-024-00738-6

Download citation

Accepted: 25 April 2024
Published: 14 June 2024
DOI: https://doi.org/10.1038/s41576-024-00738-6