Comparison methodology
The precisionFDA system includes an initial VCF comparison framework which was used in the first precisionFDA challenge. We asked participants to use that framework to compare their HG001 VCFs against the GiaB/NIST HG001 v2.19 truth data while the challenge was still ongoing, to ensure that their files were in good shape. However, for the final evaluation of the results of this challenge, we partnered with the Benchmarking Task team of the GA4GH Data Working Group, to employ an updated comparison framework.
The GA4GH benchmarking team has been discussing the difficulties of conducting proper VCF comparisons for several months now, and has devised metrics definitions, initial specifications, early software prototypes, and other benchmarking resources for implementing VCF comparison best practices. We approached the chair of the benchmarking group, Dr. Justin Zook, to consult with him on how to best use these resources in the context of this precisionFDA challenge.
The outcome of our collaboration is the utilization of a particular methodology for comparing VCFs and counting results, which was finalized and tailored for the needs of this challenge, and which we hope will form the basis of other comparisons in the future. It is available on precisionFDA as an app, titled Vcfeval + Hap.py Comparison. (Given the departure from the previous comparison technique, and until we have received enough feedback for this new method, the "Comparison" feature of precisionFDA has not yet been updated.)
Software used
This particular comparison framework consists of a specific version of Real Time Genomics' vcfeval, used for VCF comparison, and a specific version of Illumina's hap.py, used for quantification; together they form the prototype GA4GH benchmarking workflow, as illustrated here.
The vcfeval tool generates an intermediate VCF which is further quantified by hap.py. The "quantify" tool from hap.py counts and stratifies variants. It counts SNPs and INDELs separately and provides additional counts for subtypes of each variant (indels of various lengths, het / hom, stratification regions). In order to determine if a variant specifies SNPs, insertions or deletions, "quantify" performs an alignment of the corresponding REF and ALT alleles. This improves handling of MNPs/complex variants and provides results that are comparable between different methods.
Metrics calculated
The quantification process calculates the following metrics, in harmony with "Comparison Method 3" as defined in this upcoming GA4GH benchmarking spec update:
Metric | Definition |
TRUTH.TP | True positives, from the perspective of the truth data, i.e. the number of sites in the Truth Call Set for which there are paths through the Query Call Set that are consistent with all of the alleles at this site, and for which there is an accurate genotype call for the event. |
QUERY.TP | True positives, from the perspective of the query data, i.e. the number of sites in the Query Call Set for which there are paths through the Truth Call Set that are consistent with all of the alleles at this site, and for which there is an accurate genotype call for the event. |
TRUTH.FN | False negatives, i.e. the number of sites in the Truth Call Set for which there is no path through the Query Call Set that is consistent with all of the alleles at this site, or sites for which there is an inaccurate genotype call for the event. Sites with correct variant but incorrect genotype are counted here. |
QUERY.FP | False positives, i.e. the number of sites in the Query Call Set for which there is no path through the Truth Call Set that is consistent with this site. Sites with correct variant but incorrect genotype are counted here. |
FP.gt | The number of false positives where the non-REF alleles in the Truth and Query Call Sets match (i.e. cases where the truth is 1/1 and the query is 0/1 or similar). |
Recall | TRUTH.TP / (TRUTH.TP + TRUTH.FN) |
Precision | QUERY.TP / (QUERY.TP + QUERY.FP) |
F-score | Harmonic mean of Recall and Precision |
Note that recall uses TRUTH.TP whereas precision uses QUERY.TP. For recall, TRUTH.TP counts the number of truth variants reproduced in the truth set representation, which is the same for all callers and should also be consistent with the confident regions (especially important around confident region boundaries and where variants can move between categories depending on how a variant caller decides to represent them). Using QUERY.TP for recall would have introduced problems where different methods create different representations (e.g. systematically calling MNPs as insertions + deletions can skew indel numbers in the query). For calculating precision, the tool uses QUERY.TP (i.e. the precision measures the relative number of wrong calls for all query calls).
Granular reporting and region stratification
The quantification process reports results with additional granularity, according to a few different dimensions, as outlined in the four tabs of the following table:
Value | Meaning |
SNP | SNP or MNP variants |
INDEL | Indels and complex variants |
Value | Meaning |
* | Aggregate numbers of all subtypes |
ti | SNPs that constitute transitions |
tv | SNPs that constitute transversions |
I1_5 | Insertions of length 1-5 |
I6_15 | Insertions of length 6-15 |
I16_PLUS | Insertions of length 16 or more |
D1_5 | Deletions of length 1-5 |
D6_15 | Deletions of length 6-15 |
D16_PLUS | Deletions of length 16 or more |
C1_5 | Complex variants of length 1-5 |
C6_15 | Complex variants of length 6-15 |
C16_PLUS | Complex variants of length 16 or more |
Value | Meaning |
* | Aggregate numbers of all genotypes |
het | Only heterozygous variant calls (0/1 or similar genotypes) |
homalt | Only homozygous alternative variant calls (1/1 or similar genotypes) |
het | Only heterozygous alternative variant calls (1/2 or similar genotypes) |
Value | Meaning |
* | Aggregate numbers not limited to any subset (but still within confident regions) |
func_cds | Coding exons from RefSeq |
map_l100_m2_e1 | Regions in which 100bp reads map to >1 location with up to 2 mismatches and up to 1 indel |
map_l150_m0_e0 | Regions in which 150bp reads map to >1 location with up to 0 mismatches and up to 0 indels |
map_l150_m2_e1 | Regions in which 150bp reads map to >1 location with up to 2 mismatches and up to 1 indel |
map_l150_m2_e0 | Regions in which 150bp reads map to >1 location with up to 2 mismatches and up to 0 indels |
map_l100_m1_e0 | Regions in which 100bp reads map to >1 location with up to 1 mismatch and up to 0 indels |
map_l125_m1_e0 | Regions in which 125bp reads map to >1 location with up to 1 mismatch and up to 0 indels |
map_l250_m2_e1 | Regions in which 250bp reads map to >1 location with up to 2 mismatches and up to 1 indel |
map_l250_m0_e0 | Regions in which 250bp reads map to >1 location with up to 0 mismatches and up to 0 indels |
map_l150_m1_e0 | Regions in which 150bp reads map to >1 location with up to 1 mismatch and up to 0 indels |
map_l125_m2_e0 | Regions in which 125bp reads map to >1 location with up to 2 mismatches and up to 0 indels |
map_l100_m0_e0 | Regions in which 100bp reads map to >1 location with up to 0 mismatches and up to 0 indels |
map_l250_m1_e0 | Regions in which 250bp reads map to >1 location with up to 1 mismatch and up to 0 indels |
map_l125_m2_e1 | Regions in which 125bp reads map to >1 location with up to 2 mismatches and up to 1 indel |
map_l250_m2_e0 | Regions in which 250bp reads map to >1 location with up to 2 mismatches and up to 0 indels |
map_l125_m0_e0 | Regions in which 125bp reads map to >1 location with up to 0 mismatches and up to 0 indels |
map_l100_m2_e0 | Regions in which 100bp reads map to >1 location with up to 2 mismatches and up to 0 indels |
map_siren | Regions considered difficult to map by amplab SiRen |
tech_badpromoters | 1000 promoter regions with lowest relative coverage in Illumina, relatively GC-rich |
lowcmp_SimpleRepeat_homopolymer_gt10 | Homopolymers >10bp in length |
lowcmp_SimpleRepeat_triTR_11to50 | Exact 3bp tandem repeats 11-50bp in length |
lowcmp_Human_Full_Genome_TRDB_hg19_150331_TRgt6_lt51bp_gt95identity_merged | Tandem repeats with >6bp unit size and <51bp in length and >95% identity |
lowcmp_Human_Full_Genome_TRDB_hg19_150331_all_merged | All tandem repeats from TRDB with adjacent repeats merged |
lowcmp_SimpleRepeat_quadTR_11to50 | Exact 4bp tandem repeats 11-50bp in length |
lowcmp_Human_Full_Genome_TRDB_hg19_150331_TRlt7_gt200bp_gt95identity_merged | Tandem repeats with 1-6bp unit size and >200bp in length and >95% identity |
lowcmp_Human_Full_Genome_TRDB_hg19_150331_TRlt7_lt101bp_gt95identity_merged | Tandem repeats with 1-6bp unit size and <101bp in length and >95% identity |
lowcmp_SimpleRepeat_quadTR_gt200 | Exact 4bp tandem repeats >200bp in length |
lowcmp_Human_Full_Genome_TRDB_hg19_150331_TRgt6_lt101bp_gt95identity_merged | Tandem repeats with >6bp unit size and <101bp in length and >95% identity |
lowcmp_AllRepeats_gt200bp_gt95identity_merged | All perfect and imperfect tandem repeats >200bp in length |
lowcmp_Human_Full_Genome_TRDB_hg19_150331_TRgt6_51to200bp_gt95identity_merged | Tandem repeats with >6bp unit size and 51-200bp in length and >95% identity |
lowcmp_Human_Full_Genome_TRDB_hg19_150331_all_gt95identity_merged | All tandem repeats from TRDB with >95% identity |
lowcmp_SimpleRepeat_triTR_gt200 | Exact 3bp tandem repeats >200bp in length |
lowcmp_SimpleRepeat_triTR_51to200 | Exact 3bp tandem repeats 51-200bp in length |
lowcmp_Human_Full_Genome_TRDB_hg19_150331_TRlt7_lt51bp_gt95identity_merged | Tandem repeats with 1-6bp unit size and <51bp in length and >95% identity |
lowcmp_Human_Full_Genome_TRDB_hg19_150331_TRlt7_51to200bp_gt95identity_merged | Tandem repeats with 1-6bp unit size and 51-200bp in length and >95% identity |
lowcmp_AllRepeats_51to200bp_gt95identity_merged | All perfect and imperfect tandem repeats 51-200bp in length |
lowcmp_Human_Full_Genome_TRDB_hg19_150331 | All tandem repeats from TRDB |
lowcmp_SimpleRepeat_quadTR_51to200 | Exact 4bp tandem repeats 51-200bp in length |
lowcmp_AllRepeats_lt51bp_gt95identity_merged | All perfect and imperfect tandem repeats <51bp in length |
lowcmp_SimpleRepeat_homopolymer_6to10 | All homopolymers 6-10bp in length |
lowcmp_SimpleRepeat_diTR_gt200 | Exact 2bp tandem repeats >200bp in length |
lowcmp_SimpleRepeat_diTR_11to50 | Exact 2bp tandem repeats 11-50bp in length |
lowcmp_SimpleRepeat_diTR_51to200 | Exact 2bp tandem repeats 51-200bp in length |
lowcmp_Human_Full_Genome_TRDB_hg19_150331_TRgt6_gt200bp_gt95identity_merged | Tandem repeats with >6bp unit size and >200bp in length and >95% identity |
segdup | Segmental duplications in GRCh37 of all sizes, not including ALTs from GRCh37 |
segdupwithalt | Segmental duplications in GRCh37 >10kb, including ALTs from GRCh37 |
decoy | Regions in GRCh37 to which decoy sequences in hs37d5 map |
HG001compoundhet | Compound heterozygous regions in which there are 2 different variants when phased variants within 50bp are combined in HG001/NA12878 |
HG002compoundhet | Compound heterozygous regions in which there are 2 different variants when phased variants within 50bp are combined in HG002/NA24385 |
HG001complexvar | Complex variant regions in which there are 2 or more variants within 50bp in HG001/NA12878 |
HG002complexvar | Complex variant regions in which there are 2 or more variants within 50bp in HG002/NA24385 |
These dimensions help slice and dice the results according to all the different ways in which someone may want to look at them. The special value of '*' (for Subtype, Genotype, Subset) corresponds to summary statistics (i.e. as calculated across the whole genome in the confident regions, without further stratification).
Compilation of results
We ran hap.py (and specifically the HAP-207 version, with the engine set to a GA4GH-specific version of vcfeval) to compare HG001 and HG002 submissions against the GiaB/NIST v3.2.2 HG001 and HG002 truth data respectively. Our executions excluded the entry labeled ccogle-snppet as it did not call variants in the whole genome. The entries labeled ghariani-varprowl, jpowers-varprowl and qzeng-custom contained few VCF lines which were deemed incompatible by this new comparison framework (which performs stricter checks). Upon closer inspection, this was due to incompatible REF columns (such as the strings "nan" or "AC-7GATAGAA") or non-diploid genotypes (such as 0/1/2, or even 0/1/2/3). These were a small fraction, so we decided to exclude these offending lines from this comparison.
We have made available on precisionFDA complete archives of all the input files (including both the original and adjusted files, for the entries which we had to remove offending VCF lines), and the results (including the annotated VCF as output by hap.py, and extended statistics in CSV format). These are recapitulated in a precisionFDA post.
We would like to thank the benchmarking group of the Global Alliance for Genomics and Health for their excellent collaboration throughout this precisionFDA effort.