CASTLE panel is a multi-technology whole-genome sequencing of six commercially available tumor/normal cell line pairs (HCC1954, HCC1937, H1437, H2009, Hs578T and HCC1395). Genomic sequencing currently includes PacBio, Oxord Nanopore, Illumina and PoreC, in most cases sequenced from the same DNA extraction or cell line passage. We generated this panel to motivate benchmarking and development of new short- and long-read tools for cancer genomics.
The panel also includes a set of confident somatic structural variation (SV) and single nucleotide variation (SNV) calls, supported by ensemble of orthogonal methods.
Raw sequencing data is available through NCBI SRA BioProject PRJNA1086849 and Google cloud mirror. Methylation calls for Nanopore and PacBio are also available via Google mirror (or alternatively through SRA in the Cloud service).
For details on how the sequencing was performed, plase see DeepSomatic preprint.
Recently, we generated additional ultra-long ONT and PoreC data for a subset of cell lines to improve chromosome-scale phasing and de novo assmebly.
Sample | T/N | Technology | Size (Gb) | Reads N50 (kb) | SRA accession |
---|---|---|---|---|---|
HCC1954 | T | ONT R10 | 253 | 28 | SRR28305164 |
HCC1954 | T | HiFi | 195 | 17 | SRR28305163 |
HCC1954 | T | Illumina | 232 | - | SRR28305162 |
HCC1954BL | N | ONT R10 | 104 | 28 | SRR28305161 |
HCC1954BL | N | HiFi | 193 | 17 | SRR28305160 |
HCC1954BL | N | Illumina | 436 | - | SRR28305159 |
Sample | T/N | Technology | Size (Gb) | Reads N50 (kb) | SRA accession |
---|---|---|---|---|---|
HCC1937 | T | ONT R10 | 355 | 37 | SRR28305186 |
HCC1937 | T | ONT UL E821 | 572 | 48 | SRR31537484 |
HCC1937 | T | HiFi | 184 | 15 | SRR28305185 |
HCC1937 | T | Illumina | 740 | - | SRR28305184 |
HCC1937BL | N | ONT R10 | 79 | 41 | SRR28305183 |
HCC1937BL | N | ONT UL E821 | 172 | 45 | SRR31537483 |
HCC1937BL | N | HiFi | 172 | 16 | SRR28305182 |
HCC1937BL | N | Illumina | 218 | - | SRR28305181 |
HCC1937BL | N | Pore-C | 77 | - | SRR31537477 |
Sample | T/N | Technology | Size (Gb) | Reads N50 (kb) | SRA accession |
---|---|---|---|---|---|
H1437 | T | ONT R10 | 242 | 38 | SRR28305180 |
H1437 | T | ONT UL E821 | 550 | 48 | SRR31537476 |
H1437 | T | HiFi | 198 | 17 | SRR28305179 |
H1437 | T | Illumina | 595 | - | SRR28305178 |
BL1437 | N | ONT R10 | 151 | 42 | SRR28305177 |
BL1437 | N | ONT UL E821 | 203 | 39 | SRR31537475 |
BL1437 | N | HiFi | 218 | 18 | SRR28305175 |
BL1437 | N | Illumina | 203 | - | SRR28305174 |
BL1437 | N | Pore-C | 79 | - | SRR31537474 |
Sample | T/N | Technology | Size (Gb) | Reads N50 (kb) | SRA accession |
---|---|---|---|---|---|
H2009 | T | ONT R10 | 329 | 27 | SRR28305173 |
H2009 | T | ONT UL E821 | 516 | 65 | SRR31537473 |
H2009 | T | HiFi | 201 | 16 | SRR28305172 |
H2009 | T | Illumina | 669 | - | SRR28305171 |
BL2009 | N | ONT R10 | 92 | 37 | SRR28305170 |
BL2009 | N | ONT UL E821 | 166 | 64 | SRR31537472 |
BL2009 | N | HiFi | 209 | 16 | SRR28305169 |
BL2009 | N | Illumina | 171 | - | SRR28305168 |
BL2009 | N | Pore-C | 79 | - | SRR31537471 |
Sample | T/N | Technology | Size (Gb) | Reads N50 (kb) | SRA accession |
---|---|---|---|---|---|
Hs578T | T | ONT R10 | 261 | 36 | SRR31537470 |
Hs578T | T | HiFi | 172 | 18 | SRR31537482 |
Hs578T | T | Illumina | 616 | - | SRR31537481 |
Hs578Bst | N | ONT R10 | 113 | 40 | SRR31537480 |
Hs578Bst | N | HiFi | 84 | 12 | SRR31537479 |
Hs578Bst | N | Illumina | 102 | - | SRR31537478 |
Sample | T/N | Technology | Size (Gb) | Reads N50 (kb) | SRA accession |
---|---|---|---|---|---|
HCC1395 | T | ONT R10 | 246 | 10 | SRR28305167 |
HCC1395BL | N | ONT R10 | 90 | 11 | SRR28305166 |
Matching PacBio and Illumina for these cell lines is available from other sequencing projects.
COLO829/COLO829BL is a popular benchmarking dataset for cancer somatic SV evaluation. We generated additional ONT R9 and Illumina data from the same culture to support the evaluation.
Sample | T/N | Technology | Size (Gb) | Reads N50 (kb) | SRA accession |
---|---|---|---|---|---|
COLO829 | T | ONT R9 | 361 | 40 | SRR28305188 |
COLO829 | T | Illumina | 162 | - | SRR28305187 |
COLO829BL | N | ONT R9 | 419 | 38 | SRR28305176 |
COLO829BL | N | Illumina | 395 | - | SRR28305165 |
Matching PacBio and ONT R10 data for these cell lines is available from other sequencing projects.
For the cell line sequencing, the Institutional Review Board of National Institutes of Health considers patient-derived cell lines as non-human subjects, thus approval was not required. We note that because commercially available cell lines were derived prior to establishing the research use consent mechanism, no such consent was received. The cell lines used in this study are anonymized, therefore the risks of identifying original donors or their immediate family members are considered low. There are substantial benefits of openly releasing the cell line data for other research developing new methods for detecting somatic variants, which is crucial for precision cancer therapies. Given that existing data for these cell lines is already publicly available via NCBI SRA, we concluded that the benefits of releasing additional long-read data outweigh the marginal risks. This decision follows the practices previously established by the NCI and NHGRI in the TCGA tumor cell line data release.
The sequencing data was generated and analyzed in a joint initiaive by the following institutions/groups:
- National Cancer Institute (Kolmogorov lab)
- UC Santa Cruz (Paten and Miga labs)
- Children's Mercy Hospital (Farooqi lab)
- Google Health (DeepVariant group)
- Oxford Nanopore Technologies (Applications team)
- New York Genome Center (Robine and Narzisi labs)
-
The most relevant citation for data generation and small variation analysis is the DeepSomatic preprint.
-
To refer to structural variation analysis, please cite Severus preprint.
Since no ground truth SV calls are available, we created an ensemble of confident calls generated by merging the results from different methods and sequencing technologies using Minda. We used Severus, nanomonsv, SAVANA, Sniffles2, SvABA, GRIDSS, and Manta to generate ensembles. Confident calls were defined if supported by at least two (out of three) technologies and at least 4 (out of 11) callers. This assumes that singleton calls are false-positives and that calls supported by multiple tools are more reliable. For more details on the method and statistics, please see Severus preprint.
Individual vcfs and scripts for each caller are available at here. For data preprocessing steps, please see
This dataset is used to develop and evaluate the somatic structural variation (SV) caller Severus.
To be added soon!
Please raise issues concerning this dataset in the Github repository.