Bioinformatics Workshops
Bioinformatics Workshops
Bioinformatics Workshops
www.bioinformatics.ca
Module #: Title of Module 2
Module 4
Mapping and Genome Rearrangement
Jared Simpson, Ph.D.
Paired-end Reads
DNA fragment
ATCAA CTAAG
Learning Objectives of Module
• Understand mapping sequence reads to a reference
genome
• Understand file formats like FASTA, FASTQ and SAM/BAM
• Learn common terminology used to describe alignments
• Learn how paired-end reads can be used to find genome
rearrangements
• Run a mapper and rearrangement caller
Module 4 bioinformatics.ca
Sequencing platforms
14TB/run
$
600Gb/10d
Cross-platform
data integration 100Gb/15d
needed. 120Gb/1d
90Gb/10d
Increasing 150Mb/3h
Data 2Gb/27h
Per Run
700Mb/23h
Proton?
100Mb/1h GridION?
$
Increasing Run Time
Module 4 bioinformatics.ca
Basecalling
• How do we translate the machine data to base calls?
• How do we estimate and represent sequencing errors?
Module 4 bioinformatics.ca
Sources of error
Illumina: Pre-phasing & Phasing
Module 4 bioinformatics.ca
What is a base quality?
Phred quality scores:
- Estimate of probability the base call is incorrect
Module 4 bioinformatics.ca
Error Profiles
• Illumina
– Low error rate (~0.5%), mainly substitutions
• 454/Ion Torrent
– Mainly insertions/deletions in homopolymer runs
• Pacbio
– Higher error rate, mixture of insertions, deletions, substitutions
Module 4 bioinformatics.ca
Mismatch by cycle
Module 4 bioinformatics.ca
Fasta files
ASF-1.fa ASF-2.fa
Module 4 bioinformatics.ca
Fastq files
ASF-1.fastq ASF-2.fastq
Module 4 bioinformatics.ca
Reference-based Alignment
• Goal:
– find position in reference genome from which read was sampled
• Issues:
– the human genome is large and repetitive
– NGS instruments produce huge amounts of data
– the sequenced genome will differ from the reference due to SNPs,
indels and structural variation
Module 4 bioinformatics.ca
Choosing an Aligner
• High accuracy needed
– Misaligned reads are a source of false positive variant calls
• High sensitivity needed
– The aligner must allow for differences between the
individual and reference to find the correct mapping
position
• High speed needed
– With large data the informatics cost is significant
• We will use the popular aligner bwa in the tutorial
Module 4 bioinformatics.ca
Reference alignments
Reference genome
Sequence read
?
Module 4 bioinformatics.ca
Reference alignments
Reference genome
x x x
Sequence read
Module 4 bioinformatics.ca
Alignment Quality
• Most aligners will estimate how reliable the alignment is
with a Mapping Quality
– Phred-scaled estimate of the probability that the chosen
mapping is wrong
– 1 in 1000 reads with “Q30” alignment will be placed incorrectly
Module 4 bioinformatics.ca
What are Paired Reads?
Paired-end Reads
DNA fragment
ATCAA CTAAG
Slides by M. Brudno
Module 4 bioinformatics.ca
Paired Reads
Reference genome
?
Sequence read pair
Module 4 bioinformatics.ca
Read pair alignment
Reference genome
x x x xxxxx
Module 4 bioinformatics.ca
Working with alignments
• SAM/BAM is a standardized format for working with read
alignments
• SAM is tab-delimited text representation
• BAM is a compressed binary representation
Module 4 bioinformatics.ca
SAM Description
Module 4 bioinformatics.ca
SAM Description
Chromosome Coordinate
Module 4 bioinformatics.ca
SAM Description
Mapping Quality
Module 4 bioinformatics.ca
SAM Description
CIGAR
ATCAA CTAAG
Module 4 bioinformatics.ca
Resources
• samtools: toolkit for working with SAM/BAM files
– Convert between SAM/BAM
– Sort alignments
– Extract alignments for a given genomic location
• SAM/BAM specification:
http://samtools.sourceforge.net/SAM1.pdf
• Questions/Help
– https://lists.sourceforge.net/lists/listinfo/samtools-help
– http://www.biostars.org/
– http://seqanswers.com/
Module 4 bioinformatics.ca
We are now going to start an exercise
in read mapping
Module 4 bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module 4 bioinformatics.ca
What kinds of variation is there?
• Single Nucleotide Polymorphisms (SNPs)
• Short indels (< read length)
• Structural variations
– Large insertions and deletions
– Inversions
– Translocations
– Copy number variation
Module 4 bioinformatics.ca
Structural variants
Mate-pair and paired-end reads can be used to detect structural variants
Genomic
Mate-Pairs Paired-Ends
DNA
Isolate internal
adaptors and
fragment ends
Add amplification
and sequencing adaptors
Sequence
Module 4 bioinformatics.ca
Read pair orientation
Reference genome
Module 4 bioinformatics.ca
Read pair alignment
Fragment
number
Fragment size
Module 4 bioinformatics.ca
SV Signatures: Deletion
don
ref
Slides by M. Brudno
Module 4 bioinformatics.ca
SV Signatures: Deletion
don
ref
Module 4 bioinformatics.ca
SV Signatures: Insertion
don
ref
Module 4 bioinformatics.ca
SV Signatures: Tandem Duplication
don
ref
Module 4 bioinformatics.ca
SV Signatures: Inversion
don
ref
Module 4 bioinformatics.ca
SV summary
Slides by M. Brudno
Module 4 bioinformatics.ca
Where can we go wrong:
missed insertion
don
ref
IS Insertions larger than insert size cannot
be detected this way
Module 4 bioinformatics.ca
Structural Variants and Split Reads
Align
Module 4 bioinformatics.ca
Deletion: split read signature
don
ref
Module 4 bioinformatics.ca
Somatic vs. Germline
• tumor vs. normal sequencing
• approach 1:
– find SVs separately in two samples
– filter out somatic SVs that overlap germline SVs
• approach 2
– find somatic SVs
– for each somatic SV, find any type of evidence in germline
– filter out anything with germline evidence
Slides by M. Brudno
Module 4 bioinformatics.ca
Gene fusions
Gene X
ChrA
Gene Y
ChrB
Gene XY Protein
Module 4 bioinformatics.ca
SV Software and Exercise
• We will use HYDRA-SV in the tutorial
– https://code.google.com/p/hydra-sv/
– Quinlan et al, Genome-wide mapping and assembly of structural variant
breakpoints in the mouse genome. Genome Research
• Many others exist:
– Breakdancer, GASV, Pindel
– It is worth spending time learning multiple packages and their
strengths and weaknesses
– There is rarely one program that fits all needs!
Module 4 bioinformatics.ca
We are now going to start an exercise
in structural variant detection
Module 4 bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module 4 bioinformatics.ca
Any questions?
jared.simpson@oicr.on.ca
Module 4 bioinformatics.ca