Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Bioinformatics Workshops

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 49

Canadian Bioinformatics Workshops

www.bioinformatics.ca
Module #: Title of Module 2
Module 4
Mapping and Genome Rearrangement
Jared Simpson, Ph.D.

Paired-end Reads

DNA fragment

ATCAA CTAAG
Learning Objectives of Module
• Understand mapping sequence reads to a reference
genome
• Understand file formats like FASTA, FASTQ and SAM/BAM
• Learn common terminology used to describe alignments
• Learn how paired-end reads can be used to find genome
rearrangements
• Run a mapper and rearrangement caller

Module 4 bioinformatics.ca
Sequencing platforms
14TB/run
$
600Gb/10d

Cross-platform
data integration 100Gb/15d
needed. 120Gb/1d
90Gb/10d

Increasing 150Mb/3h

Data 2Gb/27h

Per Run
700Mb/23h
Proton?
100Mb/1h GridION?
$
Increasing Run Time
Module 4 bioinformatics.ca
Basecalling
• How do we translate the machine data to base calls?
• How do we estimate and represent sequencing errors?

Module 4 bioinformatics.ca
Sources of error
Illumina: Pre-phasing & Phasing

Module 4 bioinformatics.ca
What is a base quality?
Phred quality scores:
- Estimate of probability the base call is incorrect

Base Quality Perror(obs. base)


3 50 %
5 32 %
10 10 %
20 1%
30 0.1 %
40 0.01 %

Module 4 bioinformatics.ca
Error Profiles
• Illumina
– Low error rate (~0.5%), mainly substitutions
• 454/Ion Torrent
– Mainly insertions/deletions in homopolymer runs
• Pacbio
– Higher error rate, mixture of insertions, deletions, substitutions

Module 4 bioinformatics.ca
Mismatch by cycle

Module 4 bioinformatics.ca
Fasta files
ASF-1.fa ASF-2.fa

• Reads are often stored in fasta files


• Separate file for forward and reverse pairs
• header line: identifier
• sequence lines: nucleotides

Module 4 bioinformatics.ca
Fastq files

ASF-1.fastq ASF-2.fastq

• Most reads are stored in fastq • header line: @SEQUENCE_ID


• 4 lines per read • sequence line
• line beginning with +
• encoded quality value line

Module 4 bioinformatics.ca
Reference-based Alignment
• Goal:
– find position in reference genome from which read was sampled
• Issues:
– the human genome is large and repetitive
– NGS instruments produce huge amounts of data
– the sequenced genome will differ from the reference due to SNPs,
indels and structural variation

Module 4 bioinformatics.ca
Choosing an Aligner
• High accuracy needed
– Misaligned reads are a source of false positive variant calls
• High sensitivity needed
– The aligner must allow for differences between the
individual and reference to find the correct mapping
position
• High speed needed
– With large data the informatics cost is significant
• We will use the popular aligner bwa in the tutorial

Module 4 bioinformatics.ca
Reference alignments
Reference genome

Sequence read

?
Module 4 bioinformatics.ca
Reference alignments
Reference genome

x x x

Sequence read

Module 4 bioinformatics.ca
Alignment Quality
• Most aligners will estimate how reliable the alignment is
with a Mapping Quality
– Phred-scaled estimate of the probability that the chosen
mapping is wrong
– 1 in 1000 reads with “Q30” alignment will be placed incorrectly

Module 4 bioinformatics.ca
What are Paired Reads?

Paired-end Reads

DNA fragment

ATCAA CTAAG

Insert size (IS)

Slides by M. Brudno

Module 4 bioinformatics.ca
Paired Reads
Reference genome

?
Sequence read pair

Module 4 bioinformatics.ca
Read pair alignment
Reference genome

x x x xxxxx

Sequence read pair

Module 4 bioinformatics.ca
Working with alignments
• SAM/BAM is a standardized format for working with read
alignments
• SAM is tab-delimited text representation
• BAM is a compressed binary representation

SRR013667.1 99 19 8882171 60 76M = 8882214 119


NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77

Module 4 bioinformatics.ca
SAM Description

Read name Flag

SRR013667.1 99 19 8882171 60 76M = 8882214 119


NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77

➞ Flag indicates the reference strand, pairing information

Module 4 bioinformatics.ca
SAM Description

Chromosome Coordinate

SRR013667.1 99 19 8882171 60 76M = 8882214 119


NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77

Module 4 bioinformatics.ca
SAM Description

Mapping Quality

SRR013667.1 99 19 8882171 60 76M = 8882214 119


NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77

Module 4 bioinformatics.ca
SAM Description

CIGAR

SRR013667.1 99 19 8882171 60 76M = 8882214 119


NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77

REF ACGATACATAC REF GACA-AACC


READ ACGA-ACATAC READ GTCATAACC

CIGAR: 4M1D6M CIGAR: 4M1I4M


Module 4 bioinformatics.ca
SAM Description
Mate chromosome,
Insert size
position

SRR013667.1 99 19 8882171 60 76M = 8882214 119


NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGAGAAAATAGGAGT
GTGCAATAGACTTAT
#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>>CB?>BA:D?9>
8AB685C26091:77

ATCAA CTAAG

Insert size (IS)

Module 4 bioinformatics.ca
Resources
• samtools: toolkit for working with SAM/BAM files
– Convert between SAM/BAM
– Sort alignments
– Extract alignments for a given genomic location

• SAM/BAM specification:
http://samtools.sourceforge.net/SAM1.pdf

• Questions/Help
– https://lists.sourceforge.net/lists/listinfo/samtools-help
– http://www.biostars.org/
– http://seqanswers.com/

Module 4 bioinformatics.ca
We are now going to start an exercise
in read mapping

Module 4 bioinformatics.ca
We are on a Coffee Break &
Networking Session

Module 4 bioinformatics.ca
What kinds of variation is there?
• Single Nucleotide Polymorphisms (SNPs)
• Short indels (< read length)
• Structural variations
– Large insertions and deletions
– Inversions
– Translocations
– Copy number variation

Module 4 bioinformatics.ca
Structural variants
Mate-pair and paired-end reads can be used to detect structural variants

Genomic
Mate-Pairs Paired-Ends
DNA

Fragmentation & 200 – 500bp


Fragmentation
1 - 20kb
circularization
to an internal adaptor
Add amplification
and sequencing adaptors
Shear

Isolate internal
adaptors and
fragment ends
Add amplification
and sequencing adaptors
Sequence

Module 4 bioinformatics.ca
Read pair orientation
Reference genome

Sequence read pair

• The expected orientation is one read on the forward strand


and one read on the reverse strand for paired-end reads

Module 4 bioinformatics.ca
Read pair alignment

Fragment
number

Fragment size

• Fragment/insert size is determined by library preparation


• Pairs that match the expected orientation and distance are
called concordant
• Discordant read pairs give evidence of structural variation

Module 4 bioinformatics.ca
SV Signatures: Deletion

don
ref

Slides by M. Brudno

Module 4 bioinformatics.ca
SV Signatures: Deletion

don

ref

Deletion signature: mapped insert size larger than expected


Slides by M. Brudno

Module 4 bioinformatics.ca
SV Signatures: Insertion

don

ref

Insertion signature: mapped insert size smaller than expected


Slides by M. Brudno

Module 4 bioinformatics.ca
SV Signatures: Tandem Duplication

don

ref

Tandem duplication signature: wrong orientation

Module 4 bioinformatics.ca
SV Signatures: Inversion

don

ref

Inversion signature: wrong orientation of pairs

Module 4 bioinformatics.ca
SV summary

Type Mapped Distance Orientation


Insertion too small correct
Deletion too big correct
Inversion *
Tandem duplication *
Interchromosomal different N/A
chromosomes

Slides by M. Brudno

Module 4 bioinformatics.ca
Where can we go wrong:
missed insertion

don

ref
IS Insertions larger than insert size cannot
be detected this way

Module 4 bioinformatics.ca
Structural Variants and Split Reads

Paired Short Reads

Align

For some paired-end reads


Most of these pairs can one of the pair may not be
be aligned to the mapped because it goes
reference genome across the breakpoint of a
structural variant. We call
such reads split reads.
Slides by M. Brudno

Module 4 bioinformatics.ca
Deletion: split read signature

don

ref

Signature: read aligns in two pieces, one on either


side of the breakpoint

Module 4 bioinformatics.ca
Somatic vs. Germline
• tumor vs. normal sequencing
• approach 1:
– find SVs separately in two samples
– filter out somatic SVs that overlap germline SVs

• approach 2
– find somatic SVs
– for each somatic SV, find any type of evidence in germline
– filter out anything with germline evidence

Slides by M. Brudno

Module 4 bioinformatics.ca
Gene fusions

• if a linking signature connects two genes, this might indicate a


gene fusion

Gene X
ChrA

Gene Y
ChrB

Gene XY Protein
Module 4 bioinformatics.ca
SV Software and Exercise
• We will use HYDRA-SV in the tutorial
– https://code.google.com/p/hydra-sv/
– Quinlan et al, Genome-wide mapping and assembly of structural variant
breakpoints in the mouse genome. Genome Research
• Many others exist:
– Breakdancer, GASV, Pindel
– It is worth spending time learning multiple packages and their
strengths and weaknesses
– There is rarely one program that fits all needs!

Module 4 bioinformatics.ca
We are now going to start an exercise
in structural variant detection

Module 4 bioinformatics.ca
We are on a Coffee Break &
Networking Session

Module 4 bioinformatics.ca
Any questions?
jared.simpson@oicr.on.ca

Module 4 bioinformatics.ca

You might also like