Bioinformatics Workshops

This document provides an overview of Module 4 of the Canadian Bioinformatics Workshops on mapping and genome rearrangement. It discusses mapping sequence reads to a reference genome using formats like FASTA and FASTQ. It also covers using paired-end reads to find structural variations and rearrangements. The module aims to help understand common bioinformatics tasks like read mapping, variant detection, and using tools like BWA and SAM/BAM.

Uploaded by

Teflon Slim

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

Bioinformatics Workshops

Uploaded by

Teflon Slim

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 49

Canadian Bioinformatics Workshops

www.bioinformatics.ca
Module #: Title of Module 2
Module 4
Mapping and Genome Rearrangement
Jared Simpson, Ph.D.

Paired-end Reads

DNA fragment

ATCAA CTAAG
Learning Objectives of Module
• Understand mapping sequence reads to a reference
genome
• Understand file formats like FASTA, FASTQ and SAM/BAM
• Learn common terminology used to describe alignments
• Learn how paired-end reads can be used to find genome
rearrangements
• Run a mapper and rearrangement caller

Module 4 bioinformatics.ca
Sequencing platforms
14TB/run
$
600Gb/10d

Cross-platform
data integration 100Gb/15d
needed. 120Gb/1d
90Gb/10d

Increasing 150Mb/3h

Data 2Gb/27h

Per Run
700Mb/23h
Proton?
100Mb/1h GridION?
$
Increasing Run Time
Module 4 bioinformatics.ca
Basecalling
• How do we translate the machine data to base calls?
• How do we estimate and represent sequencing errors?

Module 4 bioinformatics.ca
Sources of error
Illumina: Pre-phasing & Phasing

Module 4 bioinformatics.ca
What is a base quality?
Phred quality scores:
- Estimate of probability the base call is incorrect

Base Quality Perror(obs. base)

3 50 %
5 32 %
10 10 %
20 1%
30 0.1 %
40 0.01 %

Module 4 bioinformatics.ca
Error Profiles
• Illumina
– Low error rate (~0.5%), mainly substitutions
• 454/Ion Torrent
– Mainly insertions/deletions in homopolymer runs
• Pacbio
– Higher error rate, mixture of insertions, deletions, substitutions

Module 4 bioinformatics.ca
Mismatch by cycle

Module 4 bioinformatics.ca
Fasta files
ASF-1.fa ASF-2.fa

• Reads are often stored in fasta files

• Separate file for forward and reverse pairs
• header line: identifier
• sequence lines: nucleotides

Module 4 bioinformatics.ca
Fastq files

ASF-1.fastq ASF-2.fastq

• Most reads are stored in fastq • header line: @SEQUENCE_ID

• 4 lines per read • sequence line
• line beginning with +
• encoded quality value line

Module 4 bioinformatics.ca
Reference-based Alignment
• Goal:
– find position in reference genome from which read was sampled
• Issues:
– the human genome is large and repetitive
– NGS instruments produce huge amounts of data
– the sequenced genome will differ from the reference due to SNPs,
indels and structural variation

Module 4 bioinformatics.ca
Choosing an Aligner
• High accuracy needed
– Misaligned reads are a source of false positive variant calls
• High sensitivity needed
– The aligner must allow for differences between the
individual and reference to find the correct mapping
position
• High speed needed
– With large data the informatics cost is significant
• We will use the popular aligner bwa in the tutorial

Module 4 bioinformatics.ca
Reference alignments
Reference genome

Sequence read

?
Module 4 bioinformatics.ca
Reference alignments
Reference genome

x x x

Sequence read

Module 4 bioinformatics.ca
Alignment Quality
• Most aligners will estimate how reliable the alignment is
with a Mapping Quality
– Phred-scaled estimate of the probability that the chosen
mapping is wrong
– 1 in 1000 reads with “Q30” alignment will be placed incorrectly

Module 4 bioinformatics.ca
What are Paired Reads?

Paired-end Reads

DNA fragment

ATCAA CTAAG

Insert size (IS)

Slides by M. Brudno

Module 4 bioinformatics.ca
Paired Reads
Reference genome

?
Sequence read pair

Module 4 bioinformatics.ca
Read pair alignment
Reference genome

x x x xxxxx

Sequence read pair

Module 4 bioinformatics.ca
Working with alignments
• SAM/BAM is a standardized format for working with read
alignments
• SAM is tab-delimited text representation
• BAM is a compressed binary representation