Bioinformatics Toolbox™ User's Guide
Bioinformatics Toolbox™ User's Guide
Users Guide
R2014a
Web
Newsgroup
www.mathworks.com/contact_TS.html Technical Support
www.mathworks.com
comp.soft-sys.matlab
suggest@mathworks.com
bugs@mathworks.com
doc@mathworks.com
service@mathworks.com
info@mathworks.com
508-647-7000 (Phone)
508-647-7001 (Fax)
The MathWorks, Inc.
3 Apple Hill Drive
Natick, MA 01760-2098
For contact information about worldwide offices, see the MathWorks Web site.
Bioinformatics Toolbox Users Guide
COPYRIGHT 20032014 by The MathWorks, Inc.
The software described in this document is furnished under a license agreement. The software may be used
or copied only under the terms of the license agreement. No part of this manual may be photocopied or
reproduced in any form without prior written consent from The MathWorks, Inc.
FEDERAL ACQUISITION: This provision applies to all acquisitions of the Program and Documentation
by, for, or through the federal government of the United States. By accepting delivery of the Program
or Documentation, the government hereby agrees that this software or documentation qualifies as
commercial computer software or commercial computer software documentation as such terms are used
or defined in FAR 12.212, DFARS Part 227.72, and DFARS 252.227-7014. Accordingly, the terms and
conditions of this Agreement and only those rights specified in this Agreement, shall pertain to and govern
the use, modification, reproduction, release, performance, display, and disclosure of the Program and
Documentation by the federal government (or other entity acquiring for or through the federal government)
and shall supersede any conflicting contractual terms or conditions. If this License fails to meet the
governments needs or is inconsistent in any respect with federal procurement law, the government agrees
to return the Program and Documentation, unused, to The MathWorks, Inc.
Trademarks
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See
www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand
names may be trademarks or registered trademarks of their respective holders.
Patents
MathWorks products are protected by one or more U.S. patents. Please see
www.mathworks.com/patents for more information.
Revision History
September 2003
June 2004
November 2004
March 2005
May 2005
September 2005
November 2005
March 2006
May 2006
September 2006
March 2007
April 2007
September 2007
March 2008
October 2008
March 2009
September 2009
March 2010
September 2010
April 2011
September 2011
March 2012
September 2012
March 2013
September 2013
March 2014
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
Online
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
only
Contents
Getting Started
1
Bioinformatics Toolbox Product Description . . . . . . . . .
Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-2
1-2
Product Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Expected Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-3
1-3
1-5
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Installing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Required Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optional Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-6
1-6
1-6
1-6
1-9
1-9
1-11
1-11
1-12
1-13
1-13
1-14
1-15
1-18
1-19
1-19
1-20
1-20
1-21
1-22
1-22
1-22
1-23
1-23
1-27
1-28
1-31
1-31
1-32
2
Work with Large Multi-Entry Text Files . . . . . . . . . . . . .
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What Files Can You Access? . . . . . . . . . . . . . . . . . . . . . . . . .
Before You Begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Create a BioIndexedFile Object to Access Your Source
File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Determine the Number of Entries Indexed By a
BioIndexedFile Object . . . . . . . . . . . . . . . . . . . . . . . . . . .
Retrieve Entries from Your Source File . . . . . . . . . . . . . . . .
Read Entries from Your Source File . . . . . . . . . . . . . . . . . .
2-2
2-2
2-2
2-3
2-8
2-8
vi
Contents
2-4
2-5
2-5
2-6
2-9
2-11
2-15
2-17
2-18
2-20
2-21
2-23
2-23
2-23
2-31
2-24
2-25
2-26
2-31
2-32
2-33
2-36
2-37
2-38
2-39
2-40
2-41
2-42
2-43
2-43
2-43
2-44
2-61
2-78
Sequence Analysis
3
Exploring a Nucleotide Sequence Using Command
Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overview of Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-2
3-2
vii
viii
Contents
3-2
3-5
3-6
3-11
3-15
3-18
3-22
3-22
3-22
3-24
3-26
3-29
3-32
3-33
3-33
3-33
3-37
3-37
Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overview of Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Find a Model Organism to Study . . . . . . . . . . . . . . . . . . . . .
Retrieve Sequence Information from a Public Database . .
Search a Public Database for Related Genes . . . . . . . . . . .
Locate Protein Coding Sequences . . . . . . . . . . . . . . . . . . . .
Compare Amino Acid Sequences . . . . . . . . . . . . . . . . . . . . .
3-38
3-38
3-38
3-41
3-43
3-45
3-49
3-58
3-58
3-58
3-59
3-61
3-62
3-65
Microarray Analysis
4
Managing Gene Expression Data in Objects . . . . . . . . . .
4-2
4-5
4-5
4-6
4-7
4-8
4-11
4-11
4-12
4-12
4-13
4-14
4-15
4-15
4-16
4-19
4-20
4-22
4-22
4-22
4-25
4-25
4-27
4-27
4-29
4-30
4-30
ix
4-33
4-33
4-34
4-36
4-46
4-48
4-57
4-57
4-57
4-61
4-64
4-68
4-85
Phylogenetic Analysis
Contents
5-2
5-3
5-3
5-5
5-6
5-9
5-11
5-16
5-16
5-16
5-18
5-29
5-38
5-38
1
Getting Started
Bioinformatics Toolbox Product Description on page 1-2
Product Overview on page 1-3
Installation on page 1-6
Features and Functions on page 1-9
Exchange Bioinformatic Data Between Excel and MATLAB on page 1-22
Get Information from Web Database on page 1-31
Getting Started
Key Features
Next Generation Sequencing analysis and browser
Sequence analysis and visualization, including pairwise and multiple
sequence alignment and peak detection
Microarray data analysis, including reading, filtering, normalizing, and
visualization
Mass spectrometry analysis, including preprocessing, classification, and
marker identification
Phylogenetic tree analysis
Graph theory functions, including interaction maps, hierarchy plots, and
pathways
Data import from genomic, proteomic, and gene expression files, including
SAM, FASTA, CEL, and CDF, and from databases such as NCBI and
GenBank
1-2
Product Overview
Product Overview
In this section...
Features on page 1-3
Expected Users on page 1-5
Features
The Bioinformatics Toolbox product extends the MATLAB environment
to provide an integrated software environment for genome and proteome
analysis. Scientists and engineers can answer questions, solve problems,
prototype new algorithms, and build applications for drug discovery and
design, genetic engineering, and biological research. An introduction to these
features will help you to develop a conceptual model for working with the
toolbox and your biological data.
The Bioinformatics Toolbox product includes many functions to help you
with genome and proteome analysis. Most functions are implemented in the
MATLAB programming language, with the source available for you to view.
This open environment lets you explore and customize the existing toolbox
algorithms or develop your own.
You can use the basic bioinformatic functions provided with this toolbox
to create more complex algorithms and applications. These robust and
well-tested functions are the functions that you would otherwise have to
create yourself.
Toolbox features and functions fall within these categories:
Data formats and databases Connect to Web-accessible databases
containing genomic and proteomic data. Read and convert between
multiple data formats.
High-throughput sequencing Gene expression and transcription
factor analysis of next-generation sequencing data, including RNA-Seq
and ChIP-Seq.
Sequence analysis Determine the statistical characteristics of a
sequence, align two sequences, and multiply align several sequences.
1-3
Getting Started
1-4
Product Overview
Expected Users
The Bioinformatics Toolbox product is intended for computational biologists
and research scientists who need to develop new algorithms or implement
published ones, visualize results, and create standalone applications.
Industry/Professional Increasingly, drug discovery methods are being
supported by engineering practice. This toolbox supports tool builders
who want to create applications for the biotechnology and pharmaceutical
industries.
Education/Professor/Student This toolbox is well suited for learning
and teaching genome and proteome analysis techniques. Educators
and students can concentrate on bioinformatic algorithms instead of
programming basic functions such as reading and writing to files.
While the toolbox includes many bioinformatic functions, it is not intended
to be a complete set of tools for scientists to analyze their biological data.
However, the MATLAB environment is ideal for rapidly designing and
prototyping the tools you need.
1-5
Getting Started
Installation
In this section...
Installing on page 1-6
Required Software on page 1-6
Optional Software on page 1-6
Installing
Install the Bioinformatics Toolbox software from a DVD or Web release
using the MathWorks Installer. For more information, see the installation
documentation.
Required Software
The Bioinformatics Toolbox software requires the following MathWorks
products to be installed on your computer.
Required Software
Description
MATLAB
Statistics Toolbox
Optional Software
MATLAB and the Bioinformatics Toolbox software environment is open and
extensible. In this environment you can interactively explore ideas, prototype
new algorithms, and develop complete solutions to problems in bioinformatics.
MATLAB facilitates computation, visualization, prototyping, and deployment.
1-6
Installation
Using the Bioinformatics Toolbox software with other MATLAB toolboxes and
products will allow you to do advanced algorithm development and solve
multidisciplinary problems.
Optional Software
Description
Parallel Computing
Toolbox
Signal Processing
Toolbox
Image Processing
Toolbox
SimBiology
Optimization
Toolbox
Neural Network
Toolbox
Database Toolbox
MATLAB
Compiler
MATLAB Builder
NE
1-7
1-8
Getting Started
Optional Software
Description
MATLAB Builder JA
MATLAB Builder EX
Spreadsheet Link
EX
1-9
Getting Started
from the NCBI Gene Expression Omnibus (GEO) Web site by using a single
function (getgeodata).
Get multiply aligned sequences (gethmmalignment), hidden Markov model
profiles (gethmmprof), and phylogenetic tree data (gethmmtree) from the
PFAM database.
Gene Ontology database Load the database from the Web into
a gene ontology object (geneont.geneont). Select sections of the
ontology with methods for the geneont object (geneont.getancestors,
geneont.getdescendants, geneont.getmatrix, geneont.getrelatives),
and manipulate data with utility functions (goannotread, num2goid).
Read data from instruments Read data generated from gene
sequencing instruments (scfread, joinseq, traceplot), mass spectrometers
(jcampread), and Agilent microarray scanners (agferead).
Reading data formats The toolbox provides a number of functions for
reading data from common bioinformatic file formats.
Sequence data: GenBank (genbankread), GenPept (genpeptread), EMBL
(emblread), PDB (pdbread), and FASTA (fastaread)
Multiply aligned sequences: ClustalW and GCG formats (multialignread)
Gene expression data from microarrays: Gene Expression Omnibus (GEO)
data (geosoftread), GenePix data in GPR and GAL files (gprread,
galread), SPOT data (sptread), Affymetrix GeneChip data (affyread),
and ImaGene results files (imageneread)
Hidden Markov model profiles: PFAM-HMM file (pfamhmmread)
Writing data formats The functions for getting data from the Web
include the option to save the data to a file. However, there is a function to
write data to a file using the FASTA format (fastawrite).
BLAST searches Request Web-based BLAST searches (blastncbi), get
the results from a search (getblast) and read results from a previously saved
BLAST formatted report file (blastread).
1-10
Sequence Alignments
You can select from a list of analysis methods to compare nucleotide or amino
acid sequences using pairwise or multiple sequence alignment functions.
Pairwise sequence alignment Efficient implementations of standard
algorithms such as the Needleman-Wunsch (nwalign) and Smith-Waterman
(swalign) algorithms for pairwise sequence alignment. The toolbox also
includes standard scoring matrices such as the PAM and BLOSUM
families of matrices (blosum, dayhoff, gonnet, nuc44, pam). Visualize
sequence similarities with seqdotplot and sequence alignment results with
showalignment.
Multiple sequence alignment Functions for multiple sequence
alignment (multialign, profalign) and functions that support multiple
sequences (multialignread, fastaread, showalignment). There is also a
graphical interface (seqalignviewer) for viewing the results of a multiple
sequence alignment and manually making adjustment.
Multiple sequence profiles Implementations for multiple alignment and
profile hidden Markov model algorithms (gethmmprof, gethmmalignment,
gethmmtree, pfamhmmread, hmmprofalign, hmmprofestimate,
hmmprofgenerate, hmmprofmerge, hmmprofstruct, showhmmprof).
Biological codes Look up the letters or numeric equivalents for
commonly used biological codes (aminolookup, baselookup, geneticcode,
revgeneticcode).
1-11
Getting Started
1-12
Phylogenetic Analysis
You can use functions for phylogenetic tree building and analysis. There is
also a GUI to draw phylograms (trees).
Phylogenetic tree data Read and write Newick-formatted tree files
(phytreeread, phytreewrite) into the MATLAB Workspace as phylogenetic
tree objects (phytree).
Create a phylogenetic tree Calculate the pairwise distance between
biological sequences (seqpdist), estimate the substitution rates (dnds,
dndsml), build a phylogenetic tree from pairwise distances (seqlinkage,
seqneighjoin, reroot), and view the tree in an interactive GUI that allows
you to view, edit, and explore the data (phytreeviewer or view). This GUI
also allows you to prune branches, reorder, rename, and explore distances.
Phylogenetic tree object methods You can access the functionality
of the phytreeviewer GUI using methods for a phylogenetic tree object
(phytree). Get property values (get) and node names (getbyname). Calculate
the patristic distances between pairs of leaf nodes (pdist, weights)
and draw a phylogenetic tree object in a MATLAB Figure window as a
phylogram, cladogram, or radial treeplot (plot). Manipulate tree data by
selecting branches and leaves using a specified criterion (select, subtree)
and removing nodes (prune). Compare trees (getcanonical) and use
Newick-formatted strings (getnewickstr).
1-13
Getting Started
GPR files (gprread) and GAL files (galread). Get Gene Expression Omnibus
(GEO) data from the Web (getgeodata) and read GEO data from files
(geosoftread).
A utility function (magetfield) extracts data from one of the microarray
reader functions (gprread, agferead, sptread, imageneread).
Microarray normalization and filtering The toolbox provides a
number of methods for normalizing microarray data, such as lowess
normalization (malowess) and mean normalization (manorm), or across
multiple arrays (quantilenorm). You can use filtering functions to
clean raw data before analysis (geneentropyfilter, genelowvalfilter,
generangefilter, genevarfilter), and calculate the range and variance of
values (exprprofrange, exprprofvar).
Microarray visualization The toolbox contains routines for visualizing
microarray data. These routines include spatial plots of microarray data
(maimage, redgreencmap), box plots (maboxplot), loglog plots (maloglog),
and intensity-ratio plots (mairplot). You can also view clustered expression
profiles (clustergram, redgreencmap). You can create 2-D scatter plots of
principal components from the microarray data (mapcaplot).
Microarray utility functions Use the following functions to work
with Affymetrix GeneChip data sets. Get library information for a probe
(probelibraryinfo), gene information from a probe set (probesetlookup),
and probe set values from CEL and CDF information (probesetvalues).
Show probe set information from NetAffx Analysis Center (probesetlink)
and plot probe set values (probesetplot).
The toolbox accesses statistical routines to perform cluster analysis and
to visualize the results, and you can view your data through statistical
visualizations such as dendrograms, classification, and regression trees.
1-14
1-15
Getting Started
1-16
mzXML File
mzxmlread
mzXML Structure
mzxml2peaks
Peak Lists
(Centroided Data)
mspeaks
Raw
Data
msdotplot
Plot
msheatmap
Plot
msppresample
Reconstructed
Data
Semicontinuous Signal
msviewer
Mass
Spectra
Viewer
msresample
1-17
Getting Started
1-18
Graph Visualization
The toolbox includes functions, objects, and methods for creating, viewing,
and manipulating graphs such as interactive maps, hierarchy plots, and
pathways. This allows you to view relationships between data.
The object constructor function (biograph) lets you create a biograph object to
hold graph data. Methods of the biograph object let you calculate the position
of nodes (dolayout), draw the graph (view), get handles to the nodes and
edges (getnodesbyid and getedgesbynodeid) to further query information,
and find relations between the nodes (getancestors, getdescendants,
and getrelatives). There are also methods that apply basic graph theory
algorithms to the biograph object.
Various properties of a biograph object let you programmatically change the
properties of the rendered graph. You can customize the node representation,
for example, drawing pie charts inside every node (CustomNodeDrawFcn). Or
you can associate your own callback functions to nodes and edges of the graph,
for example, opening a Web page with more information about the nodes
(NodeCallback and EdgeCallback).
1-19
Getting Started
The toolbox provides functions that build on the classification and statistical
learning tools in the Statistics Toolbox software (classify, kmeans, and
treefit).
These functions include imputation tools (knnimpute), and K-nearest neighbor
classifiers (knnclassify).
Other functions include set up of cross-validation experiments (crossvalind)
and comparison of the performance of different classification methods
(classperf). In addition, there are tools for selecting diversity and
discriminating features (rankfeatures, randfeatures).
Data Visualization
You can visually compare pairwise sequence alignments, multiply aligned
sequences, gene expression data from microarrays, and plot nucleic acid and
1-20
protein characteristics. The 2-D and volume visualization features let you
create custom graphical representations of multidimensional data sets. You
can also create montages and overlays, and export finished graphics to an
Adobe PostScript image file or copy directly into Microsoft PowerPoint.
1-21
Getting Started
1-22
Toolbox software:
matlabroot\toolbox\bioinfo\biodemos\Filtered_Yeastdata.xlsm
select Macro Security from the Code group. (If the Developer tab is not
displayed on the Excel ribbon, consult Excel Help to display it.)
from DeRisi et al. Also note that cells J5, J6, J7, and J12 contain formulas
using Spreadsheet Link EX functions MLPutMatrix and MLEvalString.
Tip To view a cells formula, select the cell, and then view the formula in
the formula bar
2 Execute the formulas in cells J5, J6, J7, and J12, by selecting the cell,
1-23
Getting Started
Each of the first three cells contains a formula using the Spreadsheet Link
EX function MLPutMatrix, which creates a MATLAB variable from the
data in the spreadsheet. Cell J12 contains a formula using the Spreadsheet
Link EX function MLEvalString, which runs the Bioinformatics Toolbox
clustergram function using the three variables as input. For more
information on adding formulas using Spreadsheet Link EX functions,
see Enter Functions into Worksheet Cells in the Spreadsheet Link EX
documentation.
1-24
which was created in the Visual Basic Editor. Running this macro does the
same as the formulas in cells J5, J6, J7, and J12. Optionally, view the
Clustergram macro function by clicking the Developer tab, and then
clicking the Visual Basic button
. (If the Developer tab is not on the
Excel ribbon, consult Excel Help to display it.)
1-25
Getting Started
For more information on creating macros using Visual Basic Editor, see
Use Spreadsheet Link EX Functions in Macros in the Spreadsheet Link
EX documentation.
4 Execute the formula in cell J17 to analyze and visualize the data:
a Select cell J17.
b Press F2.
c Press Enter.
1-26
this by editing the formulas cell ranges to include data for only the first
30 genes:
a Select cell J5, and then press F2 to display the formula for editing.
b Select cell J6, then press F2 to display the formula for editing. Change
2 Run the formulas in cells J5, J6, J7, and J12 to analyze and visualize
1-27
Getting Started
1-28
1-29
Getting Started
Note Make sure you use the ' (transpose) symbol when plotting the data
in this step. You need to transpose the data in YAGenes so that it plots as
three genes over seven time intervals.
6 Select cell J20, and then click from the MATLAB group, select Get
MATLAB figure.
The figure is added to the spreadsheet.
1-30
1-31
Getting Started
Function.
2 Define the getpubmed function, its input arguments, and return values by
typing:
function pmstruct = getpubmed(searchterm,varargin)
% GETPUBMED Search PubMed database & write results to MATLAB structure
3 Add code to do some basic error checking for the required input SEARCHTERM.
1-32
5 Add code to parse the two property name/property value pairs if provided
as input.
% Parsing the property name/value pairs
num_argin = numel(varargin);
for n = 1:2:num_argin
arg = varargin{n};
switch lower(arg)
% If NUMBEROFRECORDS is passed, set MAXNUM
case 'numberofrecords'
maxnum = varargin{n+1};
% If DATEOFPUBLICATION is passed, set PUBDATE
case 'dateofpublication'
pubdate = varargin{n+1};
end
end
6 You access the PubMed database through a search URL, which submits
a search term and options, and then returns the search results in a
specified format. This search URL is comprised of a base URL and defined
parameters. Create a variable containing the base URL of the PubMed
database on the NCBI Web site.
% Create base URL for PubMed db site
baseSearchURL = 'http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=search';
7 Create variables to contain five defined parameters that the getpubmed
function will use, namely, db (database), term (search term), report (report
type, such as MEDLINE), format (format type, such as text), and dispmax
(maximum number of records to display).
% Set db parameter to pubmed
dbOpt = '&db=pubmed';
% Set term parameter to SEARCHTERM and PUBDATE
% (Default PUBDATE is '')
termOpt = ['&term=',searchterm,'+AND+',pubdate];
1-33
Getting Started
results, and return the results (as text in the MEDLINE report type) in
medlineText, a character array.
medlineText = urlread(searchURL);
10 Use the MATLAB regexp function and regular expressions to parse and
extract the information in medlineText into hits, a cell array, where each
cell contains the MEDLINE-formatted text for one article. The first input
is the character array to search, the second input is a search expression,
which tells the regexp function to find all records that start with PMID-,
while the third input, 'match', tells the regexp function to return the
actual records, rather than the positions of the records.
hits = regexp(medlineText,'PMID-.*?(?=PMID|</pre>$)','match');
11 Instantiate the pmstruct structure returned by getpubmed to contain six
fields.
pmstruct = struct('PubMedID','','PublicationDate','','Title','',...
'Abstract','','Authors','','Citation','');
12 Use the MATLAB regexp function and regular expressions to loop through
each article in hits and extract the PubMed ID, publication date, title,
1-34
- ).*?(?=\n)','match', 'once');
- ).*?(?=PG
pmstruct(n).Abstract = regexp(hits{n},'(?<=AB
pmstruct(n).Authors = regexp(hits{n},'(?<=AU
pmstruct(n).Citation = regexp(hits{n},'(?<=SO
-|AB
- ).*?(?=AD
-)','match', 'once');
-)','match', 'once');
- ).*?(?=\n)','match');
- ).*?(?=\n)','match', 'once');
end
When you are done, your file should look similar to the getpubmed.m
file included with the Bioinformatics Toolbox software. The sample
getpubmed.m file, including help, is located at:
matlabroot\toolbox\bioinfo\biodemos\getpubmed.m
1-35
1-36
Getting Started
2
High-Throughput Sequence
Analysis
Work with Large Multi-Entry Text Files on page 2-2
Manage Short-Read Sequence Data in Objects on page 2-8
Store and Manage Feature Annotations in Objects on page 2-23
Visualize and Investigate Short-Read Alignments on page 2-31
Identifying Differentially Expressed Genes from RNA-Seq Data on page
2-44
Exploring Protein-DNA Binding Sites from Paired-End ChIP-Seq Data
on page 2-61
Exploring Genome-wide Differences in DNA Methylation Profiles on page
2-78
Overview
Many biological experiments produce huge data files that are difficult to
access due to their size, which can cause memory issues when reading the
file into the MATLAB Workspace. You can construct a BioIndexedFile
object to access the contents of a large text file containing nonuniform size
entries, such as sequences, annotations, and cross-references to data sets.
The BioIndexedFile object lets you quickly and efficiently access this data
without loading the source file into memory.
You can use the BioIndexedFile object to access individual entries or a
subset of entries when the source file is too big to fit into memory. You can
access entries using indices or keys. You can read and parse one or more
entries using provided interpreters or a custom interpreter function.
Use the BioIndexedFile object in conjunction with your large source file to:
Access a subset of the entries for validation or further analysis.
Parse entries using a custom interpreter function.
2-2
2-3
Tip If insufficient memory is not an issue when accessing your source file,
you may want to try an appropriate read function, such as genbankread, for
importing data from GenBank files. .
Additionally, several read functions such as fastaread, fastqread, samread,
and sffread include a Blockread property, which lets you read a subset of
entries from a file, thus saving memory.
your source file, use the yeastgenes.sgd file, which is included with the
Bioinformatics Toolbox software.
sourcefile = which('yeastgenes.sgd');
2 Use the BioIndexedFile constructor function to construct a
multi-row table file. Save the index file in the Current Folder. Indicate that
the source file keys are in column 3. Also, indicate that the header lines in
the source file are prefaced with !, so the constructor ignores them.
gene2goObj = BioIndexedFile('mrtab', sourcefile, '.', ...
'KeyColumn', 3, 'HeaderPrefix','!')
2-4
2-5
There are two ways to set the Interpreter property of the BioIndexedFile
object:
When constructing the BioIndexedFile object, use the Interpreter
property name/property value pair
2-6
Example
To quickly find all the gene ontology (GO) terms associated with a particular
gene because the entry keys are gene names:
1 Set the Interpreter property of the gene2goObj BioIndexedFile object
to a handle to a function that reads entries and returns only the column
containing the GO term. In this case the interpreter is a handle to an
anonymous function that accepts strings and extracts strings that start
with the characters GO.
gene2goObj.Interpreter = @(x) regexp(x,'GO:\d+','match')
2 Read only the entries that have a key of YAT2, and return their GO terms.
2-7
Overview
High-throughput sequencing instruments produce large amounts of short-read
sequence data that can be challenging to store and manage. Using objects to
contain this data lets you easily access, manipulate, and filter the data.
Bioinformatics Toolbox includes two objects for working with short-read
sequence data.
Object
BioRead
Sequence headers
FASTQ file
Read sequences
SAM file
2-8
Object
BioMap
Sequence headers
SAM file
Read sequences
BAM file
2-9
However, you can modify object properties. When you construct a BioRead
object from a FASTQ structure or cell arrays, the data is read into memory.
When you construct a BioRead object from a FASTQ- or SAM-formatted file,
use the InMemory name-value pair argument to read the data into memory.
The constructor function construct a BioRead object and, if an index file does
not already exist, it also creates an index file with the same file name, but
with an .IDX extension. This index file, by default, is stored in the same
location as the source file.
2-10
Caution
After constructing a BioRead object, do not modify the index file, or you
can get invalid results when using the existing object or constructing new
objects.
If you modify the source file, delete the index file, so the object constructor
creates a new index file when constructing new objects.
Note Because you constructed this BioRead object from a source file, you
cannot modify the properties (except for Name) of the BioRead object.
2-11
your source file, determine them using the saminfo or baminfo function
and the ScanDictionary name-value pair argument.
samstruct = saminfo('ex2.sam', 'ScanDictionary', true);
samstruct.ScannedDictionary
ans =
'seq1'
'seq2'
Tip The previous syntax scans the entire SAM file, which is time
consuming. If you are confident that the Header information of the SAM
file is correct, omit the ScanDictionary name-value pair argument, and
inspect the SequenceDictionary field instead.
2 Use the BioMap constructor function to construct a BioMap object from
the SAM file and set the Name property. Because the SAM-formatted file
in this example, ex2.sam, contains multiple reference sequences, use the
SelectRef name-value pair argument to specify one reference sequence,
seq1:
BMObj2 = BioMap('ex2.sam', 'SelectRef', 'seq1', 'Name', 'MyObject')
BMObj2 =
BioMap with properties:
2-12
SequenceDictionary:
Reference:
Signature:
Start:
MappingQuality:
Flag:
MatePosition:
Quality:
Sequence:
Header:
NSeqs:
Name:
'seq1'
[1501x1 File
[1501x1 File
[1501x1 File
[1501x1 File
[1501x1 File
[1501x1 File
[1501x1 File
[1501x1 File
[1501x1 File
1501
'MyObject'
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
property]
property]
property]
property]
property]
property]
property]
property]
property]
The constructor function constructs a BioMap object and, if index files do not
already exist, it also creates one or two index files:
If constructing from a SAM-formatted file, it creates one index file that has
the same file name as the source file, but with an .IDX extension. This
index file, by default, is stored in the same location as the source file.
If constructing from a BAM-formatted file, it creates two index files that
have the same file name as the source file, but one with a .BAI extension
and one with a .LINEARINDEX extension. These index files, by default,
are stored in the same location as the source file.
Caution
After constructing a BioMap object, do not modify the index files, or you
can get invalid results when using the existing object or constructing new
objects.
If you modify the source file, delete the index files, so the object constructor
creates new index files when constructing new objects.
2-13
Note Because you constructed this BioMap object from a source file, you
cannot modify the properties (except for Name and Reference) of the BioMap
object.
file:
SAMStruct = samread('ex2.sam');
2 To construct a valid BioMap object from a SAM-formatted file, the file must
contain only one reference sequence. Determine the number and names
of the reference sequences in your SAM-formatted file using the unique
function to find unique names in the ReferenceName field of the structure:
unique({SAMStruct.ReferenceName})
ans =
'seq1'
'seq2'
2-14
SequenceDictionary:
Reference:
Signature:
Start:
MappingQuality:
Flag:
MatePosition:
Quality:
Sequence:
Header:
NSeqs:
Name:
{'seq1'}
{1501x1 cell}
{1501x1 cell}
[1501x1 uint32]
[1501x1 uint8]
[1501x1 uint16]
[1501x1 uint32]
{1501x1 cell}
{1501x1 cell}
{1501x1 cell}
1501
''
This syntax returns a cell array containing the headers for all elements in the
BioRead object.
Similarly, to retrieve all start positions of aligned read sequences from a
BioMap object, use the Start property of the object:
allStarts = BMObj1.Start;
This syntax returns a vector containing the start positions of aligned read
sequences with respect to the position numbers in the reference sequence in
a BioMap object.
2-15
This syntax returns a cell array containing all start positions and headers
information of a BioMap object.
Note Property names are case sensitive.
For a list and description of all properties of a BioRead object, see BioRead
class. For a list and description of all properties of a BioMap object, see BioMap
class.
This syntax returns a new BioRead object containing the first 10 elements in
the original BioRead object.
For example, to retrieve the first 12 positions of sequences with headers
SRR005164.1, SRR005164.7, and SRR005164.16, use the getSubsequence
method:
subSeqs = getSubsequence(BRObj1, ...
{'SRR005164.1', 'SRR005164.7', 'SRR005164.16'}, [1:12]')
subSeqs =
2-16
'TGGCTTTAAAGC'
'CCCGAAAGCTAG'
'AATTTTGCGGCT'
2-17
To provide custom headers for sequences of interest (in this case sequences 1
to 5), do the following:
BRObj1.Header(1:5) = {'H1', 'H2', 'H3', 'H4', 'H5'};
Several other specialized set methods let you set the properties of a subset of
elements in a BioRead or BioMap object.
Note Method names are case sensitive.
For a complete list and description of methods of a BioRead object, see
BioRead class. For a complete list and description of methods of a BioMap
object, see BioMap class.
2-18
For example, you can compute the number, indices, and start positions of
the read sequences that align within the first 25 positions of the reference
sequence. To do so, use the getCounts, getIndex, and getStart methods:
Cov = getCounts(BMObj1, 1, 25)
Cov =
12
Indices = getIndex(BMObj1, 1, 25)
Indices =
1
2
3
4
5
6
7
8
9
10
11
12
startPos = getStart(BMObj1, Indices)
startPos =
1
3
5
6
9
13
13
15
18
22
2-19
22
24
The first two syntaxes return the number and indices of the read sequences
that align within the specified region of the reference sequence. The last
syntax returns a vector containing the start position of each aligned read
sequence, corresponding to the position numbers of the reference sequence.
For example, you can also compute the number of the read sequences that
align to each of the first 10 positions of the reference sequence. For this
computation, use the getBaseCoverage method:
Cov = getBaseCoverage(BMObj1, 1, 10)
Cov =
1
Indices =
2-20
1
2
3
4
5
Return the headers of the read sequences that align to a specific region of
the reference sequence:
alignedHeaders = getHeader(BMObj2, Indices)
alignedHeaders =
'B7_591:4:96:693:509'
'EAS54_65:7:152:368:113'
'EAS51_64:8:5:734:57'
'B7_591:1:289:587:906'
'EAS56_59:8:38:671:758'
BMObj2 = BioMap('ex1.sam');
2 Use the filterByFlag method to create a logical vector indicating the read
2-21
BMObj2 = BioMap('ex1.sam');
2 Use the filterByFlag method to create a logical vector indicating the read
sequences in a BioMap object that are mapped in a proper pair, that is, both
the read sequence and its mate are mapped to the reference sequence.
LogicalVec_paired = filterByFlag(BMObj2, 'pairedInMap', true);
3 Use this logical vector and the getSubset method to create a new BioMap
object containing only the read sequences that are mapped in a proper pair.
filteredBMObj_2 = getSubset(BMObj2, LogicalVec_paired);
2-22
2-23
GTFAnnotObj = GTFAnnotation('hum37_2_1M.gtf')
GTFAnnotObj =
GTFAnnotation with properties:
FieldNames: {1x11 cell}
NumEntries: 308
'Start'
'Stop'
'Feature'
'Source'
'Feature'
'Gene'
'Score'
Columns 7 through 9
'Strand'
'Frame'
'Attributes'
GTFAnnotObj.FieldNames
ans =
Columns 1 through 6
'Reference'
'Start'
'Stop'
Columns 7 through 11
'Source'
2-24
'Score'
'Strand'
'Frame'
'Attributes'
'Transcript'
Determine the range of the reference sequences that are covered by feature
annotations by using the getRange method with the annotation object
constructed in the previous section:
range = getRange(GFFAnnotObj)
range =
3631
498516
2-25
Starts = AnnotStruct.Start;
Extract the start positions for annotations 12 through 17. Notice that you
must use square brackets when indexing a range of positions:
Starts_12_17 = [AnnotStruct(12:17).Start]
Starts_12_17 =
4706
5174
5174
5439
5439
5631
Extract the start position and the feature for the 12th annotation:
Start_12 = AnnotStruct(12).Start
Start_12 =
4706
Feature_12 = AnnotStruct(12).Feature
Feature_12 =
CDS
2-26
GTFAnnotObj = GTFAnnotation('hum37_2_1M.gtf');
2 Use the getReferenceNames method to return the names for the reference
annotation object:
featureNames = getFeatureNames(GTFAnnotObj)
featureNames =
'CDS'
'exon'
'start_codon'
'stop_codon'
4 Use the getGeneNames method to retrieve a list of the unique gene names
2-27
'uc002qwf.2'
'uc002qwg.2'
'uc002qwh.2'
'uc002qwi.3'
'uc002qwk.2'
'uc002qwl.2'
'uc002qwm.1'
'uc002qwn.1'
'uc002qwo.1'
'uc002qwp.2'
'uc002qwq.2'
'uc010ewe.2'
'uc010ewf.1'
'uc010ewg.2'
'uc010ewh.1'
'uc010ewi.2'
'uc010yim.1'
Filter Annotations
Use the getData method to filter the annotations and create a structure
containing only the annotations of interest, which are annotations that are
exons associated with the uc002qvv.2 gene on chromosome 2.
AnnotStruct = getData(GTFAnnotObj,'Reference','chr2',...
'Feature','exon','Gene','uc002qvv.2')
AnnotStruct =
12x1 struct array with fields:
Reference
Start
Stop
Feature
2-28
Gene
Transcript
Source
Score
Strand
Frame
Attributes
Then use the range for the annotations of interest as input to the getCounts
method of a BioMap object. This returns the counts of short reads aligned to
the annotations of interest.
counts = getCounts(BMObj3,StartPos,EndPos,'independent', true)
counts =
1399
1
54
221
97
2-29
125
0
1
0
65
9
12
2-30
2-31
2-32
2-33
Browser Displaying Reference Track, One Alignment Track, and One Annotation Track
Tip You can use the getgenbank function with the ToFile and SequenceOnly
name-value pair arguments to retrieve a reference sequence from the
GenBank database and save it to a FASTA-formatted file.
2-34
Tip If you do not have index files (IDX or BAI and LINEARINDEX) stored
in the same location as your source file, and your source file is stored in a
location to which you do not have write access, you cannot import data from
the source file directly into the browser. Instead, construct a BioMap object
from the source file using the IndexDir name-value pair argument, and
then import the BioMap object into the browser.
To import short-read alignment data:
1 Select File > Add Data from File or File > Import Alignment Data
Reference dialog box, select a reference or scan the file for available
references and their mapped reads counts. Click OK.
4 Repeat the previous steps to import additional data sets.
click Open.
3 Repeat the previous steps to import additional annotations.
2-35
Tip Use the left and right arrow keys to pan in one base pair (bp) increments.
2-36
Note The browser computes coverage at the base pair resolution, instead
of binning, even when zoomed out.
To change the percent coverage displayed, click anywhere in the alignment
track, and then edit the Alignment Coverage settings.
Tip Set Max to a value greater than 100, if needed, when comparing the
coverage of multiple tracks of reads.
2-37
Limit the depth of the reads displayed in the pileup view by setting the
Maximum display read depth in the Alignment Pileup settings.
Tip Limiting the depth of short reads in the pileup view does not change the
counts displayed in the coverage view.
2-38
2-39
2-40
Flag Reads
Click anywhere in an alignment track to display the Alignment Pileup
settings.
2-41
In addition to the base Phred quality information that displays in the tooltip,
you can visualize quality differences by using the Shade mismatch bases
by Phred quality settings.
2-42
2-43
In the prostate cancer study, the prostate cancer cell line LNCap was treated
with androgen/DHT. Mock-treated and androgen-stimulated LNCap cells
were sequenced using the Illumina 1G Genome Analyzer [1]. For the
mock-treated cells, there were four lanes totaling ~10 million reads. For the
DHT-treated cells, there were three lanes totaling ~7 million reads. All
replicates were technical replicates. Samples labeled s1 through s4 are from
mock-treated cells. Samples labeled s5, s6, and s8 are from DHT-treated
cells. The read sequences are stored in FASTA files. The sequence IDs break
down as follows: seq_(unique sequence id)_(number of times this sequence
was seen in this lane).
This example assumes that you have:
(1) Downloaded and uncompressed the seven FASTA files (s1.fa, s2.fa,
s3.fa, s4.fa, s5.fa, s6.fa and s8.fa) containing the raw, 35bp, unmapped
short reads from the authors Web Site.
2-44
(2) Produced a SAM-formatted file for each of the seven FASTA files by
mapping the short reads to the NCBI version 37 of the human genome using a
mapper such as Bowtie [2],
(3) Ordered the SAM-formatted files by reference name first, then by genomic
position.
For the published version of this example, 4,388,997 short reads were mapped
using the Bowtie aligner [2]. The aligner was instructed to report one best
valid alignment. No more than two mismatches were allowed for alignment.
Reads with more than one reportable alignment were suppressed, i.e. any
read that mapped to multiple locations was discarded. The alignment was
output to seven SAM files (s1.sam, s2.sam, s3.sam, s4.sam, s5.sam, s6.sam
and s8.sam). Because the input files were FASTA files, all quality values
were assumed to be 40 on the Phred quality scale [2]. We then used SAMtools
[3] to sort the mapped reads in the seven SAM files, one for each replicate.
Creating an Annotation Object of Target Genes
genes =
GFFAnnotation with properties:
FieldNames: {1x9 cell}
NumEntries: 21184
2-45
Create a subset with the genes present in chromosomes only (without contigs).
The GFFAnnotation object contais 20012 annotated protein-coding genes in
the Ensembl database.
chrs = {'1','2','3','4','5','6','7','8','9','10','11','12','13','14',...
'15','16','17','18','19','20','21','22','X','Y','MT'};
genes = getSubset(genes,'reference',chrs)
genes =
GFFAnnotation with properties:
FieldNames: {1x9 cell}
NumEntries: 20012
Copy the gene information into a structure and display the first entry.
getData(genes,1)
ans =
Reference:
Start:
Stop:
Feature:
Source:
Score:
Strand:
Frame:
Attributes:
'1'
205111632
205180727
'DSTYK'
'protein_coding'
'0.0'
'-'
'.'
''
The size of the sorted SAM files in this data set are in the order of 250-360MB.
You can access the mapped reads in s1.sam by creating a BioMap. BioMap has
2-46
an interface that provides direct access to the mapped short reads stored in
the SAM-formatted file, thus minimizing the amount of data that is actually
loaded into memory.
bm = BioMap('s1.sam')
bm =
BioMap with properties:
SequenceDictionary:
Reference:
Signature:
Start:
MappingQuality:
Flag:
MatePosition:
Quality:
Sequence:
Header:
NSeqs:
Name:
{1x25 cell}
[458367x1 File
[458367x1 File
[458367x1 File
[458367x1 File
[458367x1 File
[458367x1 File
[458367x1 File
[458367x1 File
[458367x1 File
458367
''
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
property]
property]
property]
property]
property]
property]
property]
property]
property]
Use the getSummary method to obtain a list of the existing references and the
actual number of short read mapped to each one. Observe that the order of
the references is equivalent to the previously created cell string chrs.
getSummary(bm)
BioMap summary:
Name:
Container_Type:
Total_Number_of_Sequences:
Number_of_References_in_Dictionary:
gi|224589800|ref|NC_000001.10|
''
'Data is file indexed.'
458367
25
Number_of_Sequences
39037
Genomic_Range
564571 2492
2-47
gi|224589811|ref|NC_000002.11|
gi|224589815|ref|NC_000003.11|
gi|224589816|ref|NC_000004.11|
gi|224589817|ref|NC_000005.9|
gi|224589818|ref|NC_000006.11|
gi|224589819|ref|NC_000007.13|
gi|224589820|ref|NC_000008.10|
gi|224589821|ref|NC_000009.11|
gi|224589801|ref|NC_000010.10|
gi|224589802|ref|NC_000011.9|
gi|224589803|ref|NC_000012.11|
gi|224589804|ref|NC_000013.10|
gi|224589805|ref|NC_000014.8|
gi|224589806|ref|NC_000015.9|
gi|224589807|ref|NC_000016.9|
gi|224589808|ref|NC_000017.10|
gi|224589809|ref|NC_000018.9|
gi|224589810|ref|NC_000019.9|
gi|224589812|ref|NC_000020.10|
gi|224589813|ref|NC_000021.8|
gi|224589814|ref|NC_000022.10|
gi|224589822|ref|NC_000023.10|
gi|224589823|ref|NC_000024.9|
gi|17981852|ref|NC_001807.4|
23102
23788
16273
20875
16743
17022
12199
13988
15707
37506
21714
6078
14644
13199
15423
22089
5986
17690
10026
6119
7366
12939
2819
66035
39107
578280
56044
50342
277774
146474
162668
21790
179281
203411
79745
19335895
19123810
20145084
92212
56680
111538
63006
119233
9421584
16150315
2774622
2711686
12
You can access the alignments, and perform operations like getting counts
and coverage from bm. For more examples of getting read coverage at the
chromosome level, see Exploring Protein-DNA Binding Sites from Paired-End
ChIP-Seq Data.
Determining Digital Gene Expression
Next, you will determine the mapped reads associated with each Ensembl
gene. Because the strings used in the SAM files to denote the reference names
are different to those provided in the annotations, we find a vector with the
reference index for each gene:
geneReference =
2-48
seqmatch(genes.Reference,chrs,'exact',true);
2431
1977
1909
1806
1708
1588
1462
1410
1355
1343
1337
1150
1072
1025
901
810
779
590
629
480
512
1545
590
For each gene, count the mapped reads that overlap any part of the gene.
The read counts for each gene are the digital gene expression of that gene.
Use the getCounts method of a BioMap to compute the read count within a
specified range.
counts = getCounts(bm,genes.Start,genes.Stop,1:genes.NumEntries,geneReferen
Gene expression levels can be best respresented by a table, with each row
representing a gene. Create a table with two columns, set the first column to
the gene symbols and second column to the counts of the first sample.
filenames = {'s1.sam','s2.sam','s3.sam','s4.sam','s5.sam','s6.sam','s8.sam'
samples = {'Mock_1','Mock_2','Mock_3','Mock_4','DHT_1','DHT_2','DHT_3'};
lncap = table(genes.Feature,counts,'VariableNames',{'Gene',samples{1}});
ans =
Gene
____________
Mock_1
______
'DSTYK'
'KCNJ2'
'DPF3'
'KRT78'
'GPR19'
'SOX9'
'C17orf63'
'AL929472.1'
'INPP5B'
'NME4'
21
1
2
0
1
8
13
0
19
10
Determine the number of genes that have counts greater than or equal to
50 in chromosome 1.
2-49
ans =
188
Repeat this step for the other six samples (SAM files) in the data set to get
their gene counts and copy the information to the previously created table.
for i = 2:7
bm = BioMap(filenames{i});
counts = getCounts(bm,genes.Start,genes.Stop,1:genes.NumEntries,geneRef
lncap.(samples{i}) = counts;
end
Inspect the first 10 rows in the table with the counts for all seven samples.
lncap(1:10, :)
ans =
2-50
Gene
____________
Mock_1
______
Mock_2
______
Mock_3
______
Mock_4
______
DHT_1
_____
DHT_2
_____
'DSTYK'
'KCNJ2'
'DPF3'
'KRT78'
'GPR19'
'SOX9'
'C17orf63'
'AL929472.1'
'INPP5B'
'NME4'
21
1
2
0
1
8
13
0
19
10
15
0
2
0
2
13
12
0
23
11
15
2
2
0
1
19
16
0
27
14
24
0
2
0
1
15
24
1
24
22
24
0
2
0
0
27
19
0
35
11
24
2
1
0
0
22
12
0
32
20
DHT_3
_____
15
2
1
0
0
11
9
0
9
8
The table lncap contains counts for samples from two biological conditions:
mock-treated (Aidx) and DHT-treated (Bidx).
Aidx = logical([1 1 1 1 0 0 0]);
Bidx = logical([0 0 0 0 1 1 1]);
You can plot the counts for a chromosome along the chromosome genome
coordinate. For example, plot the counts for chromosome 1 for mock-treated
sample Mock_1 and DHT-treated sample DHT_1. Add the ideogram for
chromosome 1 to the plot using the chromosomeplot function.
ichr1 = find(lichr1); % linear index to genes in chromosome 1
[~,h] = sort(genes.Start(ichr1));
ichr1 = ichr1(h);
% linear index to genes in chromosome 1 sorted by
% genomic position
figure
plot(genes.Start(ichr1), lncap{ichr1,'Mock_1'}, '.-r',...
genes.Start(ichr1), lncap{ichr1,'DHT_1'}, '.-b');
ylabel('Gene Counts')
title('Gene Counts on Chromosome 1')
fixGenomicPositionLabels(gca) % formats tick labels and adds datacursors
chromosomeplot('hs_cytoBand.txt', 1, 'AddToPlot', gca)
2-51
For RNA-seq experiments, the read counts have been found to be linearly
related to the abundance of the target transcripts [4]. The interest lies
in comparing the read counts between different biological conditions.
Current observations suggest that typical RNA-seq experiments have low
background noise, and the gene counts are discrete and could follow the
Poisson distribution. While it has been noted that the assumption of the
Poisson distribution often predicts smaller variation in count data by ignoring
the extra variation due to the actual differences between replicate samples
[5]. Anders et.al.,(2010) proposed an error model for statistical inference
of differential signal in RNA-seq expression data that could address the
overdispersion problem [6]. Their approach uses the negative binomial
distribution to model the null distribution of the read counts. The mean and
variance of the negative binomial distribution are linked by local regression,
and these two parameters can be reliably estimated even when the number of
replicates is small [6].
In this example, you will apply this statistical model to process the count
data and test for differential expression. The details of the algorithm can be
found in reference [6]. The model of Anders et.al., (2010) has three sets of
parameters that need to be estimated from the data:
1. Library size parameters;
2. Gene abundance parameters under each experimental condition;
3. The smooth functions that model the dependence of the raw variance on
the expected mean.
Estimating Library Size Factor
The expectation values of all gene counts from a sample are proportional to
the samples library size. The effective library size can be estimated from the
count data.
Compute the geometric mean of the gene counts (rows in lncap) across all
samples in the experiment as a pseudo-reference sample.
pseudo_ref_sample = geomean(lncap{:,samples},2);
2-52
Each library size parameter is computed as the median of the ratio of the
samples counts to those of the pseudo-reference sample.
nzi = pseudo_ref_sample>0; % ignore genes with zero geometric mean
ratios = bsxfun(@rdivide, lncap{nzi,samples}, pseudo_ref_sample(nzi));
sizeFactors = median(ratios, 1);
The counts can be transformed to a common scale using size factor adjustment.
base_lncap = lncap;
base_lncap{:,samples} = bsxfun(@rdivide,lncap{:,samples},sizeFactors);
Use the boxplot function to inspect the count distribution of the mock-treated
and DHT-treated samples and the size factor adjustment.
figure
subplot(2,1,1)
maboxplot(log2(lncap{:,samples}), 'title','Raw Read Counts',...
'orientation', 'horizontal')
subplot(2,1,2)
maboxplot(log2(base_lncap{:,samples}), 'title','Size Factor Adjusted Read C
'orientation', 'horizontal')
Plot the log2 fold changes against the base means using the mairplot
function. A quick exploration reflects ~15 differentially expressed genes (20
fold change or more), though not all of these are significant due to the low
number of counts compared to the sample variance.
mairplot(mean_A(nzi),mean_B(nzi),'Labels',lncap.Gene,'Factor',20)
2-53
In the model, the variances of the counts of a gene are considered as the sum
of a shot noise term and a raw variance term. The shot noise term is the
mean counts of the gene, while the raw variance can be predicted from the
mean, i.e., genes with a similar expression level have similar variance across
the replicates (samples of the same biological condition). A smooth function
that models the dependence of the raw variance on the mean is obtained by
fitting the sample mean and variance within replicates for each gene using
local regression function.
Compute sample variances transformed to the common scale for mock-treated
samples. (Eq. 7 in [6])
var_A = var(base_lncap{:,samples(Aidx)}, 0, 2);
raw_var_func_A =
@(meanEstimate)calculateUnbiasedRawVariance(meanEstimate)
2-54
raw_var_func_A(mean_A) + z;
Plot the sample variance to its regressed value to check the fit of the variance
function.
figure
loglog(mean_A, var_A, '*')
hold on
loglog(mean_A, var_fit_A, '.r')
ylabel('Base Variances')
xlabel('Base Means')
title('Dependence of the Variance on the Mean for Mock-Treated Samples')
The fit (red line) follows the single-gene estimates well, even though the
spread of the latter is considerable, as one would expect, given that each
raw variance value is estimated from only four values (four mock-treaded
replicates).
Empirical Cumulative Distribution Functions
2-55
hold on
cm = jet(7);
for i = 1:7
[Y1,X1] = ecdf(pchisq(grps==i));
plot(X1,Y1,'LineWidth',2,'color',cm(i,:))
end
plot([0,1],[0,1] ,'k', 'linewidth', 2)
set(gca, 'Box', 'on')
legend(labels,'Location','NorthWest')
xlabel('Chi-squared probability of residual')
ylabel('ECDF')
title('Residuals ECDF plot for mock-treated samples')
The ECDF curves of count levels greater than 3 and below 130 follows the
diagonal well (black line). If the ECDF curves are below the black line,
variance is underestimated. If the ECDF curves are above the black line,
variance is overestimated [6]. For very low counts (below 3), the deviations
become stronger, but at these levels, shot noise dominates. For the high
count cases, the variance is overestimated. The reason might be there are
not enough genes with high counts. Get the number of genes in each of the
count levels.
array2table(accumarray(grps,1),'VariableNames',{'Counts'},'RowNames',labels
ans =
Counts
______
0-3
4-12
13-30
31-65
66-130
131-310
> 311
2-56
8984
3405
3481
2418
1173
428
123
Increasing the sequence depth, which in turn increases the number of genes
with higher counts, improves the variance estimation.
Testing for Differential Expression
Having estimated and verified the mean-variance dependence, you can test
for differentially expressed genes between the samples from the mock- and
DHT- treated conditions. Define, as test statistic, the total counts in each
condition, k_A and k_B:
k_A = sum(lncap{:, samples(Aidx)}, 2);
k_B = sum(lncap{:, samples(Bidx)}, 2);
Parameters of the new negative binomial distributions for count sums k_A can
be calculated by Eqs. 12-14 in [6]:
Compute the p-values for the statistical significance of the change from
DHT-treated condition to mock-treated condition. The helper function
computePVal implements the numerical computation of the p-values
presented in the reference [6].
res = table(genes.Feature,'VariableNames',{'Gene'});
res.pvals = computePVal(k_B, mean_k_B, var_k_B, k_A, mean_k_A, var_k_A);
You can empirically adjust the p-values from the multiple tests for false
discovery rate (FDR) with the Benjamini-Hochberg procedure [7] using the
mafdr function.
res.p_fdr = mafdr(res.pvals, 'BHFDR', true);
2-57
Plot the log2 fold changes against the base means, and color those genes
with p-values.
figure
scatter(log2(pooled_mean), res.log2_fold_change,3,(res.p_fdr).^(.02),'o')
xlabel('log2 Mean')
ylabel('log2 Fold Change')
colormap(flipud(cool(256)))
hc = colorbar;
set(hc,'YTickLabel',num2str((get(hc,'Ytick').^50)','%6.1g'))
title('Fold Change colored by False Discovery Rate (FDR)')
You can identify up- or down- regulated genes for mean base count levels
over 3.
up_idx = find(res.p_fdr < 0.01 & res.log2_fold_change >= 2 & pooled_mean >
numel(up_idx)
ans =
185
ans =
2-58
190
This analysis identified 375 statistically significant (out of 20,012 genes) that
were differentially up- or down- regulated by hormone treatment. You can
sort table res by statistical significant and display the top list.
[~,h] = sort(res.p_fdr);
res(h(1:20),:)
ans =
Gene
_________
pvals
___________
p_fdr
___________
log2_fold_change
________________
'FKBP5'
'NCAPD3'
'CENPN'
'LIFR'
'DHCR24'
'ERRFI1'
'GLYATL2'
'ACSL3'
'ATF3'
'MLPH'
'STEAP4'
'DBI'
'ABCC4'
'KLK2'
'SAT1'
'CAMK2N1'
'JAM3'
'MBOAT2'
'RHOU'
'NNMT'
0
0
6.6707e-300
2.4939e-284
2.0847e-249
9.2602e-246
8.5613e-244
2.6073e-225
1.2368e-193
2.0119e-185
1.7537e-182
3.787e-173
8.5321e-166
2.7911e-163
1.2922e-161
8.8046e-161
4.7333e-151
1.556e-140
1.4157e-138
5.6484e-138
0
0
4.4498e-296
1.2477e-280
8.3437e-246
3.0886e-242
2.4475e-240
6.5221e-222
2.75e-190
4.0263e-182
3.1905e-179
6.3155e-170
1.3134e-162
3.9897e-160
1.724e-158
1.1012e-157
5.5719e-148
1.7299e-137
1.4911e-135
5.6517e-135
5.0449
5.4914
4.8519
4.0734
3.1845
4.0914
3.4522
3.6953
3.368
2.5466
9.9479
2.7759
2.8211
2.9506
2.6687
-4.2901
5.7235
3.285
4.0932
4.3572
References
2-59
[1] Li, H., Lovci, M.T., Kwon, Y-S., Rosenfeld, M.G., Fu, X-D., and Yeo, G.W.
"Determination of Tag Density Required for Digital Transcriptome Analysis:
Application to an Androgen-Sensitive Prostate Cancer Model", PNAS, 105(51),
pp 20179-20184, 2008.
[2] Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. "Ultrafast and
Memory-efficient Alignment of Short DNA Sequences to the Human Genome",
Genome Biology, 10:R25, pp 1-10, 2009.
[3] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,
Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing
Subgroup, "The Sequence Alignment/map (SAM) Format and SAMtools",
Bioinformatics, 25, pp 2078-2079, 2009.
[4] Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B.
"Mapping and quantifying mammalian transcriptomes by RNA-Seq", Nature
Methods, 5, pp 621-628, 2008.
[5] Robinson, M.D., and Oshlack, A. "A Scaling Normalization method for
differential Expression Analysis of RNA-seq Data", Genome Biology 11:R25,
1-9, 2010.
[6] Anders, S. and Huber W. "Differential Expression Analysis for Sequence
Count Data", Genome Biology, 11:R106, 2010.
[7] Benjamini, Y., and Hochberg, Y. "Controlling the false discovery rate: a
practical and powerful approach to multiple testing", J. Royal Stat. Soc.,
B 57, 289-300, 1995.
Suggest an enhancement for this example.
2-60
2-61
(1) downloaded the file SRR054715.sra containing the unmapped short read
and converted it to FASTQ formatted files using the NCBI SRA Toolkit.
(2) produced a SAM formatted file by mapping the short reads to the Thale
Cress reference genome, using a mapper such as BWA [2], Bowtie, or SSAHA2
(which is the mapper used by authors of [1]), and,
(3) ordered the SAM formatted file by reference name first, then by genomic
position.
For the published version of this example, 8,655,859 paired-end short reads
are mapped using the BWA mapper [2]. BWA produced a SAM formatted
file (aratha.sam) with 17,311,718 records (8,655,859 x 2). Repetitive hits
were randomly chosen, and only one hit is reported, but with lower mapping
quality. The SAM file was ordered and converted to a BAM formatted file
using SAMtools [3] before being loaded into MATLAB.
The last part of the example also assumes that you downloaded the
reference genome for the Thale Cress model organism (which includes five
chromosomes). Uncomment the following lines of code to download the
reference from the NCBI repository:
%
%
%
%
%
getgenbank('NC_003070','FileFormat','fasta','tofile','ach1.fasta');
getgenbank('NC_003071','FileFormat','fasta','tofile','ach2.fasta');
getgenbank('NC_003074','FileFormat','fasta','tofile','ach3.fasta');
getgenbank('NC_003075','FileFormat','fasta','tofile','ach4.fasta');
getgenbank('NC_003076','FileFormat','fasta','tofile','ach5.fasta');
bm =
2-62
BioMap
Properties:
SequenceDictionary:
Reference:
Signature:
Start:
MappingQuality:
Flag:
MatePosition:
Quality:
Sequence:
Header:
NSeqs:
Name:
{5x1 cell}
[14637324x1
[14637324x1
[14637324x1
[14637324x1
[14637324x1
[14637324x1
[14637324x1
[14637324x1
[14637324x1
14637324
''
File
File
File
File
File
File
File
File
File
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
property]
property]
property]
property]
property]
property]
property]
property]
property]
Use the getSummary method to obtain a list of the existing references and the
actual number of short read mapped to each one.
getSummary(bm)
BioMap summary:
Name:
Container_Type:
Total_Number_of_Sequences:
Number_of_References_in_Dictionary:
Chr1
Chr2
Chr3
Chr4
Chr5
Number_of_Sequences
3151847
3080417
3062917
2218868
3123275
''
'Data is file indexed.'
14637324
5
Genomic_Range
1 30427671
1000 19698292
94 23459782
1029 18585050
11 26975502
2-63
The remainder of this example focuses on the analysis of one of the five
chromosomes, Chr1. Create a new BioMap to access the short reads mapped to
the first chromosome by subsetting the first one.
bm1 = getSubset(bm,'SelectReference','Chr1')
bm1 =
BioMap
Properties:
SequenceDictionary:
Reference:
Signature:
Start:
MappingQuality:
Flag:
MatePosition:
Quality:
Sequence:
Header:
NSeqs:
Name:
{'Chr1'}
[3151847x1
[3151847x1
[3151847x1
[3151847x1
[3151847x1
[3151847x1
[3151847x1
[3151847x1
[3151847x1
3151847
''
File
File
File
File
File
File
File
File
File
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
property]
property]
property]
property]
property]
property]
property]
property]
property]
By accessing the Start and Stop positions of the mapped short read you can
obtain the genomic range.
x1 = min(getStart(bm1))
x2 = max(getStop(bm1))
x1 =
1
x2 =
2-64
30427671
To explore the coverage for the whole range of the chromosome, a binning
algorithm is required. The getBaseCoverage method produces a coverage
signal based on effective alignments. It also allows you to specify a bin width
to control the size (or resolution) of the output signal. However internal
computations are still performed at the base pair (bp) resolution. This means
that despite setting a large bin size, narrow peaks in the coverage signal can
still be observed. Once the coverage signal is plotted you can program the
figures data cursor to display the genomic position when using the tooltip.
You can zoom and pan the figure to determine the position and height of
the ChIP-Seq peaks.
[cov,bin] = getBaseCoverage(bm1,x1,x2,'binWidth',1000,'binType','max');
figure
plot(bin,cov)
axis([x1,x2,0,100])
% sets the axis limits
fixGenomicPositionLabels
% formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
title('Coverage in Chromosome 1')
2-65
ylabel('Depth')
title('Coverage in Chromosome 1')
Observe the large peak with coverage depth of 800+ between positions
4599029 and 4599145. Investigate how these reads are aligning to the
reference chromosome. You can retrieve a subset of these reads enough to
satisfy a coverage depth of 25, since this is sufficient to understand what is
happening in this region. Use getIndex to obtain indices to this subset. Then
use getCompactAlignment to display the corresponding multiple alignment of
the short-reads.
i = getIndex(bm1,4599029,4599145,'depth',25);
bmx = getSubset(bm1,i,'inmemory',false)
getCompactAlignment(bmx,4599029,4599145)
bmx =
BioMap
Properties:
SequenceDictionary:
Reference:
Signature:
Start:
MappingQuality:
Flag:
MatePosition:
Quality:
Sequence:
Header:
NSeqs:
Name:
2-66
{'Chr1'}
[62x1 File
[62x1 File
[62x1 File
[62x1 File
[62x1 File
[62x1 File
[62x1 File
[62x1 File
[62x1 File
62
''
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
property]
property]
property]
property]
property]
property]
property]
property]
property]
ans =
2-67
In addition to visually confirming the alignment, you can also explore the
mapping quality for all the short reads in this region, as this may hint to a
potential problem. In this case, less than one percent of the short reads have
a Phred quality of 60, indicating that the mapper most likely found multiple
hits within the reference genome, hence assigning a lower mapping quality.
figure
i = getIndex(bm1,4599029,4599145);
hist(double(getMappingQuality(bm1,i)))
title('Mapping Quality of the reads between 4599029 and 4599145')
xlabel('Phred Quality Score')
ylabel('Number of Reads')
Most of the large peaks in this data set occur due to satellite repeat regions or
due to its closeness to the centromere [4], and show characteristics similar to
the example just explored. You may explore other regions with large peaks
using the same procedure.
To prevent these problematic regions, two techniques are used. First, given
that the provided data set uses paired-end sequencing, by removing the reads
that are not aligned in a proper pair reduces the number of potential aligner
errors or ambiguities. You can achieve this by exploring the flag field of the
SAM formatted file, in which the second less significant bit is used to indicate
if the short read is mapped in a proper pair.
i = find(bitget(getFlag(bm1),2));
bm1_filtered = getSubset(bm1,i)
bm1_filtered =
BioMap
Properties:
SequenceDictionary:
Reference:
Signature:
Start:
2-68
{'Chr1'}
[3040724x1 File indexed property]
[3040724x1 File indexed property]
[3040724x1 File indexed property]
MappingQuality:
Flag:
MatePosition:
Quality:
Sequence:
Header:
NSeqs:
Name:
[3040724x1
[3040724x1
[3040724x1
[3040724x1
[3040724x1
[3040724x1
3040724
''
File
File
File
File
File
File
indexed
indexed
indexed
indexed
indexed
indexed
property]
property]
property]
property]
property]
property]
Second, consider only uniquely mapped reads. You can detect reads that are
equally mapped to different regions of the reference sequence by looking at
the mapping quality, because BWA assigns a lower mapping quality (less
than 60) to this type of short read.
i = find(getMappingQuality(bm1_filtered)==60);
bm1_filtered = getSubset(bm1_filtered,i)
bm1_filtered =
BioMap
Properties:
SequenceDictionary:
Reference:
Signature:
Start:
MappingQuality:
Flag:
MatePosition:
Quality:
Sequence:
Header:
NSeqs:
Name:
{'Chr1'}
[2313252x1
[2313252x1
[2313252x1
[2313252x1
[2313252x1
[2313252x1
[2313252x1
[2313252x1
[2313252x1
2313252
''
File
File
File
File
File
File
File
File
File
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
indexed
property]
property]
property]
property]
property]
property]
property]
property]
property]
2-69
Visualize again the filtered data set using both, a coarse resolution with 1000
bp bins for the whole chromosome, and a fine resolution for a small region of
20,000 bp. Most of the large peaks due to artifacts have been removed.
[cov,bin] = getBaseCoverage(bm1_filtered,x1,x2,'binWidth',1000,'binType','m
figure
plot(bin,cov)
axis([x1,x2,0,100])
% sets the axis limits
fixGenomicPositionLabels
% formats tick labels and adds datacursors
xlabel('Base Position')
ylabel('Depth')
title('Coverage in Chromosome 1 after Filtering')
p1 = 24275801-10000;
p2 = 24275801+10000;
figure
plot(p1:p2,getBaseCoverage(bm1_filtered,p1,p2))
xlim([p1,p2])
% sets the x-axis limits
fixGenomicPositionLabels
% formats tick labels and adds datacursors
xlabel('Base Position')
ylabel('Depth')
title('Coverage in Chromosome 1 after Filtering')
2-70
information is captured in the fifth bit of the flag field, according to the SAM
file format.
fow_idx = find(~bitget(getFlag(bm1_filtered),5));
rev_idx = find(bitget(getFlag(bm1_filtered),5));
SAM-formatted files use the same header strings to identify pair mates. By
pairing the header strings you can determine how the short reads in BioMap
are paired. To pair the header strings, simply order them in ascending order
and use the sorting indices (hf and hr) to link the unsorted header strings.
[~,hf] = sort(getHeader(bm1_filtered,fow_idx));
[~,hr] = sort(getHeader(bm1_filtered,rev_idx));
mate_idx = zeros(numel(fow_idx),1);
mate_idx(hf) = rev_idx(hr);
Use the resulting fow_idx and mate_idx variables to retrieve pair mates. For
example, retrieve the paired-end reads for the first 10 fragments.
for j = 1:10
disp(getInfo(bm1_filtered, fow_idx(j)))
disp(getInfo(bm1_filtered, mate_idx(j)))
end
SRR054715.sra.6849385
SRR054715.sra.6849385
SRR054715.sra.6992346
SRR054715.sra.6992346
SRR054715.sra.8438570
SRR054715.sra.8438570
SRR054715.sra.1676744
SRR054715.sra.1676744
SRR054715.sra.6820328
SRR054715.sra.6820328
SRR054715.sra.1559757
SRR054715.sra.1559757
SRR054715.sra.5658991
SRR054715.sra.5658991
SRR054715.sra.4625439
SRR054715.sra.4625439
2-71
SRR054715.sra.1007474
SRR054715.sra.1007474
SRR054715.sra.7345693
SRR054715.sra.7345693
Use the paired-end indices to construct a new BioMap with the minimal
information needed to represent the sequencing fragments. First, calculate
the insert sizes.
J = getStop(bm1_filtered, fow_idx);
K = getStart(bm1_filtered, mate_idx);
L = K - J - 1;
Obtain the new signature (or CIGAR string) for each fragment by using the
short read original signatures separated by the appropriate number of skip
CIGAR symbols (N).
n = numel(L);
cigars = cell(n,1);
for i = 1:n
cigars{i} = sprintf('%dN' ,L(i));
end
cigars = strcat( getSignature(bm1_filtered, fow_idx),...
cigars,...
getSignature(bm1_filtered, mate_idx));
2-72
ylabel('Count')
bm1_fragments =
BioMap
Properties:
SequenceDictionary:
Reference:
Signature:
Start:
MappingQuality:
Flag:
MatePosition:
Quality:
Sequence:
Header:
NSeqs:
Name:
{0x1 cell}
{0x1 cell}
{1156626x1 cell}
[1156626x1 uint32]
[0x1 uint8]
[0x1 uint16]
[0x1 uint32]
{0x1 cell}
{1156626x1 cell}
{0x1 cell}
1156626
''
cov_reads = getBaseCoverage(bm1_filtered,x1,x2,'binWidth',1000,'binType','m
[cov_fragments,bin] = getBaseCoverage(bm1_fragments,x1,x2,'binWidth',1000,'
2-73
figure
plot(bin,cov_reads,bin,cov_fragments)
xlim([x1,x2])
% sets the x-axis limits
fixGenomicPositionLabels
% formats tick labels and adds datacursors
xlabel('Base position')
ylabel('Depth')
title('Coverage Comparison')
legend('Short Reads','Fragments')
2-74
title('Coverage Comparison')
legend('Short Reads','Fragments','E-box motif')
Observe that it is not possible to associate each peak in the coverage signals
with an E-box motif. This is because the length of the sequencing fragments is
comparable to the average motif distance, blurring peaks that are close. Plot
the distribution of the distances between the E-box motif sites.
motif_sep = diff(sort(motifs));
figure
hist(motif_sep(motif_sep<500),50)
title('Distance (bp) between adjacent E-box motifs')
xlabel('Distance (bp)')
ylabel('Counts')
Use the function mspeaks to perform peak detection with Wavelets denoising
on the coverage signal of the fragment alignments. Filter putative ChIP peaks
using a height filter to remove peaks that are not enriched by the binding
process under consideration.
putative_peaks = mspeaks(bin,cov_fragments,'noiseestimator',20,...
'heightfilter',10,'showplot',true);
hold on
plot([1;1;1]*motifs(motifs>p1 & motifs<p2),[0;max(ylim);NaN],'r')
xlim([111000 114000])
% sets the x-axis limits
fixGenomicPositionLabels
% formats tick labels and adds datacursors
legend('Coverage from Fragments','Wavelet Denoised Coverage','Putative ChIP
xlabel('Base position')
ylabel('Depth')
title('ChIP-Seq Peak Detection')
2-75
Use the knnsearch function to find the closest motif to each one of the
putative peaks. As expected, most of the enriched ChIP peaks are close to an
E-box motif [1]. This reinforces the importance of performing peak detection
at the finest resolution possible (bp resolution) when the expected density of
binding sites is high, as it is in the case of the E-box motif. This example also
illustrates that for this type of analysis, paired-end sequencing should be
considered over single-end sequencing [1].
h = knnsearch(motifs',putative_peaks(:,1));
distance = putative_peaks(:,1)-motifs(h(:))';
figure
hist(distance(abs(distance)<200),50)
title('Distance to Closest E-box Motif for Each Detected Peak')
xlabel('Distance (bp)')
ylabel('Counts')
References
[1] Wang C., Xu J., Zhang D., Wilson Z.A., and Zhang D. "An effective
approach for identification of in vivo protein-DNA binding sites from
paired-end ChIP-Seq data", BMC Bioinformatics, 11:81, Feb 9, 2010.
[2] Li H. and Durbin R. "Fast and accurate short read alignment with
Burrows-Wheeler transform", Bioinformatics, 25, pp 1754-60, 2009.
[3] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,
Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing
Subgroup "The Sequence Alignment/map (SAM) Format and SAMtools",
Bioinformatics, 25, pp 2078-2079, 2009.
[4] Jothi R, Cuddapah S, Barski A, Cui K, Zhao K. "Genome-wide identification
of in vivo protein-DNA binding sites from ChIP-Seq data", Nucleic Acids
Research, 36(16), pp 5221-31, Sep 2008.
[5] Hoofman B.G., and Jones S.J.M. "Genome-wide identification of
DNA-protein interactions using chromatin immunoprecipitation coupled with
flow cell sequencing", Journal of Endocrinology 201, pp 1-13, 2009.
2-76
[6] Ramsey SA, Knijnenburg TA, Kennedy KA, Zak DE, Gilchrist M, Gold
ES, Johnson CD, Lampano AE, Litvak V, Navarro G, Stolyar T, Aderem A,
Shmulevich I. "Genome-wide histone acetylation data improve prediction of
mammalian transcription factor binding sites", Bioinformatics, 26(17), pp
2071-5, Sep 1, 2010.
Provide feedback for this example.
2-77
You can obtain the unmapped single-end reads for four sequencing
experiments from the NCBI FTP site. Short reads were produced using
Illuminas Genome Analyzer II. Average insert size is 120 bp, and the length
of short reads is 36 bp.
2-78
To explore the signal coverage of the HCT116 samples you need to construct a
BioMap. BioMap has an interface that provides direct access to the mapped
short reads stored in the BAM-formatted file, thus minimizing the amount
of data that is actually loaded into memory. Use the function baminfo to
obtain a list of the existing references and the actual number of short reads
mapped to each one.
info = baminfo('SRR030224.bam','ScanDictionary',true);
fprintf('%-35s%s\n','Reference','Number of Reads');
for i = 1:numel(info.ScannedDictionary)
fprintf('%-35s%d\n',info.ScannedDictionary{i},...
info.ScannedDictionaryCount(i));
end
Reference
gi|224589800|ref|NC_000001.10|
Number of Reads
205065
2-79
gi|224589811|ref|NC_000002.11|
gi|224589815|ref|NC_000003.11|
gi|224589816|ref|NC_000004.11|
gi|224589817|ref|NC_000005.9|
gi|224589818|ref|NC_000006.11|
gi|224589819|ref|NC_000007.13|
gi|224589820|ref|NC_000008.10|
gi|224589821|ref|NC_000009.11|
gi|224589801|ref|NC_000010.10|
gi|224589802|ref|NC_000011.9|
gi|224589803|ref|NC_000012.11|
gi|224589804|ref|NC_000013.10|
gi|224589805|ref|NC_000014.8|
gi|224589806|ref|NC_000015.9|
gi|224589807|ref|NC_000016.9|
gi|224589808|ref|NC_000017.10|
gi|224589809|ref|NC_000018.9|
gi|224589810|ref|NC_000019.9|
gi|224589812|ref|NC_000020.10|
gi|224589813|ref|NC_000021.8|
gi|224589814|ref|NC_000022.10|
gi|224589822|ref|NC_000023.10|
gi|224589823|ref|NC_000024.9|
gi|17981852|ref|NC_001807.4|
Unmapped
187019
73986
84033
96898
87990
120816
111229
106189
112279
104466
87091
53638
64049
60183
146868
195893
60344
166420
148950
310048
76037
32421
18870
1015
6805842
bm_hct116_1 = BioMap('SRR030224.bam','SelectRef','gi|224589821|ref|NC_00000
bm_hct116_2 = BioMap('SRR030225.bam','SelectRef','gi|224589821|ref|NC_00000
bm_hct116_1 =
BioMap with properties:
SequenceDictionary: 'gi|224589821|ref|NC_000009.11|'
Reference: [106189x1 File indexed property]
Signature: [106189x1 File indexed property]
2-80
Start:
MappingQuality:
Flag:
MatePosition:
Quality:
Sequence:
Header:
NSeqs:
Name:
[106189x1
[106189x1
[106189x1
[106189x1
[106189x1
[106189x1
[106189x1
106189
File
File
File
File
File
File
File
indexed
indexed
indexed
indexed
indexed
indexed
indexed
property]
property]
property]
property]
property]
property]
property]
bm_hct116_2 =
BioMap with properties:
SequenceDictionary:
Reference:
Signature:
Start:
MappingQuality:
Flag:
MatePosition:
Quality:
Sequence:
Header:
NSeqs:
Name:
'gi|224589821|ref|NC_000009.11|'
[107586x1 File indexed property]
[107586x1 File indexed property]
[107586x1 File indexed property]
[107586x1 File indexed property]
[107586x1 File indexed property]
[107586x1 File indexed property]
[107586x1 File indexed property]
[107586x1 File indexed property]
[107586x1 File indexed property]
107586
2-81
Because short reads represent the methylated regions of the DNA, there
is a correlation between aligned coverage and DNA methylation. Observe
the increased DNA methylation close to the chromosome telomeres; it is
known that there is an association between DNA methylation and the role of
telomeres for maintaining the integrity of the chromosomes. In the coverage
plot you can also see a long gap over the chromosome centromere. This is due
to the repetitive sequences present in the centromere, which prevent us from
aligning short reads to a unique position in this region. For the data sets used
in this example, only about 30% of the short reads were uniquely mapped to
the reference genome.
Correlating CpG Islands and DNA Methylation
chr9 =
2-82
Use the cpgisland function to find the CpG clusters. Using the standard
definition for CpG islands [4], 200 or more bp islands with 60% or greater
CpGobserved/CpGexpected ratio, leads to 1682 GpG islands found in
chromosome 9.
cpgi = cpgisland(chr9.Sequence)
cpgi =
Starts: [1x1682 double]
Stops: [1x1682 double]
Use the getCounts method to calculate the ratio of aligned bases that are
inside CpG islands. For the first replicate of the sample HCT116, the ratio
is close to 45%.
aligned_bases_in_CpG_islands = getCounts(bm_hct116_1,cpgi.Starts,cpgi.Stops
aligned_bases_total = getCounts(bm_hct116_1,1,n,'method','sum')
ratio = aligned_bases_in_CpG_islands ./ aligned_bases_total
aligned_bases_in_CpG_islands =
1724363
aligned_bases_total =
3822804
ratio =
2-83
0.4511
You can explore high resolution coverage plots of the two sample replicates
and observe how the signal correlates with the CpG islands. For example,
explore the region between 23,820,000 and 23,830,000 bp. This is the 5 region
of the human gene ELAVL2.
r1 = 23820001; % set the region limits
r2 = 23830000;
fhELAVL2 = figure; % keep the figure handle to use it later
hold on
% plot high-resolution coverage of bm_hct116_1
h1 = plot(r1:r2,getBaseCoverage(bm_hct116_1,r1,r2,'binWidth',1),'b');
% plot high-resolution coverage of bm_hct116_2
h2 = plot(r1:r2,getBaseCoverage(bm_hct116_2,r1,r2,'binWidth',1),'g');
To find regions that contain more mapped reads than would be expected by
chance, you can follow a similar approach to the one described by Serre et al.
2-84
counts_1 = getCounts(bm_hct116_1,w,w+99,'independent',true,'overlap','start
counts_2 = getCounts(bm_hct116_2,w,w+99,'independent',true,'overlap','start
First, try to model the counts assuming that all the windows with counts are
biologically significant and therefore from the same distribution. Use the
negative bionomial distribution to fit a model the count data.
nbp = nbinfit(counts_1);
figure
hold on
emphist = histc(counts_1,0:100); % calculate the empirical distribution
bar(0:100,emphist./sum(emphist),'c','grouped') % plot histogram
plot(0:100,nbinpdf(0:100,nbp(1),nbp(2)),'b','linewidth',2); % plot fitted m
axis([0 50 0 .001])
legend('Empirical Distribution','Negative Binomial Fit')
ylabel('Frequency')
xlabel('Counts')
title('Frequency of counts for 100 bp windows (HCT116-1)')
2-85
The poor fitting indicates that the observed distribution may be due to the
mixture of two models, one that represents the background and one that
represents the count data in methylated DNA windows.
A more realistic scenario would be to assume that windows with a small
number of mapped reads are mainly the background (or null model). Serre et
al. assumed that 100-bp windows contaning four or more reads are unlikely
to be generated by chance. To estimate a good approximation to the null
model, you can fit the left body of the emprirical distribution to a truncated
negative binomial distribution. To fit a truncated distribution use the mle
function. First you need to define an anonymous function that defines the
right-truncated version of nbinpdf.
rtnbinpdf = @(x,p1,p2,t) nbinpdf(x,p1,p2) ./ nbincdf(t-1,p1,p2);
Before fitting the real data, let us assess the fiting procedure with some
sampled data from a known distribution.
nbp = [0.5 0.2];
% Known coefficients
x = nbinrnd(nbp(1),nbp(2),10000,1); % Random sample
trun = 6;
% Set a truncation threshold
nbphat1 = nbinfit(x);
% Fit non-truncated model to all data
nbphat2 = nbinfit(x(x<trun)); % Fit non-truncated model to truncated data (
nbphat3 = rtnbinfit(x(x<trun),trun); % Fit truncated model to truncated dat
figure
hold on
emphist = histc(x,0:100);
% Calculate the empirical distribution
bar(0:100,emphist./sum(emphist),'c','grouped') % plot histogram
h1 = plot(0:100,nbinpdf(0:100,nbphat1(1),nbphat1(2)),'b-o','linewidth',2);
h2 = plot(0:100,nbinpdf(0:100,nbphat2(1),nbphat2(2)),'r','linewidth',2);
h3 = plot(0:100,nbinpdf(0:100,nbphat3(1),nbphat3(2)),'g','linewidth',2);
axis([0 25 0 .2])
legend([h1 h2 h3],'Neg-binomial fitted to all data',...
'Neg-binomial fitted to truncated data',...
'Truncated neg-binomial fitted to truncated data')
2-86
ylabel('Frequency')
xlabel('Counts')
For the two replicates of the HCT116 sample, fit a right-truncated negative
binomial distribution to the observed null model using the rtnbinfit
anonymous function previously defined.
trun = 4; % Set a truncation threshold (as in [1])
pn1 = rtnbinfit(counts_1(counts_1<trun),trun); % Fit to HCT116-1 counts
pn2 = rtnbinfit(counts_2(counts_2<trun),trun); % Fit to HCT116-2 counts
Calculate the false discovery rate using the mafdr function. Use the
name-value pair BHFDR to use the linear-step up (LSU) procedure ([6]) to
calculate the FDR adjusted p-values. Setting the FDR < 0.01 permits you to
identify the 100-bp windows that are significantly methylated.
fdr1 = mafdr(pval1,'bhfdr',true);
fdr2 = mafdr(pval2,'bhfdr',true);
Number_of_sig_windows_HCT116_1 =
1662
2-87
Number_of_sig_windows_HCT116_2 =
1674
Number_of_sig_windows_HCT116 =
1346
Overall, you identified 1662 and 1674 non-overlapping 100-bp windows in the
two replicates of the HCT116 samples, which indicates there is significant
evidence of DNA methylation. There are 1346 windows that are significant in
both replicates.
For example, looking again in the promoter region of the ELAVL2 human
gene you can observe that in both sample replicates, multiple 100-bp windows
have been marked significant.
2-88
a =
GFFAnnotation with properties:
FieldNames: {1x9 cell}
NumEntries: 21184
a9 =
GFFAnnotation with properties:
FieldNames: {1x9 cell}
NumEntries: 800
numGenes =
800
Find the promoter regions for each gene. In this example we consider the
proximal promoter as the -500/100 upstream region.
downstream = 500;
upstream
= 100;
geneDir = strcmp(a9.Strand,'+');
2-89
promoters.Counts_1 = getCounts(bm_hct116_1,promoters.Start,promoters.Stop,.
'overlap',1,'independent',true);
promoters.Counts_2 = getCounts(bm_hct116_2,promoters.Start,promoters.Stop,.
'overlap',1,'independent',true);
Fit a null distribution for each sample replicate and compute the p-values:
Ratio_of_sig_methylated_promoters = Number_of_sig_promoters./numGenes
Number_of_sig_promoters =
2-90
74
Ratio_of_sig_methylated_promoters =
0.0925
ans =
Gene
'DMRT3'
'CNTFR'
'GABBR2'
'CACNA1B'
'BARX1'
'FAM78A'
'FOXB2'
'TLE4'
'ASTN2'
'FOXE1'
'MPDZ'
'PTPRD'
'PALM2-AKAP2'
'FAM69B'
'WNK2'
'IGFBPL1'
'AKAP2'
'C9orf4'
'COL5A1'
'LHX3'
Strand
+
+
+
+
+
+
+
+
+
+
-
Start
976464
34590021
101471379
140771741
96717554
134151834
79634071
82186188
120177248
100615036
13279489
10612623
112542089
139606522
95946698
38424344
112542269
111929471
137533120
139096855
Stop
977064
34590621
101471979
140772341
96718154
134152434
79634671
82186788
120177848
100615636
13280089
10613223
112542689
139607122
95947298
38424944
112542869
111930071
137533720
139097455
Counts_1
223
219
404
454
264
497
163
157
141
149
129
145
134
112
108
110
107
102
84
74
2-91
'OLFM1'
'NPR2'
'DBC1'
'SOHLH1'
'PIP5K1B'
'PRDM12'
'ELAVL2'
'ZFP37'
'RP11-35N6.1'
'DMRT2'
pval_1
6.6613e-16
6.6613e-16
6.6613e-16
6.6613e-16
6.6613e-16
6.6613e-16
1.4e-13
3.5649e-13
4.3566e-12
1.2447e-12
2.8679e-11
2.3279e-12
1.3068e-11
4.1911e-10
7.897e-10
5.7523e-10
9.2538e-10
2.0467e-09
3.6266e-08
1.8171e-07
1.5457e-07
4.8093e-07
1.5082e-06
3.4322e-06
2.0943e-06
5.6364e-06
9.2778e-06
2-92
+
+
+
+
+
+
Counts_2
253
226
400
408
286
499
165
151
163
133
148
127
135
144
125
114
106
96
97
91
69
73
62
67
63
61
62
137966768
35791651
122131645
138591274
71320075
133539481
23826235
115818939
103790491
1049854
pval_2
6.6613e-16
6.6613e-16
6.6613e-16
6.6613e-16
6.6613e-16
6.6613e-16
6.0363e-13
4.7348e-12
8.098e-13
6.7598e-11
7.3683e-12
1.6448e-10
5.0276e-11
1.3295e-11
2.2131e-10
1.1364e-09
3.7513e-09
1.6795e-08
1.4452e-08
3.5644e-08
1.0074e-06
5.4629e-07
2.9575e-06
1.3692e-06
2.5345e-06
3.4518e-06
2.9575e-06
137967368
35792251
122132245
138591874
71320675
133540081
23826835
115819539
103791091
1050454
75
68
61
56
59
53
50
59
60
54
2.0943e-06
1.7771e-06
4.7762e-06
47
42
46
3.0746e-05
6.8037e-05
3.6016e-05
Serre et al. [1] reported that, in these data sets, approximately 90% of the
uniquely mapped reads fall outside the 5 gene promoter regions. Using
a similar approach as before, you can find genes that have intergenic
methylated regions. To compensate for the varying lengths of the genes, you
can use the maximum coverage, computed base-by-base, instead of the raw
number of mapped short reads. Another alternative approach to normalize
the counts by the gene length is to set the METHOD name-value pair to rpkm
in the getCounts function.
intergenic = dataset({a9.Feature,'Gene'});
intergenic.Strand = char(a9.Strand);
intergenic.Start = a9.Start;
intergenic.Stop = a9.Stop;
intergenic.Counts_1 = getCounts(bm_hct116_1,intergenic.Start,intergenic.Sto
'overlap','full','method','max','independent',true);
intergenic.Counts_2 = getCounts(bm_hct116_2,intergenic.Start,intergenic.Sto
'overlap','full','method','max','independent',true);
trun = 10; % Set a truncation threshold
pn1 = rtnbinfit(intergenic.Counts_1(intergenic.Counts_1<trun),trun); % Fit
pn2 = rtnbinfit(intergenic.Counts_2(intergenic.Counts_2<trun),trun); % Fit
intergenic.pval_1 = 1 - nbincdf(intergenic.Counts_1,pn1(1),pn1(2)); % p-val
intergenic.pval_2 = 1 - nbincdf(intergenic.Counts_2,pn2(1),pn2(2)); % p-val
Number_of_sig_genes =
Ratio_of_sig_methylated_genes = Number_of_sig_genes./numGenes
[~,order] = sort(intergenic.pval_1.*intergenic.pval_2);
intergenic(order(1:30),[1 2 3 4 5 7 6 8])
2-93
Number_of_sig_genes =
62
Ratio_of_sig_methylated_genes =
0.0775
ans =
Gene
'AL772363.1'
'CACNA1B'
'SUSD1'
'C9orf172'
'NR5A1'
'BARX1'
'KCNT1'
'GABBR2'
'FOXB2'
'NDOR1'
'KIAA1045'
'ADAMTSL2'
'PAX5'
'OLFM1'
'PBX3'
'FOXE1'
'MPDZ'
'ASTN2'
'ARRDC1'
'IGFBPL1'
'LHX3'
'PAPPA'
'CNTFR'
'DMRT3'
'TUSC1'
'ELAVL2'
'SMARCA2'
2-94
Strand
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Start
140762377
140772241
114803065
139738867
127243516
96713628
138594031
101050391
79634571
140100119
34957484
136397286
36833272
137967268
128508551
100615536
13105703
119187504
140500106
38408991
139088096
118916083
34551430
976964
25676396
23690102
2015342
Stop
140787022
141019076
114937688
139741797
127269709
96717654
138684992
101471479
79635869
140113813
34984679
136440641
37034476
138013025
128729656
100618986
13279589
120177348
140509812
38424444
139096955
119164601
34590121
991731
25678856
23826335
2193624
Counts_1
106
106
88
99
86
77
58
65
51
54
50
55
48
55
45
49
51
43
49
45
44
44
41
40
46
35
36
'GAS1'
'GRIN1'
'TLE4'
pval_1
8.3267e-15
8.3267e-15
2.2901e-12
7.4385e-14
4.2677e-12
7.0112e-11
2.5424e-08
2.9078e-09
2.2131e-07
8.7601e-08
3.0134e-07
6.4307e-08
5.585e-07
6.4307e-08
1.4079e-06
4.1027e-07
2.2131e-07
2.6058e-06
4.1027e-07
1.4079e-06
1.9155e-06
1.9155e-06
4.8199e-06
6.5537e-06
1.0346e-06
3.0371e-05
2.2358e-05
4.1245e-05
2.2358e-05
2.2358e-05
+
+
Counts_2
98
98
112
96
90
62
73
58
58
55
55
45
49
42
51
46
42
43
36
39
36
35
37
37
31
41
40
41
38
37
89559279
140032842
82186688
89562104
140063207
82341658
34
36
36
pval_2
1.8097e-14
1.8097e-14
1.1102e-16
3.5083e-14
2.5391e-13
2.5691e-09
6.9018e-11
9.5469e-09
9.5469e-09
2.5525e-08
2.5525e-08
6.7163e-07
1.8188e-07
1.7861e-06
9.4566e-08
4.8461e-07
1.7861e-06
1.2894e-06
1.2564e-05
4.7417e-06
1.2564e-05
1.7377e-05
9.0816e-06
9.0816e-06
6.3417e-05
2.4736e-06
3.4251e-06
2.4736e-06
6.5629e-06
9.0816e-06
For instance, explore the methylation profile of the BARX1 gene, the sixth
significant gene with intergenic methylation in the previous list. The GTF
2-95
barx1 =
GTFAnnotation with properties:
FieldNames: {1x11 cell}
NumEntries: 18
transcripts =
'ENST00000253968'
'ENST00000401724'
Plot the DNA methylation profile for both HCT116 sample replicates with
base-pair resolution. Overlay the CpG islands and plot the exons for each of
the two transcripts along the bottom of the plot.
range = barx1.getRange;
r1 = range(1)-1000; % set the region limits
r2 = range(2)+1000;
figure
hold on
% plot high-resolution coverage of bm_hct116_1
h1 = plot(r1:r2,getBaseCoverage(bm_hct116_1,r1,r2,'binWidth',1),'b');
% plot high-resolution coverage of bm_hct116_2
h2 = plot(r1:r2,getBaseCoverage(bm_hct116_2,r1,r2,'binWidth',1),'g');
% mark the CpG islands within the [r1 r2] region
for i=1:numel(cpgi.Starts)
2-96
In the study by Serre et al. another cell line is also analyzed. New cells
(DICERex5) are derived from the same HCT116 colon cancer cells after
truncating the DICER1 alleles. It has been reported in literature [5] that
there is a localized change of DNA methylation at small number of gene
promoters. In this example, you be find significant 100-bp windows in two
sample replicates of the DICERex5 cells following the same approach as the
2-97
parental HCT116 cells, and then you will search statistically significant
differences between the two cell lines.
The helper function getWindowCounts captures the similar steps to find
windows with significant coverage as before. getWindowCounts returns
vectors with counts, p-values, and false discovery rates for each new replicate.
bm_dicer_1 = BioMap('SRR030222.bam','SelectRef','gi|224589821|ref|NC_000009
bm_dicer_2 = BioMap('SRR030223.bam','SelectRef','gi|224589821|ref|NC_000009
[counts_3,pval3,fdr3] = getWindowCounts(bm_dicer_1,4,w,100);
[counts_4,pval4,fdr4] = getWindowCounts(bm_dicer_2,4,w,100);
w3 = fdr3<.01; % logical vector indicating significant windows in DICERex5_
w4 = fdr4<.01; % logical vector indicating significant windows in DICERex5w34 = w3 & w4; % logical vector indicating significant windows in both repl
Number_of_sig_windows_DICERex5_1 = sum(w3)
Number_of_sig_windows_DICERex5_2 = sum(w4)
Number_of_sig_windows_DICERex5 = sum(w34)
Number_of_sig_windows_DICERex5_1 =
908
Number_of_sig_windows_DICERex5_2 =
1041
Number_of_sig_windows_DICERex5 =
759
To perform a differential analysis you use the 100-bp windows that are
significant in at least one of the samples (either HCT116 or DICERex5).
2-98
Use the function manorm to normalize the data. The PERCENTILE name-value
pair lets you filter out windows with very large number of counts while
normalizing, since these windows are mainly due to artifacts, such as
repetitive regions in the reference chromosome.
counts_norm = round(manorm(counts,'percentile',90).*100);
Type
Intergenic (EXD3)
Intragenic
Intragenic
Intergenic (ASTN2)
Intergenic (ABCA2)
p-value
0.000022
0.001684
0.002478
0.002531
0.002770
HCT116
13
21
258
266
64
13
21
257
270
63
HCT116
DICERe
DICERex5
104
91
434
155
26
105
93
428
155
25
2-99
126128501
71939501
124461001
140086501
79637201
136470801
140918001
100615901
98221901
138730601
89561701
977401
37002601
139744401
126771301
93922501
94187101
136044401
139611201
139716201
Intergenic (CRB2)
Prox. Promoter (FAM189A2)
Intergenic (DAB2IP)
Intergenic (TPRN)
Intragenic
Intragenic
Intergenic (CACNA1B)
Intergenic (FOXE1)
Intergenic (PTCH1)
Intergenic (CAMSAP1)
Intergenic (GAS1)
Intergenic (DMRT3)
Intergenic (PAX5)
Intergenic (PHPT1)
Intragenic
Intragenic
Intragenic
Intragenic
Intergenic (FAM69B)
Intergenic (C9orf86)
0.002968
0.005178
0.005243
0.006053
0.006998
0.006998
0.007555
0.007758
0.009231
0.009552
0.009618
0.009656
0.009808
0.010087
0.010638
0.010672
0.010696
0.010756
0.010756
0.010946
94
107
77
47
52
52
176
262
26
26
77
236
133
47
43
34
73
39
39
73
93
101
76
42
51
51
169
253
30
21
76
245
127
46
46
34
80
34
34
72
129
0
39
123
32
32
71
123
104
97
6
129
207
32
97
149
6
110
110
136
130
0
37
124
31
31
68
118
99
93
12
124
211
31
93
161
6
105
105
130
Plot the DNA methylation profile for the promoter region of gene FAM189A2,
the most signicant differentially covered promoter region from the previous
list. Overlay the CpG islands and the FAM189A2 gene.
range = getRange(getSubset(a9,'Feature','FAM189A2'));
r1 = range(1)-1000;
r2 = range(2)+1000;
figure
hold on
% plot high-resolution coverage of all replicates
h1 = plot(r1:r2,getBaseCoverage(bm_hct116_1,r1,r2,'binWidth',1),'b');
h2 = plot(r1:r2,getBaseCoverage(bm_hct116_2,r1,r2,'binWidth',1),'g');
h3 = plot(r1:r2,getBaseCoverage(bm_dicer_1,r1,r2,'binWidth',1),'r');
h4 = plot(r1:r2,getBaseCoverage(bm_dicer_2,r1,r2,'binWidth',1),'m');
2-100
Observe that the CpG islands are clearly unmethylated for both of the
DICERex5 replicates.
References
[1] Serre, D., Lee, B.H., and Ting A.H. "MBD-isolated Genome Sequencing
provides a high-throughput and comprehensive survey of DNA methylation in
the human genome", Nucleic Acids Research, 38(2), pp 391-399, 2010.
[2] Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. "Ultrafast and
Memory-efficient Alignment of Short DNA Sequences to the Human Genome",
Genome Biology, 10:R25, pp 1-10, 2009.
[3] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,
Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing
Subgroup "The Sequence Alignment/map (SAM) Format and SAMtools",
Bioinformatics, 25, pp 2078-2079, 2009.
[4] Gardiner-Garden, M and Frommer, M. "CpG islands in vertebrate
genomes", J.Mol.Biol. 196, pp 261-282, 1987.
2-101
[5] Ting, A.H., Suzuki, H., Cope, L., Schuebel, K.E., Lee, B.H., Toyota, M.,
Imai, K., Shinomura, Y., Tokino, T. and Baylin, S.B. "A Requirement for
DICER to Maintain Full Promoter CpG Island % Hypermethylation in Human
Cancer Cells", Cancer Research, 68, 2570, April 15, 2008.
[6] Benjamini, Y., Hochberg, Y., "Controlling the false discovery rate: a
practical and powerful approach to multiple testing", Journal of the Royal
Statistical Society, 57, pp 289-300, 1995.
Provide feedback for this example.
2-102
3
Sequence Analysis
Sequence analysis is the process you use to find information about a nucleotide
or amino acid sequence using computational methods. Common tasks in
sequence analysis are identifying genes, determining the similarity of two
genes, determining the protein coded by a gene, and determining the function
of a gene by finding a similar gene in another organism with a known function.
Exploring a Nucleotide Sequence Using Command Line on page 3-2
Exploring a Nucleotide Sequence Using the Sequence Viewer App on
page 3-22
Explore a Protein Sequence Using the Sequence Viewer App on page 3-33
Sequence Alignment on page 3-38
View and Align Multiple Sequences on page 3-58
Sequence Analysis
Overview of Example
After sequencing a piece of DNA, one of the first tasks is to investigate the
nucleotide content in the sequence. Starting with a DNA sequence, this
example uses sequence statistics functions to determine mono-, di-, and
trinucleotide content, and to locate open reading frames.
3-2
A separate browser window opens with the home page for the NCBI Web
site.
2 Search the NCBI Web site for information. For example, to search for the
human mitochondrion genome, from the Search list, select Genome , and in
the Search list, enter mitochondrion homo sapiens.
3 Select a result page. For example, click the link labeled NC_012920.
The MATLAB Help browser displays the NCBI page for the human
mitochondrial genome.
3-3
3-4
Sequence Analysis
The load function loads the sequence mitochondria into the MATLAB
Workspace.
3 Get information about the sequence. Type
whos mitochondria
3-5
Sequence Analysis
Size
Bytes
Class
mitochondria
1x16569
33138
char
Attributes
3-6
basecount(mitochondria)
5124
5181
2169
4094
3-7
Sequence Analysis
seqrcomplement function.
basecount(seqrcomplement(mitochondria))
4094
2169
5181
5124
4 Use the function basecount with the chart option to visualize the
nucleotide distribution.
figure
basecount(mitochondria,'chart','pie');
3-8
5 Count the dimers in a sequence and display the information in a bar chart.
figure
dimercount(mitochondria,'chart','bar')
ans =
AA:
AC:
AG:
AT:
1604
1495
795
1230
3-9
Sequence Analysis
CA:
CC:
CG:
CT:
GA:
GC:
GG:
GT:
TA:
TC:
TG:
TT:
3-10
1534
1771
435
1440
613
711
425
419
1373
1204
513
1004
3-11
Sequence Analysis
Window, type
codoncount(mitochondria)
167
137
59
126
146
141
40
175
67
81
36
43
157
125
64
96
AAC
ACC
AGC
ATC
CAC
CCC
CGC
CTC
GAC
GCC
GGC
GTC
TAC
TCC
TGC
TTC
171
191
87
131
145
205
54
142
53
101
47
26
118
116
40
107
AAG
ACG
AGG
ATG
CAG
CCG
CGG
CTG
GAG
GCG
GGG
GTG
TAG
TCG
TGG
TTG
71
42
51
55
68
49
29
74
49
16
23
18
94
37
29
47
AAT
ACT
AGT
ATT
CAT
CCT
CGT
CTT
GAT
GCT
GGT
GTT
TAT
TCT
TGT
TTT
130
153
54
113
148
173
27
101
35
59
28
41
107
103
26
78
2 Count the codons in all six reading frames and plot the results in heat maps.
for frame = 1:3
figure
subplot(2,1,1);
codoncount(mitochondria,'frame',frame,'figure',true,...
'geneticcode','Vertebrate Mitochondrial');
title(sprintf('Codons for frame %d',frame));
subplot(2,1,2);
codoncount(mitochondria,'reverse',true,'frame',frame,...
'figure',true,'geneticcode','Vertebrate Mitochondrial');
title(sprintf('Codons for reverse frame %d',frame));
end
3-12
3-13
3-14
Sequence Analysis
3-15
Sequence Analysis
After you read a sequence into the MATLAB environment, you can analyze
the sequence for open reading frames. This procedure uses the human
mitochondria genome as an example. See Reading Sequence Information
from the Web on page 3-5.
1 Display open reading frames (ORFs) in a nucleotide sequence. In the
If you compare this output to the genes shown on the NCBI page for
NC_012920, there are fewer genes than expected. This is because vertebrate
mitochondria use a genetic code slightly different from the standard genetic
code. For a list of genetic codes, see the Genetic Code table in the aa2nt
reference page.
2 Display ORFs using the Vertebrate Mitochondrial code.
orfs= seqshoworfs(mitochondria,...
'GeneticCode','Vertebrate Mitochondrial',...
'alternativestart',true);
Notice that there are now two large ORFs on the third reading frame. One
starts at position 4470 and the other starts at 5904. These correspond to
the genes ND2 (NADH dehydrogenase subunit 2 [Homo sapiens] ) and
COX1 (cytochrome c oxidase subunit I) genes.
3 Find the corresponding stop codon. The start and stop positions for ORFs
have the same indices as the start positions in the fields Start and Stop.
ND2Start = 4470;
StartIndex = find(orfs(3).Start == ND2Start)
ND2Stop = orfs(3).Stop(StartIndex)
3-16
4 Using the sequence indices for the start and stop of the gene, extract the
codoncount (ND2Seq)
The codon count shows a high amount of ACC, ATA, CTA, and ATC.
AAA
ACA
AGA
ATA
CAA
CCA
CGA
CTA
GAA
GCA
GGA
GTA
TAA
TCA
TGA
TTA
10
11
0
23
8
4
0
26
5
8
5
3
0
7
10
8
AAC
ACC
AGC
ATC
CAC
CCC
CGC
CTC
GAC
GCC
GGC
GTC
TAC
TCC
TGC
TTC
14
24
4
24
3
12
3
18
0
7
7
2
8
11
0
7
AAG
ACG
AGG
ATG
CAG
CCG
CGG
CTG
GAG
GCG
GGG
GTG
TAG
TCG
TGG
TTG
2
3
0
1
2
2
0
4
1
1
0
0
0
1
1
1
AAT
ACT
AGT
ATT
CAT
CCT
CGT
CTT
GAT
GCT
GGT
GTT
TAT
TCT
TGT
TTT
6
5
1
8
1
5
1
7
0
4
1
3
2
4
0
8
6 Look up the amino acids for codons ATA, CTA, ACC, and ATC.
aminolookup('code',nt2aa('ATA'))
aminolookup('code',nt2aa('CTA'))
3-17
Sequence Analysis
aminolookup('code',nt2aa('ACC'))
aminolookup('code',nt2aa('ATC'))
isoleucine
leucine
threonine
isoleucine
only the protein-coding sequence between the start and stop codons is
converted.
ND2AASeq = nt2aa(ND2Seq,'geneticcode',...
'Vertebrate Mitochondrial')
3-18
LGGLPPLTGFLPKWAIIEEFTKNNSLIIPTIMATITLLNLYFYLRLIYST
SITLLPMSNNVKMKWQFEHTKPTPFLPTLIALTTLLLPISPFMLMIL
2 Compare your conversion with the published conversion in the GenPept
database.
ND2protein = getgenpept('YP_003024027','sequenceonly',true)
The getgenpept function retrieves the published conversion from the NCBI
database and reads it into the MATLAB Workspace.
3 Count the amino acids in the protein sequence.
aacount(ND2AASeq, 'chart','bar')
A bar graph displays. Notice the high content for leucine, threonine and
isoleucine, and also notice the lack of cysteine and aspartic acid.
3-19
Sequence Analysis
atomiccomp(ND2AASeq)
molweight (ND2AASeq)
3-20
1818
2882
420
471
S: 25
ans =
3.8960e+004
If this sequence was unknown, you could use this information to identify
the protein by comparing it with the atomic composition of other proteins
in a database.
3-21
Sequence Analysis
seqviewer
3-22
2 To retrieve a sequence from the NCBI database, select File > Download
3-23
Sequence Analysis
3-24
sequence.
3-25
Sequence Analysis
3-26
The Sequence Viewer searches and displays the location of the selected
word.
3-27
3-28
Sequence Analysis
on
the toolbar.
The Sequence Viewer displays the ORFs for the six reading frames in
the lower-right pane. Hover the cursor over a frame to display information
about it.
The ORF is highlighted to indicate the part of the sequence that is selected.
the Export to MATLAB Workspace dialog box, type a variable name, for
example, NM_000520_ORF_2, then click Export.
3-29
Sequence Analysis
3-30
5 In the left pane, click Full Translation. Select Display > Amino Acid
3-31
Sequence Analysis
3-32
3-33
Sequence Analysis
The Sequence Viewer accesses the NCBI database on the Web and loads
amino acid sequence information for the accession number you entered.
3-34
3 Select Display > Amino Acid Color Scheme, and then select Charge,
3-35
Sequence Analysis
Color Legend
Charge
Acidic Red
Basic Light Blue
Neutral Black
Function
Acidic Red
Basic Light Blue
Hydropobic, nonpolar Black
Polar, uncharged Green
3-36
Color Legend
Hydrophobicity
Structure
Taylor
References
[1] Taylor, W.R. (1997). Residual colours: a proposal for aminochromography.
Protein Engineering 10, 7, 743746.
3-37
Sequence Analysis
Sequence Alignment
In this section...
Overview of Example on page 3-38
Find a Model Organism to Study on page 3-38
Retrieve Sequence Information from a Public Database on page 3-41
Search a Public Database for Related Genes on page 3-43
Locate Protein Coding Sequences on page 3-45
Compare Amino Acid Sequences on page 3-49
Overview of Example
Determining the similarity between two sequences is a common task in
computational biology. Starting with a nucleotide sequence for a human gene,
this example uses alignment algorithms to locate and verify a corresponding
gene in a model organism.
The MATLAB Help browser opens with the Tay-Sachs disease page in the
Genes and Diseases section of the NCBI web site. This section provides a
comprehensive introduction to medical genetics. In particular, this page
3-38
Sequence Alignment
3-39
3-40
Sequence Analysis
Sequence Alignment
The gene HEXA codes for the alpha subunit of the dimer enzyme
hexosaminidase A (Hex A), while the gene HEXB codes for the beta subunit
of the enzyme. A third gene, GM2A, codes for the activator protein GM2.
However, it is a mutation in the gene HEXA that causes Tay-Sachs.
The MATLAB Help browser window opens with the NCBI home page.
2 Search for the gene you are interested in studying. For example, from the
Search list, select Nucleotide, and in the for box enter Tay-Sachs.
The search returns entries for the genes that code the alpha and beta
subunits of the enzyme hexosaminidase A (Hex A), and the gene that codes
the activator enzyme. The NCBI reference for the human gene HEXA has
accession number NM_000520.
3-41
Sequence Analysis
3 Get sequence data into the MATLAB environment. For example, to get
3-42
Sequence Alignment
humanHEXA = getgenbank('NM_000520')
LocusName: 'NM_000520'
LocusSequenceLength: '2255'
LocusNumberofStrands: ''
LocusTopology: 'linear'
LocusMoleculeType: 'mRNA'
LocusGenBankDivision: 'PRI'
LocusModificationDate: '13-AUG-2006'
Definition: 'Homo sapiens hexosaminidase A (alpha polypeptide) (HEXA), mRNA.'
Accession: 'NM_000520'
Version: 'NM_000520.2'
GI: '13128865'
Project: []
Keywords: []
Segment: []
Source: 'Homo sapiens (human)'
SourceOrganism: [4x65 char]
Reference: {1x58 cell}
Comment: [15x67 char]
Features: [74x74 char]
CDS: [1x1 struct]
Sequence: [1x2255 char]
SearchURL: [1x108 char]
RetrieveURL: [1x97 char]
3-43
Sequence Analysis
Homologous genes are genes that have a common ancestor and similar
sequences. One goal of searching a public database is to find similar genes.
If you are able to locate a sequence in a database that is similar to your
unknown gene or protein, it is likely that the function and characteristics of
the known and unknown genes are the same.
After finding the nucleotide sequence for a human gene, you can do a BLAST
search or search in the genome of another organism for the corresponding
gene. This procedure uses the mouse genome as an example.
1 Open the MATLAB Help browser to the NCBI Web site. In the MATLAB
studying. For example, from the Search list, select Nucleotide, and in the
for box enter hexosaminidase A.
The search returns entries for the mouse and human genomes. The NCBI
reference for the mouse gene HEXA has accession number AK080777.
3 Get sequence information for the mouse gene into the MATLAB
environment. Type
mouseHEXA = getgenbank('AK080777')
3-44
Sequence Alignment
'AK080777'
'1839'
''
'linear'
'mRNA'
'HTC'
'02-SEP-2005'
[1x150 char]
'AK080777'
'AK080777.1'
'26348756'
[]
'HTC; CAP trapper.'
[]
'Mus musculus (house mouse)'
[4x65 char]
{1x8 cell}
[8x66 char]
[33x74 char]
[1x1 struct]
[1x1839 char]
[1x107 char]
[1x97 char]
3-45
Sequence Analysis
1 If you did not retrieve gene data from the Web, you can load example data
contains the position of the start and stop codons for all open reading
frames (ORFs) on each reading frame.
humanORFs =
1x3 struct array with fields:
Start
Stop
The Help browser opens displaying the three reading frames with the
ORFs colored blue, red, and green. Notice that the longest ORF is in the
first reading frame.
3-46
Sequence Alignment
3-47
Sequence Analysis
mouseORFs = seqshoworfs(mouseHEXA.Sequence)
seqshoworfs creates the structure mouseORFS.
mouseORFs =
1x3 struct array with fields:
Start
Stop
The mouse gene shows the longest ORF on the first reading frame.
3-48
Sequence Alignment
and mouse DNA sequences to the amino acid sequences. Because both the
human and mouse HEXA genes were in the first reading frames (default),
you do not need to indicate which frame. Type
humanProtein = nt2aa(humanHEXA.Sequence);
mouseProtein = nt2aa(mouseHEXA.Sequence);
2 Draw a dot plot comparing the human and mouse amino acid sequences.
Type
seqdotplot(mouseProtein,humanProtein,4,3)
ylabel('Mouse hexosaminidase A (alpha subunit)')
xlabel('Human hexosaminidase A (alpha subunit)')
Dot plots are one of the easiest ways to look for similarity between
sequences. The diagonal line shown below indicates that there may be a
good alignment between the two sequences.
3-49
Sequence Analysis
3 Globally align the two amino acid sequences, using the Needleman-Wunsch
algorithm. Type
[GlobalScore, GlobalAlignment] = nwalign(humanProtein,...
mouseProtein);
showalignment(GlobalAlignment)
showalignment displays the global alignment of the two sequences in
the Help browser. Notice that the calculated identity between the two
sequences is 60%.
3-50
Sequence Alignment
3-51
Sequence Analysis
The alignment is very good between amino acid position 69 and 599, after
which the two sequences appear to be unrelated. Notice that there is a
stop (*) in the sequence at this point. If you shorten the sequences to
include only the amino acids that are in the protein you might get a better
alignment. Include the amino acid positions from the first methionine (M) to
the first stop (*) that occurs after the first methionine.
4 Trim the sequence from the first start amino acid (usually M) to the first
stop (*) and then try alignment again. Find the indices for the stops in
the sequences.
humanStops = find(humanProtein == '*')
humanStops =
41
599
611
713
722
730
557
574
606
the stop.
humanProteinORF = humanProtein(70:humanStops(2))
humanProteinORF =
MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDV
SSAAQPGCSVLDEAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVV
TPGCNQLPTLESVENYTLTINDDQCLLLSETVWGALRGLETFSQLVWKSA
EGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLSSILDTLDVMAYNKLNV
3-52
Sequence Alignment
FHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEYARLRG
IRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEF
MSTFFLEVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQ
LESFYIQTLLDIVSSYGKGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNY
MKELELVTKAGFRALLSAPWYLNRISYGPDWKDFYIVEPLAFEGTPEQKA
LVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKLTSDLTFAYERL
SHFRCELLRRGVQAQPLNVGFCEQEFEQT*
mouseProteinORF = mouseProtein(11:mouseStops(1))
mouseProteinORF =
MAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYTLYPNNFQFRYHV
SSAAQAGCVVLDEAFRRYRNLLFGSGSWPRPSFSNKQQTLGKNILVVSVV
TAECNEFPNLESVENYTLTINDDQCLLASETVWGALRGLETFSQLVWKSA
EGTFFINKTKIKDFPRFPHRGVLLDTSRHYLPLSSILDTLDVMAYNKFNV
FHWHLVDDSSFPYESFTFPELTRKGSFNPVTHIYTAQDVKEVIEYARLRG
IRVLAEFDTPGHTLSWGPGAPGLLTPCYSGSHLSGTFGPVNPSLNSTYDF
MSTLFLEISSVFPDFYLHLGGDEVDFTCWKSNPNIQAFMKKKGFTDFKQL
ESFYIQTLLDIVSDYDKGYVVWQEVFDNKVKVRPDTIIQVWREEMPVEYM
LEMQDITRAGFRALLSAPWYLNRVKYGPDWKDMYKVEPLAFHGTPEQKAL
VIGGEACMWGEYVDSTNLVPRLWPRAGAVAERLWSSNLTTNIDFAFKRLS
HFRCELVRRGIQAQPISVGCCEQEFEQT*
6 Globally align the trimmed amino acid sequences. Type
3-53
Sequence Analysis
7 Another way to truncate an amino acid sequence to only those amino acids
in the protein is to first truncate the nucleotide sequence with indices from
3-54
Sequence Alignment
the seqshoworfs function. Remember that the ORF for the human HEXA
gene and the ORF for the mouse HEXA were both on the first reading
frame.
humanORFs = seqshoworfs(humanHEXA.Sequence)
humanORFs =
1x3 struct array with fields:
Start
Stop
mouseORFs = seqshoworfs(mouseHEXA.Sequence)
mouseORFs =
1x3 struct array with fields:
Start
Stop
humanPORF = nt2aa(humanHEXA.Sequence(humanORFs(1).Start(1):...
humanORFs(1).Stop(1)));
mousePORF = nt2aa(mouseHEXA.Sequence(mouseORFs(1).Start(1):...
mouseORFs(1).Stop(1)));
[GlobalScore2, GlobalAlignment2] = nwalign(humanPORF, mousePORF);
3-55
Sequence Analysis
algorithm. Type
[LocalScore, LocalAlignment] = swalign(humanProtein,...
mouseProtein)
LocalScore =
1057
LocalAlignment =
RGDQR-AMTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYV . . .
|| | ||:: ||| |||||||:| ||||||||| :|| :||: . . .
RGAGRWAMAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYT . . .
9 Show the alignment in color.
showalignment(LocalAlignment)
3-56
Sequence Alignment
3-57
Sequence Analysis
load primates.mat
2 Create a phylogenetic tree.
3-58
phytreeviewer(tree)
3-59
Sequence Analysis
2 Click the branches to prune (remove) from the tree. For this example, click
3 Export the selected branches to a second tree. Select File > Export to
3-60
ma = multialign(primates2);
2 View the aligned sequences in the Sequence Alignment app.
seqalignviewer(ma);
3-61
Sequence Analysis
3-62
2 Click a letter to select it, and then move the cursor over the red direction
3 Click and drag the sequence to the right to insert a gap. If there is a gap to
the left, you can also move the sequence to the left and eliminate the gap.
3-63
Sequence Analysis
Alternately, to insert a gap, select a character, and then click the Insert
Gap icon on the toolbar or press the spacebar.
Note You cannot delete or add letters to a sequence, but you can add or
delete gaps. If all of the sequences at one alignment position have gaps,
you can delete that column of gaps.
3-64
3-65
3-66
Sequence Analysis
4
Microarray Analysis
Managing Gene Expression Data in Objects on page 4-2
Representing Expression Data Values in DataMatrix Objects on page 4-5
Representing Expression Data Values in ExptData Objects on page 4-11
Representing Sample and Feature Metadata in MetaData Objects on
page 4-15
Representing Experiment Information in a MIAME Object on page 4-22
Representing All Data in an ExpressionSet Object on page 4-27
Visualizing Microarray Images on page 4-33
Analyzing Gene Expression Profiles on page 4-57
Detecting DNA Copy Number Alteration in Array-Based CGH Data on
page 4-72
Exploring Gene Expression Data on page 4-85
Microarray Analysis
4-2
4-3
Microarray Analysis
4-4
4-5
Microarray Analysis
rows and first four columns of the yeastvalues matrix, the genes cell
array, and the times vector.
yeastvalues = yeastvalues(1:5,1:4);
genes = genes(1:5,:);
times = times(1:4);
3 Import the microarray object package so that the DataMatrix constructor
object from the gene expression data in the variables you created in step 2.
dmo = DataMatrix(yeastvalues,genes,times)
dmo =
SS DNA
YAL003W
YAL012W
YAL026C
4-6
0
-0.131
0.305
0.157
0.246
9.5
1.699
0.146
0.175
0.796
11.5
-0.026
-0.129
0.467
0.384
13.5
0.365
-0.444
-0.379
0.981
YAL034C
-0.235
0.487
-0.184
-0.669
get(dmo)
Name:
RowNames:
ColNames:
NRows:
NCols:
NDims:
ElementClass:
''
{5x1 cell}
{'
0' ' 9.5'
5
4
2
'double'
'11.5'
'13.5'}
2 Use the set method to specify a name for the DataMatrix object, dmo.
dmo = set(dmo,'Name','MyDMObject');
3 Use the get method again to display the properties of the DataMatrix
object, dmo.
get(dmo)
Name:
RowNames:
ColNames:
NRows:
NCols:
NDims:
ElementClass:
'MyDMObject'
{5x1 cell}
{'
0' ' 9.5'
5
4
2
'double'
'11.5'
'13.5'}
4-7
Microarray Analysis
Parentheses () Indexing
Use parenthesis indexing to extract a subset of the data in dmo and assign
it to a new DataMatrix object dmo2:
dmo2 = dmo(1:5,2:3)
dmo2 =
9.5
SS DNA
1.699
YAL003W
0.146
YAL012W
0.175
YAL026C
0.796
YAL034C
0.487
11.5
-0.026
-0.129
0.467
0.384
-0.184
Use parenthesis indexing to extract a subset of the data using row names and
column names, and assign it to a new DataMatrix object dmo3:
dmo3 = dmo({'SS DNA','YAL012W','YAL034C'},'11.5')
dmo3 =
SS DNA
YAL012W
YAL034C
11.5
-0.026
0.467
-0.184
Note If you use a cell array of row names or column names to index into a
DataMatrix object, the names must be unique, even though the row names or
column names within the DataMatrix object are not unique.
4-8
SS DNA
YAL003W
YAL012W
YAL026C
YAL034C
9.5
1.7
0.15
0.175
0.796
0.487
11.5
-0.03
-0.13
0.467
0.384
-0.184
YAL012W
YAL026C
YAL034C
9.5
0.175
0.796
0.487
11.5
0.467
0.384
-0.184
Dot . Indexing
Note In the following examples, notice that when using dot indexing with
DataMatrix objects, you specify all rows or all columns using a colon within
single quotation marks, (':').
Use dot indexing to extract the data from the 11.5 column only of dmo:
timeValues = dmo.(':')('11.5')
timeValues =
-0.0260
-0.1290
0.4670
0.3840
-0.1840
4-9
Microarray Analysis
Use dot indexing to assign new data to a subset of the elements in dmo:
dmo.(1:2)(':') = 7
dmo =
0
SS DNA
YAL003W
YAL012W
YAL026C
YAL034C
7
7
0.157
0.246
-0.235
9.5
7
7
0.175
0.796
0.487
11.5
13.5
7
7
0.467
0.384
-0.184
7
7
-0.379
0.981
-0.669
7
7
0.157
0.246
9.5
7
7
0.175
0.796
11.5
13.5
7
7
0.467
0.384
7
7
-0.379
0.981
4-10
7
7
0.157
0.246
13.5
7
7
-0.379
0.981
2.26
158.86
68.11
74.32
75.05
80.36
216.64
B
20.14
236.25
105.45
96.68
53.17
42.89
191.32
C
31.66
206.27
82.92
84.87
57.94
77.21
219.48
An ExptData object lets you store, manage, and subset the data values from a
microarray experiment. An ExptData object includes properties and methods
that let you access, retrieve, and change data values from a microarray
experiment. These properties and methods are useful to view and analyze the
data. For a list of the properties and methods, see ExptData class.
4-11
Microarray Analysis
EDObj
Experiment Data:
500 features, 26 samples
1 elements
Element names: Elmt1
4-12
Note Property names are case sensitive. For a list and description of all
properties of an ExptData object, see ExptData class.
or
methodname(objectname)
'B'
'C'
'D'
'E'
'F'
'G'
'H'
'I'
...
4-13
Microarray Analysis
ans =
500
26
References
[1] Hovatta, I., Tennant, R S., Helton, R., et al. (2005). Glyoxalase 1 and
glutathione reductase 1 regulate anxiety in mice. Nature 438, 662666.
4-14
4-15
Microarray Analysis
A
B
C
D
E
F
Gender
'Male'
'Male'
'Male'
'Male'
'Male'
'Male'
Age
8
8
8
8
8
8
Type
'Wild
'Wild
'Wild
'Wild
'Wild
'Wild
type'
type'
type'
type'
type'
type'
Strain
'129S6/SvEvTac'
'129S6/SvEvTac'
'129S6/SvEvTac'
'A/J '
'A/J '
'C57BL/6J '
Source
'amygdala'
'amygdala'
'amygdala'
'amygdala'
'amygdala'
'amygdala'
id
Gender
Age
Type
Strain
Source
VariableDescription
'Sample identifier'
'Gender of the mouse in study'
'The number of weeks since mouse birth'
'Genetic characters'
'The mouse strain'
'The tissue source for RNA collection'
A MetaData object lets you store, manage, and subset the metadata from a
microarray experiment. A MetaData object includes properties and methods
that let you access, retrieve, and change metadata from a microarray
experiment. These properties and methods are useful to view and analyze the
metadata. For a list of the properties and methods, see MetaData class
is available.
import bioma.data.*
2 Load some sample data, which includes Fishers iris data of 5 measurements
4-16
load fisheriris
3 Create a dataset array from some of Fishers iris data. The dataset
array will contain 750 measured values, one for each of 150 samples (iris
replicates) at five variables (species, SL, SW, PL, PW). In this dataset array,
the rows correspond to samples, and the columns correspond to variables.
irisValues = dataset({nominal(species),'species'}, ...
{meas, 'SL', 'SW', 'PL', 'PW'});
4 Create another dataset array containing a list of the variable names
and their descriptions. This dataset array will contain five rows, each
corresponding to the five variables: species, SL, SW, PL, and PW. The
first column will contain the variable name. The second column will have
a column header of VariableDescription and contain a description of
the variable.
% Create 5-by-1 cell array of description text for the variables
varDesc = {'Iris species', 'Sepal Length', 'Sepal Width', ...
'Petal Length', 'Petal Width'}';
% Create the dataset array from the variable descriptions
irisVarDesc = dataset(varDesc, ...
'ObsNames', {'species','SL','SW','PL','PW'}, ...
'VarNames', {'VariableDescription'})
irisVarDesc =
species
SL
SW
PL
PW
VariableDescription
'Iris species'
'Sepal Length'
'Sepal Width'
'Petal Length'
'Petal Width'
4-17
Microarray Analysis
is available.
import bioma.data.*
2 View the mouseSampleData.txt file included with the Bioinformatics
Toolbox software.
Note that this text file contains two tables. One table contains 130
measured values, one for each of 26 samples (A through Z) at five variables
(Gender, Age, Type, Strain, and Source). In this table, the rows correspond
to samples, and the columns correspond to variables. The second table has
lines prefaced by the # symbol. It contains five rows, each corresponding to
the five variables: Gender, Age, Type, Strain, and Source. The first column
contains the variable name. The second column has a column header of
VariableDescription and contains a description of the variable.
# id: Sample identifier
# Gender: Gender of the mouse in study
# Age: The number of weeks since mouse birth
# Type: Genetic characters
# Strain: The mouse strain
# Source: The tissue source for RNA collection
ID Gender Age Type Strain Source
A Male 8 Wild type 129S6/SvEvTac amygdala
B Male 8 Wild type 129S6/SvEvTac amygdala
C Male 8 Wild type 129S6/SvEvTac amygdala
D Male 8 Wild type A/J amygdala
E Male 8 Wild type A/J amygdala
F Male 8 Wild type C57BL/6J amygdala
G Male 8 Wild type C57BL/6J amygdala
H Male 8 Wild type 129S6/SvEvTac cingulate cortex
I Male 8 Wild type 129S6/SvEvTac cingulate cortex
J Male 8 Wild type A/J cingulate cortex
K Male 8 Wild type A/J cingulate cortex
L Male 8 Wild type A/J cingulate cortex
M Male 8 Wild type C57BL/6J cingulate cortex
N Male 8 Wild type C57BL/6J cingulate cortex
4-18
O
P
Q
R
S
T
U
V
W
X
Y
Z
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
Male
8
8
8
8
8
8
8
8
8
8
8
8
Wild
Wild
Wild
Wild
Wild
Wild
Wild
Wild
Wild
Wild
Wild
Wild
type
type
type
type
type
type
type
type
type
type
type
type
129S6/SvEvTac hippocampus
129S6/SvEvTac hippocampus
A/J hippocampus
A/J hippocampus
C57BL/6J hippocampus
C57BL/6J4 hippocampus
129S6/SvEvTac hypothalamus
129S6/SvEvTac hypothalamus
A/J hypothalamus
A/J hypothalamus
C57BL/6J hypothalamus
C57BL/6J hypothalamus
file.
MDObj2 = MetaData('File', 'mouseSampleData.txt', 'VarDescChar', '#')
Sample Names:
A, B, ...,Z (26 total)
Variable Names and Meta Information:
VariableDescription
Gender
Age
Type
Strain
Source
4-19
Microarray Analysis
ans =
5
Note Property names are case sensitive. For a list and description of all
properties of a MetaData object, see MetaData class.
or
methodname(objectname)
For example, to access the dataset array in a MetaData object that contains
the variable values:
MDObj2.variableValues;
To access the dataset array of a MetaData object that contains the variable
descriptions:
variableDesc(MDObj2)
ans =
Gender
Age
4-20
VariableDescription
' Gender of the mouse in study'
' The number of weeks since mouse birth'
Type
Strain
Source
4-21
Microarray Analysis
4-22
is available.
import bioma.data.*
2 Use the getgeodata function to return a MATLAB structure containing
structure.
MIAMEObj1 = MIAME(geoStruct);
4 Display information about the MIAME object, MIAMEObj.
MIAMEObj1
MIAMEObj1 =
Experiment Description:
Author name: Mika,,Silvennoinen
Riikka,,Kivel
Maarit,,Lehti
Anna-Maria,,Touvras
Jyrki,,Komulainen
Veikko,,Vihko
Heikki,,Kainulainen
Laboratory: LIKES - Research Center
Contact information: Mika,,Silvennoinen
URL:
PubMedIDs: 17003243
4-23
Microarray Analysis
is available.
import bioma.data.*
2 Use the MIAME constructor function to create a MIAME object using
individual properties.
MIAMEObj2 = MIAME('investigator', 'Jane Researcher',...
'lab', 'One Bioinformatics Laboratory',...
'contact', 'jresearcher@lab.not.exist',...
'url', 'www.lab.not.exist',...
'title', 'Normal vs. Diseased Experiment',...
'abstract', 'Example of using expression data',...
'other', {'Notes:Created from a text file.'});
4-24
Note Property names are case sensitive. For a list and description of all
properties of a MIAME object, see MIAME class.
or
methodname(objectname)
4-25
Microarray Analysis
Note For a complete list of methods of a MIAME object, see MIAME class.
4-26
4-27
4-28
Microarray Analysis
is available.
import bioma.*
2 Construct an ExpressionSet object from EDObj, an ExptData object, MDObj2,
ESObj
4-29
Microarray Analysis
ExpressionSet
Experiment Data: 500 features, 26 samples
Element names: Expressions
Sample Data:
Sample names:
A, B, ...,Z (26 total)
Sample variable names and meta information:
Gender: Gender of the mouse in study
Age: The number of weeks since mouse birth
Type: Genetic characters
Strain: The mouse strain
Source: The tissue source for RNA collection
Feature Data: none
Experiment Information: use 'exptInfo(obj)'
Note Property names are case sensitive. For a list and description of all
properties of an ExpressionSet object, see ExpressionSet class.
4-30
objectname.methodname
or
methodname(objectname)
'Age'
'Type'
'Strain'
'Source'
4-31
Microarray Analysis
4-32
The microarray data is also available on the Gene Expression Omnibus Web
site at
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30
The GenePix GPR-formatted file mouse_a1pd.gpr contains the data for one
of the microarrays used in the study. This is data from voxel A1 of the brain
of a mouse in which a pharmacological model of Parkinsons disease (PD)
was induced using methamphetamine. The voxel sample was labeled with
Cy3 (green) and the control, RNA from a total (not voxelated) normal mouse
brain, was labeled with Cy5 (red). GPR formatted files provide a large amount
of information about the array, including the mean, median, and standard
4-33
Microarray Analysis
[1x1 struct]
[9504x38 double]
[9504x1 double]
[9504x1 double]
[9504x1 double]
{9504x1 cell}
{9504x1 cell}
{38x1 cell}
[132x72 double]
[1x1 struct]
example, you can access the field ColumnNames of the structure pd by typing
pd.ColumnNames
4-34
'F635 Median'
'F635 Mean'
'F635 SD'
'B635 Median'
'B635 Mean'
'B635 SD'
'% > B635+1SD'
'% > B635+2SD'
'F635 % Sat.'
'F532 Median'
'F532 Mean'
'F532 SD'
'B532 Median'
'B532 Mean'
'B532 SD'
'% > B532+1SD'
'% > B532+2SD'
'F532 % Sat.'
'Ratio of Medians'
'Ratio of Means'
'Median of Ratios'
'Mean of Ratios'
'Ratios SD'
'Rgn Ratio'
'Rgn R'
'F Pixels'
'B Pixels'
'Sum of Medians'
'Sum of Means'
'Log Ratio'
'F635 Median - B635'
'F532 Median - B532'
'F635 Mean - B635'
'F532 Mean - B532'
'Flags'
3 Access the names of the genes. For example, to list the first 20 gene names,
type
pd.Names(1:20)
4-35
Microarray Analysis
4-36
The MATLAB software plots an image showing the median pixel values for
the foreground of the red (Cy5) channel.
2 Plot the median values for the green channel. For example, to plot data
4-37
Microarray Analysis
The MATLAB software plots an image showing the median pixel values of
the foreground of the green (Cy3) channel.
3 Plot the median values for the red background. The field B635 Median
shows the median values for the background of the red channel.
figure
maimage(pd,'B635 Median')
4-38
The MATLAB software plots an image for the background of the red
channel. Notice the very high background levels down the right side of
the array.
4 Plot the medial values for the green background. The field B532 Median
shows the median values for the background of the green channel.
figure
maimage(pd,'B532 Median')
4-39
Microarray Analysis
The MATLAB software plots an image for the background of the green
channel.
5 The first array was for the Parkinsons disease model mouse. Now read in
the data for the same brain voxel but for the untreated control mouse. In
this case, the voxel sample was labeled with Cy3 and the control, total
brain (not voxelated), was labeled with Cy5.
wt = gprread('mouse_a1wt.gpr')
4-40
wt =
Header:
Data:
Blocks:
Columns:
Rows:
Names:
IDs:
ColumnNames:
Indices:
Shape:
[1x1 struct]
[9504x38 double]
[9504x1 double]
[9504x1 double]
[9504x1 double]
{9504x1 cell}
{9504x1 cell}
{38x1 cell}
[132x72 double]
[1x1 struct]
and background. You can use the function subplot to put all the plots
onto one figure.
figure
subplot(2,2,1);
maimage(wt,'F635
subplot(2,2,2);
maimage(wt,'F532
subplot(2,2,3);
maimage(wt,'B635
subplot(2,2,4);
maimage(wt,'B532
Median')
Median')
Median')
Median')
4-41
Microarray Analysis
7 If you look at the scale for the background images, you will notice that the
background levels are much higher than those for the PD mouse and there
appears to be something nonrandom affecting the background of the Cy3
channel of this slide. Changing the colormap can sometimes provide more
insight into what is going on in pseudocolor plots. For more control over the
color, try the colormapeditor function.
colormap hot
4-42
b532Data = wt.Data(:,b532MedCol);
4-43
Microarray Analysis
figure
subplot(1,2,1);
imagesc(b532Data(wt.Indices))
axis image
colorbar
title('B532 Median')
4-44
11 Bound the intensities of the background plot to give more contrast in the
image.
maskedData = b532Data;
maskedData(b532Data<500) = 500;
maskedData(b532Data>2000) = 2000;
subplot(1,2,2);
imagesc(maskedData(wt.Indices))
axis image
colorbar
title('Enhanced B532 Median')
4-45
Microarray Analysis
4-46
4-47
Microarray Analysis
From the box plots you can clearly see the spatial effects in the background
intensities. Blocks numbers 1, 3, 5, and 7 are on the left side of the
arrays, and numbers 2, 4, 6, and 8 are on the right side. The data must be
normalized to remove this spatial bias.
4-48
function maloglog is used to do this. Points that are above the diagonal in
this plot correspond to genes that have higher expression levels in the A1
voxel than in the brain as a whole.
figure
maloglog(cy5Data,cy3Data)
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel A1)');
The MATLAB software displays the following messages and plots the
images.
Warning: Zero values are ignored
(Type "warning off Bioinfo:MaloglogZeroValues" to suppress
this warning.)
Warning: Negative values are ignored.
(Type "warning off Bioinfo:MaloglogNegativeValues" to suppress
this warning.)
4-49
Microarray Analysis
Notice that this function gives some warnings about negative and zero
elements. This is because some of the values in the 'F635 Median - B635'
and 'F532 Median - B532' columns are zero or even less than zero. Spots
where this happened might be bad spots or spots that failed to hybridize.
Points with positive, but very small, differences between foreground and
background should also be considered to be bad spots.
3 Disable the display of warnings by using the warning command. Although
4-50
figure
maloglog(cy5Data,cy3Data)
% Create the loglog plot
warning(warnState);
% Reset the warning state.
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel A1)');
the bad spots from the data set. You can do this by finding points where
either the red or green channel has values less than or equal to a threshold
value. For example, use a threshold value of 10.
threshold = 10;
badPoints = (cy5Data <= threshold) | (cy3Data <= threshold);
4-51
Microarray Analysis
5 You can then remove these points and redraw the loglog plot.
4-52
This plot shows the distribution of points but does not give any indication
about which genes correspond to which points.
6 Add gene labels to the plot. Because some of the data points have
been removed, the corresponding gene IDs must also be removed from
the data set before you can use them. The simplest way to do that is
wt.IDs(~badPoints).
maloglog(cy5Data,cy3Data,'labels',wt.IDs(~badPoints),...
'factorlines',2)
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel A1)');
4-53
Microarray Analysis
You will see the gene ID associated with the point. Most of the outliers are
below the y = x line. In fact, most of the points are below this line. Ideally
the points should be evenly distributed on either side of this line.
8 Normalize the points to evenly distribute them on either side of the line.
If you plot the normalized data you will see that the points are more evenly
distributed about the y = x line.
figure
4-54
maloglog(normcy5,normcy3,'labels',wt.IDs(~badPoints),...
'factorlines',2)
xlabel('F635 Median - B635 (Control)');
ylabel('F532 Median - B532 (Voxel A1)');
9 The function mairplot is used to create an Intensity vs. Ratio plot for the
normalized data. This function works in the same way as the function
maloglog.
figure
mairplot(normcy5,normcy3,'labels',wt.IDs(~badPoints),...
'factorlines',2)
4-55
Microarray Analysis
10 You can click the points in this plot to see the name of the gene associated
4-56
4-57
Microarray Analysis
load yeastdata.mat
2 Get the size of the data by typing
numel(genes)
The number of genes in the data set displays in the MATLAB Command
Window. The MATLAB variable genes is a cell array of the gene names.
ans =
6400
3 Access the entries using cell array indexing.
genes{15}
This displays the 15th row of the variable yeastvalues, which contains
expression levels for the open reading frame (ORF) YAL054C.
ans =
YAL054C
4 Use the function web to access information about this ORF in the
plot(times, yeastvalues(15,:))
xlabel('Time (Hours)');
ylabel('Log2 Relative Expression Level');
4-58
The MATLAB software plots the figure. The values are log2 ratios.
plot(times, 2.^yeastvalues(15,:))
xlabel('Time (Hours)');
ylabel('Relative Expression Level');
4-59
Microarray Analysis
The MATLAB software plots the figure. The gene associated with this
ORF, ACS1, appears to be strongly up-regulated during the diauxic shift.
hold on
plot(times, 2.^yeastvalues(16:26,:)')
xlabel('Time (Hours)');
ylabel('Relative Expression Level');
title('Profile Expression Levels');
4-60
Filtering Genes
This procedure illustrates how to filter the data by removing genes that are
not expressed or do not change. The data set is quite large and a lot of the
information corresponds to genes that do not show any interesting changes
during the experiment. To make it easier to find the interesting genes, reduce
the size of the data set by removing genes with expression profiles that do not
show anything of interest. There are 6400 expression profiles. You can use
a number of techniques to reduce the number of expression profiles to some
subset that contains the most significant genes.
1 If you look through the gene list you will see several spots marked as
'EMPTY'. These are empty spots on the array, and while they might have
data associated with them, for the purposes of this example, you can
4-61
Microarray Analysis
consider these points to be noise. These points can be found using the
strcmp function and removed from the data set with indexing commands.
emptySpots = strcmp('EMPTY',genes);
yeastvalues(emptySpots,:) = [];
genes(emptySpots) = [];
numel(genes)
In the yeastvalues data you will also see several places where the
expression level is marked as NaN. This indicates that no data was collected
for this spot at the particular time step. One approach to dealing with
these missing values would be to impute them using the mean or median of
data for the particular gene over time. This example uses a less rigorous
approach of simply throwing away the data for any genes where one or
more expression levels were not measured.
2 Use the isnan function to identify the genes with missing data and then
If you were to plot the expression profiles of all the remaining profiles,
you would see that most profiles are flat and not significantly different
from the others. This flat data is obviously of use as it indicates that the
genes associated with these profiles are not significantly affected by the
diauxic shift. However, in this example, you are interested in the genes
with large changes in expression accompanying the diauxic shift. You can
use filtering functions in the toolbox to remove genes with various types
4-62
over time. The function returns a logical array of the same size as the
variable genes with ones corresponding to rows of yeastvalues with
variance greater than the 10th percentile and zeros corresponding to those
below the threshold.
mask = genevarfilter(yeastvalues);
% Use the mask as an index into the values to remove the
% filtered genes.
yeastvalues = yeastvalues(mask,:);
genes = genes(mask);
numel(genes)
absolute expression values. Note that the gene filter functions can also
automatically calculate the filtered data and names.
[mask, yeastvalues, genes] = genelowvalfilter(yeastvalues,genes,...
'absval',log2(4));
numel(genes)
low entropy:
[mask, yeastvalues, genes] = geneentropyfilter(yeastvalues,genes,...
'prctile',15);
numel(genes)
4-63
Microarray Analysis
310
Clustering Genes
Now that you have a manageable list of genes, you can look for relationships
between the profiles using some different clustering techniques from the
Statistics Toolbox software.
1 For hierarchical clustering, the function pdist calculates the pairwise
4-64
Again, 16 clusters are found, but because the algorithm is different these
are not necessarily the same clusters as those found by hierarchical
clustering.
[cidx, ctrs] = kmeans(yeastvalues, 16,...
'dist','corr',...
'rep',5,...
'disp','final');
figure
for c = 1:16
subplot(4,4,c);
plot(times,yeastvalues((cidx == c),:)');
axis tight
end
suptitle('K-Means Clustering of Profiles');
4-65
Microarray Analysis
iterations,
iterations,
iterations,
iterations,
iterations,
total
total
total
total
total
sum
sum
sum
sum
sum
of
of
of
of
of
distances
distances
distances
distances
distances
=
=
=
=
=
11.4042
8.62674
8.86066
9.77676
9.01035
5 Instead of plotting all of the profiles, you can plot just the centroids.
figure
for c = 1:16
subplot(4,4,c);
plot(times,ctrs(c,:)');
axis tight
axis off
% turn off the axis
end
suptitle('K-Means Clustering of Profiles');
4-66
6 You can use the function clustergram to create a heat map and
4-67
Microarray Analysis
4-68
-0.0245
0.0186
0.0713
0.2254
0.2950
0.6596
0.6490
-0.3033
-0.5309
-0.1970
-0.2941
-0.6422
0.1788
0.2377
-0.1710
-0.3843
0.2493
0.1667
0.1415
0.5155
-0.6689
-0.2831
-0.5419
0.4042
0.1705
0.3358
-0.5032
0.2601
Columns 5 through 7
-0.1155
-0.2384
-0.7452
-0.2385
0.5592
-0.0194
-0.0673
0.4034
-0.2903
-0.3657
0.7520
-0.2110
-0.0961
-0.0039
0.7887
-0.3679
0.2035
-0.4283
0.1032
0.0667
0.0521
2 You can use the function cumsum to see the cumulative sum of the variances.
cumsum(pcvars./sum(pcvars) * 100)
This shows that almost 90% of the variance is accounted for by the first
two principal components.
3 A scatter plot of the scores of the first two principal components shows that
there are two distinct regions. This is not unexpected, because the filtering
4-69
Microarray Analysis
process removed many of the genes with low variance or low information.
These genes would have appeared in the middle of the scatter plot.
figure
scatter(zscores(:,1),zscores(:,2));
xlabel('First Principal Component');
ylabel('Second Principal Component');
title('Principal Component Scatter Plot');
4 The gname function from the Statistics Toolbox software can be used to
identify genes on a scatter plot. You can select as many points as you like
on the scatter plot.
gname(genes);
4-70
plot where points from each group have a different color or marker. You
can use clusterdata, or any other clustering function, to group the points.
figure
pcclusters = clusterdata(zscores(:,1:2),6);
gscatter(zscores(:,1),zscores(:,2),pcclusters)
xlabel('First Principal Component');
ylabel('Second Principal Component');
title('Principal Component Scatter Plot with Colored Clusters');
gname(genes) % Press enter when you finish selecting genes.
4-71
Microarray Analysis
4-72
coriell_data =
Sample:
Chromosome:
GenomicPosition:
Log2Ratio:
FISHMap:
{1x15 cell}
[2285x1 int8]
[2285x1 int32]
[2285x15 double]
{2285x1 cell}
You can plot the genome wide log2-based test/reference intensity ratios of
DNA clones. In this example, you will display the log2 intensity ratios for cell
line GM03576 for chromosomes 1 through 23.
Find the sample index for the CM03576 cell line.
sample = find(strcmpi(coriell_data.Sample, 'GM03576'))
sample =
8
To label chromosomes and draw the chromosome borders, you need to find
the number of data points of in each chromosome.
4-73
Microarray Analysis
% Label the autosomes with their chromosome numbers, and the sex chromosome
% with X.
x_label = chr_nums - ceil(chr_data_len/2);
y_label = zeros(1, length(x_label)) - 1.6;
chr_labels=num2str((1:1:23)');
chr_labels = cellstr(chr_labels);
chr_labels{end} = 'X';
figure;hold on
h_ratio = plot(coriell_data.Log2Ratio(:,sample), '.');
h_vbar = line(x_vbar, y_vbar, 'color', [0.8 0.8 0.8]);
h_text = text(x_label, y_label, chr_labels,...
'fontsize', 8, 'HorizontalAlignment', 'center');
h_axis = get(h_ratio, 'parent');
set(h_axis, 'xtick', [], 'ygrid', 'on', 'box', 'on',...
'xlim', [0 chr_nums(23)], 'ylim', [-1.5 1.5])
title(coriell_data.Sample{sample})
xlabel({'', 'Chromosome'})
ylabel('Log2(T/R)')
hold off
In the plot, borders between chromosomes are indicated by grey vertical bars.
The plot indicates that the GM03576 cell line is trisomic for chromosomes
2 and 21 [3].
4-74
You can also plot the profile of each chromosome in a genome. In this
example, you will display the log2 intensity ratios for each chromosome in cell
line GM05296 individually.
sample = find(strcmpi(coriell_data.Sample, 'GM05296'));
figure;
for c = 1:23
idx = coriell_data.Chromosome == c;
chr_y = coriell_data.Log2Ratio(idx, sample);
subplot(5,5,c);
hp = plot(chr_y, '.');
line([0, chr_data_len(c)], [0,0], 'color', 'r');
h_axis = get(hp, 'Parent');
set(h_axis, 'xtick', [], 'Box', 'on',...
'xlim', [0 chr_data_len(c)], 'ylim', [-1.5 1.5])
xlabel(['chr ' chr_labels{c}], 'FontSize', 8)
end
suptitle('GM05296');
The plot indicates the GM05296 cell line has a partial trisomy at chromosome
10 and a partial monosomy at chromosome 11.
Observe that the gains and losses of copy number are discrete. These
alterations occur in contiguous regions of a chromosome that cover several
clones to entitle chromosome.
The array-based CGH data can be quite noisy. Therefore, accurate
identification of chromosome regions of equal copy number that accounts for
the noise in the data requires robust computational methods. In the rest of
this example, you will work with the data of chromosomes 9, 10 and 11 of
the GM05296 cell line.
Initialize a structure array for the data of these three chromosomes.
GM05296_Data = struct('Chromosome', {9 10 11},...
'GenomicPosition', {[], [], []},...
4-75
Microarray Analysis
= 1:length(GM05296_Data)
coriell_data.Chromosome == GM05296_Data(iloop).Chromosome;
= coriell_data.GenomicPosition(idx);
= coriell_data.Log2Ratio(idx, sample);
To better visualize and later validate the locations of copy number changes,
we need cytoband information. Read the human cytoband information from
the hs_cytoBand.txt data file using the cytobandread function. It returns a
structure of human cytoband information [4].
hs_cytobands = cytobandread('hs_cytoBand.txt')
% Find the centromere positions for the chromosomes.
4-76
hs_cytobands =
ChromLabels:
BandStartBPs:
BandEndBPs:
BandLabels:
GieStains:
{862x1
[862x1
[862x1
{862x1
{862x1
cell}
int32]
int32]
cell}
cell}
You can inspect the data by plotting the log2-based ratios, the smoothed ratios
and the derivative of the smoothed ratios together. You can also display the
centromere position of a chromosome in the data plots. The magenta vertical
bar marks the centromere of the chromosome.
for iloop = 1:length(GM05296_Data)
chr = GM05296_Data(iloop).Chromosome;
chr_x = GM05296_Data(iloop).GenomicPosition;
figure; hold on
plot(chr_x, GM05296_Data(iloop).Log2Ratio, '.');
line(chr_x, GM05296_Data(iloop).SmoothedRatio,...
'Color', 'r', 'LineWidth', 2);
line(chr_x, GM05296_Data(iloop).DiffRatio,...
'Color', 'k', 'LineWidth', 2);
line([acen_pos(chr), acen_pos(chr)], [-1, 1],...
'Color', 'm', 'LineWidth', 2, 'LineStyle', '-.');
if iloop == 1
legend('Raw','Smoothed','Diff', 'Centromere');
end
ylim([-1, 1])
xlabel('Genomic Position')
ylabel('Log2(T/R)')
title(sprintf('GM05296: Chromosome %d ', chr))
4-77
Microarray Analysis
hold off
end
Detecting Change-Points
4-78
% Select initial guess for the of cluster index for each point.
gmpart = (gmy > (min(gmy) + range(gmy)/2)) + 1;
% Create a Gaussian mixture model object
gm = gmdistribution.fit(gmy, 2, 'start', gmpart);
gmid = gm.cluster(gmy);
4-79
Microarray Analysis
Once you determine the optimal change-point indices, you also need to
determine if each segment represents a statistically significant changes
in DNA copy number. You will perform permutation t-tests to assess the
significance of the segments identified. A segment includes all the data points
from one change-point to the next change-point or the chromosome end. In
this example, you will perform 10,000 permutations of the data points on two
consecutive segments along the chromosome at the significance level of 0.01.
alpha = 0.01;
for iloop = 1:length(GM05296_Data)
seg_num = numel(GM05296_Data(iloop).SegIndex) - 1;
4-80
seg_index = GM05296_Data(iloop).SegIndex;
if seg_num > 1
ppvals = zeros(seg_num+1, 1);
for sloop = 1:seg_num-1
seg1idx = seg_index(sloop):seg_index(sloop+1)-1;
if sloop== seg_num-1
seg2idx = seg_index(sloop+1):(seg_index(sloop+2));
else
seg2idx = seg_index(sloop+1):(seg_index(sloop+2)-1);
end
seg1 = GM05296_Data(iloop).SmoothedRatio(seg1idx);
seg2 = GM05296_Data(iloop).SmoothedRatio(seg2idx);
n1 = numel(seg1);
n2 = numel(seg2);
N = n1+n2;
segs = [seg1;seg2];
% Compute observed t statistics
t_obs = mean(seg1) - mean(seg2);
% Permutation test
iter = 10000;
t_perm = zeros(iter,1);
for i = 1:iter
randseg = segs(randperm(N));
t_perm(i) = abs(mean(randseg(1:n1))-mean(randseg(n1+1:N)));
end
ppvals(sloop+1) = sum(t_perm >= abs(t_obs))/iter;
end
sigidx = ppvals < alpha;
GM05296_Data(iloop).SegIndex = seg_index(sigidx);
end
4-81
Microarray Analysis
numel(GM05296_Data(iloop).SegIndex) - 1, GM05296_Data(iloop).Chromos
end
1 segments found on Chromosome 9 after significance tests.
3 segments found on Chromosome 10 after significance tests.
4 segments found on Chromosome 11 after significance tests.
Assessing Copy Number Alterations
4-82
'unit', 2)
end
cna_struct =
Chromosome: [10 11]
CNVType: [2 1]
Start: [69209000 34420000]
4-83
Microarray Analysis
This example shows how MATLAB and its toolboxes provide tools for the
analysis and visualization of copy-number alterations in array-based CGH
data.
References
[1] Redon, R., Ishikawa, S., Fitch, K.R., et al. (2006). Global variation in copy
number in the human genome. Nature 444, 444-454.
[2] Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, I., Kowbel, D., Collins,
C. Kuo, W.L., Chen, C., Zhai, Y., et al. (1998). High resolution analysis of
DNA copy number variations using comparative genomic hybridization to
microarrays. Nat. Genet. 20, 207-211.
[3] Snijders, A.M., Nowak, N., Segraves, R., Blackwood, S., Brown, N., Conroy,
J., Hamilton, G., Hindle, A.K., Huey, B., Kimura, K., et al. (2001). Assembly
of microarrays for genome-wide measurement of DNA copy number", Nat.
Genet. 29, 263-264.
[4] Human Genome NCBI Build 36.
[5] Myers, C.L., Dunham, M.J., Kung, S.Y., and Troyanskaya, O.G. (2004).
Accurate detection of aneuploidies in array CGH and gene expression
microarray data. Bioinformatics 20, 18, 3533-3543.
Suggest an enhancement for this example.
4-84
4-85
Microarray Analysis
''
{7129x1 cell}
{1x42 cell}
7129
42
2
'single'
nGenes =
7129
nSamples =
42
You can use gene symbols instead of the probe set IDs to label the expression
values. The gene symbols for the HuGeneFl array are provided in a MAT
file containing a Map object.
4-86
load HuGeneFL_GeneSymbol_Map;
Warning: Unknown parameter name
of known parameters.
will be ignored.
Create a cell array of gene symbols for the expression values from the
hu6800GeneSymbolMap object.
huGenes = values(hu6800GeneSymbolMap, expr_cns_gcrma_eb.RowNames);
Set the row names of the exprs_cns_gcrma_eb to gene symbols using the
rownames method of the DataMatrix object.
expr_cns_gcrma_eb = rownames(expr_cns_gcrma_eb, ':', huGenes);
Filtering the Expression Data
Remove gene expression data with empty gene symbols. In the example, the
empty symbols are labeled as '---'.
expr_cns_gcrma_eb('---', :) = [];
Many of the genes in this study are not expressed, or have only small
variability across the samples. Remove these genes using non-specific
filtering.
Use genelowvalfilter to filter out genes with very low absolute expression
values.
[mask, expr_cns_gcrma_eb] = genelowvalfilter(expr_cns_gcrma_eb);
Use genevarfilter to filter out genes with a small variance across samples.
[mask, expr_cns_gcrma_eb] = genevarfilter(expr_cns_gcrma_eb);
nGenes =
4-87
Microarray Analysis
5669
You can now compare the gene expression values between two groups of data:
CNS medulloblastomas (MD) and non-neuronal origin malignant gliomas
(Mglio) tumor.
From the expression data of all 42 samples, extract the data of the 10 MD
samples and the 10 Mglio samples.
MDs = strncmp(expr_cns_gcrma_eb.ColNames,'Brain_MD', 8);
Mglios = strncmp(expr_cns_gcrma_eb.ColNames,'Brain_MGlio', 11);
MDData = expr_cns_gcrma_eb(:, MDs);
get(MDData)
Name:
RowNames:
ColNames:
NRows:
NCols:
NDims:
ElementClass:
''
{5669x1 cell}
{1x10 cell}
5669
10
2
'single'
4-88
''
{5669x1 cell}
{1x10 cell}
5669
10
2
'single'
In any test situation, two types of errors can occur, a false positive by
declaring that a gene is differentially expressed when it is not, and a false
negative when the test fails to identify a truly differentially expressed gene.
In multiple hypothesis testing, which simultaneously tests the null hypothesis
of thousands of genes using microarray expression data, each test has a
specific false positive rate, or a false discovery rate (FDR). False discovery
rate is defined as the expected ratio of the number of false positives to the
total number of positive calls in a differential expression analysis between
two groups of samples (Storey et al., 2003).
In this example, you will compute the FDR using the Storey-Tibshirani
procedure (Storey et al., 2003). The procedure also computes the q-value of
a test, which measures the minimum FDR that occurs when calling the test
significant. The estimation of FDR depends on the truly null distribution of
the multiple tests, which is unknown. Permutation methods can be used to
estimate the truly null distribution of the test statistics by permuting the
columns of the gene expression data matrix (Storey et al., 2003, Dudoit et
al., 2003). Depending on the sample size, it may not be feasible to consider
all possible permutations. Usually a random subset of permutations are
considered in the case of large sample size. Use the nchoosek function in
Statistics Toolbox to find out the number of all possible permutations of
the samples in this example.
4-89
Microarray Analysis
ans =
184756
ans =
2121
Estimate the FDR and q-values for each test using mafdr. The quantity pi0 is
the overall proportion of true null hypotheses in the study. It is estimated
from the simulated null distribution via bootstrap or the cubic polynomial fit.
Note: You can also manually set the value of lambda for estimating pi0.
figure;
[pFDR, qvalues] = mafdr(pvaluesCorr, 'showplot', true);
Determine the number of genes that have q-values less than the cutoff value.
Note: You may get a different number of genes due to the permutation test
and the bootstrap outcomes.
sum(qvalues < cutoff)
4-90
ans =
2173
Many genes with low FDR implies that the two groups, MD and Mglio, are
biologically distinct.
You can also empirically estimate the FDR adjusted p-values using the
Benjamini-Hochberg (BH) procedure (Benjamini et al, 1995) by setting the
mafdr input parameter BHFDR to true.
pvaluesBH = mafdr(pvaluesCorr, 'BHFDR', true);
sum(pvaluesBH < cutoff)
ans =
1374
You can store the t-scores, p-values, pFDRs, q-values and BH FDR corrected
p-values together as a DataMatrix object.
testResults = [tscores pvaluesCorr pFDR qvalues pvaluesBH];
Update the column name for BH FDR corrected p-values using the colnames
method of DataMatrix object.
testResults = colnames(testResults, 5, {'FDR_BH'});
4-91
Microarray Analysis
ans =
PLEC1
HNRPA1
FCGR2A
PLEC1
FBL
KIAA0367
ID2B
RBMX
PAFAH1B3
H3F3A
LRP1
PEA15
ID2B
SFRS3
HLA-DPA1
C5orf13
PTMA
NAP1L1
HMGB2
RAB31
ARAF
PTPRZ1
SPARCL1
t-scores
-9.6223
9.359
-9.3548
-9.3495
9.1518
-8.996
-8.9285
8.8905
8.7561
8.6512
-8.6465
-8.3256
-8.1183
8.1166
-7.8546
7.7195
7.7013
7.674
7.6532
-13.664
-7.5549
-7.5352
-7.3639
p-values
6.7194e-09
1.382e-08
1.394e-08
1.4094e-08
1.9875e-08
2.4324e-08
2.6667e-08
2.8195e-08
3.5317e-08
4.5191e-08
4.6243e-08
1.1419e-07
1.7041e-07
1.7055e-07
2.4004e-07
2.9229e-07
2.9658e-07
3.0477e-07
3.123e-07
3.308e-07
4.7835e-07
4.9875e-07
7.8426e-07
FDR
1.3675e-05
1.4063e-05
9.457e-06
7.171e-06
8.0899e-06
8.2509e-06
7.7533e-06
7.1728e-06
7.9864e-06
9.1973e-06
8.5559e-06
1.9367e-05
2.6679e-05
2.4793e-05
3.2569e-05
3.7179e-05
3.5506e-05
3.446e-05
3.3452e-05
3.3662e-05
4.6359e-05
4.614e-05
6.9397e-05
q-values
7.171e-06
7.171e-06
7.171e-06
7.171e-06
7.1728e-06
7.1728e-06
7.1728e-06
7.1728e-06
7.9864e-06
8.5559e-06
8.5559e-06
1.9367e-05
2.4793e-05
2.4793e-05
3.2569e-05
3.3452e-05
3.3452e-05
3.3452e-05
3.3452e-05
3.3662e-05
4.614e-05
4.614e-05
6.2018e-05
4-92
FDR_B
1.997
1.997
1.997
1.997
1.99
1.99
1.99
1.99
2.224
2.383
2.383
5.394
6.905
6.905
9.07
9.317
9.317
9.317
9.317
9.376
0.000
0.000
0.000
diffStruct =
Name:
PVCutoff:
FCThreshold:
GeneLabels:
PValues:
FoldChanges:
'Differentially Expressed'
0.0500
2
{327x1 cell}
[327x1 bioma.data.DataMatrix]
[327x1 bioma.data.DataMatrix]
Ctrl-click genes in the gene lists to label the genes in the plot. As seen in the
volcano plot, genes specific for neuronal based cerebella granule cells, such
as ZIC and NEUROD, are found in the up-regulated gene list, while genes
typical of the astrocytic and oligodendrocytic lineage and cell differentiation,
such as SOX2, PEA15, and ID2B, are found in the down-regulated list.
Determine the number of differentially expressed genes.
nDiffGenes = diffStruct.PValues.NRows
nDiffGenes =
327
nUpGenes =
4-93
Microarray Analysis
225
nDownGenes =
102
Use Gene Ontology (GO) to annotate the differentially expressed genes. You
can look at the up-regulated genes from the analysis above. Download the
Homo sapiens annotations (gene_association.goa_human.gz file) from
Gene Ontology Current Annotations, unzip, and store it in your the current
directory.
Find the indices of the up-regulated genes for Gene Ontology analysis.
huGenes = rownames(expr_cns_gcrma_eb);
for i = 1:nUpGenes
up_geneidx(i) = find(strncmpi(huGenes, up_genes{i}, length(up_genes{i})
end
Load the Gene Ontology database into a MATLAB object using the geneont
function.
GO = geneont('live',true);
Read the Homo sapiens gene annotation file. For this example, you will look
only at genes that are related to molecular function, so you only need to read
the information where the Aspect field is set to F. The fields that are of
interest are the gene symbol and associated ID. In GO Annotation files these
have field names DB_Object_Symbol and GOid respectively.
HGann = goannotread('gene_association.goa_human',...
'Aspect','F','Fields',{'DB_Object_Symbol','GOid'});
4-94
ans =
16006
Not all of the 5758 genes on the HuGeneFL chip are annotated. For every
gene on the chip, see if it is annotated by comparing its gene symbol to the
list of gene symbols from GO. Track the number of annotated genes and the
number of up-regulated genes associated with each GO term. Note that data
in public repositories is frequently curated and updated; therefore the results
of this example might be slightly different when you use up-to-date datasets.
It is also possible that you get warnings about invalid or obsolete IDs due to
an updated Homo sapiens gene annotation file.
m = GO.Terms(end).id;
% gets the last term id
chipgenesCount = zeros(m,1); % a vector of GO term counts for the entire ch
upgenesCount = zeros(m,1); % a vector of GO term counts for up-regulated
for i = 1:length(huGenes)
if isKey(HGmap,huGenes{i})
goid = getrelatives(GO,HGmap(huGenes{i}));
% Update the tally
4-95
Microarray Analysis
chipgenesCount(goid) = chipgenesCount(goid) + 1;
if (any(i == up_geneidx))
upgenesCount(goid) = upgenesCount(goid) +1;
end
end
end
4-96
p-value
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
0.00000
57
53
55
52
51
51
51
51
51
51
/
/
/
/
/
/
/
/
/
/
counts
definition
162 The action of a molecule that contributes to t
219 Interacting selectively and non-covalently wit
244 Functions during translation by interacting se
220 Interacting selectively and non-covalently wit
213 Interacting selectively and non-covalently wit
213 Interacting selectively and non-covalently wit
213 Interacting selectively and non-covalently wit
213 Interacting selectively and non-covalently wit
213 Interacting selectively and non-covalently wit
213 Interacting selectively and non-covalently wit
Inspect the significant GO terms and select the terms related to specific
molecule functions to build a sub-ontology that includes the ancestors of the
terms. Visualize this ontology using the biograph function. You can also color
the graphs nodes. In this example, the red nodes are the most significant,
while the blue nodes are the least significant gene ontology terms. Note: The
GO terms returned may differ from those shown due to the frequent update to
the Homo sapiens gene annotation file.
fcnAncestors = GO(getancestors(GO,idx(1:5)))
[cm acc rels] = getmatrix(fcnAncestors);
BG = biograph(cm,get(fcnAncestors.Terms,'name'))
for i=1:numel(acc)
pval = gopvalues(acc(i));
color = [(1-pval).^(1) pval.^(1/8) pval.^(1/8)];
set(BG.Nodes(i),'Color',color);
end
view(BG)
Gene Ontology object with 13 Terms.
Biograph object with 13 nodes and 14 edges.
You can query the pathway information of the differentially expressed genes
from the KEGG pathway database through KEGGs Web Service.
Following are a few pathway maps with the genes in the up-regulated gene
list highlighted:
Cell Cycle
Hedgehog Signaling pathway
mTor Signaling pathway
References
4-97
Microarray Analysis
[1] Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla, L.M., Angelo, M.,
McLaughlin, M.E., Kim, J.Y., Goumnerova, L.C., Black, P.M., Lau, C., Allen,
J.C., Zagzag, D., Olson, J.M., Curran, T., Wetmore, C., Biegel, J.A., Poggio, T.,
Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, DN, Mesirov,
J.P., Lander, E.S., and Golub, T.R. (2002). Prediction of central nervous
system embryonal tumour outcome based on gene expression. Nature,
415(6870), 436-442.
[2] Storey, J.D., and Tibshirani, R. (2003). Statistical significance for
genomewide studies. Proc.Nat.Acad.Sci., 100(16), 9440-9445.
[3] Dudoit, S., Shaffer, J.P., and Boldrick, J.C. (2003). Multiple hypothesis
testing in microarray experiment. Statistical Science, 18, 71-103.
[4] Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery
rate: a practical and powerful approach to multiple testing. J. Royal Stat.
Soc., B 57, 289-300.
Suggest an enhancement for this example.
4-98
5
Phylogenetic Analysis
Overview of Phylogenetic Analysis on page 5-2
Building a Phylogenetic Tree on page 5-3
Phylogenetic Tree App Reference on page 5-16
Phylogenetic Analysis
5-2
5-3
Phylogenetic Analysis
allows sequences to be traced through one genetic line and all polymorphisms
assumed to be caused by mutations.
Mitochondrial DNA in mammals has a faster mutation rate than nuclear
DNA sequences. This faster rate of mutation produces more variance between
sequences and is an advantage when studying closely related species. The
mitochondrial control region (Displacement or D-loop) is one of the fastest
mutating sequence regions in animal DNA.
Neanderthal DNA
The ability to isolate mitochondrial DNA (mtDNA) from palaeontological
samples has allowed genetic comparisons between extinct species and closely
related nonextinct species. The reasons for isolating mtDNA instead of
nuclear DNA in fossil samples have to do with the fact that:
mtDNA, because it is circular, is more stable and degrades slower then
nuclear DNA.
Each cell can contain a thousand copies of mtDNA and only a single copy
of nuclear DNA.
While there is still controversy as to whether Neanderthals are direct
ancestors of humans or evolved independently, the use of ancient genetic
sequences in phylogenetic analysis adds an interesting dimension to the
question of human ancestry.
References
Ovchinnikov I., et al. (2000). Molecular analysis of Neanderthal DNA from
the northern Caucasus. Nature 404(6777), 490493.
Sajantila A., et al. (1995). Genes and languages in Europe: an analysis of
mitochondrial lineages. Genome Research 5 (1), 4252.
Krings M., et al. (1997). Neanderthal DNA sequences and the origin of
modern humans. Cell 90 (1), 1930.
Jensen-Seaman, M., Kidd K. (2001). Mitochondrial DNA variation and
biogeography of eastern gorillas. Molecular Ecology 10(9), 22412247.
5-4
A separate browser window opens with the home page for the NCBI Web
site.
2 Search the NCBI Web site for information. For example, to search for the
human taxonomy, from the Search list, select Taxonomy, and in the for
box, enter hominidae.
5-5
Phylogenetic Analysis
3 Select the taxonomy link for the family Hominidae. A page with the
5-6
you can select a method for calculating the hierarchical clustering distances
used to build a tree.
After locating the GenBank accession codes for the sequences you are
interested in studying, you can create a phylogenetic tree with the data. For
information on locating accession codes, see Searching NCBI for Phylogenetic
Data on page 5-5.
In the following example, you will use the Jukes-Cantor method to calculate
distances between sequences, and the Unweighted Pair Group Method
Average (UPGMA) method for linking the tree nodes.
1 Create a MATLAB structure with information about the sequences. This
step uses the accession codes for the mitochondrial D-loop sequences
isolated from different hominid species.
data = {'German_Neanderthal'
'Russian_Neanderthal'
'European_Human'
'Mountain_Gorilla_Rwanda'
'Chimp_Troglodytes'
};
'AF011222';
'AF254446';
'X90314' ;
'AF089820';
'AF176766';
2 Retrieve sequence data from the GenBank database and copy into the
MATLAB environment.
for ind = 1:5
seqs(ind).Header
= data{ind,1};
seqs(ind).Sequence = getgenbank(data{ind,2},...
'sequenceonly', true);
end
3 Calculate pairwise distances and create a phytree object. For example,
5-7
Phylogenetic Analysis
h = plot(tree,'orient','top');
ylabel('Evolutionary distance')
set(h.terminalNodeLabels,'Rotation',65)
5-8
'AF451972';
'AF451964';
'AY079510';
'AF050738';
5-9
Phylogenetic Analysis
'Chimp_Schweinfurthii'
'Chimp_Vellerosus'
'Chimp_Verus'
'AF176722';
'AF315498';
'AF176731';
};
2 Get additional sequence data from the GenBank database, and copy the
distances = seqpdist(seqs,'Method','Jukes-Cantor','Alpha','DNA');
tree = seqlinkage(distances,'UPGMA',seqs);
4 Draw a phylogenetic tree.
h = plot(tree,'orient','top');
ylabel('Evolutionary distance')
set(h.terminalNodeLabels,'Rotation',65)
5-10
names = get(tree,'LeafNames')
names =
'German_Neanderthal'
'Russian_Neanderthal'
5-11
Phylogenetic Analysis
'European_Human'
'Chimp_Troglodytes'
'Chimp_Schweinfurthii'
'Chimp_Verus'
'Chimp_Vellerosus'
'Puti_Orangutan'
'Jari_Orangutan'
'Mountain_Gorilla_Rwanda'
'Eastern_Lowland_Gorilla'
'Western_Lowland_Gorilla'
From the list, you can determine the indices for its members. For example,
the European Human leaf is the third entry.
2 Find the closest species to a selected species in a tree. For example, find
subtree_names = names(h_leaves)
5-12
'European_Human'
'Chimp_Schweinfurthii'
'Chimp_Verus'
'Chimp_Troglodytes'
4 Extract a subtree from the whole tree by removing unwanted leaves. For
example, prune the tree to species within 0.6 of the European human
species.
leaves_to_prune = ~h_leaves;
pruned_tree = prune(tree,leaves_to_prune)
h = plot(pruned_tree,'orient','top');
ylabel('Evolutionary distance')
set(h.terminalNodeLabels,'Rotation',65)
The MATLAB software returns information about the new subtree and
plots the pruned phylogenetic tree in a Figure window.
Phylogenetic tree object with 6 leaves (5 branches)
5-13
Phylogenetic Analysis
Tree app.
phytreeviewer(pruned_tree)
5-14
You can interactively change the appearance of the tree using the app.
For information on using this app, see Phylogenetic Tree App Reference
on page 5-16.
5-15
Phylogenetic Analysis
5-16
phytreeviewer(tr)
5-17
Phylogenetic Analysis
File Menu
The File menu includes the standard commands for opening and closing a
file, and it includes commands to use phytree object data from the MATLAB
Workspace. The File menu commands are shown below.
5-18
5-19
Phylogenetic Analysis
A second Phylogenetic Tree viewer opens with tree data from the selected
file.
Open Command
Use the Open command to read tree data from a Newick-formatted file and
display that data in the app.
1 From the File menu, click Open.
app uses the file extension .tree for Newick-formatted files, but you can
use any Newick-formatted file with any extension.
The app replaces the current tree data with data from the selected file.
5-20
The app replaces the current tree data with data from the selected object.
Save As Command
After you create a phytree object or prune a tree from existing data, you can
save the resulting tree in a Newick-formatted file. The sequence data used to
create the phytree object is not saved with the tree.
1 From the File menu, select Save As.
5-21
Phylogenetic Analysis
extension .tree for Newick-formatted files, but you can use any file
extension.
3 Click Save.
The app saves tree data without the deleted branches, and it saves changes
to branch and leaf names. Formatting changes such as branch rotations,
collapsed branches, and zoom settings are not saved in the file.
5-22
3 Click OK.
5-23
Phylogenetic Analysis
5-24
Rendering Type
Description
'square' (default)
'angular'
5-25
Phylogenetic Analysis
Rendering Type
'radial'
'equalangle'
5-26
Description
Rendering Type
Description
3 Select the Display Labels you want on your figure. You can select from all
5-27
Phylogenetic Analysis
The Print Preview window opens, which you can use to select page
formatting options.
2 Select the page formatting options and values you want, and then click
Print.
5-28
Print Command
Use the Print command to make a copy of your phylogenetic tree after you
use the Print Preview command to select formatting options.
1 From the File menu, select Print.
Tools Menu
Use the Tools menu to:
Explore branch paths
Rotate branches
Find, rename, hide, and prune branches and leaves.
The Tools menu and toolbar contain most of the commands specific to trees
and phylogenetic analysis. Use these commands and modes to edit and format
your tree interactively. The Tools menu commands are:
5-29
Phylogenetic Analysis
Inspect Mode
Viewing a phylogenetic tree in the Phylogenetic Tree app provides a rough
idea of how closely related two sequences are. However, to see exactly how
closely related two sequences are, measure the distance of the path between
them. Use the Inspect command to display and measure the path between
two sequences.
1 Select Tools > Inspect, or from the toolbar, click the Inspect Tool Mode
icon
5-30
2 Click a branch or leaf node (selected node), and then hover your cursor over
The paths, branch nodes, and leaf nodes below the selected branch appear
in gray, indicating you selected them to collapse (hide from view).
The app hides the display of paths, branch nodes, and leaf nodes below the
selected branch. However, it does not remove the data.
5-31
Phylogenetic Analysis
Tip After collapsing nodes, you can redraw the tree by selecting Tools >
Fit to Window.
5-32
The branch and leaf nodes below the selected branch node rotate 180
degrees around the branch node.
4 To undo the rotation, simply click the branch node again.
4 To accept your changes and close the text box, click outside of the text box.
5-33
Phylogenetic Analysis
1 Select Tools > Prune, or from the toolbar, click the Prune (delete)
For a leaf node, the branch line connected to the leaf appears in gray. For a
branch node, the branch lines below the node appear in gray.
Note If you delete nodes (branches or leaves), you cannot undo the
changes. The Phylogenetic Tree app does not have an Undo command.
3 Click the branch or leaf node.
The tool removes the branch from the figure and rearranges the other
nodes to balance the tree structure. It does not recalculate the phylogeny.
Tip After pruning nodes, you can redraw the tree by selecting Tools > Fit
to Window.
5-34
The app activates zoom in mode and changes the cursor to a magnifying
glass.
2 Place the cursor over the section of the tree diagram you want to enlarge
4 Move the cursor over the tree diagram, left-click, and drag the diagram to
Select Submenu
Select a single branch or leaf node by clicking it. Select multiple branch or
leaf nodes by Shift-clicking the nodes, or click-dragging to draw a box around
nodes.
5-35
Phylogenetic Analysis
Use the Select submenu to select specific branch and leaf nodes based on
different criteria.
Select By Distance Displays a slider bar at the top of the window,
which you slide to specify a distance threshold. Nodes whose distance from
the selected node are below this threshold appear in red. Nodes whose
distance from the selected node are above this threshold appear in blue.
Select Common Ancestor For all selected nodes, highlights the closest
common ancestor branch node in red.
Select Leaves If one or more nodes are selected, highlights the nodes
that are leaf nodes in red. If no nodes are selected, highlights all leaf
nodes in red
Propagate Selection For all selected nodes, highlights the descendant
nodes in red.
Swap Selection Clears all selected nodes and selects all deselected
nodes.
After selecting nodes using one of the previous commands, hide and show the
nodes using the following commands:
Collapse Selected
Expand Selected
Expand All
Clear all selected nodes by clicking anywhere else in the Phylogenetic Tree
app.
5-36
The branch or leaf nodes that match the expression appear in red.
After selecting nodes using the Find Leaf/Branch command, you can hide
and show the nodes using the following commands:
Collapse Selected
Expand Selected
Expand All
5-37
Phylogenetic Analysis
to Window command to redraw the tree diagram to fill the entire Figure
window.
Select Tools > Fit to Window.
Options Submenu
Use the Options command to select the behavior for the zoom and pan modes.
Unconstrained Zoom Allow zooming in both horizontal and vertical
directions.
Horizontal Zoom Restrict zooming to the horizontal direction.
Vertical Zoom (default) Restrict zooming to the vertical direction.
Unconstrained Pan Allow panning in both horizontal and vertical
directions.
Horizontal Pan Restrict panning to the horizontal direction.
Vertical Pan (default) Restrict panning to the vertical direction.
Window Menu
This section illustrates how to switch to any open window.
The Window menu is standard on MATLAB interfaces and Figure windows.
Use this menu to select any opened window.
Help Menu
This section illustrates how to select quick links to the Bioinformatics
Toolbox documentation for phylogenetic analysis functions, tutorials, and the
Phylogenetic Tree app reference.
5-38
Use the Help menu to select quick links to the Bioinformatics Toolbox
documentation for phylogenetic analysis functions, tutorials, and the
phytreeviewer reference.
5-39