Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lesson 6

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

Lesson 6 Genome

Positional Cloning: An
Introduction to Genomics
• Before we examine the techniques of genomic research, let us consider o
ne of the important uses of genomic information: positional cloning, whi
ch is one method for the discovery of the genes involved in genetic traits.
In humans, this frequently involves the identification of genes that gover
n genetic diseases. We will begin by considering an example of positiona
l cloning that was done before the genomic era: finding the gene whose
malfunction causes Huntington disease in humans. We will see that much
of the effort went into narrowing down the region in which to look for th
e faulty gene. One reason for all this effort was to avoid having to sequen
ce a huge chunk of DNA. Nowadays, that is not a problem because the se
quencing has already been done. Nevertheless, this example serves as a g
ood introduction to genomics for several reasons: It illustrates the princip
le of positional cloning, which is still a major use of genomic information
; it shows how difficult positional cloning was in the absence of genomic
information; and it is a heroic story that still deserves to be told.
Classical Tools of Positional Cloning
• Geneticists seeking the genes responsible for human genetic disor
ders frequently face a problem: They do not know the identity of t
he defective protein, so they are looking for a gene without knowi
ng its function. Thus, they have to identify the gene by fi nding its
position on the human genetic map, and this process therefore has
come to be called positional cloning. The strategy of positional clo
ning begins with the study of a family or families afflicted with th
e disorder, with the goal of finding one or more markers that are ti
ghtly linked to the “disease gene,” that is, the gene which, when m
utated, causes the disease. Frequently, these markers are not genes
, but stretches of DNA whose pattern of cleavage by restriction en
zymes or other physical attributes vary from one individual to ano
ther.
• Because the position of the marker is known, the disease gene can b
e pinned down to a relatively small region of the genome. However,
that “relatively small” region usually contains about a million base
pairs, so the job is not over. The next step is to search through the m
illion or so base pairs to find a gene that is the likely culprit. Severa
l tools have traditionally been used in the search, and we will descri
be two here. These are: (1) finding exons with exon traps; and (2) l
ocating the CpG islands that tend to be associated with genes. We w
ill see how these tools have been used as we discuss our example in
the next section of this chapter. First, let us examine a favorite meth
od to map a gene to a fairly small region of the genome.
Restriction Fragment Length
Polymorphisms
• In the late twentieth century, we knew the locations of relatively few
human genes, so the likelihood of finding one of these close to a ne
w gene we were trying to map was small. Another approach, which
does not depend on finding linkage with a known gene, is to establis
h linkage with an “anonymous” stretch of DNA that may not even co
ntain any genes. We can recognize such a piece of DNA by its patter
n of cleavage by restriction enzymes. Because each person differs ge
netically from every other, the sequences of their DNAs will differ a
little bit, as will the pattern of cutting by restriction enzymes. Consid
er the restriction enzyme HindIII, which recognizes the sequence A
AGCTT. One individual may have three such sites separated by 4 an
d 2 kb, respectively, in a given region of a chromosome
• Detecting a RFLP. Two individuals are polymorphic with respect to a HindIII restriction site (r
ed).The first individual contains the site, so cutting the DNA with HindIII yields two fragment
s, 2 and 4 kb long, that can hybridize with the probe, whose extent is shown at top. The second
individual lacks this site, so cutting that DNA with HindIII yields only one fragment, 6 kb lon
g, which can hybridize with the probe. The results from electrophoresis of these fragments, fol
lowed by blotting, hybridization to the radioactive probe, and autoradiography, are shown at ri
ght. The fragments at either end, represented by dashed lines, do not show up because they ca
nnot hybridize to the probe.
• Another individual may lack the middle site but have the o
ther two, which are 6 kb apart. This means that if we cut th
e first person’s DNA with HindIII, we will produce two fra
gments, 2 kb and 4 kb long, respectively. The second perso
n’s DNA will yield a 6-kb fragment instead. In other words
, we are dealing with a restriction fragment length polymor
phism (RFLP). Polymorphism means that a genetic locus h
as different forms, or alleles (Chapter 1), so this clumsy ter
m simply means that cutting the DNA from any two indivi
duals with a restriction enzyme may yield fragments of dif
ferent lengths. The abbreviated term, RFLP, is usually pron
ounced “rifflip.”
• How do we go about looking for a RFLP? Clearly, we cannot analyze
the whole human genome at once. It contains approximately a million
cleavage sites for a typical restriction enzyme, so each time we cut the
whole genome with such an enzyme, we release about a million fragm
ents. No one would relish sorting through that morass for subtle differ
ences between individuals. Fortunately, there is an easier way. With a
Southern blot one can highlight small portions of the total genome wit
h various probes, so any differences are easy to see. However, there is
a catch. Because each labeled probe hybridizes only to a small fractio
n of the total human DNA, the chances are very poor that any given o
ne will reveal a RFLP linked to the gene of interest. We may have to s
creen many thousands of probes before we find the right one. As labor
ious as it is, this procedure at least provides a starting point, and it has
been a key to finding the genes responsible for several genetic disease
s.
Exon Traps

• Once a gene has been pinned down to a region stretching o


ver hundreds of kilobases, how does one sort out the genes
from the other DNA? If that DNA region has not yet been
sequenced, one can sequence it and look for open reading f
rames (ORFs). An ORF is a sequence of bases that, if trans
lated in one reading frame, contains no stop codons for a re
latively long distance. But searching for ORFs is very lab
orious. Several more efficient methods are available, inclu
ding a procedure invented by Alan Buckler called exon am
plification or exon trapping.
• Exon trapping. Begin with a cloning vector, such as pSPL
1, shown here in slightly simplified form. This vector has
an SV40 promoter (P), which drives expression of a hybri
d gene containing the rabbit b-globin gene (orange), interr
upted by part of the HIV tat gene, which includes two exo
n fragments (blue) surrounding an intron (yellow). The ex
on–intron borders contain 59- and 39-splice sites (ss). The
tat intron contains a cloning site, into which random DNA
fragments can be inserted. In step 1, an exon (red) has bee
n inserted, flanked by parts of its own introns, and its own
59- and 39-splice sites. In step 2, insert this construct into
COS cells, where it can be transcribed and then the transcr
ipt can be spliced. Note that the foreign exon (red) has bee
n retained in the spliced transcript, because it had its own
splice sites. Finally (steps 3 and 4), subject the transcripts
to reverse transcription and PCR amplification, withprime
rs indicated by the arrows. This gives many copies of a D
NA fragment containing the foreign exon, which can now
be cloned and examined. Note that a non-exon will not ha
ve splice sites and will therefore be spliced out of the tran
script along with the intron. It will not survive to be ampli
fied in step 3, so one does not waste time studying it.
• Positional cloning begins with mapping studies to pin down the lo
cation of the gene of interest to a reasonably small region of DNA.
Mapping depends on a set of landmarks to which the position of a
gene can be related. Sometimes such landmarks are genes, but mor
e often they are RFLPs—sites at which the lengths of restriction fr
agments generated by a given restriction enzyme vary from one in
dividual to another. Several methods are available for identifying t
he genes in a large region of unsequenced DNA. One of these is th
e exon trap, which uses a special vector to help clone exons only.
Another is to use methylationsensitive restriction enzymes to searc
h for CpG islands—DNA regions containing unmethylated CpG se
quences.
Identifying the Gene Mutate
d in a Human Disease
• Let us conclude this section with a classic example of positional cloni
ng: pinpointing the gene for Huntington disease. Huntington disease
(HD) is a progressive nerve disorder. It begins almost imperceptibly w
ith small tics and clumsiness. Over a period of years, these symptoms
intensify and are accompanied by emotional disturbances. Nancy Wex
ler, an HD researcher, describes the advanced disease as follows: “The
entire body is encompassed by adventitious movements. The trunk is
writhing and the face is twisting. The full-fl edged Huntington patient
is very dramatic to look at.” Finally, after 10–20 years, the patient dies
. Huntington disease is controlled by a single dominant gene. Therefor
e, a child of an HD patient has a 50:50 chance of being affected. Peopl
e who have the disease could avoid passing it on by not having childre
n, except that the first symptoms usually do not appear until after the
childbearing years.
• Because they did not know the nature of the product of the HD gene
(HD), geneticists could not look for the gene directly. The next best ap
proach was to look for a gene or other marker that is tightly linked to
HD. Michael Conneally and his colleagues spent more than a decade t
rying to find such a linked gene, but with no success. In their attempt t
o find a genetic marker linked to HD, Wexler, Conneally, and James G
usella turned next to RFLPs. They were fortunate to have a very large
family to study. Living around Lake Maracaibo in Venezuela is a fami
ly whose members have suffered from HD since the early nineteenth c
entury. The fi rst member of the family to be so afflicted was a woman
whose father, presumably a European, carried the defective gene. So t
he pedigree of this family can be traced through seven generations, an
d the number of individuals is unusually large: It is not uncommon for
a family to have 15–18 children.
• Gusella and colleagues knew they might have to test hundreds of prob
es to detect a RFLP linked to HD, but they were amazingly lucky. Am
ong the first dozen probes they tried, they found one (called G8) that d
etected a RFLP that is very tightly linked to HD in the Venezuelan fa
mily. The following figure shows the locations of HindIII sites in the s
tretch of DNA that hybridizes to the probe. We can see seven sites in a
ll, but only five of these are found in all family members. The other t
wo, marked with asterisks and numbered 1 and 2, may or may not be
present. These latter two sites are therefore polymorphic, or variable.
The RFLP associated with the Huntington disease gene. The HindIII sites in the region
that hybridizes to the G8 probe are shown. The families studied show polymorphisms i
n two of these sites, marked with an asterisk and numbered 1 (blue) and 2 (red). Presen
ce of site 1 results in a 15-kb fragment plus a 2.5-kb fragment that is not detected becau
se it lies outside the region that hybridizes to the G8 probe. Absence of this site results i
n a 17.5-kb fragment. Presence of site 2 results in two fragments of 3.7 and 1.2 kb. Abs
ence of this site results in a 4.9-kb fragment. Four haplotypes (A–D) result from the fou
r combinations of presence or absence of these two sites. These are listed at right, besid
e a list of polymorphic HindIII sites and a diagram of the HindIII restriction fragments
detected by the G8 probe for each haplotype. For example, haplotype A lacks site 1 but
has site 2. As a result, HindIII fragments of 17.5, 3.7, and 1.2 are produced. The 2.3- an
d 8.4-kb fragments are also detected by the probe, but we ignore them because they are
common to all four haplotypes
• Let us see how the presence or absence of these two restriction sites gi
ves rise to a RFLP. If site 1 is absent, a single fragment 17.5 kb long
will be produced. However, if site 1 is present, the 17.5-kb fragment
will be cut into two pieces having lengths of 15 kb and 2.5 kb, respect
ively. Only the 15-kb band will show up on the autoradiograph becaus
e the 2.5-kb fragment lies outside the region that hybridizes to the G8
probe. If site 2 is absent, a 4.9-kb fragment will be produced. On the o
ther hand, if site 2 is present, the 4.9-kb fragment will be subdivided i
nto a 3.7-kb fragment and a 1.2-kb fragment.
• There are four possible haplotypes (clusters of alleles on a sin
gle chromosome) with respect to these two polymorphic Hind
III sites, and they have been labeled A–D:
• The term haplotype is a contraction of haploid genotype, which emphasiz
es that each member of the family will inherit two haplotypes, one from e
ach parent. For example, an individual might inherit the A haplotype fro
m one parent and the D haplotype from the other. This person would hav
e the AD genotype. Sometimes different genotypes (pairs of haplotypes)
can be indistinguishable. For example, a person with the AD genotype wi
ll have the same RFLP pattern as one with the BC genotype because all fi
ve fragments will be present in both cases. However, the true genotype ca
n be deduced by examining the parents’ genotypes. The following Figure
shows autoradiographs of Southern blots of two families, using the radio
active G8 probe. The 17.5- and 15-kb fragments migrate very close toget
her, so they are diffi cult to distinguish when both are present, as in the A
C genotype; nevertheless, the AA genotype with only the 17.5-kb fragme
nt is relatively easy to distinguish from the CC genotype with only the 1
5-kb fragment. The B haplotype in the first family is obvious because of t
he presence of the 4.9-kb fragment
• Southern blots of HindIII fragments from members of two families, h
ybridized to the G8 probe. The bands in the autoradiographs represent
DNA fragments whose sizes are listed at right. The genotypes of all th
e children and three of the parents are shown at top. The fourth parent
was deceased, so his genotype could not be determined
• Which haplotype is associated with the disease in the Venezuelan fam
ily? Figure 24.5 demonstrates that it is C. Nearly all individuals with
this haplotype have the disease. Those who do not have the disease y
et will almost certainly develop it later. Equally telling is the fact that
no individual lacking the C haplotype has the disease. Thus, this is a
very accurate way of predicting whether a member of this family is c
arrying the Huntington disease gene. A similar study of an American
family showed that, in this family, the A haplotype was linked with th
e disease. Therefore, each family varies in the haplotype associated w
ith the disease, but within a family, the linkage between the RFLP sit
e and HD is so close that recombination between these sites is very ra
re. Thus we see that a RFLP can be used as a genetic marker for map
ping, just as if it were a gene.
• Finding linkage between HD and the DNA region that hybridizes to t
he G8 probe also allowed Gusella and colleagues to locate HD to chr
omosome 4. They did this by making mouse–human hybrid cell lines,
each containing only a few human chromosomes. They then prepared
DNA from each of these lines and hybridized it to the radioactive G8
probe. Only the cell lines having chromosome 4 hybridized; the prese
nce or absence of all other chromosomes did not matter. Therefore, h
uman chromosome 4 carries HD.
• At this point, the HD mapping team’s luck ran out. One long detour ar
ose from a mapping study that indicated the gene lay far out at the end
of chromosome 4. This made the search much more difficult because t
he tip of the chromosome is a genetic wasteland, full of repetitive seq
uences, and apparently devoid of genes. Finally, after wandering for y
ears in what he called a genetic “junkyard,” Gusella and his group tur
ned their attention to a more promising region. Some mapping work s
uggested that HD resided, not at the tip of the chromosome, but in a 2.
2-Mb region several megabases removed from the tip. Unless you kno
w the DNA sequence, over 2 Mb is a tremendous amount of DNA to s
ift through to fi nd a gene, so Gusella decided to focus on a 500-kb re
gion that was highly conserved among about one-third of HD patients,
who seemed to have a common ancestor.
• On average, a 500-kb region of the human genome contains about fiv
e genes. To find them, Gusella and colleagues used an exon-trapping
strategy and identified a handful of exon clones. They then used thes
e exons to probe a cDNA library to identify the DNA copies of mRN
As transcribed from the target region. One of the clones, called IT15,
for “interesting transcript number 15,” hybridized to cDNAs that ide
ntified a large (10,366 nt) transcript that codes for a large (3144 amin
o acid) protein called huntingtin. The presumed protein product did n
ot resemble any known proteins, so that did not provide any evidence
that this is indeed HD. However, the gene had an intriguing repeat of
23 copies of the triplet CAG (one copy is actually CAA), encoding a
stretch of 23 glutamines.
• Is this really HD? Gusella’s team’s comparison of the gene in affected
and unaffected individuals in 75 HD families demonstrated that it is. I
n all unaffected individuals, the number of CAG repeats ranged from
11 to 34, and 98% of these unaffected people had 24 or fewer CAG re
peats. In all affected individuals, the number of CAG repeats had expa
nded to at least 42, up to a high of about 100. Thus, we can predict wh
ether an individual will be affected by the disease by looking at the nu
mber of CAG repeats in this gene. Furthermore, the severity, or age of
onset of the disease correlates at least roughly with the number of CA
G repeats. People with a number of repeats at the low end of the affect
ed range (now known to be 36–40) generally survive well into adultho
od before symptoms appear, whereas people with a number of repeats
at the high end of the range tend to show symptoms in childhood. In o
ne extreme example, an individual with the highest number of repeats
detected (about 100) started showing disease symptoms at the extraor
dinarily early age of 2.
• Finally, two people were affected, even though their parent
s were not. In both cases, the affected individuals had expa
nded CAG repeats, whereas their parents did not. New mut
ations (expanded CAG repeats), although a rare occurrence
in HD, apparently caused both these cases of disease.
• Another way of demonstrating that this gene is really HD would be to
deliberately mutate it and show that the mutation has neurological effe
cts. Obviously, one cannot perform such an experiment in humans, bu
t it would be feasible in mice, if the gene corresponding to HD is kno
wn. Fortunately, HD is conserved in many species, including the mou
se, where the gene is known as Hdh. In 1995, a team of geneticists led
by Michael Hayden created knockout mice (Chapter 5) with a targeted
disruption in exon 5 of Hdh. Mice that are homozygous for this mutati
on die in utero. Heterozygotes are viable, but they show loss of neuro
ns with corresponding lowering of intelligence. This reinforces the not
ion that Hdh, and therefore HD, plays an important role in the brain—
exactly what we would expect of the gene that causes HD
• How can we put this new knowledge to work? One obviou
s way is to perform accurate genetic screening to detect pe
ople who will be affected by the disease. In fact, by counti
ng the CAG repeats, we may even be able to predict the ag
e of onset of the disease. However, that kind of informatio
n is a mixed blessing, as it can be psychologically devastat
ing. What we really need, of course, is a cure, but that may
be a long way off.
The Advantage of G
enomic Data
• The positional cloning study we have just examined took years, and much of
that time was spent sequencing DNA in the suspected regions and trying to d
etermine which gene in the sequence was the most likely culprit. With the hu
man genome now finished, that job has become much easier. Just how much
easier is indicated by Neal Copeland, a mouse geneticist who has been doing
positional cloning in mice for years. He says, “It took us 15 years to get 10 p
ossible cancer genes before we had the sequence. And it took us a few mont
hs to get 130 genes once we had the sequence.” He was talking about the m
ouse sequence, of course, but the same principle applies to humans, and mou
se positional-cloning studies very often identify genes that cause similar pro
blems in humans. So one of the biggest anticipated payoffs of genomics rese
arch will be the acceleration of discovery of disease genes in humans. You s
hould not conclude from this discussion that positional cloning is obsolete. It
will be important as long as we are curious about finding genes responsible f
or traits in any organism. Sequenced genomes simply make positional clonin
g much easier.
• Using RFLPs, geneticists mapped the Huntington disease g
ene (HD) to a region near the end of chromosome 4. Then t
hey used an exon trap to identify the gene itself. The mutat
ion that causes the disease is an expansion of a CAG repea
t from the normal range of 11–34 copies, to the abnormal r
ange of at least 38 copies. The extra CAG repeats cause ex
tra glutamines to be inserted into huntingtin
Techniques in Genomic
Sequencing
• The first genome to be sequenced, as you might expect, was a very si
mple one: The small DNA genome of an E. coli phage called fX174.
Frederick Sanger, the inventor of the dideoxy chain termination meth
od of DNA sequencing, obtained the sequence of this 5375-nt genom
e in 1977.
• What kind of information can we glean from this sequence? First, we
can locate exactly the coding regions for all the genes. This tells us th
e spatial relationships among genes and the distances between them t
o the exact nucleotide. How do we recognize a coding region? It cont
ains an ORF that is long enough to code for one of the phage proteins
. Furthermore, the ORF must start with an ATG (or occasionally a GT
G) triplet, corresponding to an AUG (or GUG) translation initiation c
odon, and end with the DNA equivalent of a stop codon (UAG, UAA,
or UGA). In other words, an ORF in a bacterium or phage is the same
as a gene’s coding region.
• The base sequence of the phage DNA also tells us the amin
o acid sequences of all the phage proteins. All we have to d
o is use the genetic code to translate the DNA base sequen
ce of each open reading frame into the corresponding amin
o acid sequence. This may sound like a laborious process,
but a personal computer can do it in a split second.
• With the advent of automated sequencing, geneticists have added
much larger genomes to the list of total known sequences. In 1988,
D.J. McGeoch and colleagues published the sequence of an import
ant human virus (herpes simplex virus I) with a relatively large gen
ome: 152,260 bp. In 1995, Craig Venter and Hamilton Smith and c
olleagues determined the entire base sequences of the genomes of t
wo bacteria: Haemophilus infl uenzae and Mycoplasma genitalium.
The H. infl uenzae (strain Rd) genome contains 1,830,137 bp and it
was the fi rst genome from a freeliving organism to be completely
sequenced. The M. genitalium genome, at only 580,000 bp, is the s
mallest of any known free-living organism and contains only about
470 genes.
• In April 1996, the leaders of an international consortium of
laboratories announced another milestone: The 12-million-
bp genome of baker’s yeast (Saccharomyces cerevisiae) ha
d been sequenced. This was the first eukaryotic genome to
be entirely sequenced. Later in 1996, the first genome of a
n organism (Methanococcus jannaschii) from the third do
main of life, the archaea, was sequenced. Then, in 1997, th
e long-awaited sequence of the 4.6 million-bp E. coli geno
me was reported. This is only about one-third the size of th
e yeast genome, but the importance of E. coli as a genetic t
ool made this a milestone as well.
• In 1998, the sequence of the first animal genome, from the roundw
orm Caenorhabditis elegans, was reported. The fi rst plant genome
(from the mustard family member Arabidopsis thaliana) was comp
leted in 2000. C. elegans and A. thaliana are both model organisms
chosen for study because of their small genome size, short generati
on time, and their ease of manipulation in genetic experiments. C. e
legans has the additional advantages of having fewer than 1000 cel
ls, and being transparent, so the development of each of its cells ca
n be tracked visually. Two other famous model organisms are the fr
uit fl y Drosophila melanogaster and the house mouse Mus muscul
us. The sequences of the genomes of these two organisms were rep
orted in 2000 and 2002, respectively. Also in 2000, the eagerly awa
ited rough draft of the human genome sequence was announced. B
y 2001, this “working draft” of the human genome was published.
Vectors for Large-Scale Genome
Projects

• Two high-capacity vectors have been used extensively in t


he Human Genome Project. Much of the mapping work wa
s done with yeast artificial chromosomes (YACs), which c
an accept inserts of a million or more base pairs. Most of t
he sequencing work was performed with bacterial artificial
chromosomes (BACs) which can accept up to about 300,0
00 bp. The BACs are more stable and easier to work with t
han the YACs
Shotgun Sequencing

• Massive sequencing projects can take two forms: (1) In the


map-then-sequence strategy, one produces a physical map
of the genome including STSs, then sequences the clones
(mostly BACs) used in the mapping. This places the seque
nces in order so they can be pieced together. (2) In the shot
gun approach, one assembles libraries of clones with differ
ent size inserts, then sequences the inserts at random. This
method relies on a computer program to find areas of overl
ap among the sequences and piece them together. In practi
ce, a combination of these methods was used to sequence t
he human genome.
• Yeast Artificial Chromosomes
Cloning in yeast artificial chromosomes. We begin with two tiny pieces of D
NA from the two ends of a yeast chromosome. One of these, the left arm, contai
ns the left telomere (yellow, labeled L) plus the centromere (red, labeled C). The
right arm contains the right telomere (yellow, labeled R). These two arms are lig
ated to a large piece of foreign DNA (blue)—several hundred kilobases of huma
n DNA, for example—to form the YAC, which can replicate in yeast cells along
with the real chromosomes.
• Bacterial Artificial Chro
mosomes
Map of the BAC vector, p
BAC108L. Key features inc
lude the cloning sites HindII
I and BamHI, at top; the chl
oramphenicol resistance gen
e (CmR), used as a selection
tool; the origin of replicatio
n (oriS); and the genes gove
rning partition of plasmids t
o daughter cells
Self study
Functional Genomics, Proteomics, and
Bioinformatics
• Functional Genomics: Gene Expression on a Genomic Sca
le
• Genomic Functional Profiling---Deletion Analysis
• Tissue-Specific Functional Profiling
• Locating Target Sites for Transcription Factors
• In Situ Expression Analysis
Words in the last

• I just lead you into the door of Molecular biology, more work
s are needed for you to mine in the ocean of knowledge.

You might also like