Proposal For An Allele Nomenclature System Based On The Evolutionary Divergence of Haplotypes

HUMAN MUTATION 20:463^472 (2002)
SPECIAL ARTICLE
Proposal for an Allele Nomenclature System

Based on the Evolutionary Divergence of
Haplotypes
Daniel W. Nebert
Center for Environmental Genetics, Department of Environmental Health, and Department of Pediatrics/Division of Human Genetics,
University of Cincinnati Medical Center, Cincinnati, Ohio
Communicated by Richard G.H. Cotton

The classical view of what constitutes an ‘‘allele’’ has been challenged by recent findings of a great deal of
human genetic variability, i.e., we can expect, on average, one variant site every 100–250 bases of our
haploid genome. The haplotype is defined as ‘‘the patterns of co-occurrence of variant sites on the same
chromosome’’ (and therefore within each particular gene). Sufficient evidence exists for the divergence of
haplotypes during evolution of Homo sapiens sapiens, and the total number of haplotypes per gene will
reflect the amount of time any particular ethnic group has existed on the planet, e.g., greatest in Africans,
fewer in East Asians, and still fewer in Caucasians. If the average gene spans 30 kb, we can expect B170
polymorphic variant sites per gene in the world population. We do not see 2170 haplotypes, however; we
might find only 10 to 200 haplotypes (depending on the gene’s size and degree of conservation of the gene
product). This finite number allows for a reasonable haplotype nomenclature system for each gene, based
on evolutionary divergence. For polymorphic variants (i.e., frequency 0.01), I propose using Arabic
numerals for the major clades (e.g., *1, *2, y*20, *21), capital letters for sublineages (e.g., *2A, *2B,
*2C), and Arabic numerals for sub-sublineages (e.g., *22G12, *22G13); additional subcategories may be
added, in an alternating number/letter/number/letter sequence, depending on the complexity of present-
day haplotypes of a particular gene. Web sites with a web master and external advisory committee should
be set up for each gene superfamily, family, or individual gene (depending on complexity), and an
international haplotype nomenclature committee, perhaps comprised of several dozen of these web
masters, should oversee haplotype nomenclature for the entire human genome. The higher heterozygosity
and multiallelic nature makes haplotypes more informative than biallelic SNPs. Ultimately, our
knowledge of haplotype patterns, rather than single variant sites, of perhaps several hundred genes will
likely be helpful in finding associations between genotype and any multiplex phenotype (e.g., complex
diseases including cancer, and/or toxicity of pharmaceutical agents or environmental pollutants).
Hum Mutat 20:463–472, 2002. r 2002 Wiley-Liss, Inc.
KEY WORDS: nomenclature; variation; allele; evolution; bioinformatics; haplotype
INTRODUCTION article is offered. Because the human genome has now

not only been sequenced, but an enormous amount of
At the December 2001 Symposium on Gene data on interindividual and ethnic variability in and
Expression and Proteomics in Environmental Health
Research (Bethesda, Maryland), Leroy Hood de-
scribed the fast-moving field of high-throughput Received 24 April 2002; accepted revised manuscript 21 August
2002.
DNA sequencing: ‘‘Between 1990 and 2002, ‘one’ n
Correspondence to: Daniel W Nebert, M.D., Department of
haploid genome has been more or less sequenced at a Environmental Health, University of Cincinnati Medical Center,
cost of more than US$500 million. During this time, P.O. Box 670056, Cincinnati OH 45267- 0056.
there has been a 2,000-fold increase in throughput— E-mail dan.nebert@uc.edu
resulting in greater quality and diminishing costs per Contract grant sponsor: NIH; Contract grant numbers: P30
base sequenced. Ten years from now, I predict that an ES06096; R01 ES06321; R01 ES08147; R01 ES10416.
individual’s entire genome will be sequenced in one DOI:10.1002/humu.10143
day, at a cost of US$10,000.’’ It is within the Published online in Wiley InterScience (www.interscience.wiley.
framework of Hood’s estimation that this special com).
r2002 WILEY-LISS, INC.

464 NEBERT
around each gene is exploding at such an increasingly ketonuria (PKU) was first found to be caused by the
rapid pace, I would like to propose that we begin IVS12 þ 1G4A ‘‘mutant allele’’ of the PAH phenyl-
thinking about a standardized nomenclature system to alanine hydroxylase (PAH) gene [DiLella et al.,
categorize and simplify the extensive variability that 1986]; this was soon followed by discovery of the
will be found in and around each human gene. R261Q and R408W ‘‘mutant alleles.’’ It has now been
determined that the most frequent Caucasian muta-
tions include these three, plus M1V, Y414C, and
Terms De¢ned
IVS10 þ 546 [Eisensmith and Woo, 1992], with no
Polymorphism. A polymorphism exists any time single mutation accounting for the majority of PKU
there are two or more subgroups (of a phenotype or patients. There has been a simple numbering system
genotype) in any species population. E. B. Ford in the (#1 through #87) on the PKU web site for the
1940s and, more recently, Harry Harris [1980] defined extended PAH haplotypes, of which at least 87 are
a genetic polymorphism when the ‘‘commonest known so far [Scriver and Prevost, 2002]. Currently
identifiable allele has a frequency no greater than there are more than 400 PAH gene mutations
0.99.’’ When an allele has a frequency of 0.10 or responsible for the PKU phenotype, with no ‘‘cosmo-
greater, this is termed a ‘‘common variant.’’ When an politan’’ haplotype, i.e., none that is present in all
allele has a frequency of 0.01 or greater, this is called populations worldwide [Kidd et al., 2000; Scriver
a ‘‘polymorphic variant,’’ and if a minor allele has a et al., 2000].
frequency of less than 0.01, this is a ‘‘rare variant.’’ A similar story exists for cystic fibrosis. Defects in
Considering the Hardy-Weinberg distribution (p2 þ the CFTR gene, which comprises 27 exons and spans
2pq þ q2 = 1), therefore, p has classically denoted B190 kb, were shown to be associated with the
the frequency of the most common allele and q disease phenotype [Kerem et al., 1989]. Since the first
represents the sum of all other (common, poly- DF508 ‘‘mutant allele’’ [Claustres et al., 1990], today
morphic, plus rare variant) alleles. there are more than 830 CFTR variants responsible
Gene. A gene is any ‘‘segment of DNA from which for cystic fibrosis. The same story also holds for
a functional unit (gene product) is derived.’’ The Gaucher disease, which is the result of defects in the
length of a gene should extend from the 50 -most glucosylceramide b-glucosidase (GBA) gene. Since
experimentally proven enhancer region to the 30 -most the time that the first N370S ‘‘mutant allele’’ was
DNA motif affecting expression of that gene. What reported, there are currently more than 100 mutant
represents the span for each of the B50,000 genes in GBA alleles described in the literature [Beutler,
the human genome, and all their regulatory regions, is 1993].
anticipated to be fully resolved during the next several Mutation. Even the meaning of the term ‘‘muta-
years. The length of a 50 -regulatory region can be tion’’ is changing [Marshall, 2002]. Alterations in the
enormous. For example, the b-globin gene involves DNA sequence are perhaps better referred to as
enhancer sequences scattered over hundreds of kilo- ‘‘variants.’’ Although most DNA sequence variants
bases 50 and 30 of the HBB gene [Andrin and Spencer, are harmless, some are responsible for predominantly
1994; Levings and Bungert, 2002], and regions monogenic diseases, and presumably many are asso-
controlling the sex-determining-region-Y-box-9 ciated with complex diseases. ‘‘Variants’’ are generally
(SOX9) gene associated with campomelic dysplasia associated with harmless polymorphisms in a gene,
extend over more than 1 Mb [Pfeifer et al., 1999]. whereas ‘‘mutations’’ might be regarded as ‘‘disease-
One gene can be located entirely within, or over- causing’’ [den Dunnen and Antonarakis, 2001;
lapping with, a second gene, or might also be Cotton and Scriver, 1998]; in terms of complex
transcribed on the opposite strand located within, or diseases, variant sites, in isolation, will not cause
overlapping with, a second gene. We must also disease, whereas the combination of numerous variant
consider as a gene any small stretch of DNA (usually sites together may cause disease.
21–23 bp) that transcribes antisense RNA, because Thus, as additional populations are screened for any
this acts as a ‘‘functional unit’’ in RNA interference disease gene, more and more ‘‘variant alleles’’ are
(RNAi), a recently discovered form of gene regulation discovered. The same has been found for all genes: the
[Bernstein et al., 2001; Moss, 2001; Nishikura, 2001]. more one searches, the more variant sites one finds.
It would also seem most reasonable that multiple For each gene, therefore, what used to be called ‘‘the
transcripts as the result of alternative splicing be allele’’ now needs to be reconsidered in terms of the
called separate gene products, each derived from a haplotype.
‘‘gene.’’ These topics, however, are beyond the scope Haplotype. The haplotype represents the ‘‘patterns
of this special article. of co-occurrence of variant sites on the same
Allele. During the past three decades of human chromosome’’ (and therefore within each particular
genetics, our understanding of an ‘‘allele’’ has gene). The number of haplotypes per gene is
changed, due to the unexpected degree of variability considerably smaller than the total number of variant
in each gene. For example, the etiology of phenyl- sites found in the whole population for that gene. For
PROPOSED HAPLOTYPE NOMENCLATURE FOR HUMAN GENES 465
FIGURE 1. ACE genotypes and haplotypes.Top, genotypes of the 11 individuals for each of the 78 variant sites found. Individual
sample identi¢ers are shown at left (C = European-American; S = African-American), and at the top the variant sites are
numbered consecutively (50 to 30 across the 24,070-base region that was resequenced). At each site all individuals for the com-
mon allele are denoted blue, homozygotes for the rare allele are yellow, and heterozygotes are red. Middle, genotypes of the 11
individuals at the 52 non-unique polymorphic sites (the 26 singletons are excluded). Bottom, corresponding haplotypes for
each individual; 13 unique haplotype patterns were resolved from the 22 chromosomes (reproduced with permission from
authors and journal [Rieder et al.,1999]).
example, although 78 DNA sequence variants were ago and probably accounts for most if not all of
found in resequencing the 24 kb of the ACE gene in present-day Homo sapiens sapiens. Thousands of years
11 individuals (Fig. 1), the authors did not find 278 of being essentially geographically isolated led to
haplotypes but rather 13 distinct patterns, or haplo- development of the five major races: African, East
types, were identified [Rieder et al., 1999]. (The co- Asian, Caucasian, Pacific Islander, and Native Amer-
authors did identify 26 ‘‘singletons,’’ however, and, ican [Risch et al., 2002].
until a sufficient number of additional individuals Long-distance travel by foot, then horse-and-cart,
from ethnically very diverse backgrounds are se- and then ships, was responsible for genetic admixture
quenced, one cannot be confident as to how many of during the last two or more thousands of years;
these singletons might be part of a distinct haplotype.) furthermore, travel across great distances became
Moreover, there can be problems with a string of greatly accelerated by the train, automobile, and
tightly linked biallelic SNPs, in which linkage airplane more recently. The remarkably high degree of
disequilibrium would mitigate ambiguity, partially by present-day genetic admixture has recently been
making some haplotypes a priori more likely than described [Excoffier et al., 1992; Destro-Bisol et al.,
others [Hodge et al., 1999; Stephens et al., 2001a]. 1999; Lonjou et al., 1999; Owens and King, 1999]. It
is therefore obvious that many humans now share
African and Caucasian alleles; many share African
Evolutionary Divergence of Haplotypes
and East Asian alleles; and others share African,
It is now quite clear that the evolution of Homo Caucasian, and East Asian alleles [Nebert and
sapiens began in Africa and migrated to Europe and Menon, 2001]. The ‘‘Latino’’ is a good example of
Asia in two or more waves [Cavalli-Sforza and admixture between Spanish, African, and Amerindian
Cavalli-Sforza, 1995; Underhill et al., 2000; Temple- genes in just the last 400 years (B20 generations).
ton, 2002]. The most recent, and most significant, Individuals in some geographically isolated pockets
diaspora occurred between 110,000 and 60,000 years may possess relatively little admixture (e.g., uniquely
466 NEBERT
African alleles in the rural-living sub-Saharan African spanning 30 kb, we would anticipate between 120 and
native; Amerindian alleles in tribes located in the 300 polymorphic sites in the world (170 sites, on
jungles of Brazil or Panama). DNA sequence variant average), but 2170 haplotype patterns are not seen. A
sites thus have ‘‘evolved’’ over thousands of years, just much smaller number of haplotypes is seen, and this is
as genes have evolved over millions of years. Today’s the basis for proposing a reasonable nomenclature
haplotypes (for any given gene) therefore mirror all system for naming all haplotypes associated with a
the DNA sequence variants that have occurred before particular gene.
in the particular lineage of that person, and to be able It is possible to discern haplotype patterns by eye
to unravel the evolution of these haplotype patterns is (Fig. 1), and their divergence from one another as a
to understand better the origin of human populations. function of evolutionary time can be estimated using
various software programs (Fig. 2A). Instead of using
Suggested ApproachTo Naming Haplotypes Based
‘‘H’’ for haplotype (Fig. 2A), it is proposed (Fig. 2B) to
name these haplotypes first using Arabic numbers,
on Evolutionary Divergence
e.g., ‘‘members of the *1, *2, *3, *4, *5, etc. clade.’’
As the world population becomes more thoroughly Sublineages of each clade would be designated with
studied, it has been estimated [Kruglyak and Nick- capital letters (e.g., *6A, *6B, *6C, etc.), and
erson, 2001] that we will asymptotically approach 6 individual present-day haplotypes would be given
million common (q 0.10) and 11 million poly- Arabic numerals (e.g., *12A1, *12A2, *12A3, etc.).
morphic (q 0.01) variant sites. The latter number Use of ‘‘I,’’ ‘‘O,’’ ‘‘P,’’ and ‘‘X’’ should be avoided so as
can be extrapolated to one variant site, on average, not to be confused with lower-case ‘‘l,’’ zero,
per 100–250 bases. Hence, for the average gene pseudogene, or chromosomal designations. This same
FIGURE 2. A: Consensus parsimony tree for the 13 human ACE haplotypes (H), constructed from the 52 nonunique variant
sites.The size of each H encircled is correlated with the frequency of times that haplotype in the population studied; the only
haplotypes that were found more than once in these 11 individuals were H1 (n = 5), H6 (n = 4), and H7 (n = 3).The program
DNAPARS in Phylip 3.5 was used to infer the maximum parsimony tree.The insertion (I) or deletion (D) of the reverse-oriented
Alu I repeat element in intron 16 has been used as a common marker in numerous association studies; the D allele is found in
chimpanzee and African populations (modi¢ed and reproduced with permission from authors and journal [Rieder et al.,
1999]). B: The same consensus parsimony tree, with haplotype nomenclature consistent with that suggested in the text.
C: Hypothetical divergence of these 13 haplotypes over evolutionary time, consistent with the parsimony tree. Note that the
Alu I insertion (Alu*Ins) in intron 16 occurs in the *4 clade (Caucasian portion of this tree) and is not seen in the *1, *2, or *3
clades of African origin. P, present day.
FIGURE 3. Diagram of a hypothetical gene, and how it has diverged through tens of thousands of years of evolution, to result in
today’s haplotype patterns (depicted across the bottom). For the sake of simplicity, only 12 variant sites are depicted; open
rectangles denote variant sites in the ancestral gene, and closed rectangles denote mutated variant sites. For the sake of sim-
plicity, only number/letter/number patterns are shown herealthough additional letters and numbers are possible with more
complex haplotypes (discussed in text). The *1A1 allele, with a single variant site (position #5) mutated, is agreed upon by
investigators in the ¢eld as the consensus haplotype in today’s population, although the frequency need not be 40.50 in any
ethnic group (as discussed in the text). Several points are illustrated. First, the number of variant sites generally increases as a
function of evolutionary time. Second, there are many cases of ‘‘identity-by-descent’’ (IBD) in which the variant site in the pre-
sent generation (haplotypes lined up along the bottom) re£ects mutations that arose by direct lineage in earlier generations.
For example, variant site #7 arose early in the‘‘*3 clade’’and still exists today in haplotypes *3A1, *3A2, *3A3, and *3B1; the
mutated variant site #7 thus de¢nes the‘‘*3 clade.’’ Third, variant site #1 exists in both the *2B1 and the *4A1 haplotype, but
they can be seen to have arisen independently; this is‘‘identity-by-state’’ (IBS). Finally, variant site #2, although representing
an ancient mutation and present in the *2, *3, and *4 clades, is shown in haplotype *2B1 as back-mutated to the reference
base; this, of course, can happen.This example serves to emphasize the point that the possibility of back-mutation of any in-
dividual SNP is a good reason why genotype^phenotype association studies should be carried out on haplotype patterns rather
than with individual variant sites.
approach to a nomenclature based on divergent ancestral gene. This latter scenario serves to empha-
evolution, including number/letter/number, has been size the point that studies of genotype–phenotype
proposed for several dozen gene families and super- associations using only a single variant site can be
families [Nebert et al., 1987; Nelson et al., 1996; equivocal. Obviously, genotype–phenotype associa-
Mackenzie et al., 1997; Vasiliou et al., 1999; Nuclear tions with one or more informative haplotypes should
Receptors Nomenclature Committee, 1999; Freimuth provide a more unequivocal approach to such studies.
et al., 2000; Mier et al., 2001; Nelson, 2002; Povey By consensus, the ‘‘*1 haplotype’’ might be agreed
et al., 2002], which has been uniformly and upon by investigators in the field as the reference
enthusiastically embraced by the scientific community. haplotype for that gene in the entire world population.
It should be noted that, in contrast to naming genes, However, it should be kept in mind that, for many
the alternating use of number/letter/number/letter for genes, no ‘‘major allele’’ might exist world-wide. For
designating haplotypes of genes that have diverged example, Stephens et al. [2001a], in a study of 82
throughout the evolution of Homo sapiens sapiens, unrelated individuals (Africans, East Asians, Cauca-
might possibly become strings of four or more numbers sians, Hispanic-Latino, and Amerindians), found 35%
and letters each (e.g., *7A28T17L47B88), depending of the 331 genes examined to exhibit no allelic
on the gene and especially for larger genes. frequency 40.50.
Figure 2C is an attempt to illustrate that the oldest For a hypothetical 8-exon gene (including 50 and 30
DNA sequence variants will be derived from the regulatory regions) spanning 30 kb and having 30
oldest (African) populations, and more recent variant variant sites, therefore, the conventional haplotype
sites will have appeared in East Asians, Caucasians, by the current roles would be designated as:
Pacific Islanders, Native Americans, and so forth. ‘‘-4588A4G; -3460_3458delACT; -2170G4C;
Figure 3 illustrates the divergence of haplotype -1897C4G; -466insGG; -15G4T; þ 173A4C;
patterns in more detail. Figure 3 includes examples W26C; S155Y; IVS1 þ 2C4T; IVS1 þ 408A4G;
of identity-by-descent, identity-by-state, and back- IVS1 þ 816C4T; IVS1-8C4T; IVS2 þ 168G4A;
mutation of a variant site to the base present in the þ888T4C; þ899C4T; þ2244delACCC; IVS2-3T4C;
468 NEBERT
R249Q; L391N; þ 1,508_09dupTG; IVS4 þ 16C4T; concern, that we urgently need such a nomenclature
IVS6-9C4T; þ3167_72AGGTCAinsTG; þ3588A4C; system in place before the explosion of high-
þ4460_4464delACGGT; þ5270G4C; IVS7þ1G4A; throughput resequencing information for each gene
þ 6331dupCCC; þ 7652C4T.’’ It is thus proposed becomes overwhelming to us all.
that it would be more convenient to name this
haplotype, for example, ‘‘*7A2’’—because ‘‘it includes
the þ 1,508_09dupTG; IVS4 þ 16C4T that assigns
Haplotype Inference Methods
it to the *7 clade.’’ The ‘‘S155Y; IVS7 þ 1G4A;
þ 6331dupCCC’’ might further be considered the To date, it has not been trivial to determine which
signature for assigning the haplotype to the ‘‘*7A variant sites occur together on the same chromosome
sublineage,’’ and the ‘‘ þ 888T4C; þ 899C4T’’ or within the same gene. Using PCR primers on
variant sites make it a unique newly defined haplotype genomic DNA as a template, obviously, one has an
(*7A2) within the ‘‘*7A sublineage.’’ equal probability of sequencing either of the two
This hypothetical example is probably overly chromosomes. Family studies are of course helpful in
simplistic, compared with real-life examples. For discerning haplotypes [Stephens et al., 2001a].
example, we might discover the haplotype should be Additional methods for inferring haplotype include:
named *7A2S29Q1. It must also be emphasized that the classical algorithm by Clark [1990]; the expecta-
this method of naming alleles is quite arbitrary and tion-maximization (EM) algorithm [Excoffier and
depends to a large extent on which allele is sequenced Slatkin, 1995; Hawley and Kidd, 1995; Long et al.,
first, on which haplotype is called the reference 1995]; a pseudo-Gibbs sampler (PGS) Bayesian
sequence, and on the chronological order in which algorithm [Stephens et al., 2001b]; and a novel
variant alleles are revealed. It is predicted that, Bayesian Monte Carlo method with an underlying
because of our rapidly increasing achievements in statistical model similar to that of the EM [Niu et al.,
high-throughput resequencing (see Lee Hood’s quote 2002]. Clark’s parsimony algorithm is most successful
at the beginning of this article), the evolution of when there is a sufficient number of homozygous
haplotype patterns will become defined for every gene reference and homozygous variant haplotypes [Rieder
in the human genome within the next few years. The et al., 1999; Nebert, 2000]. For example, studying six
easiest forum to record all haplotypes discovered for a SNPs of the NAT2 gene in 241 Panamanian
particular gene is the Internet, and clearly a haplotype Amerindians (482 chromosomes), Jorge-Nebert et al.
nomenclature committee should agree to serve as [2002] used the Clark method successfully in resol-
advisors to this web site so that all complex decisions ving seven alleles, which had already been named on
regarding new haplotypes can be properly and the web [Hein et al., 2002], into seven distinct
efficiently discussed and taken care of. haplotypes that had evolved over time from the
It is realized that this described task is so enormous original NAT2*4 consensus allele. The EM method is
that web sites with a web master and external advisory accurate in the inference of common haplotypes
committee should be set up for each gene superfamily, [Tishkoff et al., 2000; Zhang et al., 2001] but cannot
family, or individual gene (depending on complexity). handle a large number of sequence variants. The PGS
For example, the Duchenne muscular dystrophy method [Stephens et al., 2001b] provides an appeal-
(dystrophin, DMD) gene on the X chromosome spans ing strategy for the incorporation of evolutionary
at least 2.4 Mb, having nine distinct promoters and effects in haplotype construction; its shortcomings are
therefore N-termini, and having at least two different detailed by Niu et al. [2002]. The pros and cons of
30 -termini. The neurexin-3 (NRXN3) gene on using the latest Bayesian Monte Carlo method,
chromosome 14 spans 1.46 Mb, and the first intron including employment of partition ligation and prior
is 479 kb. An international haplotype nomenclature annealing, are described by Niu et al. [2002]; for 100
committee, perhaps comprised of several dozen of subjects, their software currently can handle 256
these web masters, should then oversee haplotype SNPs, but for a sample of 1,000 individuals, their
nomenclature for the entire human genome. software is currently limited to only 50 SNPs.
This type of haplotype nomenclature would be New and better software programs are expected in
useful in the fields of criminology and anthropology, as the near future to help determine haplotype patterns,
well as the fields of metabolism genes, transcription especially for larger populations and for larger gene
factor genes, receptor genes, transporter genes, or any sizes. Obviously, the larger the gene, the larger the
other family of genes of interest to a particular number of variant sites expected to be found, and the
research group. A simplified nomenclature system, greater the complexity might be in distinguishing the
such as that described here, will aid not only the haplotypes. For genes spanning 4100 kb or 41 Mb,
researchers working on the alleles of a particular for example, the recombination fraction will increase,
human gene but should also help postdoctoral fellows and the definition of haplotype patterns, as well as
and graduate students entering the field. This special carrying out linkage disequilibrium studies, will
article is intended to offer a long-range view of become increasingly problematic.
Other Proposed Haplotype Nomenclature Systems an increasing source of confusion. (‘‘Haplogroup’’

refers to NRY lineages defined by binary polymorph-
Early attempts at setting general rules for the isms, whereas ‘‘haplotype’’ is reserved for all sub-
naming of human alleles have been developed, lineages of haplogroups that are defined by variations
initially from the CYP2D6 report [Daly et al., 1996], at STRs on the NRY locus.) The Y Chromosome
and more recently for the UGT1A1 gene [Mackenzie Consortium therefore constructed a single most-
et al., 1997; McKinnon and Mackenzie, 2002], several parsimonious phylogeny for 245 markers into 153
ALDH genes [Vasiliou et al., 1999; Vasiliou, 2002], haplogroups, and a simple set of rules was developed
the NAT2 and NAT1 genes [Hein et al., 2002], and to label unambiguously the different clades nested
18 CYP genes [Oscarson et al., 2002]. By and large, within this tree [Hammer et al., 2002]. Their
these endeavors at naming alleles have been very approach is quite similar to the one set forth in this
helpful to colleagues in the field. Some have wanted special article, except theirs is letter/number/letter
to list only those alleles with an experimentally proven whereas the one proposed herein is number/letter/
altered function of the gene product. It should be number. The consortium proposed two complemen-
emphasized, however, that these web sites have not tary NRY nomenclatures, and these are viewed by this
yet fully taken into account the importance of author as highly commendable: the first one being
haplotypes that will be found to exist, nor the hierarchical to enable clades at all levels to be named
significance of noncoding variant sites that may or without ambiguity, and the second one retaining the
may not be shown experimentally to change the major haplogroup information from past publications
activity or function of the gene product. Of course, (e.g., ‘‘M’’ for ‘‘mutation,’’ ‘‘P’’ for ‘‘polymorphism’’).
catalytic changes would be of great interest to, for The proposed cladistic nomenclature for NRY [Ham-
example, an enzymologist or pharmacologist; receptor mer et al., 2002] is considerably better than the
affinity would be of most importance to the receptor- proposed naming of ‘‘clusters’’ representing human
ologist. However, it is suggested that all variant sites mtDNA diversity [Richards et al., 1998], which had
within the total span of each gene be given equal been enthusiastically received by many groups and has
weight and should always be considered in specifying greatly advanced studies of maternal lineages and
haplotype patterns. These haplotypes ultimately will communication of their conclusions. A recent report
become useful in genotype–phenotype association describes further the mtDNA diversity within the
studies involving complex diseases such as risk New World haplogroups among Native North Amer-
of diabetes mellitus, obesity, hypertension, stroke, icans [Malhi et al., 2002].
rheumatoid arthritis, or particular forms of One small problem with this nomenclature system
cancer or toxicity caused by drugs or environmental [Hammer et al., 2002], in my opinion, is that each
chemicals. clade is assigned a capital letter including ‘‘I,’’ ‘‘O,’’
For several years, some web curators sponsoring a and ‘‘P,’’ followed by numbers to define sublineages,
disease gene (e.g., the PAH gene in phenylketonuria) then further divided into lowercase letters followed by
have listed haplotypes simply as numbers in succession Arabic numerals, and so on (e.g., A3b2b, E3b3a1,
of chronology as to when they were first discovered or I1b2a, O3d1, etc.). I believe that the Arabic numeral/
reported (e.g., #1, #2, y #86, #87, etc.) [Scriver capital letter/Arabic numeral capital letter system, as
et al., 2000]. A recently proposed complex SNP-based proposed in the present special article, and avoiding
haplotype nomenclature system for three human use of I, O, P, or X, is preferred. For major clades,
helicases having implications for cancer association especially beyond 22 in number for any gene, the use
studies [Trikka et al., 2002] was described, based on of Arabic numerals makes it easier than using capital
the location of the SNP within the gene (e.g. B21.2, letters. (Agreed, one might use double-letters to
W15.3 and R19 for SNPs in the 22nd intron of the designate those beyond 22 clades, e.g., AA3b2b,
BLM, the 24th intron of the WRN, and the 10th EE3b3a1.) Irrespective of this, I submit that
intron of the RECQL gene). Neither of these two *1C3L5, *29E4H7Q, etc. is a better nomenclature
proposed systems provides any insight into the origin system than A3b2b, E3b3a1, I1b2a, O3d1, etc.
of human populations, however, compared with the Two caveats were raised by the Y Chromosome
nomenclature system proposed in the present special Consortium [Hammer et al., 2002]. First, not all
article. polymorphisms in the database have been genotyped
While this special article was being written, there in all individuals. This is always going to be a problem,
has been a very recent explosion in the number of as the isolated sequencing of exons and intronic
haplotype-naming systems described; this outburst of regions only near exons has been the standard practice
information is consistent with the predictions of Lee for some years; resequencing of all exons, introns and
Hood, detailed in the first part of this article. For flanking regions has only begun to replace the older
example, several unrelated and nonsystematic no- approach. Second, it is possible that some variant sites
menclatures for the nonrecombining portion of the Y assumed to be unique actually are recurrent on the
chromosome (NRY) into binary haplogroups had been tree. This latter point has been illustrated in Figure 3
470 NEBERT
of the present special article (examples of identity-by- thereby named, based on evolutionary divergence—
state and of back-mutation to the ancestral base). using an Arabic numeral/capital letter/Arabic numeral
Two proposals were made by the Y Chromosome format. Alleles of particular genes have also begun to
Consortium [Hammer et al., 2002]. First, a nomen- be named in this fashion. Such a system is convenient
clature committee must be willing and able to receive and provides important evolutionary information for
requests from investigators who wish new markers or those in the field and for young colleagues entering
haplogroups to be incorporated into the nomenclature the field. In most cases, these gene names were in
tree, and this committee will make joint decisions on place in anticipation of the complexity and chaos that
changes to be made with the existing nomenclature might have occurred if no such system were in place.
system. Second, the current nomenclature and the Similarly, given the explosion in successful advances
committee’s contact details will be made available on in high-throughput resequencing methods, it is now
a web site. These same proposals are suggested in the time to seriously consider such a nomenclature system
present special article, specifically, that I propose the for the haplotype patterns of each human gene, based
haplotype nomenclature system to be carried out for on the evolutionary divergence of its variant sites.
each gene (or class, family, or superfamily of genes), as Haplotype nomenclature of genes, in my opinion, is
the Y Chromosome Consortium proposes to do for the more informative than haplotype nomenclature of
NRY locus. chromosomal segments, or haplotype blocks. It is
As the Y Chromosome Consortium, as well as the expected that the number of haplotypes will range
present special article, have pointed out, it is between about 10 and 200 for most genes; this is a
extremely valuable to consider the evolution of reasonable number for instituting such a taxonomic
linkage disequilibrium [Nordborg and Tavaré, 2002], system. It would be most convenient to establish these
as well as any data on population genetics, in order to innumerable haplotype nomenclature systems for
understand genealogy, origin of human populations, genes (or families of homologous genes) on the web,
the Great Human Diaspora, and the forces that shape because curators/volunteers could keep the informa-
genetic diversity. This knowledge is likely to lead to a tion updated much more frequently than anyone
better understanding of the population basis of might do in a series of published journal articles.
etiology of complex diseases. Moreover, each web site, taking care of one super-
The explosion of haplotype information is presently family, one family, or one gene depending on its
very much in full swing. The global haplotype structure complexity, should have an external advisory commit-
of chromosome 21 was recently studied [Patil et al., tee to oversee this information flow and to advise/
2001]. Very recently, a programming algorithm for participate in all decisions necessary to maintain
‘‘haplotype-block’’ partitioning was presented [Zhang accuracy of this information. An international haplo-
et al., 2002], which decreases the size of the haplotype type nomenclature committee, perhaps comprised of
block to between 10 and 92 kb, about one-third the web masters from some of these sites, should oversee
size of that reported earlier by Patil et al. [2001]. the haplotype nomenclature of the entire human
Moreover, the haplotype patterns across 51 autosomal genome. The development and acceptance of such a
regions (spanning 13 Mb of the human genome) have nomenclature system will greatly facilitate research in
recently been characterized in African, East Asian, and the medical and genetics fields, decrease confusion in
Caucasian genomes [Gabriel et al., 2002]. This study, the literature, and be extremely important for the
as part of the SNP Consortium Allele Frequency future of human genetics, anthropology, ecogenetics
Project, provides the best foundation yet published for and pharmacogenetics, and molecular epidemiology.
the imminent construction of a complete haplotype
map of the human genome.
NOTE ADDED IN PROOF
Which is better? Studying the evolutionary diver-
gence of haplotype patterns of a specific gene, or of a Between the time of submitting this manuscript and
particular haplotype block? I would much prefer that viewing the galley proofs, it has now become clear that
of a particular gene, because geneticists are generally ‘‘haplotype blocks’’ are shorter in older populations
interested in a specific gene or family of genes. I thank (e.g., African) and longer in younger populations (e.g.,
Richard G.H. Cotton for inviting me to write this Finnish). The lengths of ‘‘short’’ blocks can be o1 kb
special article, and I invite interested colleagues to use to 5 kb; the lengths of ‘‘long’’ blocks can be 300 kb to
this journal as a forum, or to correspond directly with 41 Mb; the average length of a haplotype block
me at dan.nebert@uc.edu for further discussion of appears to be B20 kb.
this rapidly evolving topic.
ACKNOWLEDGMENTS
CONCLUSIONS
The concepts proposed in this special article were
Genes within a growing number of superfamilies are initiated at the Second International Nomenclature
being classified into families and subfamilies and Workshop (May, 1999) in Cambridge, England. It is
realized that nomenclature issues are always complex DNA haplotypes: application to human mitochondrial DNA
and often difficult to arrive at a consensus. This restriction data. Genetics 131:479–491.
author was first invited in 1995 by the Nomenclature Excoffier L, Slatkin M. 1995. Maximum-likelihood estimation
Committee of the International Union of Biochem- of molecular haplotype frequencies in a diploid population.
istry and Molecular Biology (NC-IUBMB) to promote Mol Biol Evol 12:921–927.
the nomenclature of various gene superfamilies on the Freimuth RR, Raftogianis RB, Wood TC, Moon E, Kim UJ, Xu
basis of divergent evolution. Because of the interest J, Siciliano MJ, Weinshilboum RM. 2000. Human sulfo-
and success in helping format the naming of several transferases SULT1C1 and SULT1C2: cDNA characteriza-
gene superfamilies, this author in 1999 accepted an tion, gene cloning, and chromosomal localization. Genomics
invitation to join the 12-member International 65:157–165.
Advisory Committee (IAC) of the Human Gene Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J,
Nomenclature Committee (HGNC), which is en- Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M,
dorsed by the Council of the Human Genome Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R,
Lander ES, Daly MJ, Altshuler D. 2002. The structure of haplo-
Organization. I thank my numerous colleagues,
type blocks in the human genome. Science 296:2225–2229.
especially Lucia Jorge-Nebert, Mark Rieder, Debbie
Hammer M, Forster P, Hurles ME, Jobling MA, de Knijff P,
Nickerson, Richard Cotton, and Beate Pesch for
Tyler-Smith C, Underhill PA, Ellis N, Y Chromosome
valuable discussions and suggestions.
Consortium Nomenclature Committee. 2002. A nomencla-
ture system for the tree of human Y-chromosomal binary
haplogroups. Genome Res 12:339–348.
REFERENCES Harris H. 1980. Principles of human biochemical genetics, 3rd
Andrin C, Spencer C. 1994. The intricacies of b-globin gene edition. New York: Elsevier/North Holland Biomedical.
expression. Biochem Cell Biol 72:377–380. p 331.
Bernstein E, Denli AM, Hannon GJ. 2001. The rest is silence. Hawley ME, Kidd KK. 1995. HAPLO: a program using the EM
RNA 7:1509–1521. algorithm to estimate the frequencies of multi-site haplo-
Beutler E. 1993. Gaucher disease as a paradigm of current types. J Hered 86:409–411.
issues regarding single gene mutations of humans. Proc Natl Hein DW, Grant DM, Sim E. 2002. Arylamine N-acetyltrans-
Acad Sci USA 90:5384–5390. ferase (EC 2.3.1.5). www.louisville.edu/medschool/pharma-
Cavalli-Sforza LL, Cavalli-Sforza F. 1995. The great human cology/NAT.html/.
diasporas: the history of diversity and evolution. New York: Hodge SE, Boehnke M, Spence MA. 1999. Loss of information
Addison-Wesley Publishing Company, Inc. 300 p. due to ambiguous haplotyping of SNPs. Nat Genet 21:360–361.
Clark AG. 1990. Inference of haplotypes from PCR-amplified Jorge-Nebert LF, Eichelbaum M, Griese EU, Inaba T, Arias TD.
samples of diploid populations. Mol Biol Evol 7:111–122. 2002. Analysis of six SNPs of the NAT2 gene in Ngawbe and
Claustres M, Desgeorges M, Kjellberg P, Demaille J. 1990. Embera Amerindians of Panama and determination of the
Identification of carriers by screening for DF508 deletion in a Embera acetylation phenotype using caffeine. Pharmacoge-
multi-generation cystic fibrosis family. Genet Couns 1:211– netics 12:39–48.
217. Kerem B, Rommens JM, Buchanan JA, Markiewicz D, Cox TK,
Cotton RG, Scriver CR. 1998. Proof of ‘‘disease-causing’’ Chakravarti A, Buchwald M, Tsui LC. 1989. Identification of
mutation. Hum Mutat 12:1–3. the cystic fibrosis gene: genetic analysis. Science 245:
Daly AK, Brockmoller J, Broly F, Eichelbaum M, Evans WE, 1073–1080.
Gonzalez FJ, Huang JD, Idle JR, Ingelman-Sundberg M, Kidd JR, Pakstis AJ, Zhao H, Lu RB, Okonofua FE, Odunsi A,
Ishizaki T, Jacqz-Aigrain E, Meyer UA, Nebert DW, Steen Grigorenko E, Tamir BB, Friedlaender J, Schulz LO, Parnas J,
VM, Wolf CR, Zanger UM. 1996. Nomenclature for human Kidd KK. 2000. Haplotypes and linkage disequilibrium at the
CYP2D6 alleles. Pharmacogenetics 6:193–201. phenylalanine hydroxylase locus, PAH, in a global represen-
den Dunnen JT, Antonarakis SE. 2001. Nomenclature for the tation of populations. Am J Hum Genet 66:1882–1899.
description of human sequence variations. Hum Genet Kruglyak L, Nickerson DA. 2001. Variation is the spice of life.
109:121–124. Nat Genet 27:234––236.
Destro-Bisol G, Maviglia R, Caglia A, Boschi I, Spedini G, Levings PP, Bungert J. 2002. The human b-globin locus control
Pascali V, Clark A, Tishkoff S. 1999. Estimating European region. Eur J Biochem 269:1589–1599.
admixture in African Americans by using microsatellites and Long JC, Williams RC, Urbanek M. 1995. An E-M algorithm
a microsatellite haplotype (CD4/Alu). Hum Genet 104: and testing strategy for multiple-locus haplotypes. Am J Hum
149–157. Genet 56:799–810.
DiLella AG, Marvit J, Lidsky AS, Guttler F, Woo SL. 1986. Lonjou C, Collins A, Morton NE. 1999. Allelic association
Tight linkage between a splicing mutation and a specific between marker loci. Proc Natl Acad Sci USA 96:
DNA haplotype in phenylketonuria. Nature 322:799–803. 1621–1626.
Eisensmith RC, Woo SLC. 1992. Updated listing of haplotypes Mackenzie PI, Owens IS, Burchell B, Bock KW, Bairoch A,
at the human phenylalanine hydroxylase (PAH) locus. Am J Bélanger A, Fournel-Gigleux S, Green M, Hum DW, Iyanagi
Hum Genet 51:1445–1448. T, Lancet D, Louisot P, Magdalou J, Roy Chowdhury J, Ritter
Excoffier L, Smouse PE, Quattro JM. 1992. Analysis of JK, Schachter H, Tephly TR, Tipton KF, Nebert DW. 1997.
molecular variance inferred from metric distances among The UDP glycosyltransferase gene superfamily: recommended
472 NEBERT
nomenclature update based on evolutionary divergence. proximal to SOX9: evidence for an extended control region.
Pharmacogenetics 7:255–269. Am J Hum Genet 65:111–124.
Malhi RS, Eshleman JA, Greenberg JA, Weiss DA, Schultz Povey MS, Bruford E, Wain H, White JA. 2002. Human gene
Shook BA, Kaestle FA, Lorenz JG, Kemp BM, Johnson JR, nomenclature committee. www.gene.ucl.ac.uk/nomenclature/.
Smith DG. 2002. The structure of diversity within Richards MB, Macaulay VA, Bandelt HJ, Sykes BC. 1998.
New World mitochondrial DNA haplogroups: implications Phylogeography of mitochondrial DNA in western Europe.
for the prehistory of North America. Am J Hum Genet Ann Hum Genet 62:241–260.
70:905–919. Rieder MJ, Taylor SL, Clark AG, Nickerson DA. 1999.
Marshall JH. 2002. On the changing meanings of ‘‘mutation.’’ Sequence variation in the human angiotensin converting
Hum Mutat 19:76–78. enzyme. Nat Genet 22:59–62.
McKinnon RA, Mackenzie PI. 2002. UDP glycosyltransferase Risch N, Burchard E, Ziv E, Tang H. 2002. Categorization of
gene superfamily. www.unisa.edu.au/pharm_medsci/gluc_trans/. humans in biomedical research: genes, race and disease.
Mier RJ, Holderbaum D, Ferguson R, Moskowitz R. 2001. Genome Biol 3:1–12.
Osteoarthritis in children associated with a mutation in the Scriver CR, Waters PJ, Sarkissian C, Ryan S, Prevost L, Cote D,
type II procollagen gene (COL2A1). Mol Genet Metab Novak J, Teebi S, Nowacki PM. 2000. PAHdb: a locus-
74:338–341. specific knowledgebase. Hum Mutat 15:99–104.
Moss EG. 2001. RNA interference: it’s a small RNA world. Scriver CR, Prevost L. 2002. Phenylalanine hydroxylase locus
Curr Biol 11:R772–R775. knowledgebase (PAHdb). http://data.mch.mcgill.ca/pahdb_
Nebert DW, Adesnik M, Coon MJ, Estabrook RW, Gonzalez FJ, new/about_team.html.
Guengerich FP, Gunsalus IC, Johnson EF, Kemper B, Levin Stephens JC, Schneider JA, Tanguay DA, Choi J, Acharya T,
W, Phillips IR, Sato R, Waterman MR. 1987. The P450 gene Stanley SE, Jiang R, Messer CJ, Chew A, Han JH, Duan J,
superfamily. Recommended nomenclature. DNA 6:1–11. Carr JL, Lee MS, Koshy B, Kumar AM, Zhang G, Newell
Nebert DW. 2000. Suggestions for the nomenclature of human WR, Windemuth A, Xu C, Kalbfleisch TS, Shaner SL,
alleles: relevance to ecogenetics, pharmacogenetics and Arnold K, Schulz V, Drysdale CM, Nandabalan K, Judson
molecular epidemiology. Pharmacogenetics 10:279–290. RS, Ruano G, Vovis GF. 2001a. Haplotype variation and
Nebert DW, Menon AG. 2001. Pharmacogenomics, ethnicity, linkage disequilibrium in 313 human genes. Science
and susceptibility genes. The Pharmacogenomics J 1:19–22. 293:489–493.
Nelson DR, Koymans L, Kamataki T, Stegeman JJ, Feyereisen Stephens M, Smith NJ, Donnelly P. 2001b. A new statistical
R, Waxman DJ, Waterman MR, Gotoh O, Coon MJ, method for haplotype reconstruction from population data.
Estabrook RW, Gunsalus IC, Nebert DW. 1996. Cytochrome Am J Hum Genet 68:978–989.
P450 superfamily: update on new sequences, gene mapping, Templeton A. 2002. Out of Africa again and again. Nature
accession numbers, and nomenclature. Pharmacogenetics 416:45–51.
6:1–42. Tishkoff SA, Pakstis AJ, Ruano G, Kidd KK. 2000. The
Nelson DR. 2002. Cytochrome P450 gene superfamily. http:// accuracy of statistical methods for estimation of haplotype
drnelson.utmem.edu/cytochromeP450.html/. frequencies: an example from the CD4 locus. Am J Hum
Nishikura K. 2001. A short primer on RNAi: RNA-directed Genet 67:518–522.
RNA polymerase acts as a key catalyst. Cell 107:415–418. Trikka D, Fang Z, Renwick A, Jones SH, Chakraborty R,
Niu T, Qin ZS, Xu X, Liu JS. 2002. Bayesian haplotype Kimmel M, Nelson DL. 2002. Complex SNP-based haplo-
inference for multiple linked single-nucleotide polymorph- types in three human helicases: implications for cancer
isms. Am J Hum Genet 70:157–169. association studies. Genome Res 12:627–639.
Nordborg M, Tavaré S. 2002. Linkage disequilibrium: what Underhill PA, Shen P, Lin AA, Jin L, Passarino G, Yang WH,
history has to tell us. Trends Genet 18:83–89. Kauffman E, Bonne-Tamir B, Bertranpetit J, Francalacci P,
Nuclear Receptors Nomenclature Committee. 1999. A unified Ibrahim M, Jenkins T, Kidd JR, Mehdi SQ, Seielstad MT,
nomenclature system for the nuclear receptor superfamily. Wells RS, Piazza A, Davis RW, Feldman MW, Cavalli-Sforza
Cell 97:161–163. LL, Oefner PJY. 2000. Chromosome sequence variation and
Oscarson M, Ingelman-Sundberg M, Daly AK, Nebert DW. the history of human populations. Nat Genet 26:358–361.
2002. Human cytochrome P450 (CYP) alleles. www.imm. Vasiliou V, Bairoch A, Tipton KF, Nebert DW. 1999. Eukaryotic
ki.se/CYPalleles/. aldehyde dehydrogenase (ALDH) genes: human polymorph-
Owens K, King MC. 1999. Genomic views of human history. isms, and recommended nomenclature based on divergent
Science 286:451–453. evolution and chromosomal mapping. Pharmacogenetics
Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker 9:421–434.
CR, Kautzer CR, Lee DH, Marjoribanks C, McDonough DP, Vasiliou V. 2002. Aldehyde dehydrogenase gene superfamily.
Nguyen BT, Norris MC, Sheehan JB, Shen N, Stern D, www.uchsc.edu/sp/sp/alcdbase/aldhcov.html/.
Stokowski RP, Thomas DJ, Trulson MO, Vyas KR, Frazer KA, Zhang K, Deng M, Chen T, Waterman MS, Sun F. 2002. A
Fodor SP, Cox DR. 2001. Blocks of limited haplotype dynamic programming algorithm for haplotype block parti-
diversity revealed by high-resolution scanning of human tioning. Proc Natl Acad Sci USA 99:7335–7339.
chromosome 21. Science 294:1719–1723. Zhang S, Pakstis AJ, Kidd KK, Zhao H. 2001. Comparisons of
Pfeifer D, Kist R, Dewar K, Devon K, Lander ES, Birren B, two methods for haplotype reconstruction and haplotype
Korniszewski L, Back E, Scherer G. 1999. Campomelic frequency estimation from population data. Am J Hum
dysplasia translocation breakpoints are scattered over 1 Mb Genet 69:906–914.

Proposal For An Allele Nomenclature System Based On The Evolutionary Divergence of Haplotypes

Uploaded by

Copyright:

Available Formats

Proposal For An Allele Nomenclature System Based On The Evolutionary Divergence of Haplotypes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Proposal For An Allele Nomenclature System Based On The Evolutionary Divergence of Haplotypes

Uploaded by

Copyright:

Available Formats

HUMAN MUTATION 20:463^472 (2002)

Proposal for an Allele Nomenclature System

Communicated by Richard G.H. Cotton

KEY WORDS: nomenclature; variation; allele; evolution; bioinformatics; haplotype

INTRODUCTION article is offered. Because the human genome has now

r2002 WILEY-LISS, INC.

Other Proposed Haplotype Nomenclature Systems an increasing source of confusion. (‘‘Haplogroup’’

You might also like