GLM: General Language Model Pretraining With Autoregressive Blank Infilling
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
GLM: General Language Model Pretraining With Autoregressive Blank Infilling
x1 x2 x3 x4 x5 x6
x1
x2
x5 x6 [E] x3 [E]
[M] x5 x6 [E] x3 [E]
x1 x2 [M] x4 [M] [S] x5 x6 x[S]
4 x3 xx15 x26 [M]
[E] xx34 [E]
Key[S] x5 x6 [S] x3
[M]
Query
[S] [M] × × × × ×
[M] x5 x6 [E] x3 [E] [M]
x2 [S] × × × ×
x3
xx11A: xx22x1[M]
[M]x2 xx44[M][M]x4[S]
[M] [S][M]xx55 [S]xx66 x[S]
5[S] xx6x
x313 [S] x3 x4 [M] x[S] × × ×
x4 Part x2 [M] [M] [S] x5 x6 [S] x3 5
xx65 × ×
x x6
1 x11 x22 x33 x44 x55 x56
4
xx11B: xx22x1 xx33x2 xx44x3xx55 x4xx66 x5 xPosition
[M]Part 6Position 1
5 5 3 3 [S] ×
[S]
Position
Positionx212 0 0 0 0 [M] 0 1 2 3 1 2 x3
x3 x1 x2 [M] x4 [M] [S] x5 x6 [S] x3
xx[S]
11 (b) xDivide
1 the input into Part A / Part B [S] autoregressively
(c)x2Generate the Part B spans x(d)
1 x2Self-attention
[M] x4 [M] [S] mask
x5 x6 [S] x3
Position 1 1 2 3 4 5 5 5 5 3 3
xx2x2 5 x2 [M] x5 Position 21 01
Position 02
30 40 50 51 52 53 3 1 3 2
Figure 2: GLM pretraining. (a) The original text is [x1 , x2x, x3 , x4 , x5 , x6 ]. Two Position[x
spans 2 0] 0 0[x0 , 0x 1] are
and 2 sampled.
3 1 2
3 5 6
x(b) x 6
[M]
[M]6 Replace
[M] the sampled spans with [M]4 in Part A, and shuffle the spans in Part B. (c) GLM autoregressively
[S]
[M] with [S] as input and appended with [E] as output. 2D positional
generates Part B. Each span is prepended
xx[S]
44 x4 x3
encoding represents inter- and intra-span
[S] positions. (d) Self-attention mask. Grey areas are masked out. Part A
x x x x x x
1 2 [M] 4 [M] [S] 5 6 [S] 3
[M]xtokens
[M] 3
[M] can attend to themselves (blue frame)
x5 but not B. Part B tokens can attend to A and their antecedents in B
x1 and
x2 green
[M] xframes [S] x5 x6to x[S]
4 [M] correspond x3 spans). [M]
Position 1 :=
1 [MASK],
2 3 4 [S] 5 :=5 [START],
5 5 3 and3 [E] := [END].
[S](yellow
[S] [S]
the
6
two
Position 2 0 0 0 0 0 1 2 3 1 2
[S]
xx55 1 1 2 3 4 5 5 5 5 3 3
Position
x5
2 0 respectively.
Position output 0 0 0 0In this 2 3ourx31model
1 way, 2
x1 x2 auto- as the original objective, i.e. Eq. 1. The only differ-
xx66 [M] x4 [M] [S] x5 x6 [S] x3
x6
matically learns a bidirectional encoder (for Part ence is the number of spans and the span lengths.
[S]
[S]A) and a unidirectional decoder Position 1 1 2 3 4 5 5 5 5 3 3
[S] (for Part B) in a
Position 2 0 0 0 0 0 1 2 3 1 2
xx33unified model. The implementation of GLM is 2.2 Model Architecture
xxx113 xx22 [M]
illustrated xx44 [M]
[M]Figure
in [M] [S] xx55 xx66 [S]
2. [S] [S] xx33 GLM uses a single Transformer with several mod-
Token x1 x2 [M] x4 [M] [S] x5 x6 [S] x3
We randomly sample spans ofx5length
Target drawn
x6 [E] x3 from
[E] ifications to the architecture: (1) we rearrange
Position 1 11 1 22 1 33 2 44 3 55 4 55 5 55 5 55 5 33 533 3
Position 1Position 3
a Poisson distribution with λ = 3. We repeatedly the order of layer normalization and the resid-
Position 2 00 2 00 0 00 0 00 0 00 0 11 0 22 1 33 2 11 322 1
Position 2Position 2
sample new spans until at least 15% of the original ual connection, which has been shown critical for
tokens are masked. Empirically, we have found large-scale language models to avoid numerical
that the 15% ratio is critical for good performance errors (Shoeybi et al., 2019); (2) we use a sin-
on downstream NLU tasks. gle linear layer for the output token prediction;
(3) we replace ReLU activation functions with
2.1.2 Multi-Task Pretraining
GeLUs (Hendrycks and Gimpel, 2016).
In the previous section, GLM masks short spans
and is suited for NLU tasks. However, we are 2.2.1 2D Positional Encoding
interested in pretraining a single model that can One of the challenges of the autoregressive blank
handle both NLU and text generation. We then infilling task is how to encode the positional infor-
study a multi-task pretraining setup, in which a mation. Transformers rely on positional encodings
second objective of generating longer text is jointly to inject the absolute and relative positions of the
optimized with the blank infilling objective. We tokens. We propose 2D positional encodings to
consider the following two objectives: address the challenge. Specifically, each token is
encoded with two positional ids. The first posi-
• Document-level. We sample a single span
tional id represents the position in the corrupted
whose length is sampled from a uniform distri-
text xcorrupt . For the masked spans, it is the position
bution over 50%–100% of the original length.
of the corresponding [MASK] token. The second
The objective aims for long text generation.
positional id represents the intra-span position. For
• Sentence-level. We restrict that the masked tokens in Part A, their second positional ids are
spans must be full sentences. Multiple spans 0. For tokens in Part B, they range from 1 to the
(sentences) are sampled to cover 15% of length of the span. The two positional ids are pro-
the original tokens. This objective aims for jected into two vectors via learnable embedding
seq2seq tasks whose predictions are often tables, which are both added to the input token
complete sentences or paragraphs. embeddings.
Our encoding method ensures that the model is
Both new objectives are defined in the same way not aware of the length of the masked span when
<latexit sha1_base64="cIlXHKTMHL8y94GI+KZXnlT1K7g=">AAAB7XicbVDLSgNBEOyNrxhfUY9ehgQhIoRdD+ox6MVjBPOAZAmzk9lkzOzMMjMrLjH/4EEPinj1f7zlb5w8DppY0FBUddPdFcScaeO6Yyezsrq2vpHdzG1t7+zu5fcP6lomitAakVyqZoA15UzQmmGG02asKI4CThvB4HriNx6o0kyKO5PG1I9wT7CQEWysVI9L6dPjSSdfdMvuFGiZeHNSrBTapy/jSlrt5L/bXUmSiApDONa65bmx8YdYGUY4HeXaiaYxJgPcoy1LBY6o9ofTa0fo2CpdFEplSxg0VX9PDHGkdRoFtjPCpq8XvYn4n9dKTHjpD5mIE0MFmS0KE46MRJPXUZcpSgxPLcFEMXsrIn2sMDE2oJwNwVt8eZnUz8reedm9tWlcwQxZOIIClMCDC6jADVShBgTu4Rne4N2Rzqvz4XzOWjPOfOYQ/sD5+gFYz5H0</latexit>
p(y|x)
For text generation tasks, the given context con-
Positive good stitutes the Part A of the input, with a mask token
y
<latexit sha1_base64="cb5S93r+RGEy3gKCUaUf2i3SjJQ=">AAAB6HicbVDJSgNBEK2JW4xb1KMijUHwFGY8qMegF48JmAWSIfR0apI2PQvdPcIw5OjJiwdFvPoV+Q5vfoM/YWc5aPRBweO9KqrqebHgStv2p5VbWl5ZXcuvFzY2t7Z3irt7DRUlkmGdRSKSLY8qFDzEuuZaYCuWSANPYNMbXk/85j1KxaPwVqcxugHth9znjGoj1dJusWSX7SnIX+LMSalyOK59PRyNq93iR6cXsSTAUDNBlWo7dqzdjErNmcBRoZMojCkb0j62DQ1pgMrNpoeOyIlResSPpKlQk6n6cyKjgVJp4JnOgOqBWvQm4n9eO9H+pZvxME40hmy2yE8E0RGZfE16XCLTIjWEMsnNrYQNqKRMm2wKJgRn8eW/pHFWds7Lds2kcQUz5OEAjuEUHLiACtxAFerAAOERnuHFurOerFfrbdaas+Yz+/AL1vs31mKQqQ==</latexit>
<latexit sha1_base64="CZ8abL6hJorh4jxHBPAGp8NppAA=">AAAB63icbVBNSwMxEJ2tX7V+VT16CS1CRSi7HtRj0YvHCvYD2qVk07QNTbJLki0sS/+CFwVFvPqHvPXfmG170NYHA4/3ZpiZF0ScaeO6Mye3sbm1vZPfLeztHxweFY9PmjqMFaENEvJQtQOsKWeSNgwznLYjRbEIOG0F4/vMb02o0iyUTyaJqC/wULIBI9hk0qSSXPSKZbfqzoHWibck5Vqpe/k6qyX1XvG72w9JLKg0hGOtO54bGT/FyjDC6bTQjTWNMBnjIe1YKrGg2k/nt07RuVX6aBAqW9Kgufp7IsVC60QEtlNgM9KrXib+53ViM7j1Uyaj2FBJFosGMUcmRNnjqM8UJYYnlmCimL0VkRFWmBgbT8GG4K2+vE6aV1Xvuuo+2jTuYIE8nEEJKuDBDdTgAerQAAIjeIY3eHeE8+J8OJ+L1pyznDmFP3C+fgCio5Dy</latexit>
Table 2: Results of abstractive summarization on the CNN/DailyMail and XSum test sets.
CNN/DailyMail XSum
Model
RG-1 RG-2 RG-L RG-1 RG-2 RG-L
BERTSumAbs (Liu and Lapata, 2019) 41.7 19.4 38.8 38.8 16.3 31.2
UniLMv2Base (Bao et al., 2020) 43.2 20.4 40.1 44.0 21.1 36.1
T5Large (Raffel et al., 2020) 42.5 20.7 39.8 40.9 17.3 33.0
BARTLarge (Lewis et al., 2019) 44.2 21.3 40.9 45.1 22.3 37.3
GLMRoBERTa 43.8 21.0 40.5 45.5 23.5 37.3
are also shown in Table 1. We observe that with pretrained on larger corpora.
multi-task pretraining, GLMDoc and GLMSent per-
The results for models trained on BookCorpus
form slightly worse than GLMLarge , but still outper-
and Wikipedia are shown in Tables 3 and 4. We
form BERTLarge and UniLMLarge . Among multi-
observe that GLMLarge can achieve performance
task models, GLMSent outperforms GLMDoc by
matching the other pretraining models on the two
1.1% on average. Increasing GLMDoc ’s param-
generation tasks. GLMSent can perform better than
eters to 410M (1.25×BERTLarge ) leads to better
GLMLarge , while GLMDoc performs slightly worse
performance than GLMLarge . GLM with 515M pa-
than GLMLarge . This indicates that the document-
rameters (1.5×BERTLarge ) can perform even better.
level objective, which teaches the model to extend
the given contexts, is less helpful to conditional
Sequence-to-Sequence. Considering the
generation, which aims to extract useful informa-
available baseline results, we use the Gigaword
tion from the context. Increasing GLMDoc ’s pa-
dataset (Rush et al., 2015) for abstractive summa-
rameters to 410M leads to the best performance on
rization and the SQuAD 1.1 dataset (Rajpurkar
both tasks. The results for models trained on larger
et al., 2016) for question generation (Du et al.,
corpora are shown in Table 2. GLMRoBERTa can
2017) as the benchmarks for models pretrained
achieve performance matching the seq2seq BART
on BookCorpus and Wikipedia. Additionally, we
model, and outperform T5 and UniLMv2.
use the CNN/DailyMail (See et al., 2017) and
XSum (Narayan et al., 2018) datasets for abstrac- Text Infilling. Text infilling is the task of pre-
tive summarization as the benchmarks for models dicting missing spans of text which are consistent
Table 3: Results on Gigaword summarization.
Books&Wiki Test
16
Model RG-1 RG-2 RG-L
14
Perplexily
MASS 37.7 18.5 34.9
12
UniLMLarge 38.5 19.5 35.8
GLMLarge 38.6 19.7 36.0 10
GLMDoc 38.5 19.4 35.8
8
GLMSent 38.9 20.0 36.3 Unidirectional Bidirectional
GLM410M 38.9 20.0 36.2
LAMBADA
60
50
Table 4: Results on SQuAD question generation.
Accuracy
40
Model BLEU-4 MTR RG-L
30
SemQG 18.4 22.7 46.7
UniLMLarge 22.1 25.1 51.1 20
Unidirectional Bidirectional
GLMLarge 22.4 25.2 50.4
GLMDoc GLM410M GPTLarge
GLMDoc 22.3 25.0 50.2
GLMDoc – 2D GLM515M
GLMSent 22.6 25.4 50.4
GLM410M 22.9 25.6 50.5
Figure 4: Zero-shot language modeling results.
†
Table 5: BLEU scores on Yahoo text infilling. indi-
cates the results from (Shen et al., 2020). et al., 2016), which tests the ability of systems to
model long-range dependencies in text. The task
Mask ratio 10% 20% 30% 40% 50%
is to predict the final word of a passage. As the
BERT† 82.8 66.3 50.3 37.4 26.2 baseline, we train a GPTLarge model (Radford et al.,
BLM† 86.5 73.2 59.6 46.8 34.8 2018b; Brown et al., 2020) with the same data and
GLMLarge 87.8 76.7 64.2 48.9 38.7
GLMDoc 87.5 76.0 63.2 47.9 37.6 tokenization as GLMLarge .
The results are shown in Figure 4. All the models
are evaluated in the zero-shot setting. Since GLM
with the surrounding context (Zhu et al., 2019; learns the bidirectional attention, we also evalu-
Donahue et al., 2020; Shen et al., 2020). GLM ate GLM under the setting in which the contexts
is trained with an autoregressive blank infilling are encoded with bidirectional attention. Without
objective, thus can straightforwardly solve this generative objective during pretraining, GLMLarge
task. We evaluate GLM on the Yahoo Answers cannot complete the language modeling tasks,
dataset (Yang et al., 2017) and compare it with with perplexity larger than 100. With the same
Blank Language Model (BLM) (Shen et al., 2020), amount of parameters, GLMDoc performs worse
which is a specifically designed model for text in- than GPTLarge . This is expected since GLMDoc
filling. From the results in Table 5, GLM outper- also optimizes the blank infilling objective. In-
forms previous methods by large margins (1.3 to creasing the model’s parameters to 410M (1.25× of
3.9 BLEU) and achieves the state-of-the-art result GPTLarge ) leads to a performance close to GPTLarge .
on this dataset. We notice that GLMDoc slightly GLM515M (1.5× of GPTLarge ) can further outper-
underperforms GLMLarge , which is consistent with form GPTLarge . With the same amount of param-
our observations in the seq2seq experiments. eters, encoding the context with bidirectional at-
Language Modeling. Most language model- tention can improve the performance of language
ing datasets such as WikiText103 are constructed modeling. Under this setting, GLM410M outper-
from Wikipedia documents, which our pretraining forms GPTLarge . This is the advantage of GLM
dataset already contains. Therefore, we evaluate over unidirectional GPT. We also study the con-
the language modeling perplexity on a held-out tribution of 2D positional encoding to long text
test set of our pretraining dataset, which contains generation. We find that removing the 2D posi-
about 20M tokens, denoted as BookWiki. We also tional encoding leads to lower accuracy and higher
evaluate GLM on the LAMBADA dataset (Paperno perplexity in language modeling.
Table 6: Ablation study on the SuperGLUE dev set. (T5 ≈ GLM – shuffle spans + sentinel tokens.)
includes: (1) Due to resource limit, we only pre- When finetuning GLM on the SuperGLUE tasks,
train GLM RoBERTa for 250,000 steps, which are we construct the input using the cloze questions
half of RoBERTa and BART’s training steps, and in Table 8 and replace the blank with a [MASK]
close to T5 in number of trained tokens. (2) We use token. Then we compute the score of generating
cosine decay instead of linear decay for learning each answer candidate. For the 5 single-token tasks,
rate scheduling (3) We additionally apply gradient the score is defined to be the logit of the verbal-
clipping with value 1.0. izer token. For the 3 multi-token tasks, we use
The hyperparameters for all the pre-training set- the sum of the log-probabilities of the verbalizer
tings are summarized in Table 7. tokens. Thanks to the autoregressive blank infill-
ing mechanism we proposed, we can obtain all the
A.3 Implementation log-probabilities in one pass. Then we compute the
Our pretraining implementation is based on cross entropy loss using the groundtruth label and
Megatron-LM (Shoeybi et al., 2019) and Deep- update the model parameters.
Speed (Rasley et al., 2020). We include our code in
the supplementary material. Due to the size limit of
supplementary material, we cannot include the pre- For the baseline classifiers, we follow the stan-
trained models, but will make them public available dard practice to concatenate the input parts of each
in the future. task (such as the premise and hypothesis for textual
entailment, or the passage, question and answer
B Downstream Tasks for ReCORD and MultiRC) and add a classifica-
tion layer on top of the [CLS] token representa-
B.1 SuperGLUE tion. We also implemented cloze-style finetuning
The SuperGLUE benchmark consists of 8 NLU for the other pre-trained models, but the perfor-
tasks. We formulate them as blank infilling tasks, mance was usually similar to the standard classifier,
following (Schick and Schütze, 2020b). Table 8 as we shown in the ablation study. Models with
shows the cloze questions and verbalizers we used blank-infilling objectives, such as T5 and our GLM,
in our experiments. For 3 tasks (ReCoRD, COPA, benefits more from converting the NLU tasks into
and WSC), the answer may consist of multiple cloze questions. Thus for T5 and GLM, we report
tokens, and for the other 5 tasks, the answer is the performance after such conversion in our main
always a single token. results.
Table 8: Cloze questions and verbalizers for the 8 SuperGLUE tasks used in our experiments. ∗ denotes the answer
contains multiple tokens.
where x<t = [x0 , · · · , xt−1 ]. Since transformers elected from single-member districts representing
can only operate on a window of fixed input size the majority of the state. The current state senate
w, we cannot fully calculate p(xt |x<t ) and can members are: In recent years, there have been four
only calculate p(xt |xt−w:t−1 ). Even calculating changes to the senate. The most recent is the cre-
this value for each token is prohibitively expensive, ation of a six-seat district that includes all or part of
since we need to conduct T evaluations of w-size the following: In the 2009 elections, the state sen-
contexts. To improve evaluation efficiency, we ate members were elected to six-year terms. The
adopt overlapping evaluation, where we advance current state house members are: The Wyoming
the sliding windows by some overlap o each time Constitution assigns certain powers to the governor.
and only compute the cross entropy loss for the last Most notably, the governor is president of the sen-
o tokens of the window. In our experiments we set ate and governor. However, if the governor desires
o = 256 for all the models. to appoint a member to the Wyoming state senate, a
LAMBDA is a cloze-style dataset to test the abil- law authorizes the governor to do so. The governor
ity of long-range dependency modeling. Each ex- of Wyoming holds no legislative power but has the
ample is a passage consisting of 4-5 sentences with power to veto lawmakers, which is not limited to
the last word missing and the model is required to the veto of laws. Under the wyoming state con-
predict the last word of the passage. Since we use stitution, the governor can veto the actions of the
WordPiece tokenization, a word can be split into other members of the wyoming house of represen-
several subword units. We use teacher forcing and tatives. The governor can also appoint members of
consider the prediction correct only when all the the wyoming senate. In addition, the governor can
predicted tokens are correct. appoint members of the Wyoming house of repre-
sentatives. Wyoming’s constitution provides that
C Results on Other NLU Benchmarks the governor can appoint a member of the wyoming
state senate to the wyoming supreme court, and the
GLUE (Wang et al., 2018) is another widely-used
chairman of the wyoming senate.
NLU benchmark, including single sentence tasks
(e.g. sentiment analysis (Socher et al., 2013)) and Example D.2. Jonathan Terry is a television and
sentence pair tasks (e.g. text similarity (Cer et al., film actor.
2017) and natural language inference (Williams GLM: He was a voice actor for the “X-Men"
et al., 2018; Dagan et al., 2005)). The benchmark is cartoon series. He was also a voice actor for “the
usually considered as less challenging than Super- Simpsons" as well as “the marvelous misadven-
GLUE. SQuAD (Rajpurkar et al., 2016, 2018) is an tures of superman. He also has voiced characters
extractive question answering benchmark. We fur- in “the legend of korra" He has appeared on several
ther compare GLM with BERT on the two bench- television series, including “the simpsons", “the
marks. x-files", “heroes" and “the simpsons movie" as the
The results on GLUE and SQuAD are shown in character captain billy higgledy-pig, and the web
Tables 9 and 10. On the two benchmarks, GLM series “krusty mysteries" as the character Colonel
can still outperform BERT with the same amount Trungus. He wrote for “the Guardian" newspaper.
of parameters, but with a smaller margin. Jonathan Terry was born in London. Terry grew
up in Surrey, England and attended the University
D Text Generation Samples
of Sussex in the United Kingdom, graduating with
We show texts generated by GLMDoc given unseen a degree in english literature. He was a guest lec-
contexts randomly sampled from the test set. We turer at King’s College London, and then took two
use top-k random sampling with k = 40 for gen- years of acting courses at the brit school of acting
eration and set maximum sequence length to 512. to prepare for his future career in the entertainment
Some of the texts are cut short. industry. Terry first appeared in the TV series “the
Table 9: Results on the GLUE dev set.
Model MNLI QNLI QQP RTE SST-2 MRPC CoLA STS-B Avg
BERTLarge 86.6 92.3 91.3 73.6 93.2 88.0 60.6 90.0 84.4
GLMLarge 86.7 92.8 91.5 74.0 93.5 90.0 61.4 90.7 85.1
Table 10: Results on the SQuAD v1.1/v2.0 dev sets. Example D.4. Robert Lee Smith (born July 5,
1938) is a former American football cornerback
SQuAD v1.1 SQuAD v2.0 in the national football league. He played for the
Model
EM/F1 EM/F1 Los Angeles Rams (1962–1965) and the Detroit
BERTBase 80.8 / 88.5 73.7 / 76.3 Lions (1965–1966). He was also a kick and punt
GLMBase 81.5 / 88.6 74.7 / 77.8 returner.
BERTLarge 84.1 / 90.9 79.0 / 81.8 GLM: In his four-year NFL career, he played
GLMLarge 85.4 / 91.6 80.3 / 83.3 in 33 games and started 14, registering 62 career
interceptions. He completed his NFL career with
five interceptions. Smith was born in La Canada
Simpsons" as the character captain Billy Higgledy- Flintridge, Michigan, in 1938. He attended Bishop
pig, but his character was only a one-time recurring O’ Dowd high school in Flintridge. Smith’ s fa-
character in the series’ first six seasons. He later ap- ther was a pharmacist. He graduated from Michi-
peared as a regular for the show’s final six seasons, gan State University in 1958 with a degree in busi-
and has been a frequent guest in the show since. ness administration. Smith played for the Michi-
He appeared in the first few episodes of “" as the gan Wolverines football team from 1959 to 1963.
character major Jack Ryan. He has also appeared In his two years as a reserve cornerback, he led
as part of the supporting cast of several episodes of the conference in interceptions with five. In 1962,
“the secret life of pets". He has also worked on “the he set the Wolverines’ all-time interception record
simpsons" TV show since “the simpsons movie", with 13, and was second overall in the 1962 sea-
most notably playing the roles of Captain Skeletor son’s Heisman Trophy voting. Smith also won
and the ghost of the same name. He plays char- the Dick Butkus award as the nation’s outstand-
acters in several films, including “", “", “" and “". ing linebacker. In 1961, the “Los Angeles Times"
He has appeared in music videos for the killers in wrote that Smith “is an outstanding pass rusher,
1993, the pretenders in 1995, and in the TV shows with an average of almost 100 yards per punt re-
“the royal" and “the bill". turn." Smith was inducted into the university of
Example D.3. Corona was a station along the port Michigan athletic hall of honor in 1989 and the
Washington branch of the long island rail road in national football foundation hall of fame in 1991.
the Corona section of queens, New York City. It He was elected to the Michigan sports hall of fame
was one of two stations built by the flushing rail- in 1995. Smith earned the honor because of his ac-
road in Corona, this one having been at Grand Av- complishments prior to his NFL career. He was one
enue (later called National Avenue, now National of four Michigan players honored as first-overall
Street ) and 45th Avenue. selections in the 1964 NFL draft. The others were
GLM: It had a brick entrance building next to Joe Namath, Bill Nelsen, and Jerry Kramer. In
the tracks, and one platform with 2 side platforms. 1966, the NFL gave players $300,000 a season to
The station burned down and was rebuilt in 1908. play football. After his rookie season, he was not
The station closed on september 15, 1927, with selected to play in the 1966 pro bowl. On January
the train service transferred from Grand Avenue to 13, 1966, the Rams traded smith to the Detroit Li-
45th Avenue. There were plans to build a subway ons for Paul Hornung, and later that year he was
extension to Corona, but it was never built. In the traded to the Lions for Ray “the Lion" Jones in
1980s, two stations were constructed on the line, exchange for Linebacker Jim “the Hawk" Johnson.
Corona Road and Corona Park. The new Corona On September 10, 1968, he was traded back to Los
station opened in 1988, and the original Corona Angeles for a second round pick in the 1970 draft.
station was demolished. He was also traded to the St. Louis Cardinals for
a second round pick in the 1970 draft. On June 2,
1970 he was cut by the Cardinals. On November
15, 1970, the Los Angeles Rams acquired Smith
from the Lions in exchange for Linebacker Tony
Harris. The Rams waived Smith during the Septem-
ber 1, 1972 offseason. Smith’s number at Michigan
State was # 7 in 1969.