Main

SARS-CoV-2 emerged in 2019 and has spread rapidly around the world, causing over 80 million recorded cases of COVID-19 and over 1.7 million deaths attributable to this disease by the end of 2020. The failure of public health measures to contain the spread of the virus in many countries has given rise to a large number of virus lineages. Open sharing of genomic surveillance data and collaborative online platforms have enabled the real-time tracking of the emergence and spread of these lineages9,10.

To date, there has been relatively limited evidence for SARS-CoV-2 mutations that have had a substantial functional effect on the virus. A mutation resulting in a substitution in the spike protein (D614G) emerged early in the epidemic, and spread rapidly through Europe and North America in particular. Several lines of evidence now suggest that SARS-CoV-2 variants that carry this mutation have increased transmissibility11,12,13,14. Later in the epidemic, several lineages with a N439K substitution in the receptor-binding domain (RBD) of the spike protein emerged independently, probably in a range of European countries and the USA. This mutation is associated with escape from neutralization mediated by monoclonal antibodies or polyclonal serum15.

South Africa is the most severely affected country in Africa, with over 80,000 excess natural deaths having occurred by the end of 2020 (approximately 1,400 per million individuals)16. The introduction and spread of several SARS-CoV-2 lineages to South Africa have previously been described, as has the identification of lineages unique to South Africa during the early phase of the epidemic17,18. Here we describe the emergence and spread of a SARS-CoV-2 lineage that contains several nonsynonymous spike mutations, including mutations that affect key sites in the RBD (resulting in K417N, E484K and N501Y substitutions) that may have functional importance. We demonstrate that this lineage is likely to have emerged after the first wave of the epidemic in the worst-affected metropolitan area within the Eastern Cape province. This was followed by rapid spread of this lineage, to the extent that by the end of 2020 it had become the dominant lineage in three provinces.

Epidemic dynamics in South Africa

The second wave of the SARS-CoV-2 epidemic in South Africa began around October 2020, weeks after a trough in daily recorded cases following the first peak19 (Fig. 1a). The country-wide estimated effective reproduction number (Re) increased to above 1 at the end of October (indicating a growing epidemic), which coincided with a steady rise in daily cases. At the peak of the national epidemic in the middle of July, there were over 13,000 confirmed cases per day and almost 7,000 excess deaths per week. The epidemiological profile in the three provinces that are the focus of this analysis (the Eastern Cape, Western Cape and KwaZulu–Natal) were broadly similar, although the Western Cape had an earlier and flatter peak in the first wave (Fig. 1b–d). At the end of the first wave of the epidemic in early September, there had been over 10,000 excess deaths in the Eastern Cape (1,510 per million individuals)—the highest for any province (Extended Data Fig. 1). Although there was a plateau in cases after the first wave, this was noticeably short in the Eastern Cape; by early October, there was a second phase of exponential growth that was associated with an increase in deaths at a rate similar to that of the first wave (Fig. 1b). The rate of positive PCR tests at a local-municipality level shows very high levels of infection (>20%) in Nelson Mandela Bay from the middle of October, followed by rapidly rising levels in the surrounding areas through October and November (Extended Data Fig. 2). The resurgence of the daily case counts at an exponential rate happened later for the Western Cape and KwaZulu–Natal than for the Eastern Cape (Fig. 1c, d). By early December, all three provinces were experiencing a second wave and new cases in the Western Cape had already surpassed the peak of the first wave.

Fig. 1: SARS-CoV-2 epidemiological dynamics in South Africa.
figure 1

a–e, Histograms show the number of daily confirmed cases of COVID-19 (mapped to the left y axis) from March 2020 to January 2021 in South Africa (a) and in the four provinces under study: Eastern Cape (b), Western Cape (c), KwaZulu–Natal (d) and Northern Cape (e). Fluctuations in the daily estimates of Re are shown in red (mapped to the right y axis); the mean estimated median Re with upper and lower bounds of the 95% confidence interval are shown, along with a cut-off for R = 1 (broken red line). Weekly excess deaths in South Africa and in each region are shown as black broken lines (mapped to the left y axis).

Phylogenetic and phylogeographic analysis

The early and rapid resurgence of the epidemic in parts of the Eastern Cape and Western Cape prompted the intensification of genomic surveillance by the Network for Genomic Surveillance in South Africa (NGS-SA), including sampling in and around Nelson Mandela Bay in the Eastern Cape and in the neighbouring Garden Route district of the Western Cape (Extended Data Fig. 3). We analysed 2,882 whole genomes of SARS-CoV-2 from South Africa, which were collected between 5 March and 10 December 2020. We estimated preliminary maximum-likelihood and molecular clock phylogenies for a dataset containing an additional 2,573 global reference genomes (Fig. 2a). We identified a previously unrecognized monophyletic cluster (501Y.V2) that contained 341 sequences, from samples collected between 8 October and 10 December in KwaZulu–Natal, Eastern Cape, Western Cape and Northern Cape (Fig. 2b). Seven South African sequences that are basal to the 501Y.V2 cluster (Fig. 2a) were sampled in the Eastern Cape, Western Cape, Gauteng and KwaZulu–Natal provinces between late June and early September. Although these sequences do not have any of the defining mutations of the 501Y.V2 variant, they are basal to the B.1.351 lineage and indicate that the precursor to the new variant was probably circulating throughout the country before the emergence of 501Y.V2.

Fig. 2: Evolution and spread of the 501Y.V2 cluster in South Africa.
figure 2

a, Time-resolved maximum clade credibility phylogeny of 5,329 SARS-CoV-2 sequences; 2,756 of these are from South Africa (red). The newly identified SARS-CoV-2 cluster (501Y.V2) is highlighted in yellow. b, Time-resolved maximum clade credibility phylogeny of the 501Y.V2 cluster (n = 341), with province indicated. Mutations that characterize the cluster are highlighted at the branch at which each first emerged. c, Frequency and distribution of SARS-CoV-2 lineages circulating in South Africa over time. d, Spatiotemporal reconstruction of the spread of the 501Y.V2 cluster in South Africa during the second wave of the epidemic. Circles represent nodes of the maximum clade credibility phylogeny, coloured according to their inferred time of occurrence (scale in bottom panel). Shaded areas represent the 80% highest posterior density interval and depict the uncertainty of the phylogeographic estimates for each node. Solid curved lines denote the links between nodes and the directionality of movement.

The 501Y.V2 cluster is phylogenetically distinct from the three main lineages (B.1.1.54, B.1.1.56 and C.1) that were circulating widely in South Africa (>42% of samples sequenced before October 2020) during the first wave of infections18 (Fig. 2a). These three lineages had been circulating in the KwaZulu–Natal, Western Cape, Gauteng, Free State, Limpopo and North-West provinces. By the middle of November, the 501Y.V2 lineage had superseded the B.1.1.54, B.1.1.56 and C.1 lineages, and it rapidly became the dominant lineage in samples from the Eastern Cape, KwaZulu–Natal and Western Cape (Fig. 2c, Extended Data Fig. 4).

Our spatiotemporal phylogeographic analysis suggests that the 501Y.V2 lineage emerged in early August (95% highest posterior density ranging from the middle of July to the end of August 2020) in Nelson Mandela Bay. Its initial spread to the Garden Route district of the Western Cape was followed by a more-diffuse spread from both of these areas to other regions of the Eastern Cape, and more recently to the City of Cape Town municipality and several locations in KwaZulu–Natal (Fig. 2d). From the City of Cape Town, the variant has travelled north along the west coast of the country to the Namakwa district in the Northern Cape province.

Mutational profile

At the point of first sampling on the 15 October, this lineage had—in addition to D614G—five nonsynonymous mutations resulting in substitutions in the spike protein: D80A, D215G, E484K, N501Y and A701V (Figs. 2b, 3a, Extended Data Fig. 5). A further three mutations that lead to substitutions in the spike protein had emerged by the end of November: L18F, R246I and K417N. We also observe a deletion of three amino acids at positions 242 to 244, which was seen in samples extracted and generated in different laboratories across the NGS-SA. This region is difficult to align; the deletion could potentially also be located at positions 241 to 243, but the resulting sequence would be exactly the same. Although the variants appeared in a varying proportion of the sampled genomes and showed changing levels of frequency with time, the mutations in RBD seem to become fixed in our sampling set, are present in almost all of the samples and are consistently high in frequency across time (Fig. 3a, b). Compared to the previous three largest lineages circulating in South Africa, 501Y.V2 shows marked hypermutation both in the whole genomes and the spike regions—including nonsynonymous mutations that lead to amino acid changes (Fig. 3c). The main lineages identified in South Africa during first wave (B.1.1.54, B.1.1.56 and C.1) contained only the single nonsynonymous spike mutation (D614G) and did not show the rapid accumulation of mutations, as is observed with 501Y.V2. We estimate that substitutions on the 501Y.V2 lineage are happening at 1.917 × 10−3 nucleotide changes per site per year, compared to 5.344 × 10−4, 4.251 × 10−4 and 9.781 × 10−4 nucleotide changes per site per year for B.1.1.54, B.1.1.56 and C.1, respectively (Extended Data Fig. 6).We performed structural modelling of the spike trimer with these mutations, which revealed that three of the substitutions (N501Y, E484K and K417N) are at key residues in the RBD; three (L18F, D80A and D215G) are in the N-terminal domain; and one (A701V) is in loop 2 (Fig. 3d). The deletion of three amino acid (242 to 244) also lies in the N-terminal domain. In particular, two of the RBD sites (at positions 417 and 484) are key regions for the binding of neutralizing antibodies (Extended Data Fig. 7).

Fig. 3: Mutational profile of the spike region of the 501Y.V2 lineage.
figure 3

a, Amino acid changes in the spike region of the 501Y.V2 genomes in this study (n = 341) mapped to the spike-protein sequence structure, indicating key regions (such as the RBD). Each spike protein variant is shown at its respective protein location; bar lengths represent the number of genomes that contain the specific mutations. Only mutations that appear in >10% (grey dotted line) of sequences are shown. The D614G substitution (in black) is already present in the parent lineage. b, Changes in the mutation frequency of each variant observed during the course of sampling. Grey bars show the number of 501Y.V2 sequences sampled at a given time point; coloured lines show the change in the number of sequences that contain each variant at the respective time points. c, Violin plots showing the number of nucleotide substitutions and amino acid changes that have accumulated in both the whole genomes and the spike region of the 501Y.V2 lineage (n = 341), compared to lineages B.1.1.54 (n = 472), B.1.1.56 (n = 179) and C.1 (n = 271). The dot and error bars inside each group denote the mean and range for two s.d., respectively. d, A complete model of the SARS-CoV-2 spike trimer is shown, with domains of a single protomer shown in cartoon view and coloured cyan (N-terminal domain), yellow (C-terminal domain and receptor binding domain), purple (subdomain 1 and 2), and dark green; N-acetylglucosamine moieties are coloured in light green. The adjacent protomers are shown in surface view and coloured shades of grey. Eight nonsynonymous mutants (red) and a deletion of three amino acids (pink) that together define the spike of the 501Y.V2 lineage are shown as spheres.

Selection analysis

We examined patterns of nucleotide variation and fluctuations in mutant frequencies at eight polymorphic sites in the spike gene (Fig. 3a) to determine whether any of the observed polymorphisms might contribute to changes in viral fitness worldwide. For this analysis, we used 142,037 high-quality sequences from the Global Initiative On Sharing All Influenza Data (GISAID) sampled between the 24 December 2019 and 14 November 2020, which represented 5,964 unique spike haplotypes. The analysis indicated that two of the three sites in the RBD (E484 and N501) display a pattern of nucleotide variation that is consistent with the site evolving under diversifying positive selection. The N501Y polymorphism that first appears in our sequences sampled on the 15 October shows indications of positive selection on five global-tree internal branches; codon 501 of the spike gene displays a significant excess of nonsynonymous substitutions globally (dN/dS > 1 on internal branches, P = 0.0011 by the fixed-effects likelihood method), and mutant viruses that encode Y at this site have rapidly increased in frequency in both the UK and South Africa (z score = 11, trend Jonckheere Terpstra non-parametric trend test). Similarly, at codon 484, there is an indication of positive selection on seven global-tree internal branches, with an overall significant excess of nonsynonymous substitutions globally (P = 0.015). Outside the RBD, codons 18 (P < 0.001), 80 (P = 0.0014) and 215 (P < 0.001) show evidence of positive diversifying selection globally, and the L18F mutation has also increased in frequency in the regions in which it has occurred (z score = 17). Up until the 14 November 2020, there was no statistical evidence of positive selection at codons 417, 246 and 701.

Discussion

We describe and characterize a newly identified SARS-CoV-2 lineage with several spike mutations that is likely to have emerged in a major metropolitan area in South Africa after the first wave of the epidemic, and then to have spread to multiple locations within two neighbouring provinces. We show that this lineage has rapidly expanded and become dominant in three provinces, at the same time as there has been a rapid resurgence in infections. Although the full import of the mutations is not yet clear, the genomic and epidemiological data suggest that this variant has a selective advantage—from increased transmissibility, immune escape or both. These data highlight the urgent need to refocus the public health response in South Africa on driving transmission down to low levels, not only to reduce hospitalizations and deaths but also to limit the spread of this lineage and the further evolution of the virus.

We detected this variant through intensified genomic surveillance that was enacted in response to a rapid resurgence of cases in the Eastern Cape province20. However, both before and after the detection of 501Y.V2, our genomic surveillance involved the regular sequencing of a random selection of residual samples from routine diagnostic services. We show that 501Y.V2 was detected in samples from 197 health facilities in multiple districts across four provinces. We are therefore confident that, although our sequencing coverage is relatively low, the sequences are representative of the circulating viruses in these provinces. Although the epidemic in the Eastern Cape was contracting from the middle of July to the middle of August (the estimated time to the most-recent common ancestor), this was not a period of low transmission: incidence was above 20 case per 100,000 people per week at this time and the positive testing rate remained above 10%, which suggests moderate-to-high levels of transmission. As there were many lineages circulating at this time, the rapid expansion of 501Y.V2 and the almost complete displacement of other lineages in multiple regions strongly suggest a selective advantage for this variant.

Preliminary modelling suggests that the 501Y.V2 could be approximately 50% more transmissible than the previously circulating variants, although this estimate assumes that natural immunity confers complete protection against reinfection6. Increased transmissibility is plausible, given what we know about the spike mutations in 501Y.V2 and what we are learning about similar SARS-CoV-2 variants that are emerging in other locations. The 501Y.V2 lineage has three substitutions that affect key sites in the RBD (K417N, E484K and N501Y). The N501Y substitution has also recently been identified in a lineage that has spread rapidly in the UK (designated B.1.1.7)21. There is now good evidence that this lineage is associated with increased transmissibility22. The N501Y substitution has previously been shown through deep mutation scanning, and in a mouse model, to enhance binding affinity to human ACE23,23. There is some evidence that the E484K substitution may also increase binding affinity to human ACE23; and that the combination of N501Y and E484K enhances binding affinity still further24,25. Additional work is being conducted to understand the precise mechanisms that underlie the increased transmissibility of these new variants.

The other reason for a selective advantage of 501Y.V2 could be immune escape (that is, the capacity to cause reinfection). We have very limited SARS-CoV-2 seroprevalence data from South Africa to help us to understand the true extent of the epidemic. In studies that used residual blood samples from routine public sector antenatal and HIV care, seroprevalence in parts of the City of Cape Town was estimated at approximately 40% in July and August (toward the end of the first wave of the epidemic in this area)26. We have shown that the Eastern Cape—and Nelson Mandela Bay, in particular—were worse-affected than City of Cape Town in the first wave, and we therefore believe that population immunity could have been sufficiently high in this region to contribute to population-level selection. The RBD of the spike protein is the main target of neutralizing antibodies that are elicited during SARS-CoV-2 infection27. Neutralizing antibodies to the RBD can be broadly divided into four main classes28. Of these, class 1 and class 2 antibodies appear to be elicited most frequently during SARS-CoV-2 infection, and their epitopes directly overlap the human ACE2 binding site27. Class 2 antibodies bind to E484, and the E484K substitution has previously been shown to confer resistance to neutralizing antibodies in this class and to panels of convalescent sera, which suggests that E484 is a dominant neutralizing epitope4,5,29,30,31. Aside from the RBD, the remaining neutralizing activity is targeted at the N-terminal domain, and some of the N-terminal domain mutations in 501Y.V2 affect residues that form an antigenic supersite or are close to this site32,33. Preliminary evidence from live virus and pseudovirus experiments indicates that 501Y.V2 shows substantial or complete escape from neutralizing antibodies in convalescent plasma7,8. We are currently investigating the frequency of reinfection in the second wave, as well as the clinical presentations of individuals with reinfection to better understand the clinical and epidemiological effects of any immune escape. We are also conducting neutralization assays on plasma from recipients of vaccines, and await results of vaccine efficacy trials conducted in South Africa during the expansion of 501Y.V2.

One hypothesis for the emergence of this lineage (given the large number of mutations relative to the background mutation rate of SARS-CoV-2) is that it may have arisen through intrahost evolution34,35,36. This hypothesis is supported by the long branch length that connects the lineage to the remaining sequences in our phylogenetic tree (Extended Data Fig. 8). The mutation leading to the N501Y substitution is one of several spike mutations that emerged in an immunocompromised individual in the USA who had prolonged viral replication for over 20 weeks34. In South Africa (which has the largest HIV epidemic in the world), one concern has been the possibility of prolonged viral replication and intrahost evolution in the context of HIV infection, although the limited evidence so far does not suggest that HIV infection is associated with persistent SARS-CoV-2 replication37. However, the observed diversity within this lineage cannot be explained by a single long-term infection in one individual, because the lineage contains circulating intermediate mutants with subsets of the main mutations that characterize the lineage. If evolution within long-term infections were the explanation for the evolution of this lineage, then one would need to invoke a transmission chain that passes through several individuals. Furthermore, antigenic evolution—even within individuals who are not immunosuppressed—could offer an alternative explanation, as several of the individual sites in the spike protein appear to be under selective pressure worldwide and several of the identified mutations have emerged independently around the world (Extended Data Fig. 9) and been found in circulating lineages together.

Although the full implications of the 501Y.V2 lineage in South Africa are yet to be determined, these findings highlight the importance of coordinated molecular surveillance systems in all parts of the world in enabling the early detection and characterization of new lineages and informing the global response to the COVID-19 pandemic.

Methods

No statistical methods were used to predetermine sample size. The experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment.

Epidemiological dynamics

We analysed daily cases of SARS-CoV-2 in South Africa up the 16 January 2020 from publicly released data provided by the National Department of Health and the National Institute for Communicable Diseases. This was accessible through the repository of the Data Science for Social Impact Research Group at the University of Pretoria (https://github.com/dsfsi/covid19za)38,39. The National Department of Health releases daily updates on the number of confirmed new cases, deaths and recoveries, with a breakdown by province. We also mapped excess deaths in each province and in South Africa as a whole onto general epidemiological data to determine the extent of potential underreporting of case numbers and gauge the severity of the epidemic. Excess deaths here are defined as the excess natural deaths (in individuals aged 1 year and above) relative to the value predicted from 2018 and 2019 data, setting any negative excesses to zero. We obtained these data from the Report on Weekly Deaths from the South Africa Medical Research Council Burden of Disease Research Unit16. We generated estimates for the Re of SARS-CoV-2 in South Africa from the ‘covid-19-Re’ data repository (https://github.com/covid-19-Re/dailyRe-Data) as of the 14 December 202040.

Sampling of SARS-CoV-2

As part of the NGS-SA20, five sequencing hubs receive randomly selected samples for sequencing every week according to approved protocols at each site. These samples include remnant nucleic acid extracts or remnant nasopharyngeal and oropharyngeal swab samples from routine diagnostic SARS-CoV-2 PCR testing from public and private laboratories in South Africa. In response to a rapid resurgence of COVID-19 in the Eastern Cape and the Garden Route district of the Western Cape in November, we enriched our routine sampling with additional samples from those areas. In total, we received samples from over 50 health facilities in the Eastern Cape and Western Cape (Extended Data Fig. 10).

Ethical statement

The project was approved by University of KwaZulu–Natal Biomedical Research Ethics Committee (ref. BREC/00001510/2020), the University of the Witwatersrand Human Research Ethics Committee (HREC) (ref. M180832), Stellenbosch University HREC (ref. N20/04/008_COVID-19) and the University of Cape Town HREC (ref. 383/2020). Individual participant consent was not required for the genomic surveillance. This requirement was waived by the Research Ethics Committees.

Whole-genome sequencing and genome assembly

cDNA synthesis was performed on the extracted RNA using random primers followed by gene-specific multiplex PCR using the ARTIC V3 protocol41. In brief, extracted RNA was converted to cDNA using the Superscript IV First Strand synthesis system (Life Technologies) and random hexamer primers. SARS-CoV-2 whole-genome amplification was performed by multiplex PCR using primers designed on Primal Scheme (http://primal.zibraproject.org/) to generate 400-bp amplicons with an overlap of 70 bp that covers the 30-kb SARS-CoV-2 genome. For nanopore sequencing, we adapted the nCoV-2019 sequencing LoCost protocol v341. In brief, PCR reactions were done in 12.5 μl volumes and no PCR product purification was done. After DNA repair (NEB) and end-prep reactions (NEB), up to 24 samples were barcoded by ligation (EXP-NBD104/NBD114, Oxford Nanopore Technologies). Barcoded samples were pooled, bead-purified and ligated to sequence adapters. After the bead clean-up, the DNA concentration was determined with a Qubit 2.0 instrument (Thermo Fisher). Up to 50 ng of the library in 75 μl were loaded on a prepared R9.4.1 flow-cell. A GridION X5 sequencing run was initiated using MinKNOW software with the high-accuracy base-call setting. The NC045512 reference was used for alignment during base-calling and the barcodes were split into different folders. .fastq files were downloaded from the GridION X5 for assembly and further analysis.

For Illumina sequencing, PCR products were cleaned up using AmpureXP purification beads (Beckman Coulter) and quantified using the Qubit dsDNA High Sensitivity assay on the Qubit 4.0 instrument (Life Technologies).

We then used the Illumina Nextera Flex DNA Library Prep kit according to the manufacturer’s protocol to prepare indexed paired end libraries of genomic DNA. Sequencing libraries were normalized to 4 nM, pooled and denatured with 0.2 N sodium acetate. A 12 pM sample library was spiked with 1% PhiX (PhiX Control v3 adaptor-ligated library used as a control). We sequenced libraries on a 500-cycle v2 MiSeq Reagent Kit on the Illumina MiSeq instrument (Illumina). Full details of the amplification and sequencing protocol have previously been published42,43.

We assembled paired-end and nanopore .fastq reads using Genome Detective 1.132 (https://www.genomedetective.com) and the Coronavirus Typing Tool44. For short reads, to accurately call mutations and short insertions and deletions (indels) for SARS-CoV-2, Genome Detective software was updated with an additional assembly step after the de novo assembly and strain identification. When the de novo assembly indicates a nucleotide similarity higher than 97% to the reference strain, a new assembly is made by read mapping against the reference. In this process, for strains satisfying this criterion, reads are mapped using minimap245 against the reference rather than the de novo consensus sequence, and subsequently final mutations and indels are called using GATK HaplotypeCaller46, with low-quality variants (with QD < 10) filtered using GATK VariantFiltration46. To call the consensus sequence, GATK HaplotypeCaller is used with default settings, followed by GATK VariantFiltration to select only variants with a variant confidence normalized by unfiltered depth of variant samples of at least 10 (QualByDepth ≥ 10). For nanopore data, candidate reads are assigned to candidate reference sequences using NCBI blastn with sensitive settings and low gap costs. Candidate reads are then aligned using Annotated Genome Aligner, after which a draft majority consensus sequence is subsequently called, and iteratively improved by realignment of all reads against the draft consensus sequence and realignment of regions with a putative insert against the reference using global alignment (MAFFT). The resulting consensus sequence is further polished by considering and correcting indels of length one or two in homopolymer regions of length four or longer that break the open reading frame (probably sequencing errors). Mutations were confirmed visually with .bam files using Geneious software V2020.1.2 (Biomatters). The reference genome used throughout the assembly process was NC_045512.2 (numbering equivalent to MN908947.3). All of the sequences were deposited in GISAID (https://www.gisaid.org/), and the GISAID accession identifiers are included as part of Supplementary Table 2. Raw reads for our sequences have also been deposited at the NCBI Sequence Read Archive (BioProject accession PRJNA694014).

In some samples, the K417N substitution was previously not called. To avoid an assembly concern, these samples were also analysed using the ARTIC Illumina pipeline (https://github.com/connor-lab/ncov2019-artic-nf, git revision 9ac3119a87). Results between the two pipelines were highly consistent with respect to the lineage-defining mutations, but also consistent with respect to the missing 22813G>T (K417N) mutation in these samples, despite being considered covered by both pipelines (Supplementary Table 1). In addition, we have implemented a Sanger sequencing method that covers the main RBD sites and this was used to confirm the K417N and other substitutions (that is, E484K and N501Y) in sequences in which we were not confident about the call from next-generation sequencing data. The full sequence properties, mutation and spike mutations of the 501Y.V2 sequences are shown in Supplementary Tables 3, 4.

LoFreq was used to detect minor viral variants to study the intrahost heterogeneity of viral variants (quasi-species)47 (Extended Data Fig. 5). Variants were called with at minimum coverage of 10% and conservative false discovery rate P value of 0.1. LoFreq models sequencing error rate and implements a Poisson distribution to probe the statistical significance of nucleotide variants at each position, filtering out all variants that fall below the P value threshold.

Quality control of genomic sequences from South Africa

We retrieved all SARS-CoV-2 genomes from South Africa from the GISAID database as of the 4 January 2021 (n = 2,882). Before phylogenetic reconstruction, we removed low-quality sequences from this dataset. We filtered out genomes that did not pass standard quality assessment parameters used in NextClade (https://clades.nextstrain.org). We filtered out 105 genomes from South Africa owing to low coverage, and a further 18 owing to poor sequence quality. Poor sequence quality was defined as sequences with clustered single-nucleotide polymorphisms and ambiguous bases at >10% of sites, and low-coverage genomes were anything with <90% genome coverage against the reference. We therefore analysed a total of 2,756 South African genomes. We also retrieved a global reference dataset (n = 2,573). This was selected from the NextStrain global reference dataset, plus the five most similar sequences to each of the sequences from South Africa as defined by a local BLAST search.

Phylogenetic analysis

We initially analysed genomes from South Africa against the global reference dataset using a custom pipeline based on a local version of NextStrain (https://github.com/nextstrain/ncov)9. The pipeline contains several Python scripts that manage the analysis workflow. It performs an alignment of genomes in MAFFT48, phylogenetic tree inference in IQ-Tree V1.6.949, tree dating and ancestral state construction and annotation (https://github.com/nextstrain/ncov). The full NextStrain build can be viewed at https://nextstrain.org/groups/ngs-sa/COVID19-ZA-2021.01.18.

The initial phylogenetic analysis enabled us to identify a large cluster of sequences (n = 341) with multiple spike mutations. We extracted this cluster and constructed a preliminary maximum-likelihood tree in IQ-tree, together with seven basal sequences from the region that were sampled from June to September 2020. We inspected this maximum-likelihood tree in TempEst v.1.5.3 for the presence of a temporal (that is, molecular clock) signal. Linear regression of root-to-tip genetic distances against sampling dates indicated that the SARS-CoV-2 sequences evolved in a relatively strong clock-like manner (correlation coefficient = 0.33, R2 = 0.11) (Extended Data Fig. 6).

We then estimated time-calibrated phylogenies using the Bayesian software package BEAST v.1.10.4. For this analysis, we used the strict molecular clock model, the HKY+I, nucleotide substitution model and the exponential growth coalescent model50. We computed Markov chain Monte Carlo (MCMC) in duplicate runs of 100 million states each, sampling every 10,000 steps. Convergence of MCMC chains was checked using Tracer v.1.7.151. Maximum clade credibility trees were summarized from the MCMC samples using TreeAnnotator after discarding 10% as burn-in. The phylogenetic trees were visualized using ggplot and ggtree52,53.

Phylogeographic analysis

To model phylogenetic diffusion of the new cluster across the country, we used a flexible relaxed random walk diffusion model that accommodates branch-specific variation in rates of dispersal with a Cauchy distribution54. For each sequence, latitude and longitude were attributed to the health facility at which the diagnostic sample was obtained or, if that information was not available, to a point randomly sampled within the local area or district of origin. Given that we do not have access to residential geolocations within the genomic surveillance, the location of the health facility serves as a reasonable proxy, especially as two-thirds of the population live within 2 km of their nearest health facility55.

As described in ‘Phylogenetic analysis’, MCMC chains were run in duplicate for 100 million generations and sampled every 10,000 steps, with convergence assessed using Tracer v.1.7.1. Maximum clade credibility trees were summarized using TreeAnnotator after discarding 10% as burn-in. We used the R package seraphim to extract and map spatiotemporal information embedded in posterior trees.

Lineage classification

We used a previously proposed56 dynamic lineage classification method from the ‘Phylogenetic Assignment of Named Global Outbreak Lineages’ (PANGOLIN) software suite (https://github.com/hCoV-2019/pangolin). This is aimed at identifying the most epidemiologically important lineages of SARS-CoV-2 at the time of analysis, enabling researchers to monitor the epidemic in a particular geographic region. A lineage is a linear chain of viruses in a phylogenetic tree showing connection from the ancestor to the last descendant. Variant refers to a genetically distinct virus with different mutations to other viruses. For the variant identified in South Africa in this study, we have assigned it the name 501Y.V2; the corresponding PANGO lineage classification is B.1.351 (lineages version 2021-01-06).

Selection analysis

To identify which (if any) of the observed mutations in the spike protein was most likely to increase viral fitness, we used the natural selection analysis of SARS-CoV-2 pipeline (https://observablehq.com/@spond/revised-sars-cov-2-analytics-page). This pipeline examines the entire global SARS-CoV-2 nucleotide sequence dataset for evidence of: (i) polymorphisms having arisen in multiple epidemiologically unlinked lineages that have statistical support for non-neutral evolution (mixed effects model of evolution)57, (ii) sites at which these polymorphisms have support for a greater-than-expected ratio of nonsynonymous-to-synonymous nucleotide substitution rates on internal branches of the phylogenetic tree (fixed-effects likelihood)58 and (iii) whether these polymorphisms have increased in frequency in the regions of the world in which they have occurred.

Structural modelling

We modelled the spike protein on the basis of the Protein Data Bank coordinate set 7A94, showing the first step of the spike protein trimer activation with one RBD domain in the up position, bound to the human ACE2 receptor59. We used the Pymol program (The PyMOL Molecular Graphics System, version 2.2.0) for visualization.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.