Variants Within over a Hundred Complete COVID-19 Genomes and the Impact on Health Security

Email: samer@nauss.edu.sa Abstract: 102 complete COVID-19 genomes have been collected from the viral genomes database to track and characterize novel variants. The data were treated bioinformatically so that 172 variants, with 127 unique and 45 polymorphic variants were found. The 127 unique variants consist of 76 missense, 39 synonymous, 6 non-coding, 5 deletions and 1 insertion. The 45 polymorphic variants consist of 25 missense, 15 synonymous, 4 non-coding and 1 in-frame-deletion. Most common variants are 28144T>C (33 missense), 8782C>T (31 synonymous), followed by missense 11083G>T (11 samples), 18060C>T (9 samples) and 26144G>T (7 samples). L3606F, S5932F and L84S are the amino acid changes in the last three common variants. Most variants were found in ORF1ab gene within the region encoded for domains (nsp4 and nsp6) and in the coding ORF8 gene. The variant 28144T>C could be among the main enhancers of viral transmission. There is a tendency for a national specificity of the most recorded variants. The virus outbreak could be between countries or dependent on the place of origin. Reasonable evidence of Chinese origin of the virus could be possible and thus more genomes should be collected and analyzed to understand the origin and the reason for its outbreak. This could support human health security by either finding out suitable vaccines or managing health precautionary measures.

Systematically, CoVs belong to the subfamily Orthocoronavirinae of the family Coronaviridae and the order Nidovirales. Coronavirinae contains four genera: Alphacoronavirus, Betacoronavirus (4 subgenera) (Chan et al., 2013), Gammacoronavirus and Deltacoronavirus (Chen et al., 2020). Bats and rodents are the probable source of alpha and Betacoronaviruses, whereas birds could be the source of Gamma and delta viruses (Cascella et al., 2020). Over 210,000 COVID-19 complete genomes have so far been sequenced with an approximate genome length of 30000 bases. As fast as the virus spreads, more variants accumulate with the possibility of further emerging virulent strains. The virus genome is found in a single-stranded positive-sense RNA (+ssRNA) (Perlman and Netland, 2009) acquiring 5′-cap structure and 3′-poly-A tail (Chen et al., 2020). As a severe global health threat, CoVs outbreaks are, most probably, unavoidable in the future. There is an urgent need, thus, to identify the possible variants in the 102 sequenced genomes that might support in understanding the reasons for the virus outbreak and in producing an effective therapy and vaccine against the virus.

Materials and Methods
The COVID-19 page (https://bigd.big.ac.cn/ncov) available in China's National Genomics Data Center (NGDC) was accessed on the 27th of September 2020.
From 2019-nCoV genomes available on this page and in three Genbank databases, 102 publicly complete genomes have been collected and are listed in Table 1. As has been previously used (Koyama et al., 2020;Matsuda et al., 2020), the two longest and identical NC_045512 and MN908947 sequences (29903 bp) were used as reference genomes to coordinate the differences in the ribosomal slippage of the 102 used genomes. The genome data were checked for the accuracy of the alignments and the aligned data could be obtained from the author upon request. Since genomes acquire differences in their start and end points, their lengths have been adjusted to the lengths of NC_045512 and MN908947 and their variants were numbered according to the positions of these two genomes. All genomes were first aligned to each other using BioEdit Sequence Alignment Editor (Hall et al., 2011) and the FASTA file of the aligned data was executed to MacClade v.4.10 program (Maddison and Maddison, 2002) and manually aligned to the reference genomes. Using MacClade v.4.10, mutations were recorded and data were transferred to the Paup v. 4.0b10 (Swofford, 2002) program for phylogenetic analysis.
After deleting ambiguous, gap-containing sites and N or mixed bases sites, the remaining 29532 sites were analyzed by Maximum-Parsimony (MP), Neighbor-Joining (NJ) and Maximum-Likelihood (ML) methods in Paup v. 4.0b10. For MP, heuristic searches of 10 random stepwise additions were conducted by Tree Bisection Reconnection (TBR) branch swapping and 1000 bootstrap replications. For NJ analysis (Saitou and Nei, 1987;Tamura and Nei, 1993) distance option and 1000 bootstrap replications were used. As the MT226610 sequence was rapidly evolved, it was used as an outgroup. For ML, heuristic searches by axis additions and Nearest-Neighbor Interchange (NNI) branch-swapping were adjusted. Other conditions for the ML analysis like gamma shape parameter of 1.5214 and 4 rate categories were also adjusted. The substitution rate matrix of the data was as follow: R(a) = 0.1571, R(b) = 0.7500, R(c) = 0.2121, R(d) = 0.6049, R(e) = 2.6308 and R(f) = 1.00. Likelihood settings from best-fit model (GTR+I+G) were selected by AIC in Modeltest Version 3.06.
In agreement with previous studies (Koyama et al., 2020;Matsuda et al., 2020), most common variants were 8782C>T (ORF1ab) and 28144T>C (ORF8) in 31 samples followed by 11083G>T (nsp6) in 11 samples and 18060C>T (nsp14) in 9 samples. The occurrences of 8782C>T and 28144T>C concur and most of the other common variants are subsets of these most common ones. 8782C>T is synonymous; however, 11083G>T (L3606F) and 18060C>T (S5932F) and 28144T>C (L84S) are missense causing amino acid changes in nsp6, nsp14 and ORF8, respectively. The variant 28144 T>C which exhibited an amino acid change in ORF8 protein from Leucine to Sereine, was indicated in the polypeptide involved in enhancing virus transition from bat to human (Chan et al., 2020;Nguyen et al., 2020). As this variant was recorded in thirteen North Americans, ten central Asians (Chinese including Wuhan residents), three Japanese, one Taiwanese, one Indian and one Spanish, it could be among the main enhancers of viral transmission.          Genbank  MT159715  67  USA  29882  Genbank  MT159716  68  China  29923  Genbank  MT123292  69  China  29871  Genbank  MT123293  70  Italy  29867  Genbank  MT066156  71  Japan  29903  Genbank  LC529905  72  USA  29882  Genbank  MT184907  73  USA  29880  Genbank  MT184908  74  USA  29882  Genbank  MT184909  75  USA  29882  Genbank  MT184910  76  USA  29882  Genbank  MT184911  77  USA  29882  Genbank  MT184912  78  USA  29882  Genbank  MT184913  79  USA  29783  Genbank  MT188339  80  USA  29845  Genbank  MT188340  81  USA  29835  Genbank  MT188341  82  USA  29829  Genbank  MT192765  83  Taiwan  29862  Genbank  MT192759  84  Vietnam  29891  Genbank  MT192772  85  Vietnam  29891  Genbank  MT192773  86  Spain  29611  Genbank  MT198651  87  Spain  29782  Genbank  MT198652  88  China  29899  Genbank  MT226610  89  Pakistan  29836  Genome Warehouse  GWHACDD01000001  90 Wuhan 29899 Genome Warehouse GWHABKF0000000001  For the 101 missense variants, 80 variants are found in the longest ORF1ab gene distributed in the cleaved nonstructural proteins (NSP1-NSP16). However, more variants are found in the structural protein genes (S, ORF3a and N). MT226610-CHN was the fastest evolving substrain as it exhibited 29 substitutions. One of the out-of-frame-deletions is found close to 3'end of nsp15 protein of an Indian substrain and the other one is found close to 5'end of S protein of the USA strain (Fig.  1). The first mutation probably did not alter the O-ribose methyltransferase (nsp16) since it is located at the end of the gene, while the second could alter the posttranslational spike, glycoprotein. This mutation may increase disease susceptibility (Zimmerman et al., 1997) or stop protein function indicating that it is not necessary for efficient viral transmission. It is not known that S deletion enhances virulence or transmission rates of the virus and it is not known whether the strain acquiring this deletion could successfully transmit to a new host (Assiri et al., 2016).
Fortunately, this study collected various COVID-19 genomes from the same place of origin as shown in Table  5 (51 genomes from USA, 25 from China (including 10 from Wuhan) and 8 from Japan). This supports that the novel mutations found herein could reflect the diversity of the place of origin rather being acquired during spreading of the infection (Matsuda et al., 2020). It is therefore an indication that stopping virus outbreak is possible in the short-term future. However, the constructed tree (Fig. 2) indicated that viruses from the same country did not form a single group, which suggests that CoVs-19 were introduced to each country several times (Koyama et al., 2020;Matsuda et al., 2020) and it, thus, may be difficult to follow the virus origin. The phylogeny of the maximum-likelihood analysis (Fig. 3) indicated a possible transmission scenario of the virus. The tree referred to Chinese origin of COVID-19 and showed its transmissions to USA, Spain, Japan and India.
Researchers sequenced a lot of SARS-CoV-2 genomes and shared results during the pandemic. The sequenced data allowed public health officials to evaluate the relevant epidemiological parameters such as the reproductive number and virus introduction into new regions. Knowing the possibilities for the outbreak is still managing health precautionary measures which could be conducted in daily life (Hopkins, 2020). Understanding genetic framework of COVID-19 genome enhances WHO's ability to analyze the risk of the virus introduction into countries and define the response actions and prioritization of resources, as well as the possible capacity to manage the virus outbreak. The implementation of action plans for health security is occurring globally with varied progress rates (Samhouri et al., 2018) and is actively supported by WHO to enhance operational readiness for the virus in countries (Al-Mandhari et al., 2020).

Conclusion
In conclusion, the virus is still considered a threat to human health security as there is lack of knowledge about the origin and the reasons for its outbreak. Chinese origin could be possible. Two debates about the virus outbreak are either the diversity of the place of origin or spreading the infection through individuals' movements between countries. Emergence of new variants by releasing more genomes could help in clarifying the virus origin, the reasons of its outbreak and the development of vaccines or effective precautions.