Comparative Analysis of the Complete Chloroplast Genome of the Alloplasmic Sunflower (Helianthus L.) Lines with Various CMS Types

Corresponding Author: Kirill Azarin Southern Federal University, Rostov-on-Don, Russia Email: sunlitart@yandex.ru Abstract: The complete chloroplast genomes of sunflower fertile line HA89 and isonuclear CMS lines with four different cytoplasmic backgrounds (PET1, PET2, ANN2 and MAX) were sequenced. A total of 451 polymorphic sites, with including 58 SSRs, 317 SNPs and 76 microindels were revealed between the fertile and CMS cytotypes. Among the alloplasmic male-sterile lines, cpDNA of CMS-MAX had the largest number of polymorphisms. The lowest number of polymorphic sites was revealed in CMS-PET1. Like as CMS-PET1, CMS-PET2 was obtained as a result of interspecific crossing between H. petiolaris and cultivated sunflower H. annuus. Nevertheless, the number of INDELs and SNPs in CMS-PET2 chloroplast genome was more than 4-fold and 6.5-fold higher than that in cpDNA of CMS-PET1. The average frequency of SNPs and INDELs in the non-genic regions and genic regions were 0.0062 and 0.0046, respectively. Increased mutation rates were found in the psbMrpoB, rps16 intron, atpA-psbD, rps4-ndhJ and ndhc-atpE non-coding regions, as well as in the rpoC2, atpA, rbcL, ndhF and ycf1 genes. In addition to short insertions and deletions ranging from 1 to 5 bp, the relatively long INDELs (14-24 bp) unique for each CMS line were found. These insertions and deletions may be of use for PCR differentiation of the CMS lines due to differences in the amplicons length.


Introduction
The success of a high productivity hybrids breeding with tolerant to a number of environmental factors largely depends on the genetic potential of the parental lines. In the overwhelming majority of the analysis of genetic diversity in plants, the main emphasis is on the combination of nuclear alleles (Pervaiz et al., 2015). At the same time, the potential of cytoplasmic variability and new cytoplasmic-nuclear combinations are practically not taken into account. The nuclear genome plays a significant role in the ontogenesis of plants, however, the effects of cytoplasmic genes has now been proven both on the expression of quantitative traits so on the adaptive potential of plants to extreme environmental factors (Mashkina et al., 2010). Indeed, plastid DNA, which account for only a few percent of the total cellular DNA, is involved in the realization of vital plant functions such as photosynthesis (Jansen and Ruhlman, 2012). Nuclear and cytoplasmic genomic interactions are confirmed by studies of the simultaneously variability both organelle and nuclear DNA (Russell et al., 2003). Mitochondrial and chloroplast DNA has a much lower level of variability compared to the nuclear genome. Along with this, the reduction of genetic diversity in the process of domestication and further selection is demonstrated. So, based on the restriction site polymorphism in cpDNA of 34 wild and cultivated lines of sunflower, Rieseberg and Seiler (1990) showed the monomorphism in cultivated lines for cpDNA phenotypes. Also, based on polymorphism of microsatellite loci of chloroplast genome in six Helianthus species and 46 lines of cultivated sunflower, it was demonstrated the absent cpDNA polymorphisms within cultivated forms of the sunflower H. annuus (Markin et al., 2015). Currently, almost all commercial sunflower hybrids are obtained on the basis of CMS of only one type of PET1, which was discovered by P. Leclercq (1966) in the interspecific hybrid between H. petiolaris Nutt and H. annuus L. However, intensive use of only one CMS source makes cultivated hybrids extremely vulnerable to new strains of the pathogens (Levings, 1990;Liberatore et al., 2016). In connection with the decrease in cytoplasmic genetic diversity, accompanying the processes of domestication and artificial selection, it is necessary to introduce new plasmotype into cultivated plants. The study of the structural and functional organization of the organelle genomes is also relevant from the point of view of modern taxonomy and phylogenetics. Until now, only a few chloroplast markers are used in studies of plant diversity and phylogeny (Soejima and Wen, 2006;Schroeder et al., 2011). With the advent of Next-Generation Sequencing (NGS) technologies, a real opportunity to use complete chloroplast genomes for the taxonomic and evolutionary studies was appeared (McPherson et al., 2013). The complete sequences of cpDNA of sunflower HA89-alloplasmic lines with the types of cytoplasmic male sterility PET1, PET2, ANN2 and MAX will allow us to identify an additional source of genetic variation and make a significant contribution to the solution of the fundamental problems of biology, especially in the field of genetics and evolutionary genomics of plastids.
The aim of this study was to investigate the polymorphism of the complete chloroplast genomes of sunflower fertile line HA89 and isonuclear CMS lines with four different cytoplasmic backgrounds (PET1, PET2, ANN2 and MAX).

Plant Materials
The study was carried out on sunflower fertile line HA89 and isonuclear CMS lines with four different cytoplasmic backgrounds. The plant materials were obtained from the genetic collection of the N. I. Vavilov Institute of Plant Genetic Resources (Russia). CMS sources belonged to the following species of the genus Helianthus: H. petiolaris (PET1, PET2), H. annuus (ANN2) and H. Maximilliani (MAX).

Mitochondrial DNA Extraction, Genome Library Construction and Sequencing
First of all, from leaves of 14 day sunflower seedlings was extracted organelle fraction with reduced amount of nuclear DNA as has been described earlier (Makarenko et al., 2016). For each line we used the same quantity of leaf tissue from 5 plants. The DNA isolation from such fraction was performed with PhytoSorb kit (Syntol, Russia), according to the manufacture's instruction. The NGS library preparation was made using 1 ng of DNA and Nextera XT DNA Library Prep Kit (Illumina, USA) (Head et al., 2014). All the preparation steps were done pursuant to manual. For the qualitative control of libraries Bioanalyzer 2100 (Agilent, USA) was used. The libraries quantitation was performed with Qubit fluorimeter (Invitrogen, USA) and qPCR (van Dijk et al., 2014). For sequencing libraries were diluted up to concentration of 8 pM. Libraries were sequenced on different sequencing platforms. Fertile line, PET1, MAX1 NGS libraries were sequenced with NextSeq 500 sequencer using High Output v2 kit (Illumina, USA). PET2 and ANN2 libraries were sequenced with HiSeq2000 platform using TruSeq SBS Kit v3-HS (Illumina, USA).

Results and Discussion
About 13.2, 14.7, 9.3, 6.9 and 10.2 GB of raw DNA sequences isolated from the fertile line HA 89, CMS-PET1, CMS-PET2, CMS-ANN2, CMS-MAX were obtained, respectively. The overall alignment rates for chloroplast genomes were 25-30% of total read number, depending on sample. The average read coverage was more than 1000 for chloroplast genomes. Based on aligning sequencing reads to reference sequences (NC_007977.1) we obtained the complete sequence of the chloroplast genome of the studied sunflower lines. The chloroplast genomes of the НА89, CMS-PET1, CMS-PET2, CMS-ANN2, CMS-MAX consisted of 151,094 bp, 151,110 bp, 151127, 151147 and 151,138 bp, respectively. The difference in size is due to the increased length of individual non-coding regions of the chloroplast genome. In most terrestrial plants, the chloroplast genome consists of a circular DNA molecule, which includes a Large Single Copy region (LSC) and a small Single Copy Region (SSC) separated by two copies of Inverted Repeats (IR). The content, order and organization of chloroplast genes are usually highly conserved, which makes chloroplast genomes invaluable for genetic and phylogenetic studies. Chloroplast genomes of the researched sunflower lines are constructed on the conservative type and contain an LSC, an SSC and a pair of IRs region. The lengths of the LSC regions ranged from 83,527 of fertile line to 83,605 bp of CMS-ANN2. While the lengths of the SSC varied in five cytotypes from 19,113 of CMS-MAX to 19,147 bp of fertile line. The IR regions are characterized by the smallest length change (Table 1). The content and order of the genes in the investigated cytotypes was identical to the previously sequenced sunflower cpDNA (NC_007977.1). Regions of high G+C content are more sensitive to mutation (Smith et al., 2002). The total G+C content had insignificant variations from 37.60 to 37.62% (Table 1), which is comparable with the chloroplast genomes of other Angiosperms (Jansen and Ruhlman, 2012).
Comparative chloroplast genome analyses of alloplasmic CMS lines of sunflower revealed 451 polymorphic sites. Among them, 58 sites were microsatellite loci, represented exclusively by mononucleotide repeats, are located in non-coding cpDNA regions (Table 2). Chloroplast microsatellites due to uniparental inheritance have been widely used in the analysis of genetic diversity, differentiation and population structure (Provan, 2000;Flannery et al., 2006). Also, 317 SNPs and 76 microindels were identified. The ratio of SSRs, SNPs and INDEL variations were 12.9, 70.2 and 16.9%, respectively. SNPs are represented by six types of nucleotide substitutions: A/G (26.9%), A/C (14.4%), A/T (10.5%), T/C (29.7%), T/G (12.5%), G/C (6.0), whereas SSRs were as follows: (A) 6-30 (46.6%), (T) 6-31 (46.6%), (C) 6-11 (5.1%), (G) 7-9 (1.7%). Our results are well consistent with the data from other studies that chloroplast microsatellite loci are mainly represented by short polyT and polyA repeats, which in turn makes a significant contribution to the prevalence of AT nucleobases (62.38-62.40%). As is typical for other flowering plants (Ni et al., 2016;Shen et al., 2017), the most number of polymorphic sites is located in large single copy region; the lowest number of polymorphisms is located in inverted repeat region of chloroplast genomes. Of the 317 SNPs detected, 120 were located in the coding regions. Moreover, 59 substitutions were nonsynonymous (Table 3). The highest number of nonsynonymous SNPs was identified in CMS-MAX (36 substitutions), whereas the lowest value of substitutions was in CMS-PET1 (3 substitutions). 21 and 23 nonsynonymous substitutions were identified in CMS-PET2 and CMS-ANN2 cpDNA, respectively. Chloroplast genome of the HA89 fertile line was largely similar to the reference sequence (NC_007977.1). Only 2 INDELs in coding regions and 1 SNP, 4 INDELs in intergenic regions were identified in the fertile line (Table 4). Among the alloplasmic male-sterile lines, cpDNA of CMS-MAX had the largest number of polymorphisms in comparison with the reference sequence. The lowest number of polymorphisms was revealed in CMS-PET1 (23 SNPs and 10 indel mutations throughout the genic and non-genic regions). Like as CMS-PET1, CMS-PET2 was obtained as a result of interspecific crossing between H. petiolaris and cultivated sunflower H. annuus. It is interesting to note, that the number of INDELs and SNPs in CMS-PET2 chloroplast genome were more than 4fold and 6.5-fold higher than that in cpDNA of CMS-PET1. Also, the high frequency of SNPs and indel mutations was detected in the CMS-ANN2 cytotype (Table 4). Previously, analysis of restriction site polymorphisms in the chloroplast DNA of the accessions of wild and cultivated H. annuus, including the lines CMS 89 and HA89, demonstrated that all accessions of domesticated sunflower had an H. annuus cpDNA (Rieseberg and Seiler, 1990). The authors explained the absence of H. petiolaris cDNA in cultivated sunflower in that the original H. petiolaris source for CMS 89 was a hybrid or introgressive population of H. annuus and H. petiolaris. This was subsequently confirmed by the analysis of seven individuals from the source population of H. petiolaris, where all seven accessions were morphologically H. petiolaris but had the cpDNA of H. annuus (Rieseberg and Seiler, 1990). Also, the high frequency of SNPs and INDEL mutations was detected in the CMS-ANN2 cytotype ( Table 4). The ANN2 cytotype was derived from a crossing between wild and cultivated sunflower H. annuus (Serieys, 1984;Skoric et al., 2012). Previously, the comparative analysis of the complete chloroplast genomes of cultivated and wild sunflower H. annuus revealed only 43 variant sites, including 22 SNPs and 21 polymorphic SSR loci (Makarenko et al., 2016).
The average frequency of SNPs and INDELs in the intergenic regions was 0.0062. Herewith, a twofold increase in frequency was identified in the intergenic region between the psbI and petN genes of the CMS-MAX line. In addition, the frequency was higher in such non-genic regions as the psbM-rpoB, atpA-psbD, rps4-ndhJ, ndhc-atpE and rps16 intron (Fig. 1).        Meantime, the average frequency of SNPs and INDELs in the genic regions was 0.0046. Some chloroplast genes such as rpoC2, atpA, rbcL, ndhF and ycf1 were distinguished by a higher level of polymorphism. Interestingly, the highest increase in frequency was characterized for the ycf1 gene, with a 4.3-fold increases in frequency observed in the CMS-ANN2 line (Fig. 2). In a study devoted to a comparative analysis of the complete chloroplast genome sequences of the male-fertile line and two CMS lines of onions, it was reported that the average frequency of SNPs and INDELs in the non-genic regions was 0.0057, while the mean frequency in the genic regions was 0.0016 (Kim et al., 2015). Among intergenic regions, the most polymorphic was the regions between the ndhF-rpl32, petN-psbM and trnS-trnG genes. Meanwhile, as in our study, an increase in frequency of SNPs and INDELs in the genic regions was observed in the ndhF (3.5-fold) and ycf1 (4.2-fold times) genes (Kim et al., 2015). It was shown that the rpo genes are highly variable and reliable phylogenetic markers, effective in the reconstruction of interrelations of species belonging to the same genus (Krawczyk and Sawicki 2013). A high level of polymorphism of ndh, rpoC2, rbcL and ycf1 genes is also demonstrated in other studies (Wei et al., 2017;Joseph et al., 2013;Benkeblia, 2014).
In addition to short insertions and deletions (ranging from 1 to 5 bp) of bases in the chloroplast genome of CMS lines, the INDELs within the range of 14-24 bp in length were also found in the investigated cpDNA (Table 5). These insertions and deletions may be appropriate to use for PCR differentiation of the CMS lines due to differences in the amplicons length. Indeed, CMS-PET1 is widely used for the commercial production of F1 hybrid sunflower seeds. By now, this type of CMS has been fairly well studied. In particular, its molecular genetic bases are known and STS markers have been developed for the mitochondrial orf522 gene, which make it possible to distinguish between fertile line and sterile line which contained CMS-PET1 (Schnabel et al., 2008). The molecular basis of the CMS types like PET2, ANN2 and MAX1 is insufficiently studied, which is an obstacle to their introduction into commercial breeding. In our study, we detected INDELs specific for the chloroplast DNA of the CMS-PET2, CMS-ANN2 and CMS-MAX1 cytotypes ( Table 5). The design a pair of primers for the conserved flanking regions of the INDELs will allow to development of PCR markers for identification of various types of sunflower CMS. These markers are a prerequisite for the development of highly productive heterotic sunflower hybrids on the basis of new CMS sources.

Conclusion
The comparative analysis of complete chloroplast genomes of fertile line HA89 and 4 alloplasmic CMS lines (PET1, PET2, ANN2, MAX1) revealed a total of 451 polymorphic sites, with including 58 SSRs, 317 SNPs and 76 microindels. Chloroplast microsatellite loci are mainly represented by short polyT and polyA repeats. Of the 317 SNPs detected, 120 were located in the coding regions, 59 of these substitutions are nonsynonymous. Among the alloplasmic male-sterile lines, cpDNA of CMS-MAX was characterized by the largeast number of polymorphisms in comparison with the sequence of fertile line. The lowest number of polymorphic sites was revealed in CMS-PET1. Like as CMS-PET1, CMS-PET2 was obtained as a result of interspecific crossing between H. petiolaris and cultivated sunflower H. annuus. Nevertheless, the number of INDELs and SNPs in CMS-PET2 chloroplast genome was more than 4-fold and 6.5-fold higher than that in cpDNA of CMS-PET1. The average frequency of SNPs and INDELs in the non-genic regions and genic regions were 0.0062 and 0.0046, respectively. Increased mutation rates were found in the psbM-rpoB, rps16 intron, atpA-psbD, rps4-ndhJ and ndhc-atpE non-coding regions, as well as in the rpoC2, atpA, rbcL, ndhF and ycf1 genes. In addition to short insertions and deletions ranging from 1 to 5 bp, the relatively long INDELs (14-24 bp) unique for each CMS line were found. These insertions and deletions may be of use for PCR differentiation of the CMS lines due to differences in the amplicons length.