Illumina-Based De Novo Transcriptome Analysis and Identifications of Genes Involved in the Monolignol Biosynthesis Pathway in Acacia koa

Corresponding Author: Dulal Borthakur Department of Molecular Biosciences and Bioengineering, University of Hawaii, Honolulu, HI 96822, USA Email: dulal@hawaii.edu Abstract: Acacia koa is a leguminous timber tree endemic to the Hawaiian Islands. For breeding projects involved in improving wood quality of A. koa, understanding of genes influencing wood quality is crucial. Therefore, the objective of this study was to identify A. koa genes in the monolignol biosynthesis pathway, which is involved in wood formation and development. In this study, whole transcritpome sequencing of A. koa seedlings was performed through Illumina-based sequencing and over 88 million high-quality paired-end raw reads were generated. Trinity de novo assembly of those reads yielded 85,533 unigenes with an average length of 641 bp. Based on sequence similarity search with known proteins, we annotated 47,038 of the unigenes. Using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, 149 unigenes were assigned to ortholog groups of enzymes involved in the monolignol biosynthesis pathway. In addition, we identified complete coding sequences of genes for all the ten identified enzymes of the pathway. Future studies on expression levels of these genes in A. koa with different wood qualities will provide a tool for selection of desirable types. Comprehensive sequence resources of A. koa generated through this study will contribute to genomic studies and improvement programs for this tree.


Introduction
Acacia koa is an important leguminous tree endemic to the Hawaiian Islands. The native A. koa forests are broadly distributed across all five major Hawaiian Islands (Wagner et al., 1990). The A. koa populations in these islands are genetically diverse and can be divided into morphologically distinguishable groups of koa, koaia and an intermediate type (Adamski et al., 2012). A. koa serves as an ecologically and economically vital resource for the Hawaiian Islands. It provides a habitat for many native fauna and flora (Elevitch et al., 2006;Sakai, 1988;Whitesell, 1990). In addition, due to the beautiful texture, hardiness, and carving quality of the wood, the A. koa timber, referred to as Hawaiian mahogany, is a high priced commodity with a current market value of up to $125 per board foot (Baker et al., 2009). A. koa wood is used for fine furniture, decorative items, musical instruments, and jewelry. The gross value of the A. koa timber and wood products produced is estimated to be in the range of $20-$30 million annually (Baker et al., 2009;Yanagida et al., 2004).
Because of high value of A. koa wood, it is crucial to understand the factors affecting wood formation and development. There have been many studies to identify factors that affect wood qualities, including wood density, wood color, stiffness and orientation and morphology of fiber. Although wood quality is a highly complex trait, in recent years, technologies such as gene mapping, sequencing and microarrays have been developed to understand molecular mechanisms underlying it. Once candidate genes are identified, they can be used as markers for selection of seedlings with desirable traits at early stages. However, there are only a limited number of nucleotide sequences publicly available for A. koa. To identify genes for wood formation and development in A. koa, sequencing of the genome will be necessary. Given that A. koa is an allotetraploid, it is substantially challenging to sequence and assemble its complex genome (Hamilton and Buell, 2012). For such cases, transcriptome sequencing provides an alternative to the whole genome sequencing.
In the present study, we utilized Illumina de novo sequencing technology to characterize the transcriptome of A. koa. Our objectives were to enrich the gene resource of A. koa with the sequencing data and to identify the transcripts involved in the monolignol biosynthesis pathway, which may be related to wood formation and development in A. koa, as lignin is one of the major constituents of wood. To the best of our knowledge, this study is the first exploration to characterize the transcriptome of A. koa. The transcriptome sequencing of A. koa will offer valuable sequence resources and contribute to further research on functional genomics and improvement of A. koa.

Plant Materials and RNA Extraction
Two and a half month old A. koa seedlings were obtained from the Maunawili sub-center of Hawaii Agriculture Research Center (HARC), Kailua, HI. Total RNA was extracted from the whole seedlings using RNeasy Plant Mini Kit (Qiagen) and purified with TURBO DNA-free Kit (Ambion). The quality and quantity of the RNA were assessed using NanoDrop Spectrophotometer (ND-1000).

Library Construction, Sequencing and Assembly
Cofactor Genomics, St. Louis, MO conducted cDNA library construction, sequencing and assembly. Sequencing was performed through the Illumina platform (Illumina Genome Analyzer IIx) with 60 bp paired-end reads. The quality of the raw reads were assessed through FASTQC to make sure more than 90% of the bases have Q20 or higher and were assembled using a de novo assembly program Trinity (http://trinityrnaseq.sourceforge.net/) (Grabherr et al., 2011). The resulting assembled sequences were defined as unigenes. Assembled sequences with lengths ≥200 bp were included in the downstream analysis.

Functional Annotations of Unigenes
The assembled unigene sequences were compared against multiple protein databases, including the NCBI non-redundant (nr) database, the Swiss-Prot database, the Translated European Molecular Biology Laboratory (TrEMBL) database, the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and the Clusters of Orthologous Group (COG) database, through the Basic Local Alignment Search Tool (BLAST) algorithm with a cut-off E-value of 1E-3, using the doblast server of the Noble Foundation (http://bioinfo3.noble.org/doblast/) and the WebMGA server (http://weizhonglab.ucsd.edu/metagenomic-analysis/) (Wu et al., 2011). Gene names were assigned to each query based on the highest sequence similarity. A Java program Blast2Go (Conesa et al., 2005) was utilized to assign Gene Ontology (GO) functional categories for the annotated unigenes. The COG database, which classifies orthologous gene products, was used to categorize the annotated unigenes into 26 general functional groups. With the KEGG database, which contains systematic analysis of biochemical pathways and functions of the gene products, unigenes involved in the monolignol biosynthesis pathway were identified. The BLASTX analysis was performed to confirm the sequence identities of some unigenes in the ortholog groups and to detect unigenes with a complete Open Reading Frame (ORF). NCBI ORF Finder (http://www.ncbi.nlm.nih.gov/gorf/) was used to determine the ORFs and the protein sequences of the unigenes.

Putative SSR Molecular Markers
For development of new molecular markers, the annotated unigenes were used to identify potential simple sequence repeats (SSRs). With the MISA Perl script (http://pgrc.ipk-gatersleben.de/misa/), motifs of dito hexanucleotide with a minimum of four repetitions and compound motifs, in this case, motifs interrupted by sequences of up to 100 bp, were also identified.

Sequence Analysis and Assembly
In this study, a total of 88,983,363 paired-end raw reads were generated from a 250-bp insert library. These reads contained 97.66% Q20 bases (base quality 20) and were used for de novo assembly. The raw reads were deposited on the NCBI Sequence Read Archive (SRA) with an accession number SRR1686818. Using the Trinity de novo assembly software, 85,533 unigenes were generated with a total length of 45.82 Mb, an average length of 640.97 bp and an N50 length of 1,068 bp (Table 1)

Functional Annotation
The 85,533 unigenes were searched against diverse protein databases, including the nr database, the Swiss-Prot database, the TrEMBL database, the KEGG database and the COG database, using the BLAST algorithm (E-value <1E-3). The annotation with the TrEMBL database had the highest aligned unigenes (46,146 unigenes), followed by the annotation with the nr database (45,800 unigenes). With the two databases combined, a total of 46,782 unigenes (54.69%) were annotated. The number of unigenes that showed homology with sequences in the Swiss-Prot, KEGG and COG databases were 33,113, 26,024 and 20,288, respectively. Overall, a total of 47,038 unigenes (54.99%) were successfully annotated using nr, TrEMBL, Swiss-Prot, KEGG and COG (Table 3).
Among the unigenes annotated in the nr and TrEMBL databases, 50.24% and 42.75%, respectively, had an Evalue <1.0E-50, showing strong homology; however, only 33.53% of the unigenes annotated in the Swiss-Prot database had an E-value <1.0E-50 ( Fig. 1).  Green bars represent the total number of unigenes that have high similarities to known plant proteins. The two peak-pattern of the graph is due to two ranges of data used; the distribution range was 100 bp for unigenes with lengths of up to 1 kb and 500 bp for unigenes above 1 kb The annotation rate decreased as the lengths of unigenes decreased; 98.16% of unigenes with the length of ≥1,000 bp showed homologous matches, whereas the annotation rates for unigenes with a length of 500-999 bp and unigenes <500 bp were 78.78 and 37.60%, respectively (Fig. 2). The majority of the unigenes without annotations from the nr and TrEMBL databases were <500 bp, as 62.25 and 28.59% were 200-299 bp and 300-499 bp, respectively, totaling 90.84%. The reason for this was most likely their short sequence lengths, resulting statistically insignificant matches.
For the plant species distribution, the most represented species in the unigenes aligned in the nr database were all legumes: Glycine max (32.25%), Cicer arietinum (13.52%) and Medicago truncatula (8.02%) (A). Similarly, more than half of the unigenes that matched with sequence in the TrEMBL database showed homology with legumes, including G. max (47.21%) and M. truncatula (8.72%) (Fig. 3B). Since the Swiss-Prot database contains only manually reviewed protein sequences, a higher percentage of the unigenes (76.62%) showed homology with the well-studied Arabidopsis thaliana sequences (Fig.  3C). Considering the E-value and plant species distributions, the annotations of the unigenes with the nr and TrEMBL databases gave consistent results. The 46,782 unigenes annotated from the nr and TrEMBL databases were used for further analysis. Additionally, we found that a total of 3,473 unigenes (3,262 and 3,442 annotated with the nr and TrEMBL databases, respectively) had homology to sequences with nonplant origins, such as Staphylococcus and Drosophila. Approximately 90.09% (3,128 unigenes) of those were 200-499 bp in length; 9.04% (314 unigenes) were 500-999 bp and 0.86% (30 unigenes) were ≥1,000 bp (Fig. 2). These sequences were considered contaminants and removed and the remaining 43,309 unigenes were used for sequence classifications.

GO Classification
Of the 43,309 unigenes, 20,884 were classified into 3 GO functional categories: biological process, cellular component, and molecular function (Fig. 4). In the biological process category, the unigenes were further clustered into 20 subcategories. Of those, the largest was metabolic process (23.99%); the second was cellular process (19.62%), and the third was single-organism process (14.25%). Under the cellular component category, the unigenes were assigned to 16 subcategories; the most abundant classes were cell (21.85%), cell part (21.84%), and organelle (15.15%). The unigenes under the molecular function category were sorted into 6 subcategories; the most represented ones were binding activity (44.33%), catalytic activity (42.69%), and transporter activity (4.76%).

SSR Identification
For development of new molecular markers, the 43,309 annotated unigenes were used to identify potential Simple Sequence Repeats (SSRs). With the MISA Perl script, we searched for di-to hexa-nucleotides with a minimum of four repetitions and identified 13,109 unigenes containing a total of 20,755 putative SSRs. Among them, 4,731 unigenes had more than one SSR (Table 5). Of those, 2,699 had SSRs in compound formation, having two or more consecutive SSRs interrupted by less than 100 bp. In total, we detected 111 different motifs. Di-nucleotide repeats except CG/CG (0.35%) were the most abundant (71.95%), and tri-nucleotide was the second abundant (26.95%) ( Table 6). The dominant repeat motif was AG/CT (45.28%), followed by AT/AT (13.72%), AC/GT (12.45%), and AAG/CTT (8.78%) (Fig. 7).

Transcriptome Sequencing and Assembly
Despite numerous studies on genomes of various legume species, only a limited number of nucleotide or protein sequences of A. koa have been reported, and almost no genomic information is available in public databases. As a majestic timber tree, A. koa could be a rich source of genes for tree improvement programs. Thus, the objective of this study was to produce a global overview of the whole transcriptome of A. koa. After comparing with the five databases and filtering out all of the mostly small, unannotated sequences, a total of 43,309 unigenes were identified in this study. A large proportion of smaller unigenes obtained through Illumina sequencing may be due to the allotetraploid \genome of A. koa (2n = 52); homeologous or paralogous gene copies can be distinct yet highly similar, possibly causing incomplete assembly (Duan et al., 2012;Gruenheit et al., 2012;Nakasugi et al., 2014). In spite of being an allotetraploid species, both the average length and the N50 length obtained from A. koa in the present study were greater than those obtained from other related diploid legume species, such as Acacia auriculifomis (496 and 949 bp), Acacia mangium (498 and 938 bp) (Wong et al., 2011), and Cicer arietinum (523 and 900 bp) (Garg et al., 2011). Also, the total number and cumulative length of unigenes of A. koa were more than twice of those of A. auriculiformis (42,217 unigenes and 21.02 Mb) and A. mangium (35,759 unigenes and 17.84 Mb) (Wong et al., 2011). These differences may be also due to their ploidy levels, as A. auriculiformis and A. mangium are both diploid (2n = 26), in addition to the use of different assembly software in those studies.

Genes Involved in the Monolignol Biosynthesis Pathway in A. koa
In this study, we identified genes in the monolignol biosynthesis pathway in A. koa because the pathway is involved in wood formation and development. Monolignols, which consist of p-hydroxyphenyl (H), guaiacyl (G), and syringyl (S) units, are the building blocks of lignin. Lignin constitutes 25-35% of the secondary cell wall (Plomion et al., 2001) and lignin composition is one of the major determinants of physical characteristics of wood (Novaes et al., 2010). Various studies have shown that the downregulation of upstream biosynthesis genes PAL, C4H, and 4CL results in lower content and altered composition of lignin in N. tabacum, A. thaliana and Populus tremuloides (Bate et al., 1994;Elkind et al., 1990;Hu et al., 1999;Kajita et al., 1997;Lee et al., 1997;Sewalt et al., 1997). Silencing a 4CL gene in Pinus taeda also reduced the G/H ratio, making it similar to that of compression wood (Wagner et al., 2009). Other enzymes in the pathway also affect lignin content and composition of wood. For instance, in transgenic Populus, the downregulation of C3H decreased lignin levels by half and highly increased the proportion of H units (Ralph et al., 2012). The repression of CCoAOMT also reduced lignin production and increased S/G ratio in Zea mays and Populus (Li et al., 2013;Meyermans et al., 2000). The downregulated COMT expression reduced content of S units in N. tabacum and Populus (Atanassova et al., 1995;Jouanin et al., 2000;Pincon et al., 2001). In P. taeda, a mutation in the CAD gene causes a decline in CAD protein, resulting in lower lignification, higher wood density, and increased stem-growth, thus affecting wood quality Yu et al., 2005;Wu et al., 1999). Also, some monoligninol biosynthesis genes (4CL, C4H, C3H and CCoAOMT) matched with quantitative trait loci (QTL) for wood density in P. taeda . In addition, the monolignol biosynthesis pathway is part of the phenylpropanoid pathway, which generates a wide variety of secondary metabolites, such as flavonoids and tannins (Fig. 8), so regulations of monolignol biosynthesis enzymes can affect other metabolite production. For example, repression of a HCT gene in A. thaliana resulted in accumulation of flavonoids, such as anthocyanin (Besseau et al., 2007). Based on these published reports, we expect wood quality attributes, such as wood density and wood color, are determined through differential expression of the genes encoding enzymes in the monolignol biosynthesis pathway.  In the present study, 149 unigenes were assigned as orthologs of the enzymes involved in the monolignol biosynthesis pathway (Fig. 8). However, there are many closely related superfamily members ("like" proteins) and some of them may be unrelated to the monolignol biosynthesis pathway, so we excluded unigenes with <50% homology with A. thaliana monolignol genes and unigenes without important conserved amino acid motifs identified in previous studies (Ehlting et al., 2001;Hoffmann et al., 2003;Joshi and Chiang, 1998;Larsen 2004;Lynch et al., 2002;Mckie et al., 1993;Zubieta et al., 2002;Schuler, 1996;Wanner et al., 1995) (Fig. S1). We identified complete CDSs of genes for all the ten enzymes involved in monolignol biosynthesis. There may exist more isoforms in A. koa because conserved motifs could not be determined in some of the unigenes due to incomplete assembly. Future studies of A. koa involving significant variations in wood quality attributes will determine the level of expressions of key monolignol biosynthesis genes that result in specific phenotypes. Therefore, determining the expression levels of those key genes will be useful for selection for improved wood quality.

Putative SSR Molecular Markers
Next-generation sequencing (NGS) is a rapid and effective approach to identify SSR molecular markers in non-model organisms without known genomic sequences. The traditional approach to develop SSR molecular markers involves fragmentation of DNA, construction of genomic DNA libraries in Escherichia coli, PCR amplification, and sequencing of the amplified fragments (Sahu et al., 2012;Song et al., 2005). These steps are time-consuming and labor-intensive. Using NGS technology and bioinformatics, identification of numerous SSRs from the sequence data can be rapid and cost-effective. Previously, through the traditional approach, Fredua-Agyeman et al. (2008) analyzed 96 sequences and developed 31 primer pairs that targeted microsatellite loci in A. koa. Some of these primers successfully identified polymorphic loci and were also used to measure genetic diversity in A. koa (Adamski et al., 2012); yet, only limited number of genetic markers exists in A. koa, so the identification of more SSRs with NGS technology will be useful.
In the present study, we predicted 13,109 unigenes containing a total of 20,755 putative SSRs. In A. koa, dinucleotide repeats were the predominant motif as in many other plants, such as A. thaliana, Arachis hypogaea, Brassica napus, Beta vulgaris, Brassica oleracea, G. max, Vitis vinifera, and Sesamum indicum (Kumpatla and Mukhopadhyay, 2005;Wei et al., 2014). The frequency of SSR repeat motifs in A. koa obtained in this study was consistent with that of other plant species. The AG/CT repeat (45.28%) was the most abundant dinucleotide motif group and the CG/CG repeat (0.35%) was the smallest dinucleotide motif group in A. koa just as in various species studied by Jayashree et al. (2006) and Kumpatla and Mukhopadhyay (2005). The AAG/CTT repeat (8.78%) was the predominant trinucleotide motif in A. koa, and it was also predominant in other plants, including three legume species, G. max, M. truncatula and Lotus japonicus (Jayashree et al., 2006;Kumpatla and Mukhopadhyay, 2005). Our results provide a substantial number of SSRs; in future studies, we may be able to identify SSR loci linked to genes associated with wood properties.

Conclusion
This is the first comprehensive transcriptome-wide analysis of A. koa using NGS technology. Illumina sequencing and Trinity de novo assembly generated 85,533 unigenes, and we successfully annotated 43,309 of them. With the KEGG database, we identified complete coding sequences of all the ten genes involved in the monolignol biosynthesis pathway, which could be highly associated with wood formation and development in A. koa. Further characterization of these genes will contribute to a deeper understanding of wood quality in A. koa. In addition, we predicted a significant number of potential SSR markers from our transcriptome data. Our results will be a valuable resource for future genetic studies and improvement programs of A. koa.