Introduction

The human population is expected to increase 45% by the year 2050, which will in turn lead to a massive increase in the global demand for food1. Given the scarcity of arable land worldwide, an increase in agricultural production of this magnitude will require vast increases in cropping intensity and yield2. It has been estimated that as much as 90% of the increase in global crop production will need to come from increased yield alone3. At the same time, climate change and other environmental challenges will necessitate the development of agricultural practices that are more ecologically friendly and sustainable.

Chemical fertilizers that provide critical macronutrients to crops—such as nitrogen (N), phosphorus (P), potassium (K), and sulfur (S)—are widely used to maximize agricultural yield4. The application of chemical fertilizers represents a major cost for agricultural companies and also contributes to environmental damage, such as air pollution through the formation of microparticles, soil depletion, and water contamination via run-off5. Biological fertilizers (biofertilizers) are comprised of microbial inoculants that promote plant growth, thereby representing an alternative or complementary approach for increasing crop yield, which is more sustainable and environmentally friendly. Biofertilizers augment plant growth through nutrient acquisition, hormone production, and by boosting immunity to pathogens6.

Sugarcane is a tall, perennial grass cultivated in tropical and warm temperate regions around the world, which is capable of producing high concentrations of sugar (sucrose) and generating diverse byproducts7. Sugarcane is consistently ranked as one of the top ten planted crops in the world8. Sugarcane agriculture plays a vital role in the economy of Colombia by supporting the production of food, energy, and fuel (ethanol) along with a variety of organic by-products. Our group is working to help develop more effective and sustainable sugarcane cropping practices in Colombia. The long-term goals of this work are to simultaneously (i) increase crop yield, and (ii) decrease the reliance on chemical fertilizers via the discovery, characterization, and application of endemic (native) biofertilizers to Colombian sugarcane fields.

Most sugarcane companies in Colombia currently use commercially available biofertilizers, consisting primarily of nitrogen-fixing bacteria, which were discovered and isolated from other countries (primarily Brazil), with limited success. We hypothesized that indigenous bacteria should be better adapted to the local environment and thereby serve as more effective biofertilizers for Colombian sugarcane. The use of indigenous bacteria as biofertilizers should also mitigate potential threats to the environment posed by non-native, and potentially invasive, species of bacteria. Finally, indigenous bacteria represent a renewable resource that agronomists can continually develop through isolation and cultivation of local strains.

The advent of next-generation sequencing technologies has catalyzed the development of genome-enabled approaches to harness plant microbiomes in sustainable agriculture9,10. The objective of this study was to use genome analysis to predict the local bacterial isolates that have the greatest potential for plant growth promotion while representing the lowest risk for virulence and antibiotic resistance. Putative biofertilizer strains were isolated and cultivated from Colombian sugarcane fields, and computational phenotyping was employed to predict their potential utility of strains as biofertilizers. We then performed a laboratory evaluation of predict the potential utility of these strains as biofertilizers, with the aim of validating our computational phenotyping approach.

Results

Initial genome characterization of putative nitrogen-fixing bacteria

A systematic cultivation approach, incorporating seven carbon substrates in nitrogen-free media (Supplementary Figure S1), was employed to isolate putative nitrogen-fixing bacteria from the four different sugarcane plant compartments, and isolates were screened for nitrogen fixation potential through PCR amplification of nifH genes. This initial screening procedure yielded several hundred clonal isolates of putative nitrogen-fixing bacteria, and Ribosomal Intergenic Spacer Analysis (RISA) was subsequently used to identify the (presumably) genetically unique strains from the larger set of clonal isolates. A total of 22 potentially unique strains of putative nitrogen-fixing bacteria were isolated in this way and selected for genome sequence analysis.

Genome sequencing and assembly summary statistics for the 22 isolates are shown in Table 1. Isolate genomes were sequenced to an average of 67x coverage (range 50x—88x) and genome sizes range from 4.5 to 6.1 Mb. GC content varies from 41.82 to 66.69%, with a distinct mode at ~ 57%. The genome assemblies show a range of 24–294 contigs ≥ 500 bp in length, with N50 values that range from 65,329 to 670,745bp (avg. = 310,166bp) and L50 values that range from 3 to 29 (avg. = 8.4). Genome sequence assemblies, along with their functional annotations, can all be found using the NCBI BioProject PRJNA418312. Individual BioSample, Genbank Accession, and Assembly Accession numbers for the 22 isolates are shown in Supplementary Table S1.

Table 1 Genome assembly statistics for the isolates characterized here.

Comparative genomic analysis

Average nucleotide identity (ANI; Fig. 1) and 16S rRNA sequence analysis (Supplementary Figure S2) were used to assign the species (genus) origins for the 22 putative nitrogen-fixing isolate genome sequences and the results of both approaches are highly concordant (Table 2), with ANI yielding superior resolution to 16S rRNA sequence analysis. A total of seven different species and seven different genera were identified among the 22 isolates characterized here. Analysis of nifH gene sequences also gave similar results; however, four of the isolates were not found to encode nifH genes, despite their (apparent) ability to grow on nitrogen-free media and the positive nifH PCR results. As described in the “Methods” section, we used the Rapid Annotations using Subsystems Technology (RAST) server to predict and annotate genes from our assemblies, and this approach was unable to detect nifH genes in those 4 isolates. We performed additional analyses on these four genomes to confirm the absence of nifH: we used NCBI BLAST with nifH nucleotide and amino acid queries to search the genomes and the nifH-specific tool TaxADivA11. Neither of these additional analyses found nifH genes in these four genomes. This could be due to false-positives in the original PCR analysis for the presence of nifH genes, or to changes in the composition of (possibly mixed) bacterial cultures during subsequent growth steps after the initial isolation on nitrogen-free media.

Figure 1
figure 1

Phylogeny of the bacterial isolates characterized here (SCK numbers) together with their most closely related bacterial type strains. The phylogeny was reconstructed using pairwise average nucleotide identities between whole genome sequence assemblies, converted to p-distances, with the neighbor-joining method. Horizontal branch lengths are scaled according the p-distances as shown.

Table 2 Identity of the most closely related species (genus) for the isolates characterized here.

Most of the isolates characterized here belong to the genus Klebsiella. K. variicola, 14 of 22 isolates, is the most abundant species characterized here. This finding is consistent with previous studies showing that Klebsiella strains are capable of fixing nitrogen12; in fact, the canonical nif operons were defined in the K. variicola type strain 342 (originally classified as Kelbsiella pneumoniae strain 342) genome sequence13.

K. variicola is also known to be an opportunistic pathogen that can cause disease in immunocompromised human hosts14,15, which raises obvious safety concerns regarding its application to crops as part of a biofertilizer inoculum. We performed a comparative sequence analysis between the endophytic nitrogen-fixing K. variicola type strain 342, which is capable of infecting the mouse urinary tract and lung13, and five of the isolates identified as K. variicola here. The nif cluster, which contains five functionally related nif operons involved in nitrogen fixation, is present in all of these genomes (Fig. 2). However, the four most critical pathogenicity islands implicated in the virulence of K. variicola 342 are all missing in the environmental K. variicola isolates characterized here (PAI 1–4 in Fig. 2a). The absence of pathogenicity islands in the genome of the endophytic nitrogen-fixer K. michiganensis Kd70 is associated with an inability to infect the urinary tract in mice16. Our results indicate that nitrogen-fixing K. variicola environmental isolates from Colombian sugarcane fields do not pose a health risk compared to clinical and environmental isolates that have previously been associated with pathogenicity. We explore this possibility in more detail in the following section on computational phenotyping.

Figure 2
figure 2

Comparison of the K. variicola type strain 342 to K. variicola sugarcane isolates characterized here. (a) BLAST ring plot showing synteny and sequence similarity between K. variicola 342 and five K. variicola sugarcane isolates. The K. variicola 342 genome sequence is shown as the inner ring, and syntenic regions of the five K. variicola sugarcane isolates are shown as rings with strain-specific color-coding according to the percent identity between regions of K. variicola 342 and the sugarcane isolates. The genomic locations of nif operon cluster along with four important pathogenicity islands (PAIs) are indicated. PAI1—type IV secretion and aminoglycoside resistance, PAI2 hemolysin and fimbria secretion, heme scavenging, PAI3—radical S-adenosyl-L-methionine (SAM) and antibiotic resistance pathways, PAI4—fosfomycin resistance and hemolysin production. (b) A scheme of the nif operon cluster present in both K. variicola 342 and the five K. variicola sugarcane isolates.

The nifH genes from the Klebsiella isolates characterized here form two distinct phylogenetic clusters (Fig. 3). This finding is consistent with previous results showing multiple clades of nifH among Klebsiella genome sequences17,18,19 and underscores the potential functional diversity, with respect to nitrogen fixation, for the sugarcane isolates characterized here.

Figure 3
figure 3

Phylogeny of the nifH genes for the bacterial isolates characterized here (SCK numbers). The phylogeny was reconstructed using pairwise nucleotide p-distances between nifH genes recovered from the isolate genome sequences using the neighbor-joining method. Horizontal branch lengths are scaled according the p-distances as shown.

Computational phenotyping

Computational phenotyping, also referred to as reverse genomics, was used to evaluate the potential of the bacterial isolates characterized here to serve as biofertilizers for Colombian sugarcane fields. For the purpose of this study, computational phenotyping entails the prediction of specific organismal phenotypes, or biochemical capacities, based on the analysis of functionally annotated genome sequences20. The goal of the computational phenotyping performed here was to identify isolates that show the highest predicted capacity for plant growth promotion while presenting the lowest risk to human populations. Accordingly, bacterial isolate genome sequences were screened for gene features that correspond to the desirable (positive) characteristics of (1) nitrogen fixation and (2) plant growth promotion and the disadvantageous (negative) characteristics of (3) virulence and (4) antimicrobial resistance. Genome sequences were scored and ranked according to the combined presence or absence of these four categories of gene features as described in the “Methods”. To compute genome scores, the presence of nitrogenase and plant growth promoting genes contribute positive values, whereas the presence of virulence factors and predicted antibiotic resistance yield negative values. Scores for each of the four specific phenotypic categories were normalized and combined to yield a single composite score for each bacterial isolate genome. The highest scoring isolates are predicted as best candidates to be included as part of a sugarcane biofertilizer inoculum. The predicted biochemical capacities of the highest scoring isolates were subsequently experimentally validated.

The results of the computational phenotyping analysis for the 22 bacterial isolate genome sequences characterized here are visualized as a heatmap in Fig. 4, and the presence/absence patterns for all of the genes analyzed here are shown in Supplementary Table S2. Isolates are ranked according to their composite genome scores, with the highest potential (10.87) for biofertilizers production shown at the top. Individual gene and phenotype scores are color coded for each genome, and the four functional-specific categories are shown separately. Individual gene results are shown for the nitrogenase (nif) genes, whereas genes are combined into functional sub-categories for the plant growth promoting and virulence factor genes. Predicted antimicrobial resistance phenotypes are shown for individual antibiotic classes.

Figure 4
figure 4

Computational phenotyping of the sugarcane bacterial isolates characterized here. The presence (red) and absence (blue) profiles for nitrogen fixation genes, plant growth promoting genes, and virulence factor genes are shown for the 22 bacterial isolates. Results are shown for all n = 21 nitrogen-fixing genes. Results for plant growth promoting genes (n = 25) and virulence factor genes (n = 44) are merged into six gene categories each. Predicted antibiotic resistance profiles are shown for n = 20 antibiotic classes. Detailed results for gene presence/absence and predicted antibiotic resistance profiles are shown in Supplementary Table S2. The results for all four phenotypic classes of interest were merged into a single priority score for each isolate (right side of plot), as described in the “Methods”, and used to rank the isolates with respect to their potential as biofertilizers.

The nif gene presence/absence profiles are highly similar for all but four of the bacterial isolates characterized here. These four isolates do not correspond to the Klebsiella genus, or closely related species, and do not encode any nif genes. These four isolates represent bacterial species that are commonly found in soil21,22,23,24,25, but they are not predicted to be viable biofertilizers. The Kosakonia radicincitans genome encodes the largest number of nif genes (n = 17) seen for any of the isolates characterized here. This is consistent with previous studies showing that isolates of this species can fix nitrogen26. The 14 K. variicola genomes characterized here all contain 16 out of 21 nif genes, including the core nifD and nifK genes, which encode the heterotetramer core of the nitrogenase enzyme, and the nifH gene, which encodes the dinitrogenase reductase subunit27. These genomes also all encode the nitrogenase master regulators nifA and nifL. The missing nif genes for the K. variicola isolates correspond to accessory structural and regulatory proteins that are not critical for nitrogen fixation. Accordingly, all of K. variicola isolate genomes are predicted to encode the capacity for nitrogen fixation, consistent with previous results13,28. The single Raoultella electrica isolate characterized here also contains the same 16 nif genes; Raoultella species have previously been isolated from sugarcane29 and have also been demonstrated to fix nitrogen30.

Initially, a total of 29 canonical bacterial plant growth promoting genes were mined from the literature, 25 of which were found to be present in at least one of the bacterial isolate genome sequences characterized here. These 25 plant growth promoting genes were organized into six distinct functional categories: phosphate solubilization, indolic acetic acid (IAA) production, siderophore production, 1-aminocyclopropane-1-carboxylate (ACC) deaminase, acetoin butanediol synthesis, and peroxidases (Supplementary Table S3). For the purposes of visualization (Fig. 4), each functional category is deemed to be present in an isolate genome sequence if all required genes for that function can be found, but the weighted scoring for these categories is based on individual gene counts as described in the “Methods”. The R. electrica isolate shows the highest predicted capacity for plant growth promotion, with 5 of the 6 functional categories found to be fully present. The majority K. variicola isolates also show similar, but not identical, plant growth promoting gene presence/absence profiles, with 3 or 4 functional categories present. The capacity for siderophore production is predicted to vary among K. variicola isolates. The K. radicincitans genome also encodes 4 functional categories of plant growth promoting genes, but differs from the K. variicola isolates with respect to absence of phosphate solubilization genes and the presence of acetoin butanediol synthesis genes. Three of the four species found to lack nif genes also do not score present for any of the plant growth promoting gene categories, further underscoring their predicted lack of utility as biofertilizers.

Initially, a total of ~ 2500 virulence factor genes were mined from the Virulence Factor Database (VFDB)31, 44 of which were found to be present in at least one of the bacterial isolate genome sequences characterized here. These 44 virulence factors were organized into six distinct functional categories related to virulence and toxicity: adherence, invasion, capsules, endotoxins, exotoxins, and siderophores. The weighted scores for these categories were computed based on individual gene presence/absence patterns, which were combined to yield the color scheme shown in Fig. 4. Despite the fact that K. pneumoniae clinical isolates have previously been characterized as opportunistic pathogens, the K. variicola environmental isolates characterized here show uniformly low virulence scores. The virulence factor genes found among the K. variicola isolates correspond to adherence proteins, capsules, and siderophores. As shown in Fig. 2, these genomes lack coding capacity for important invasion and toxin proteins, including the Type IV secretion system, which can be found in clinical K. pneumoniae isolates. The R. electrica and K. radicincitans isolates, both of which show high scores for nitrogen fixation and plant growth promoting, have higher virulence scores than can be seen for the environmental K. variicola isolates characterized here. Whereas Bacillus safensis has the lowest virulence score for any of the isolates characterized here, the remaining three isolates that lack nif coding capacity have the highest virulence scores and encode well- known virulence factors, such as Type IV, hemolysin, and fimbria secretion systems. The results of a more detailed comparison of the predicted virulence for clinical isolates of K. pneumoniae and closely related species, compared to the environmental isolates, are reported in the following section of the manuscript.

The predicted antibiotic resistance phenotypes for the 20 classes of antimicrobial compounds for which predictions were made are fairly similar across the isolates characterized here. The majority of the K. variicola isolates, along with the relatively high scoring R. electrica and K. radicincitans isolates, show predicted susceptibility to 10 of the 20 classes of antimicrobial compounds, intermediate susceptibility for 2–4, and predicted resistance to 5–8. The highest level of predicted antibiotic resistance was seen for Serratia marcescens, with resistance predicted for 8 compounds and intermediate susceptibility predicted for 4.

Computational phenotyping scores for the four categories were normalized and combined into a final score, which is shown on the right of Fig. 4 and used to rank the isolates top-to-bottom with respect to their potential as biofertilizers. Most of the top positions are occupied by K. variicola isolates, with the exception of the second-ranked R. electrica and the third-ranked K. radicincitans. The results of a similar analysis of four additional plant associated Klebsiella genomes are shown in Supplementary Figure S3.

Virulence comparison

The results described in the previous section indicate that the most of the K. variicola strains isolated from Colombian sugarcane fields have the highest overall potential as biofertilizers, including a low predicted potential for virulence. Nevertheless, the fact that strains of K. variicola have previously been characterized as opportunistic pathogens17,32 raises concerns when considering the use of K. variicola as part of a bioinoculum that will be applied to sugarcane fields. With this in mind, we performed a broader comparison of the predicted virulence profiles for the 22 environmental bacterial isolates characterized here compared to a collection of 28 clinical and one environmental isolate of K. pneumoniae and several other closely related species (See Supplementary Table S5 for isolate accession numbers). For this comparison, the same virulence factor scoring scheme described in the previous section was applied to all 51 genome sequences. The results of this comparison are shown in Fig. 5. Perhaps most importantly, there is a very clear distinction in the virulence score distribution, whereby all surveyed clinical strains show higher predicted virulence (from 4.45 to 2.11) than any of the environmental isolates characterized here (1.55 to 0.00). Furthermore, the three environmental isolates that show the highest predicted virulence correspond to species with low predicted capacity for both nitrogen fixation and plant growth promotion; as such, these isolates are not being considered as potential biofertilizers. The K. variicola environmental isolates, on the other hand, show uniformly low predicted virulence compared to clinical isolates of the same species. These results support, in principle, the use of the environmental K. variicola isolates characterized here as biofertilizers for Colombian sugarcane fields.

Figure 5
figure 5

Comparison of predicted virulence profiles for clinical K. pneumoniae isolates compared to the environmental (sugarcane) bacterial isolates characterized here. As in Fig. 4, predicted virulence profiles for six classes of virulence factor genes are shown for each isolate. Isolate-specific virulence factor scores are shown for each isolate are based on the presence/absence profiles for the n = 44 virulence factor genes as described in the “Methods”. Relative virulence levels are color-coded as shown in the key (note that the color coding here is slightly different than seen in Fig. 4). The virulence factor genes are used to rank the genomes from most (left) to least (right) virulent. Clinical versus environmental samples are shown to the left and right, respectively, of the red line, based on their virulence scores.

Experimental validation of prioritized isolates

The top six scoring isolates from the computational phenotyping were subjected to a series of cultivation-based phenotypic assays in order to validate their predicted biochemical activities: (1) acetylene reduction (a proxy for nitrogen fixation), (2) phosphate solubilization, (3) siderophore production, (4) gibberellic acid production, and (5) indole acetic acid production.

Nitrogen fixation activity, as measured by acetylene reduction to ethylene, was observed in all six isolates, three of which had higher levels in comparison to the positive control (Fig. 6A). The remaining three isolates showed higher levels of ethylene production compared to the negative control. All six of the isolates showed high levels of phosphate solubilization (Fig. 6B, C) and siderophore production (Fig. 6D, E) compared to the respective negative controls. All six isolates showed the ability to produce gibberellic acid (Fig. 6F), whereas none were able to produce indole acetic acid. The biochemical assay results are consistent with the computational phenotype predictions for these isolates.

Figure 6
figure 6

Experimental validation of prioritized biofertilizer isolates. The computationally predicted plant growth promoting phenotypes for the top six isolates were experimentally validated. All six strains were capable of acetylene reduction, i.e. ethylene production (Azotobacter vinelandii positive and Escherichia coli negative control) (a), phosphate solubilization (Pseudomonas aeruginosa ATCC 2785 positive and E. coli negative control) (b, c), siderophore production (E. coli ATCC 35,218 positive and Bacillus safensis negative control) (d, e), and gibberellic acid production (A. vinelandii positive and Stenotrophomonas sp negative control) (f). For panels (a), (b), (d), and (f), mean values are shown, with error bars representing + /− 1 standard deviation.

Discussion

The overall goal of this study was to characterize nitrogen-fixing bacteria and their potential as biofertilizers. Strains cultivated from Colombian sugarcane fields were screened by nifH specific PCR, and both 16S rRNA gene sequence and whole genome sequence comparisons were used to identify the isolates’ taxonomic origins. We found that most of the isolates characterized in this study belong to the family Enterobacteriaceae, with Klebsiella as the most abundant genus. 15 of the 22 nitrogen-fixing bacteria cultivated from Colombian sugarcane were identified as Klebsiella (Fig. 1 and Table 2), and 7 distinct isolates from other genera were also identified, including Raoultella electrica and Kosakonia radicincitans, which are also members of the family Enterobacteriaceae. These two species are closely related to Klebsiella and have been previously misclassified as Klebsiella33. In addition, we also isolated Serratia marcescens, Phytobacter diazotrophicus, Stenotrophomonas spp., and Bacillus safensis from Colombian sugarcane fields, all of which have previously been found in soils and associated with several plants, including sugarcane, and are opportunistic pathogens34,35,36,37.

Klebsiella are Gram-negative, facultative anaerobic bacteria that can be isolated from soils, plants, or water38. Klebsiella species have been isolated from a large variety of crops worldwide, such as sugarcane, rice, wheat, and maize38,39,40. Klebsiella species associated with plants have been shown to fix nitrogen and express other plant growth promoting traits12,39. Many of the bacteria previously isolated from sugarcane fields belong to the family Enterobacteriaceae, and Klebsiella species are abundant amongst the cultivable strains of Enterobacteriaceae obtained from sugarcane41. Klebsiella species present in sugarcane fields have been identified in several areas of the world. A survey of sugarcane in Guangxi, China found Enterobacteriaceae, especially Klebsiella, to be the most abundant plant-associated nitrogen-fixing bacteria41. The same group demonstrated that a nitrogen-fixing strain of K. variicola was able to colonize sugarcane and promote plant growth39. In Brazil, endophytic Klebsiella spp. have been isolated from commercial sugarcane, and their ability to produce plant growth promoting activity has been evaluated in vitro42. In Pakistan, the phenotypic diversity of plant growth promoting bacteria associated with sugarcane has been determined, with Klebsiella also appearing as one of the most abundant bacteria found43.

In this study, we developed a computational phenotyping approach for the screening of potential plant growth promoting bacteria that can serve as biofertilizers. Computational phenotyping entails the implementation of a variety of bioinformatic and statistical methods to predict phenotypes of interest based on whole genome sequence analysis44,45. This approach has been used for a variety of applications in the biomedical sciences: prediction of clinically relevant phenotypes, study of infectious diseases, identification of opportunistic pathogenic bacteria in the human microbiome, and cancer treatment decisions46,47. To our knowledge, this study represents the first-time computational phenotyping has been used for agricultural applications. To implement computational phenotyping for the prioritization of potential biofertilizers, we developed a scoring scheme based on the genome content of four functional gene categories of interest: nitrogen-fixing genes, other plant growth promoting genes, virulence factor genes, and antimicrobial resistance genes.

The results of the computational phenotyping predictions, confirmed by laboratory experiment, supported the potential use of some of the bacterial strains isolated from Colombian sugarcane fields as biofertilizers with minimal human health risk. This is particularly true for the isolates that show the higher scores (5.53–10.87, Fig. 4), all of which encode the potential to fix nitrogen and promote plant growth but lack many of the important known virulence factors and antibiotic resistance genes that can be found in clinical isolates of the same species. In general, isolates SCK7, SCK14, and SCK19 appeared to possess more potent plant growth promoting properties compared to isolates SCK9, SCK16, and SCK21 (Fig. 4). Our computational phenotyping scheme also has valuable negative predictive value. We discovered isolates that show few or none of the beneficial traits that characterized biofertilizers; Bacillus safensis SCK3 and Stenotrophomonas maltophilia SCK1 had the lowest scores (− 10 and − 11 respectively). Finally, it is also worth reiterating that the computationally predicted biochemical activities related to plant growth promotion were all validated by experimental results (Fig. 6).

A number of the nitrogen-fixing bacterial species isolated from sugarcane can also be opportunistic pathogens, which are microorganisms that usually do not cause disease in a healthy host; instead, they colonize and infect the immunocompromised host48,49. Although Klebsiella spp. exist in the environment and show plant growth promoting potential, they have also been associated with nosocomial diseases in humans17. Klebsiella spp. isolates causing hospital-acquired infections, primarily in immunocompromised persons, include Klebsiella pneumoniae, Klebsiella oxytoca, and Klebsiella granulomatis50.

The potential for virulence, along with the presence of antimicrobial resistance genes, is an obvious concern when proposing to use Klebsiella spp. as biofertilizers. Importantly, we found that the environmental Klebsiella isolates did not contain pathogenicity islands associated with many virulence factor genes usually found in clinical isolates of Klebsiella spp. (Fig. 2). These results are consistent with a recent study using whole genome sequences analysis16, which found that the Klebsiella michiganensis Kd70 isolated from the intestine of larvae of Diatraea saccharalis contains multiple genes associated with plant growth promotion and root colonization, but lacks pathogenicity islands in its genome. We also performed a broader comparison of the presence of virulence factors in the environmental isolates characterized here versus genomes of Klebsiella clinical isolates associated with opportunistic infections in humans along with a number of other environmental isolates that have available genome sequences (Fig. 5). The virulence factor profiles for all of the environmental isolates were clearly distinct from the clinical strains, which show uniformly higher virulence profile scores, underscoring the relative safety of Klebsiella environmental isolates for use as biofertilizers.

The results obtained from the computational phenotyping approach developed in this study serve as a proof of principle in support of genomic guided approaches to sustainable agriculture. In particular, computational phenotyping can serve to substantially narrow the search space for potential plant growth promoting bacterial isolates, which can be further interrogated via experimental methods. Computational phenotyping can be used to simultaneously identify beneficial properties of plant associated bacterial isolates while avoiding potentially negative characteristics. In principle, this approach can be applied to a broad range of potential plant growth promoting isolates, or even assembled metagenomes, from managed agricultural ecosystems.

We can also envision a number of other potential applications for computational phenotyping of microbial genomes. The computational phenotyping methodology developed here has broad potential including diverse applications in agriculture, plant and animal breeding, food safety, water quality microbiology along with other industrial microbiology applications such as bioenergy, quality control/quality assurance, and fermentation microbiology as well as human health applications such as pathogen antibiotic resistance, virulence predictions, and microbiome characterization. For instance, computational phenotyping could be useful in food safety related to vegetable crop production. Vegetables such as lettuce, spinach, and carrots are usually consumed raw, which increases the potential for bacterial infections or human disease outbreaks48,51. Vegetable plants and other crops harbor diverse bacterial communities, and the dominant families in these communities vary according to different variables, such as soil type and host genotype. The microbiome of the sugar cane crops studied here is dominated by the family Enterobacteriaceae, Gram-negative bacteria that include a huge diversity of plant growth promoting bacteria and enteric pathogens52,53,54,55,56. However, in Arabidopsis thaliana and other plants, gamma-Proteobacteria other than Enterobacteriaceae appear to dominate the plant-associated microbiome57.

Increasing antibiotic resistance, generated by the abuse of antibiotics in agriculture as well as medicine, is another major threat to human health58, and the food supply chain creates a direct connection between the environmental habitat of bacteria and human consumers59. Our computational phenotyping approach could provide for an additional food safety solution, which could be used to prevent the spread of antibiotic resistant pathogens present in the food chain.

Conclusions

A genome-enabled approach was developed for the prioritization of native bacterial isolates with the potential to serve as biofertilizers for sugarcane fields in Colombia’s Cauca Valley. The approach is based on computational phenotyping, which entails predictions related to traits of interest based on bioinformatic analysis of whole genome sequences. Bioinformatic predictions were validated through investigation of plant growth promoting traits with experimental assays in the laboratory, thereby demonstrating the utility of computational phenotyping for assessing the benefits and risks posed by bacterial isolates that can be used as biofertilizers. The quantitative approach to computational phenotyping developed here for the discovery of potential biofertilizers has broad potential applications for environmental and industrial microbiology, including potential use in food safety, water quality, and antibiotic resistance studies.

Methods

Sampling and cultivation of putative nitrogen-fixing bacteria from sugarcane

INCAUCA is a Colombian sugarcane company located in the Cauca River Valley in the southwest region of the country between the western and central Andes mountain ranges (http://www.incauca.com/). Samples of leaves, rhizosphere soil, stem, and roots were collected from the sugarcane fields designated as 32 T and 37 T of the INCAUCA San Fernando farm located in the Cauca Valley (3° 16′ 30.0″ N 76° 21′ 00.0″ W). Samples were collected with the permission of the United States Department of Agriculture and the Colombian Ministry of Environment and Sustainable Development, and all experiments were performed in compliance with institutional, national, and international guidelines. A high-throughput enrichment approach was developed to enable the cultivation of multiple strains of putative nitrogen-fixing bacteria from sugarcane field samples; details of this approach can be found in the Supplementary Material (Supplementary Methods and Supplementary Figure S1).

A total of 22 distinct nifH PCR + isolates that passed the initial cultivation and screening steps were grown in LB medium (Difco) at 37 °C for subsequent genomic DNA extraction. The E.Z.N.A. bacterial DNA kit (Omega Bio-Tek) was used for genomic DNA extraction, and paired-end fragment libraries (~ 1,000 bp) were constructed using the Nextera XT DNA library preparation kit (Illumina).

Genome sequencing, assembly, and annotation

Isolate genomic DNA libraries were sequenced on the Illumina MiSeq platform using V3 chemistry, yielding approximately 400,000 paired-end 300 bp sequence reads per sample. A list of all genome sequence analysis programs that were used for this study is provided Supplementary Table S4. Sequence read quality control and trimming were performed using the programs FastQC version0.11.560 and Trimmomatic (v.0.35)61. De novo sequence assembly was performed using the program SPAdes (v.3.6)62. Assembled genome sequences were annotated using the Rapid Annotations using Subsystems Technology (RAST) Web server63,64 and NCBI Prokaryotic Genome Annotation Pipeline (PGAP)65. The 15 Klebsiella isolates characterized in this way were briefly described in a Genome Announcement66, and the analysis here includes 7 additional non-Klebesiella isolates.

Comparative genomic analysis

Average Nucleotide Identity (ANI) was employed to assign the taxonomy of the bacterial isolates characterized here67,68. Taxonomic assignment was also conducted by targeting small subunit ribosomal RNA (SSU rRNA) gene sequences. Nitrogenase enzyme encoding nifH gene sequences were extracted from isolate genome sequences, clustered, and taxonomically assigned using the TaxaDiva (v.0.11.3) method developed by our group11. Whole genome sequence comparisons between bacterial isolates characterized here and the K. variicola type strain 342 were performed using BLAST + (v.2.2.28)69 and visualized with the program CGView (v.1.0)70. Details of the methods used comparative genomic analysis can be found in the Supplementary Methods section.

Computational phenotyping

Computational phenotyping was performed by searching the bacterial isolate genome sequences characterized here for the presence/absence of genes or features related to four functional classes of interest, with respect to their potential as biofertilizers: (1) nitrogen fixation (NF), (2) plant growth promotion (PGP), (3) virulence factors, and (4) antimicrobial resistance (AMR). Gene panels were manually curated by searching the literature (NCBI PubMed) for genes implicated in nitrogen fixation and plant growth promotion. The Virulence Factors Database (VFDB) was used to curate the virulence factor gene panel31. AMR levels were quantified using the PATRIC3/mic prediction tool71. A composite score was developed to characterize each bacterial isolate genome sequence with respect to the presence/absence of genes from the NF, PGP, and VF gene panels along with the predicted AMR levels. Details on the gene panels, AMR level, and the composite scoring system can be found in the Supplementary Methods.

Experimental validation

Predictions made by computational phenotyping were validated using five distinct experimental assays: (1) Acetylene reduction assay for nitrogen fixation activity, (2) Phosphate solubilization assay, (3) Siderophore production assay, (4) Gibberellic acid production assay, and (5) Indole acetic acid production assay. Details of each experimental assay can be found in the Supplementary Methods.