Biodiversity: the foundation of discovery, Part 1

To find new functional ingredients for the plant-based industry, we've been hard at work growing our protein database of natural sources. While a database of 450 million proteins is certainly impressive, read on to learn why we think it's time for you to pay more attention to what is arguably the heart of our database: biodiversity.

 Min read
March 29, 2022

An Introduction to Biodiversity in Proteins

Here at Shiru, we discover proteins from plants and other natural sources that can serve as plant-based functional food ingredients. Our discovery process is fueled by data, and we have lots of it; our ever-growing database currently hosts almost 450 million protein sequences. Having a plethora of data, however, is necessary and yet not sufficient for driving our innovative discovery process. What we really need is biodiversity in our data.

Biodiversity refers to variation (diversity) in living organisms (biology) at multiple scales. In studying proteins, we are interested in a gradient of biodiversity scales: from the variety of proteins produced by one individual, to the divergent collections of proteins manufactured by different organisms. In an individual organism, there are a variety of different proteins. If one looks at a population of those organisms, one encounters even more proteins, as there are variations on that initial individual's collection of proteins in that population. Expand the scope of inquiry even further, to different branches in the tree of life, and one finds more distant variations, or homologs, of those proteins, as well as completely distinct proteins that are unique to disparate lifeways and niches. By aggregating protein sequences across this gradient of biodiversity scales, we can assemble a protein space that hosts gradients of similarity and connectivity between proteins. Such a protein space can then serve as an excellent foundation for searching out patterns in protein sequences that relate to functional properties of interest, such as gelation or emulsification. This protein space can also capture insight into how proteins have evolved in convergent, divergent, and conserved fashions, which may further be leveraged to inform predictions of protein structure and function.

Protein Richness across the Taxonomic Hierarchy of Corn

One measure of protein diversity is protein richness, or the number of unique proteins attributable to a particular organism or group of organisms. As an example of how protein richness varies across taxonomic rank, let us examine a broadly familiar individual: Zea mays, commonly known as maize or corn. Zea mays B84 is a founding line of inbred stiff-stalk corn, and is an example of a sub-specific taxonomic grouping, which is the lowest taxonomic ranking we have access to in our data. Sub-specific taxonomic groupings can fall under an array of different names in plants: subspecies, variants, cultivars, breeding lines, etc. In our data, we have 41,436 (~ 4x104) unique proteins attributable to Zea mays B84 (Fig. 1, denoted with a species count of 10-1, due to its sub-specific rank), which is consistent with a reference corn proteome (~40,000 proteins). Due to corn’s ubiquity in American industrial agriculture, our protein dataset contains proteins from a number of different corn lines, so when one expands the query to inspect corn proteins at the species level and to look at all Zea mays proteins currently available to us, we see an almost order of magnitude increase in proteins (386,177; ~4x105). This notable expansion in unique proteins is a testament to the intense interest in corn breeding for the development of different useful phenotypes, an interest that dates back millenia in the Americas and has been pursued with greatly increased ferocity in modern agritechnological contexts. This focus on corn is further highlighted when one ascends to the next rank in the taxonomic hierarchy.

Figure 1 - Pro Protein count as a function of species in the Corn Taxonomic Hierarchy

In biological contexts, taxonomy refers to our attempts at hierarchically organizing evolutionary relationships between organisms. Canonically, the ranks in the taxonomic hierarchy are (from highest to lowest): domain, kingdom, phylum, class, order, family, genus, and species. These ranks are, naturally, a reductionist view of the complexity of structure in evolutionary relationships, and so are often expanded to various degrees to capture additional complexity. An example we discussed earlier is sub-specific organization, e.g. subspecies and variants. One also encounters interrank groupings, which can include those explicitly tied to canonical ranks (e.g. superorders, subfamilies), those with consistent positioning between ranks but no explicit tie based on naming (e.g. tribe, which falls between family and genus), and those that are not indicative of rank at all (namely, clade). Further complexity arises in that taxonomic ranks were traditionally assigned based on similarities in phenotype, or physical characteristics. The increasing accessibility and coverage of genotypic and genomic data has revealed a widespread need for reorganization, if taxonomy is to correctly represent evolutionary relationships. Being that taxonomic entities are numerous and many branches of the tree of life are understudied to varying degrees, we can expect taxonomy to be a dynamic and approximate descriptor of organismal relatedness well into the future.

Prior to defining taxonomy, we discussed that moving from an individual line of corn to corn as a species resulted in an approximately order of magnitude increase in the available protein sequences, and that this was likely a function of intensive corn breeding programs especially in modern industrialized agriculture. This hypothesis is further supported when we ascend through subsequent taxonomic ranks (Fig. 2). As we move from the species Zea mays up to its genus Zea, our scope expands to include the four additional congenerics of Zea mays (5x100 species), but our protein count increases by just under 2,000 (total of ~4x105), far less than what we would expect in a typical plant genome, which tends to code for tens of thousands of proteins. As close relatives of corn, Zea spp. have received some attention likely aimed at identifying resistance or other genes that could be of interest in hybrid corn crosses, but the volume of data we have been able to aggregate on these lesser known species certainly suggests attention with a very limited scope. We see this generalized even further with inspection of Tripsacinae (~2x101 species, ~4x105 proteins), the subtribe within which Zea is found. We have so far only come across about 150 additional protein sequences that are attributable to Tripsacinae entities not of the Zea genus. The Tripsacum genus and its 11 attributable species, also known as the Gamagrasses, make up the rest of the tribe. Since Gamagrasses have limited commercial interest, there exists limited protein data, an unfortunate reality shared by many non-crop plants. Tripsacum undoubtedly has some fascinating proteins worthy of study, but many remain a mystery to us at this time. As we proceed next to the tribe Andropogoneae, and an expanded view encompassing 505 (~5x102) species, we see a close to doubling of protein sequences available (total of ~7x105 ). While a modest expansion in relation to the volume of species added, this notable increase in proteins is likely attributable to the inclusion of two more crops of commercial importance: sugarcane and sorghum. We shall terminate our progression up the corn taxonomic hierarchy at the rank of family - Poaceae, or the family of grasses. While not the most species-rich family in the plant kingdom (~6,700 [~7x103] ]by NCBI’s count), Poaceae hosts the largest selection of proteins in our current database with 4.3 million (4.3x106)! While this may at first be surprising, it is important to note that what are commonly referred to as grasses comprise only a subset of the grass family. Along with the aforementioned corn, sugarcane, and sorghum, other notable members of the grass family are wheat, rice, barley, and bamboo.

Figure 2 - Abbreviated Taxonomic Hierarchies for Corn and Sunchoke

Protein Richness across the Taxonomic Hierarchy of Sunchoke

Compared to corn, Helianthus tuberosus, the sunchoke, is considerably lesser known. The sunchoke is a tall perennial with golden flowers that is native to North America and produces a tasty tuber that can be prepared like other root vegetables. As a root vegetable that is not a common staple in most kitchens, it would follow that it does not appear to be a common focus of scientific inquiry. We have access to little more than 500 protein sequences from the plant (Fig. 3; ~5x102 proteins). The Helianthus genus is a notably different story; comprising 54 (~5x101) species and associated availability of over 110,000 (~1x105) protein sequences, the richness of protein data in this genus is courtesy of its most famous member, our thirst for its oil, and our hunger for its seeds. Greater than 98% of the protein sequences we have from Helianthus are from Helianthus annuus, the common sunflower. Proceeding upwards to the tribe Heliantheae, the expansion to 708 (~7x102) species results in a relatively modest increase of ~40,000 proteins (total: ~2x105 proteins). We then ascend to the subfamily Asteroideae, and leap to over half a million (~5x105) proteins across over 8,000 (~8x103) species. This subfamily is significant as its plant members range from the delicious (tarragon) and libatious (wormwood, a key ingredient in absinthe), to the medicinal (Artemisia annua is the organismal source of Artemisinin, a small molecule with large importance in the treatment of malaria). We will again terminate at the family rank, which in this case is Asteraceae, the daisy family. This step up in rank involves another expansive increase in species and protein sequences, nearing 700,000 (~7x105) sequences across almost 13,000 species (~1x104).

Figure 3 - Protein and species counts in Corn and Sunchoke Taxonomic Hierarchies

We have thus far explored protein diversity across taxonomic rankings only as measured by protein sequence richness, or the counts of unique proteins per rank. This approach has revealed two key aspects of our data: 1. When more species are considered, protein richness increases. However, 2. there is a sampling bias that skews our pool of proteins to be more enriched in sequences originating from taxonomic groups of greater human interest and interaction. While bias is something that we generally like to avoid in our data, this particular bias works in our favor at Shiru, as it enriches our search space with proteins from organisms that we already know that we love to eat! In other words, it increases our probability of finding an innovative new use for a protein that can otherwise be found in our home vegetable gardens or down the produce aisle. 

In the following post, we will explore how our understanding of protein diversity deepens when we move beyond simple protein richness.

Note on species counts

The species counts per taxonomic rank used in this blog post were determined via NCBI Taxonomy data.

Like the article? Spread the word