Proteínas análogas à linguagem natural: mero acaso, fortuita necessidade ou design inteligente?

quarta-feira, fevereiro 13, 2019

Grammar of protein domain architectures

Lijia Yu, Deepak Kumar Tanwar, Emanuel Diego S. Penha, Yuri I. Wolf, Eugene V. Koonin, and Malay Kumar Basu

PNAS published ahead of print February 7, 2019

Edited by Clyde A. Hutchison III, J. Craig Venter Institute, La Jolla, CA, and approved January 4, 2019 (received for review August 27, 2018).

Phylogenetic tree built from cross-entropy values. Domain bigram models were generated from 37 selected eukaryotic clades (Dataset S2) from the main branches of Eukaryota. The cross-entropies of bigram models were calculated in an all-vs.-all comparison. The entropy values were then normalized to create a distance matrix (see Methods for details), and the tree was constructed using the neighbor-joining method. The major groups are colored as shown in the legend.


Genomes appear similar to natural language texts, and protein domains can be treated as analogs of words. To investigate the linguistic properties of genomes further, we calculated the complexity of the “protein languages” in all major branches of life and identified a nearly universal value of information gain associated with the transition from a random domain arrangement to the current protein domain architecture. An exploration of the evolutionary relationship of the protein languages identified the domain combinations that discriminate between the major branches of cellular life. We conclude that there exists a “quasi-universal grammar” of protein domains and that the nearly constant information gain we identified corresponds to the minimal complexity required to maintain a functional cell.


From an abstract, informational perspective, protein domains appear analogous to words in natural languages in which the rules of word association are dictated by linguistic rules, or grammar. Such rules exist for protein domains as well, because only a small fraction of all possible domain combinations is viable in evolution. We employ a popular linguistic technique, n-gram analysis, to probe the “proteome grammar”—that is, the rules of association of domains that generate various domain architectures of proteins. Comparison of the complexity measures of “protein languages” in major branches of life shows that the relative entropy difference (information gain) between the observed domain architectures and random domain combinations is highly conserved in evolution and is close to being a universal constant, at ∼1.2 bits. Substantial deviations from this constant are observed in only two major groups of organisms: a subset of Archaea that appears to be cells simplified to the limit, and animals that display extreme complexity. We also identify the n-grams that represent signatures of the major branches of cellular life. The results of this analysis bolster the analogy between genomes and natural language and show that a “quasi-universal grammar” underlies the evolution of domain architectures in all divisions of cellular life. The nearly universal value of information gain by the domain architectures could reflect the minimum complexity of signal processing that is required to maintain a functioning cell.

n-grambigramprotein domainlanguagedomain architecture