Computational investigation of the sequence context of arginine/glycine-rich motifs in the human proteome

Human RG motif statistical comparisons reveal significant differences in physicochemical properties

To understand important characteristics or properties of an RG motif, their properties were compared between a positive (“functional”) set and a negative (“nonfunctional”) set. These two groups were defined by classifying human proteins containing at least one RG motif (pattern definition shown in Fig. 1A) into positive and negative sets on the basis of phase separation prediction and NA-binding annotation. The positive set (193 proteins) included those predicted to phase separate and annotated with at least one NA-binding GO term, whereas the negative set (230 proteins) lacked both features (see Methods for details).

Fig. 1

A Venn diagram of protein sets with predicted phase separation characteristics, nucleic acid binding and RG motif-containing proteins, which defines the positive and negative subsets. B-I General comparative analysis between RG proteins from the positive set (green) and the negative set (red) mIDR (motif-containing IDR): IDR in which the RG motif is located, oIDR (other IDR): IDR without any RG motif.

While the average length of the RG motifs seems to be greater in the positive set (see Fig. 1B), the percentage of amino acids, which are neither arginine nor glycine (here called impurity), does not differ between the groups (see Fig. 1C). This underlines the problem of defining a clear RG motif, since impurities do not affect the function as much as they do for structured regions.

The number of RG duplets vs. the number of RGG triplets reveals an opposing image, where the number of RGG triplets is greater in the positive group, whereas more RG duplets can be found in the nonfunctional group (Fig. 1D, E). Notably, only the RG duplets that do not have a glycine residue in the following position are counted because that would create an overlap between the RG duplet count and the RGG triplet count. Although there are well-studied RG motifs consisting largely of RG duplets (314 isoforms found with a tri-RG motif [3]), they seem to occur less in the positive dataset than in the negative dataset.

Compared with those in the negative dataset, the proteins in the positive dataset were significantly longer (Fig. 1F), suggesting that functional RG motifs are preferentially found in larger proteins. While this relationship has not been systematically reported, our results indicate a novel link between RG motif functionality and host protein length. Furthermore, the number of intrinsically disordered regions (IDRs) per protein was significantly greater in the positive dataset (Fig. 1G), reinforcing the established link between RG motifs and disordered protein domains. IDRs provide structural flexibility, facilitating transient interactions with nucleic acids and other biomolecules, which is a hallmark of RG-containing RNA-binding proteins [60]. This, of course, is directly related to host protein length; however, it has never been directly shown.

The absolute length of IDRs was also significantly greater in proteins from the positive dataset (Fig. 1H). When distinguishing between motif-containing IDRs (mIDRs)—disordered regions that contain at least one RG motif—and other IDRs (oIDRs)—disordered regions that lack an RG motif, mIDRs are significantly longer in the positive dataset than in the negative dataset, suggesting that functional RG motifs tend to appear within extended disordered regions. This association is however correlative, since there is no sign that longer mIDRs cause increased motif functionality. For oIDRs, we also observed a length increase, but this increase was not as strong (Fig. 1H). Interestingly, the relative length of IDRs (expressed as a percentage of total protein length) did not differ significantly between the two datasets, indicating that while RG-containing proteins tend to have longer IDRs, the proportion of disordered content within a given protein remains consistent. This can also be observed for the oIDRs. The effect that remains consistent across an absolute or relative perspective is that mIDRs are generally much longer than oIDRs are, regardless of the set. These observations support the hypothesis that RG motifs are functionally linked to protein disorder and are preferentially embedded within longer IDRs, where they may contribute to RNA-binding and phase-separating functions [61]. However, we notice that the negative subset also has longer mIDRs than oIDRs so the embedding of RG motifs in longer IDR stretches seems to be independent of their functionality but clearly not independent to the existence of an RG-motif. Furthermore, no significance in the positioning of the RG motif within the IDR was found, suggesting that there is no generally necessary position of the motif within the IDR (see supplementary material S3). This result is consistent with a general lack of positional bias found for all types of compositionally biased regions within the IDRs of human proteins [62].

Taken together, the general findings resulting from the comparison of the RG motifs in the positive and negative datasets confirm many previous independent observations about RG motif functionality. Beyond this, we aimed to underline the validity of the separation of the RG-rich proteins into positive and negative sets, by comparing the number of methylations in and around the RG-motifs between the two sets. We observed that there was a significant enrichment of Omega-N-methylarginine and asymmetric dimethylarginine in the positive set (p-values 2.4e-08 and 2.0e-07 respectively, Mann–Whitney U test), while other methylation types do not appear at all in the negative set. Since arginine methylation is a well-known regulation mechanism in RG-motifs, we conclude that the approach presented here is valid.

RG motifs appear along a large variety of domains and maintain specific distances from them

RG motifs can act as functional elements on their own within a protein, but they can also enable, fine-tune and enhance the function of structural domains in a protein. For example, the RG motif in Scd6 (the yeast homolog of LSM14A) is solely responsible for translational repression [63], whereas the RG motifs in FMRP, hnRNPU or FUS increase binding to RNA and fine-tune the RNA-binding specificity of those proteins [6, 64].

In the positive set of proteins, we observed 25 proteins (out of 193) without any domain annotated to it, suggesting that the RG motif could be the only RNA binding element and solely responsible for the nucleic acid binding character (see Fig. 2A). Most RG motifs co-occur with domains, indicating that the RG motif often functions in combination with structured domains to enable, enhance or fine-tune certain functions. The finding that a much larger share of proteins in the negative set did not have any annotated domains in the protein (111 proteins out of 230) suggests that functional RG motifs functioning on their own are most likely less frequent and rare (see Fig. 2B).

Fig. 2

A-B Distribution of the number of domains per protein for the positive and negative sets. C-D Domain types sorted by the number of occurrences in the positive and negative sets, respectively. Domains with fewer than 5 occurrences are not shown. E–F Top 10 most common GO term types for the annotated domains of the positive and negative sets, respectively. G Distributions of the distances between an RG motif and annotated domains that appear together at least 10 times. A cross indicates a case where a protein has exactly 1 domain and 1 RG motif annotated.

Owing to the functional filtering of the positive dataset (see Sect. 2.2), some very well-studied RNA/DNA-binding domains, such as the RNA recognition motif (RRM), homeobox domains and a subset of the zinc-finger domain family, appear in the positive set of the human RG proteome, whereas they are missing in the negative set (see Fig. 2C, D). However, the amount of domain variety that appears is surprising, since in 168 proteins in the positive dataset with at least one domain, we find 149 distinct domains appearing 330 times in total. Over 50% (168 out of 330) of the domain appearances in the positive dataset are not annotated with any GO term, which underlines the missing knowledge about functionality in the RG proteome (see Fig. 2E). However, the case is even more extreme in the negative dataset—170 different domains appearing 258 times, with almost 60% (146 out of 258) being of unknown function—which might be due to many annotated domains not being properly understood in their function (see Fig. 2F). Understanding the function of the domains might help understand exactly how the RG motif – provided it is a functional motif – works collectively with the domain.

Since an RG motif—by our definition (see Methods)—exists in a disordered region, an RG motif that works collectively with a domain does not have to be directly adjacent to the domain but could theoretically, owing to the flexibility of the disordered region, be located apart from the domain. Many well-studied RG proteins have relatively large sequence distances between the domains and the RG motif. FUS has an RG motif that is 40 residues away from the zinc-finger domain and over 100 residues apart from another RG motif. The heterogeneous nuclear ribonucleoprotein A1 (hnRNPA1) motif contains an RG motif that is 40 residues apart from the first RRM and more than 100 residues apart from the second RRM.

When comparing distances between domains and RG motifs, we observed the distance values clustering together specific to their domain types (see Fig. 2G). While no clusters or patterns emerge when comparing the entirety of domain‒motif distances across the datasets, they can be detected when looking at the domains individually, suggesting that the distance is related to the specific functionality of the domain.

While some domains show clear singular clusters (e.g., K homology domain (type 1), Sam68 (tyrosine-rich) domain or KHDRBS (Qua1) domain), others show a more homogeneous distribution (e.g., the zinc-finger types), but with visible differences in the overall proximity to the motif. Additionally, some domains appear (almost) exclusively in either the N- or C-terminus of the motif, whereas others seem symmetrically distributed between the N- or C-termini. A clustering of distances across unrelated proteins suggests that there is either a functional connection between the domain and the motif and/or an evolutionary reason for the consistency. If that is the case, a “preference” of a side (N-terminal or C-terminal) suggests that there is some mechanistic association between the two elements.

An analysis of the domains grouped by annotated function (GO terms) did not reveal any patterns or clusters. This finding indicates that the positioning of the RG motif relative to a domain is not correlated with the overall function but rather, more specifically, with the specific mechanism of a certain domain type. The residues that separate the RG motif and the domain type could act as linker elements and, depending on the functional mechanism, are optimized to reach a certain length.

This domain analysis in the context of RG motifs shows that distances between RG motifs and domains could be a useful property to determine whether an RG motif interacts functionally with a domain.

Amino acid composition analysis reveals biases and properties of “true” RG motifs

Composition analysis of disordered regions has been a difficult area of research because of the naturally high variance in sequences within these regions. Here, we attempt a systematic approach by comparing the amino acid compositions of the positive and negative protein sets to identify biases or trends, despite the high variance of IDRs.

First, we compared the amino acid composition of the RG motif of both datasets and revealed that phenylalanine (F), aspartic acid (D), and asparagine (N) increased by more than 1% in the positive set compared with the negative set, whereas tyrosine (Y) and methionine (M) increased slightly less than 1% (see Fig. 3A). Notable decreases of more than 1% are only observed for alanine (A) and proline (P), with leucine (L) barely missing the 1% mark. As expected, arginine and glycine are more common than average in all human IDRs, but this is also the case for tyrosine and phenylalanine. These results are not consistent with classical amino acid groupings (aromatics, positively charged, negatively charged, uncharged and hydrophobic), especially since tryptophane (W, as the third aromatic amino acid) is slightly decreased, whereas glutamic acid (E, as the second negatively charged amino acid) and histidine (H) and lysine (K) (the two other positively charged amino acids) barely change. Tyrosine and phenylalanine have previously been associated with RG motifs [4, 65].

Fig. 3

A Changes in amino acid frequency between the positive and negative sets of motifs, grouped by amino acid type. The dashed bars indicate the average amino acid proportions calculated for all human IDRs. B-E Comparative analysis of the NCPR, aromaticity, hydrophobicity and disorder-promoting residue fraction of the motif and the regions of 10 residues, N-(left) and C-terminal (right) of the motif.

Arginine and glycine content differences are also shown in the graph, and while the arginine content is lower, the glycine content is higher in the positive set, which is consistent with the finding that RGG triplets are more prevalent in the positive set than are RG duplets.

In general, these results raise the question of the extent to which the chemical properties of charge, aromaticity and hydrophobicity play a role in affecting the functionality of the RG motifs and whether other (usually more complex) properties, such as the propensity to form secondary structures, or the chemical groups of the amino acid side chains (amino groups, guanidium groups, etc.) are more important for their role in RG motifs.

To answer these questions, we compared the amino acid properties (net charge per residue (NCPR), aromaticity, hydrophobicity and fraction of disorder-promoting residues) in and around the motif. Here, we observed no significant differences in NCPR directly around the motif; however, only in the motif itself was the NCPR slightly greater in the negative dataset, possibly because of the greater percentage of arginine (Fig. 3B). This finding strongly suggests that the charge itself is most likely not the main driving force behind a functioning RG motif or at least not enough to explain their full function, which has been indicated in past works by observing loss of function through mutation of arginines to lysines [35].

The aromaticity shows a very strong and uniform signal of being more prevalent in the positive set, which is underlined by the findings concerning tyrosine and phenylalanine mentioned above (Fig. 3C). The importance of aromaticity (not including tryptophan, which was found more often in the negative set; Fig. 3A) is well known for RG motifs, but it seems that the presence of aromatic compounds extends far beyond the motif itself.

Despite the differences in the frequency of aromatic residues, the hydrophobicity of the adjacent regions is reduced in the positive set, suggesting that a large increase in hydrophobicity directly around the motif might affect how well the RG motif is available to its hydrophilic environment (Fig. 3D). Strong hydrophobic forces in sequences usually appear inside folded domains, facilitating the folding and exposition of hydrophilic residues to the surface. If strong hydrophobic forces surround the motif, the RG motif could be hidden by hydrophobic residues clustering together around the RG motif, thus making it difficult for the motif to be accessed by interacting partners or domains, which would effectively limit their functionality and explain the prevalence in the negative dataset.

Therefore, we also analyzed the fraction of disorder-promoting residues (Threonine, Alanine, Glycine, Arginine, Aspartic Acid, Histidine, Glutamine, Lysine, Serine, Glutamic Acid and Proline are considered disorder-promoting according to [59]). Inside and closely around the motif, a lower fraction of disorder-promoting residues can be found in the positive set (Fig. 3E). This could suggest that possible secondary structures might arise under certain conditions. Since RG motifs are regions of low complexity, no general structure has yet been defined for RG motifs. However, in the RG motif of nucleolin (NCL), which is rich in RGGF repeats, repeated β-turns are the major structural component that is observed [66]. In fragile X mental retardation protein (FMRP), arginines at positions 533 and 538 have been shown to form intramolecular contacts with the RNA duplex-quadruplex junction [67]. Notably, the residue at position 532, which is directly adjacent to the first arginine, is a phenylalanine. Additionally, RNA-binding protein EWS (EWSR1) is associated with G-quartets and contains many phenylalanine residues within its RG motif [68]. Thus, we find evidence of substructures in RG motifs, which should be evaluated more closely.

To expand the analysis, we also examined the amino acid composition of the entire protein, which was separated into regions with different structural propensities and relationships with the RG motif (Fig. 4A, B). We differentiated between 4 regions: the actual RG motif (1), the motif-containing IDR (mIDR), (2), other IDRs (oIDRs) in the protein (3) and structured regions (4). Tyrosine (Y) and asparagine (N) seem to be enriched over the entire protein, which could be associated with their functions, for which they were selected. The protein composition depends on the context, including the subcellular location [69]. Phenylalanine (F) is enriched only in the direct vicinity of the RG motif or in the motif itself, therefore showing a different image than tyrosine (Y). Aspartic acid (D) stands out, especially since its negative charge can inhibit or promote phase separation depending on the sequence context, particularly in relation to arginine-rich motifs, as well as the overall charge patterning of intrinsically disordered regions [65, 70, 71]. Also visible is the enrichment of lysine everywhere except in the RG motif itself. Lysine has been shown to have a weaker phase separation propensity and is outcompeted by arginine for negatively charged partners [72]. Furthermore, acetylated lysine can even reverse lysine-driven phase separation [73, 74], suggesting that lysine should not appear within RG motifs, which is also visible in the figure. However, the enrichment of lysines outside of the RG motif in both ordered and disordered parts of the protein suggests a possible further role of this motif in terms of RG-rich proteins. Finally, we note the enrichment of methionine, which has been associated with a role in regulating LLPS [75, 76].

Fig. 4

A Positional enrichment heatmap of the amino acid composition of the positive vs. negative subsets (positive log. fold change (green): enrichment in pos. dataset; negative log. fold change (red): depletion in neg. dataset) when separated into 4 regions, as shown schematically in (B). C-I Amino acid compositions at the residue level of the regions left and right of the motif acquired through a sliding window approach with error bars as transparent background areas and the average proportion of each amino acid calculated for all human IDRs as dotted lines.

Significant amino acid depletion does not occur directly in the motif; however, the observed fold changes indicate depletion of a group of hydrophobic residues (most notably tryptophan) but is not statistically significant, most likely due to the small sample size. Cysteine, leucine, alanine, threonine and proline are significantly depleted in mIDRs, which fits previous observations, especially the lower hydrophobicity around the motif in the positive set.

Some other signals could lead to interesting yet unknown insights into the RG motif context. For example, we do notice an enrichment of isoleucine (I) in oIDRs. Additionally, as mentioned above, lysine was enriched throughout the entire protein except for the RG motif itself. Similarly, glutamic acid is enriched in the structured parts of the protein. The protein may balance out functionally similar amino acids (such as D and E, R and K or I and L), and since aspartic acid and arginine are used heavily in and around the motif, glutamic acid or lysine are used more frequently in other areas of the protein to correct the overall imbalance of usage in other parts of the protein. There are studies that show that there is selection pressure in organisms not only to use low-cost amino acids but also to balance out the usage of certain amino acids, since the availability of heavily used amino acids will be lower and, therefore, the synthesis of the protein overall will be lower [77]. However, it is also possible that the increase in these amino acids could imply functional aspects.

For a more detailed composition analysis, we specifically looked at the amino acid compositions around the motif at the residue level (Fig. 4C-I). We applied a sliding window method (see Methods; Chapter 2.3). These results show that the two aromatic residues (F and Y), which we previously identified as strong signals, manifest very different profiles. While phenylalanine is strongly enriched (fold change of 1.5) in the motif and less enriched but still significantly enriched in the mIDR, the phenylalanine signal outside the motif ends after 10 residues left or right of the motif. Tyrosine, however, seems to be even more enriched outside the motif but weaker in the RG motif itself. This is surprising considering the similar chemical properties of the two amino acids and given that both have been strongly associated with RG motifs and are necessary for function [65, 78, 79].

Another interesting fact we observed is that in the case of asparagine (Fig. 4H), the average proportion shown is comparable to the human IDR average, but the average proportion in the negative dataset is far below the human IDR average. This is in contrast with tyrosine, for example, where the average proportion in the negative set is close to the human IDR average of tyrosine, and the average proportion is much greater (Fig. 4F). Aspartic acid and phenylalanine show mixed images (Fig. 4I and G, respectively). It is unclear what this result could imply at this stage, however, this could provide an interesting and fruitful hypothesis to test.

Additionally, the frequency profile of leucine seems intriguing since the composition of the positive and negative sets is not different directly adjacent to the motif but starts diverging only after approximately 20 residues outside the motif (Fig. 4E). This could reflect constraints on the composition of the sequences surrounding the RG motif associated with the physicochemical properties of the motif independent of function.

Notably, the arginine and glycine proportions are strongly above the human IDR average for both the positive and the negative sets, even outside of the actual motif, particularly for arginine (see supplementary material S4). This strongly suggests that the actual RG motif might extend far beyond the definition used in this work and previous definitions. It is more likely that a whole region must be considered for its functionality, and it would be vital to determine if and where the functional cutoff in terms of arginine and glycine content would be.

In the following chapter, we conducted deeper computational analysis on one of the clearest signals, which are the phenylalanine and tyrosine enrichment profiles.

Opposing tyrosine and phenylalanine profiles in RG motifs suggest different roles of the two aromatic residues

The clearest signals that could be observed in the amino acid composition analysis were the signals of F/Y and how they affected the entire aromaticity of the motif and the surrounding region (up to 30 residues in the N- and C-termini of the motif) to be enriched in the positive dataset. The role of aromaticity in the RG motif is well known; however, the difference between these two aromatic compounds, in addition to the possible phosphorylation of tyrosine, is unclear. Thus, we applied further computational research.

To understand the relationship between phenylalanine and tyrosine frequencies, we overlapped all proteins containing at least 5 tyrosines/phenylalanines in the RG motifs or the surrounding regions of 30 N- and C-terminal residues (analogous to the analysis in Fig. 4C–I) to determine whether there was any overlap between the proteins with “phenylalanine-rich” and “tyrosine-rich” motif regions (see Fig. 5A). The overlap is minimal, and only one motif region contains at least 5 residues of both tyrosine and phenylalanine. This small overlap created the opportunity to perform an enrichment analysis to find possible different functions for the distinct sets of proteins containing either of these regions.

Fig. 5

A Venn diagram of groups with at least 5 tyrosine (pink) and at least 5 phenylalanine residues (blue) in their motif region. A motif region is defined as the motif itself and the 30 residues left and right of the motif. B-D Enrichment analysis of the 2 protein sets from (A) in terms of GO biological processes, molecular functions and cellular components. The labels that are unique to either the tyrosine or phenylalanine set are marked in bold. E- F Correlation matrices for selected amino acids for the motifs of the positive and negative sets, respectively

Indeed, we detected differences in the biological processes, molecular functions and cellular components of the two protein sets via GO term enrichment analysis (Fig. 5B, C and D, respectively). While the tyrosine set seems to show a much stronger connection to spliceosome-related processes (four unique spliceosome-related processes and generally lower p values) and is associated with the cellular component “U12-type spliceosome complex”, the phenylalanine set is rather involved in nuclear body organization, with PML body organization actually being the most significant biological process visible, as are cellular components such as P granule (and its subcomponents piP-body and pi-body), stress granules and the nucleolus. This clear distinction could suggest unique functions of phenylalanine-rich motif regions versus tyrosine-rich motif regions and provides a strong case for further experimentation.

To further support the notion of a potential regulatory role of tyrosines within or adjacent to RG motifs, we examined the occurrence of phosphotyrosine sites in the positive and negative sets. In the negative set (230 proteins), only four phosphotyrosine sites were detected within mIDRs, whereas the positive set (193 proteins) contained 21 such sites. This represents a more than five-fold increase in the positive set, suggesting that tyrosines in the vicinity of RG motifs are preferentially phosphorylated and may contribute to regulatory functions. Importantly, this effect persists even after normalizing for the total length of mIDRs, which is higher in the positive set (37,157 residues) compared to the negative set (19,556 residues), although the enrichment is somewhat reduced.

Furthermore, we compared the Pearson correlation coefficient between the occurrences of the amino acids within the RG motifs. In addition to an obvious and expected correlation between arginine and glycine (0.82 in the positive set and 0.55 in the negative set), the only amino acid with which we observed any correlation was phenylalanine, with both glycine and arginine values of 0.73 and 0.55, respectively (Fig. 5E–F; full correlation matrices in supplementary material S5). These findings suggest that phenylalanine can play a role in RG motifs, possibly by emerging as a pattern together with arginine and glycine. One pattern that has already been mentioned and appears in well-known RG motif-containing proteins, such as FUS, EWS and TAF15 (FET family), is the RGGF tetramer. Since tyrosine shows no correlation, this result again underlines that it manifests in a different way than phenylalanine, but is still very prevalent in RG motifs and surrounding regions.

Computational investigation of the sequence context of arginine/glycine-rich motifs in the human proteome | BMC Genomics

Tags: