We present here an ANN Model using novel fusion features (sequence and structural ones) for protein-RNA binding sites prediction, which is intended to serve as a better scoring function for protein-RNA docking MD. In contrast to the most conventional machine learning algorithms, which focus mainly on proteins' sequences, we incorporate into our model proteins' structural features as well, and achieve a better prediction power when testing it on a set of benchmarks. On a sequence level, one can always find that some equivalent residues within a protein family crucial to their RNA binding are more conserved than the rest ones during their evolutionary processes. Though these sequence information are useful to a certain degree, it's not enough to get a precise prediction. More importantly, RNA binding sites of a protein often require a group of residues that are in a more or less spatial proximity but possibly far apart along the sequence, to form some kinds of specific milieux favorable for their contacts. Therefore, features on a sequence and on a structural level could be mutually supportive. After investigating structural features of RNA-binding residues, we identified a couple of features giving a better discrimination power, like residues as spatial neighbor, relative accessible surface area (rASA) normalized in a new fashion, residue representation via multi-scale laplacian coordinates, and geometry constrains derived from coevolution of proteins and their RNA partners, etc.
When given residue sequence and coordinates of a protein structure of interest, our ANN model will output a list of binding propensities for each residue along the chain. A testing prediction shown in Fig. 1 involves PIN domain of human regnase-1 (PDB ID: 3V33). A corresponding surface representation of residues' binding propensities of the PIN domain is given in Fig. 2. Binding propensities predicted here are, thereafter, used to score and rank decoils generated in docking MD. In a decoil, when its residues with large binding propensities are in RNA contacts, such contacts will be rewarded in score, and vice versa. Docking results indicate that our ANN model can specifically select those decoils that are in near-native conformations.
Figure 1: Predicted binding propensities of residues along the chain. Crystal structure of regnase-1's PIN domain is used here. Binding propensities range from 0 to 1. The larger the propensity is, the higher the probability of a residue in RNA contact is. The first row of X-axis shows residues' types along the chain, and the second one labels residues' numbers in the structure file with a step of 10 residues.
Figure 2: Surface representation of residues' binding propensities on the PIN-domain of regnase-1 (PDB ID: 3V33, chain A and B). All residues are colorized according to their predicted binding propensities with a rainbow spectrum. A color changing from blue to red indicates an increasing binding propensity from 0 to 1.