CitationVandenbon A, Kumagai Y, Akira S, Standley DM. "A novel unbiased measure for motif co-occurrence predicts combinatorial regulation of transcription", BMC Genomics, 2012. doi: 10.1186/1471-2164-13-S7-S11, PMID: 23282148.
IntroductionREgulatory MOtif COmbination Detector (REMOCOD) is a program for the systematic identification of pairs of transcription factor binding sites (TFBSs) that tend to co-occur in promoter sequences of a set of co-regulated genes. Given a set of gene IDs, the program first detects over-represented TFBSs in the promoter sequences corresponding to these genes, and subsequently predicts regulatory motifs that show a tendency to co-occur with any of the over-represented TFBSs.
MethodsExisting methods for predicting TFBS co-occurrences suffer from biases caused by TFBS over-representation, and from the difficulty of evaluating the significance of co-occurrences using standard statistical tests. We aimed to overcome these problems. Our measure for co-occurrences of pairs of regulatory motifs, the frequency ratio (FR), does not only take into account the number of cases where two motifs are present in the same sequence, but also the cases where they are not present together. The significance of observed tendencies in a set of input sequences is evaluated by comparing it with a large number of randomly d cases.
UsageInput Sequence IDs
RefSeq IDs or Gene symbols of a set of co-regulated or co-expressed genes. Supported organisms are for the moment limited to human and mouse. The "Home" tab contains sample inputs for human and mouse.
Select 'RefSeq IDs' or 'Gene symbols', in accordance with the sequence IDs given as input.
Select either 'human' or 'mouse'.
Remove redundancy in input sequences? (Default: Yes)
Bioinformatics approaches for the prediction of regulatory motifs are in general sensitive to redundancy in the input sequences. In the case of REMOCOD, two or more highly similar or identical input sequences can bias the degree to which pairs of TFBSs appear to be co-occurring. It is therefore strongly recommended to remove redundancy in the input sequences. In practice, our approach checks for pairs of similar sequences in the input set, based on BLAST E values, and removes 1 sequence of each similar pair, if any.
Sample ResultsThe following links show the sample results when running this tool using the sample input sequence IDs (see Usage > Input Sequence IDs) with default paramaters set:
Explanatory notes for the results tableA successful run of REMOCOD returns a table similar to the one shown below. At the top, a unique identifier (TFBSxxxxxxx) is shown, which can be used for accessing results at a later time.
Below that is a table reporting significant (P-value < 0.01) co-occurring pairs of transcription factor binding site (TFBS) motifs. If no significant pairs are found, the top co-occurring motif pairs are shown. Each row in the table corresponds to a pair of co-occurring motifs.
The following is a description of each column in the table:
- Motif A shows generally over-represented motifs in the input set of sequences.
- Motif B shows a second motif that co-occurs with Motif A. Motif B can be any motif, and is not restricted to over-represented motifs. In case Motif B is the same motif as Motif A the co-occurrence is homotypic, indicating that Motif A is more often than expected occurring in pairs.
- Input shows some properties of the motif pair in the input sequences. This will be explained further below.
- Reference shows the same properties, for the reference set of promoter sequences (e.g. the genomic set of promoter sequences).
- Frequency Ratio, shows a visual representation of the Frequency Ratios in input sequences and the reference sequences. The Frequency Ratio is a measure for the tendency of Motif B sites to co-occur with Motif A sites. For a detailed explanation we refer to our publication on REMOCOD.
- P-value shows a p-value for the observed input sequence p-value, estimated by random sampling from the reference set of promoter sequences. The p-value shown is the ratio of random samplings resulting in a higher Frequency Ratio than observed in the actual input sequence set.
This column contains 2 pie charts.
- The left pie chart shows the number (percentage) of input sequences that contain 1 or more sites for Motif A (orange), and the same for sequences that do not contain any site for motif A (blue).
- The right pie chart shows the number (percentage) of sites for Motif B that are present in sequences also containing 1 or more Motif A sites (= co-occurrences, orange), and the same for Motif B sites that are present in sequences lacking Motif A sites (= avoidances, blue).
Same as above (Input column) but for the reference set of promoter sequences (e.g. the genomic set of promoter sequences).
Specific explanation of the HNF1 – CIZ case
In the first row of the example result table, the co-occurrence between the binding motifs for HNF1 and CIZ is shown. As shown in the Input column, in the 125 input sequences, 49 (39%) contain one of more predicted sites for the HNF1 transcription factor. In the reference sequences, on the other hand, only 17% of the sequences contain one or more HNF1 sites. In other words, the HNF1 motif is enriched in the input sequences as compared to the reference. This makes sense, because the example input sequences are a set of promoter sequences of genes with high expression levels in liver tissues, and HNF1 is a major regulator of liver-specific gene expression. Also shown in the column Input is that 25 out of 32 sites (78%) for the CIZ motif predicted in the input set of sequences is present in a sequence also containing one or more sites for HNF1, while only 7 (21%) are present in sequence lacking a HNF1 site. Given the ratio of input sequences having/lacking HNF1 sites (39% vs 60%) this shows that CIZ sites tend to co-occur with HNF1 sites more often than expected.
In the reference sequences, about 23% of CIZ sites co-occur with HNF1 sites, which is more or less as expected given that 17% of the references contain one or more HNF1 sites.
These results indicate that in the liver-specific promoter sequences, CIZ sites show a tendency to be present together with HNF1 sites and not in sequences lacking HNF1 sites; and that in reference sequences there is no clear preference for co-occurrence or avoidance.
The Frequency Ratio column shows the Frequency Ratios for the pair HNF1 – FOXP1 in input and reference sequences. For a description of the calculation of Frequency Ratios we refer to our paper on REMOCOD. Here we just observe that in input sequences the Frequency Ratio is about 5, while in reference sequences it is slightly higher than 1. Roughly speaking, this means that in input sequences, CIZ sites are about 5 times more likely to be present in sequences containing HNF1 sites than in sequence lacking HNF1 sites. In reference sequences there is a slight preference for sequences containing HNF1 sites.
The significance of this tendency for co-occurrence in input sequences is evaluated by randomly sampling sets of sequences of the same size as the input set from the reference set, and recalculating the Frequency Ratio in these sampled sequences. A p-value is estimated as the ratio of randomly samples sets having a Frequency Ratio equal to or higher than the one observed in the actual input sequences. In the case of the HNF1 – CIZ pair, in 1000 random sampling, none of the sets lead to a higher Frequency Ratio, so the p-value is estimated to be < 0.001 (shown in the table as 0.000).
FundingThis research is supported by the Japan Society for the Promotion of Science (JSPS) through the “Funding Program for World-Leading Innovative R&D on Science and Technology (FIRST Program),” initiated by the Council for Science and Technology Policy (CSTP), and by a Kakenhi Grant-in-Aid for Scientific Research (23710234) from the Japan Society for the Promotion of Science.
All Rights Reserved 2012
Systems Immunology, IFReC
Systems Immunology, IFReC