Last year, the massive ENCODE consortium disclosed that over 80% of the human genome appears to be functional through several detailed biochemical experiments. Their findings fuelled an already heated debate regarding the biological pertinence of similar findings. Many old-school biochemists and proponents of the “selfish” DNA hypothesis (who I collectively refer to as junk DNAy-sayers) dismiss the use of such data to support the notion that the majority of the genome is functional.
Amidst the nit-picking, bickering, and refutations, one logical argument stands out that somewhat confounds the ENCODE findings: the lack of detectable evolutionary conservation. Indeed, the statement that > 80% of the human genome sequence is biologically functional lies in stark contrast to the fact that < 9% of it is observed to be conserved throughout mammalian evolution. But is this estimate really accurate?
Some have argued that these estimates are biased, as they are based on assumptions of non-functionality and they sample a non-representative set of true-negatives, or “neutrally” evolving sequences (see Pheasant and Mattick). Others, including the ENCODE consortium, have investigated DNA sequence variation within human populations in an attempt to bridge the gap between biochemical activity and evolutionary conservation.
The findings are generally consensual: the majority of genomic elements required for organismal function are largely unconstrained at the level of DNA sequence. If you like to jump to conclusions, this statement would lead you to believe that most of the observed biochemical activity in our genome is not necessary for our basic survival.
On the the other hand, if you like to think critically you might consider that something might be missing. Perhaps something particular to the function of non-protein coding RNAs?
To answer this, let’s step back and look at the facts:
- The human genome is pervasively transcribed in a tissue-specific and developmentally coordinated manner.
- The proportion of non-coding DNA in metazoan genomes is almost always consistent with biological complexity (N.B. no “dog ass plots” in that paper).
- Evolutionary conservation of DNA sequence is insufficient to substantiate the function of ncRNA through the laws of purifying, negative selection.
With regards to that last point, we know that sequence composition (a.k.a. primary structure) of DNA is essential for the translation of messenger RNA into proteins–the dogmatic genetic code. We also know that some regulatory mechanisms require precise DNA sequences to properly function, like microRNAs and gene promoters.
So how can the gap between points (1) and (3) be explained without dismissing pervasive transcription in complex organisms as “leaky” or “background noise”, while most evidence suggests otherwise? One must consider that most biological sequences function through distinct physicochemical structures that nature has forged over millions of years, especially RNA.
In my opinion, RNA is too often dismissed as a primary structure required for protein synthesis. RNA, much like proteins, can fold into complex structures via specific base-pairings and tertiary interactions. These properties enable both sequence- and structure-specific interactions centred on RNA molecules within the diverse cellular environment. In fact, the most fundamental cellular machinery is centrally composed of structured RNA (e.g. ribosomes, tRNAs, sn(o)RNAs in eukaryotes, etc).
These specific pairings are also encoded in DNA (much like the protein code), however their evolutionary dynamics are much more flexible than protein-coding genes or other sequence-specific elements. For instance, a C->T mutation in DNA may cause a lethal disease when located in a protein sequence (or a promoter, for example). However if the “healthy” C nucleotide is required to form a helix in an RNA structure (via a C-G base-pair), the mutation to a T (U in RNA) will nonetheless form a compatible base pair U-G, maintaining the structural topology of the molecule.
So how much of the transcribed genome forms RNA structures in our cells? More importantly, how can we determine if these structure are functional? My colleagues and I set out to answer these questions and report the findings in a paper that was just published in Nucleic Acids Research.
What we found is that substantially more of the human genome is functional than previously believed.
As mentioned above, biochemical measure of function is not quite convincing at the genome-wide scale unless it is supported by evolutionary data. So, over 3 years ago I set out to look at this specific question: how can we reliably detect evolutionary conservation of function at the level of RNA structure?
Several algorithms–or computer programs if you prefer–have been developed over the past 30 years to predict an RNA secondary structure from a given sequence. However, almost any DNA sequence will form a precise RNA secondary structure that may be extremely stable (try for yourself by typing a random DNA sequence into the RNAfold webserver).
In the RNA field, the best way to prove that an RNA 2D (and 3D) structure is correct is to validate the observed structural interactions (the base-pairs in a RNA helix) with “compensatory mutations” in orthologous RNAs. In other words, you can deduce biological function by detecting mutations in DNA that change the sequence but not the structure of the resulting RNA molecule.
Over sufficient evolutionary distance, mutations that disrupt the primary sequence but not the secondary structure of RNA are indicative of evolutionary pressure to maintain the specific structure. This provides bona-fide evidence that the RNA structure has a biological function, which effects an organism’s evolutionary success.
We searched for evolutionarily conserved RNA structures in the genomes of 35 placental mammals, using multiple sequence alignments freely accessible from the ENSEMBL website. Similar work has already been done by other groups a few years ago, which were published in Nature and Genome Research. Their findings were quite modest: a few thousand predictions, at most, which also suffered from very high false discovery rates (or false positives).
One reason for this is that they used multiple sequence alignments from a pair of programs called TBA and MULTIZ. A few years ago, a different set of algorithms (Enredo-Pecan-Ortheus) was published by Dr Benedict Paten that produced much longer syntenic blocks than MULTIZ, meaning that longer stretches of the genome could be compared across different species. Anyways, back to the point…
The predictions have also been stigmatised because of their poor reliability. Until now, genome-wide screens for RNA structure had very little to go on with regards to calibrating their accuracy, i.e. there are not many reliable true positives and true negatives in the multiple genome alignments to train the prediction tools.
How we improved the reliability of RNA structure predictions
So how can one accurately measure the performance of RNA structure prediction? So I decided to develop an analytic pipeline to generate accurate benchmark data consistent with the experimental conditions of genome-wide screens. The most innovative aspect of the study is how we measured the accuracy of both (i) the employed prediction software, and (ii) the general methodology used in genome-wide scans.
We did this by taking a very diverse set of “real” RNA structures that are well described in the literature, and using them as positive controls under the same conditions of genome-wide screen for conserved RNA structures. This benchmarking dataset should, in theory, produce RNA structure predictions 100% of the time. It also allows to calibrate the various parameters of a genome-wide screen.
We found that the 2 programs we tested (RNAz and SISSIz) predicted at most ~40% of the sampled RNA secondary structures. The main reason for this is that a fixed sampling “window” has to be employed to scan the genome, mainly for computational complexity constraints–longer windows are more likely to encompass paired bases but require exponentially more computation time). On a control data set of scrambled sequences, the programs also predicted very few conserved RNA structures (I will get back to this important point latter).
This approach allows us to estimate how many conserved RNA structure predictions we are potentially missing and how many of the predictions can be explained by chance. We also discovered that both programs produced high-confidence predictions with different affinities. For instance, one of the programs prefers multiple alignments with more species in it than the other program.
Our genome-wide screen for evolutionary conservation
Alright, so we did some fancy bioinformatics and now have an accurate set of parameters for a genome-wide screen. I created a chunk of JAVA code that essentially breaks up the large multiple alignment files of 35 mammals and scans them in parallel on a super-computer (this took about 15 years to complete and I ran it 3 times to be sure).
We found > 4 million RNA structures that were conserved across the 35 species. When overlapped to the human genome, these predictions encompass 13.6% of the human genome. That’s roughly 10x more than the amount of protein-coding sequence in the human genome.
Furthermore, the bulk of the 4 million predictions fail to overlap most known sequence-constrained elements. This observation supports the novelty of our findings, that RNA structures are distinct functional elements that are widespread throughout our genome.
But it gets better. When considering that the programs we used predicted ~40% of known RNA structures (prediction sensitivity), we can extrapolate the findings like so: 13.6% @ 40% sensitivity = 34%. That’s 1/3 of the human genome which is likely to function through RNA secondary structure components alone.
Furthermore, only 85% of the human genomic sequence was surveyed in this screen. The other 15 is either unalignable (human-specific) or is identical to the other samples sequences in the multiple alignments (our analysis ignored these situations). So that 34% could, in principle, be further extrapolated to 40%.
Keep in mind that there are several other limitations to this analysis that could be used to argue that even more of the genome is functional through RNA structure. For example, our analysis does not consider RNA pseudoknots, non-standard base-pairs, tertiary structures, multiple alignment artefacts, etc.
In any case, I hope that these findings place RNA structure elements into the spotlight for future investigations of genome functionality and of ncRNA regulatory mechanisms.
On the matter of negative controls
One problem with any experimental analysis of this sorts is that you need a reliable set of negatives controls to estimate how many false positive results you can expect (this is called specificity). This is particularly true for RNA secondary structure prediction endeavours, which have been plagued by high false discovery rate.
As highlighted in this work, assumptions of functionality are directly linked to biased estimates. For this reason, we did not directly include native biological sequences in our evaluation of false-positives. This made it quite challenging to produce a set of reliable negative controls.
We tested the false discovery rate several ways. Traditionally, multiple sequence alignments have been shuffled by permuting the individual columns, thus preserving global sequence conservation while disrupting any conserved base-pairings. This also disrupts the dinucleotide composition of the alignment, which is an important factor in RNA structure prediction–specific dinucleotide arrangements contribute base-pair stacking energies.
Thus, we chose alternative methods. One way permuted chunks of 3 consecutive columns across a large section of chromosome while ensuring that any 2 chunks would not be present in a sampled window. We also realigned the resulting pseudo-chromosome blocks to scramble the transition/transversion base substitution patterns. Both these situations produced un-natural looking multiple sequence alignments that differed largely to the sequence characteristics of native genomic alignments.
We resorted to using algorithms that randomise or shuffle native alignments. We combined SISSI and MULTIPERM algorithms to generate a shuffled chromosome 10. This was used to calibrate the specificity and the false discovery rate of our method.
Using MULTIPERM alone, 22% of the ~4 million predictions could be explained by chance. However, we also show that MULTIPERM does not maintain the general sequence conservation very accurately. On the other hand, SISSI does so quite well. Since we also use SISSI to make the RNA structure predictions, we combined it to MULTIPERM in a manner that removes any potential redundancy bias.
This double-shuffled true-negative dataset produced a false discovery rate of <5%, suggesting that false RNA structure predictions we report encompass less than 0.3% of the human genome (13.6% * 0.05).
So, what does this mean for mankind?
Prima facie, our findings provide evolutionary support to the rampant biochemical activity observed across our genome. Perhaps this will help some of the skeptics embrace the noncoding world for what it likely is: a plastic genomic matrix for regulation of gene expression. As we state in the manuscript:
Concordantly, we propose that the higher-order structural components of RNA serve as a flexible and modular evolutionary platform for the diversification of regulatory mechanisms guiding developmental ontologies, assisted by low penetrance of the affected alleles and by compensatory base pairing.
For the most common of mortals, these findings provide medical researchers with a new resource to further understand the genetic factors underlying disease, development, and evolution.