The extent of the genome that is transcribed (known as pervasive transcription) and the amount of “dark matter” RNA (uncharacterised and/or function unknown transcripts) produced by the cell has ignited a few controversies over the years.
This month marks 2 years since the last salvo from the pervasive transcription ‘skeptics’, van Bakel et al, 2010, was published in PLoS biology. van Bakel claimed, amongst other things, that previous studies had overestimated the pervasiveness of transcription due to false positives, that the small percentage of RNA “dark matter” in their RNA sequencing datasets supported this conclusion and that much of the low level transcription found in intergenic regions were probably transcriptional bi-products (ie: some sort of noise, be it technical and biological).
We were critical of many aspects of this study in a reply we published in PLoS in 2011 (further info here). Now that another year has past it’s probably worth seeing how their conclusions are holding up.
Most RNA sequencing reads in a standard sample are re-sequencing of the highly expressed transcripts in the sample, making a lot of sequencing depth required to properly discover and characterise lowly expressed RNAs. The greatest depth of sequencing for human RNA in van Bakel 2010 was ~ 20 million paired end reads, and this was in was a pooled sample of 10 tissues/cells lines and in whole brain (which is also like a pooled sample given its complexity). We criticized this sequencing depth at the time as being insufficient for making conclusions about pervasive (mostly low level) transcription. With the increased capacity of sequencing depth now readily available, I think a good rhetorical question can be asked here: Does anyone think ~20 M paired ends reads is enough to fully characterize the transcriptome in a complex sample?
A couple of studies published since van bakel 2010 have helped to throw more light on the amount of the transcriptome which is made up of “dark matter” and the nature of uninformative intergenic reads in a RNA sequencing library of a depth similar to van Bakel’s.
While our response to van Bakel was in review, Kapranov et al 2010 published a study looking at the percentage of the transcriptome consisting of “dark matter”. Unlike many studies, Kapranov et al did not limit themselves to polyA+ RNA and instead sequenced rRNA depleted total RNA. This difference in methodology lead to a strikingly different estimate of the amount of cellular “dark matter”, at around 50% of the transcriptome. They also came up with the thought provoking estimate that around 100 million 35 nt sequencing reads would be require to fully sequence all the diversity of RNA in just one cell.
The second study by Mercer et al 2011 sequenced transcripts from foot fibroblasts before and after using Capture Arrays to enrich for transcripts from regions of interest including gene deserts. Sequencing, (at a similar depth to van Bakel 2010) revealed that intronic and intergenic reads which were present as uninformative reads pre-capture and would likely have been classified as “random” or “background” by van Bakel were instead often from complex spliced, but rare, transcripts. These results supported our arguments that van Bakel et al were concluding many sequence reads were bi-products or noise due to a lack of sequencing depth, not because they were artifacts.
In summary, I think that the literature published since van Bakel 2010 generally agrees with our critiques of their findings. There really is a lot of dark matter transcription out there. Many of van Bakel’s criticisms were directed at the amount of transcription identified in the ENCODE pilot project, so with the results of the full ENCODE project soon to be published it will be interesting to see whether these also refute their conclusions.
Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ, Rinn JL, Ponting CP, Stadler PF, Morris KJ, Morillon A, Rozowsky JS, Gerstein M, Wahlestedt C, Hayashizaki Y, Carninci P, Gingeras TR, Mattick JS. (2011) The reality of pervasive transcription. PLoS Biology 9: e1000625.
Kapranov P, St Laurent G, Raz T, Ozsolak F, Reynolds CP, Sorensen PH, Reaman G, Milos P, Arceci RJ, Thompson JF et al. 2010. The majority of total nuclear-encoded non-ribosomal RNA in a human cell is ‘dark matter’ un-annotated RNA. BMC Biol 8: 149.
Mercer TR, Gerhardt DJ, Dinger ME, Crawford J, Trapnell C, Jeddeloh JA, Mattick JS, Rinn JL. 2011a. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat Biotechnol 30: 99-104.
van Bakel H, Nislow C, Blencowe BJ, Hughes TR. 2010. Most “dark matter” transcripts are associated with known genes. PLoS Biol 8: e1000371.