What does the human genome Encode?

So the ENCODE consortium, which aims to catalogue all the functional elements in the human genome, have just published their results to-date in 30 articles in Nature, Genome Research and Genome biology. As far as I can tell all the papers are open access so anyone can read them. A reasonably useful place to start is Nature’s ENCODE explorer.

To date I have only read a few of the new ENCODE papers (and no doubt will be busy reading more over the next week or two) but here are some of my initial thoughts.

1. As far as biology goes, the new ENCODE results don’t provide a lot of big novel insights. Many of the conclusions in ENCODE 2012 are the same as those from the ENCODE pilot paper in 2007, except now they are over all the genome, not just 1%. For example if you take my areas of transcriptomics and noncoding RNAs, these new results confirm that yes the genome really is pervasively transcribed (~75% as part of primary transcripts from 15 cell lines according to Djebali et al.) and that it really does seem there are similar numbers of coding and noncoding genes in the genome (Harrow et al & Derrien et al) etc.

We shouldn’t confuse not novel, with not useful though. It was always likely ENCODE 12 would mostly confirm ENCODE 07 and it has. Understanding transcription, transcription factor binding sites and chromatin marks genome-wide is a good thing, it provides a much more comprehensive and thorough understand of the genome.

2. There is a lot more insight that can be pulled out of this data than ENCODE have had time to look at. These 30 papers are just the start, using ENCODE data as part of their studies is going to keep people busy for years. In fact most of the novel insight will probably come from scientists integrating and parsing this data in interesting ways to answer their own questions about gene regulation and function, genomic variants and disease etc, etc.

3. ENCODE have created a very useful resource not just for the genomics community but for a much wider range of biological scientists. Given the cataloguing nature of ENCODE, we now have stupendous amounts of data about the transcription and regulation of most genes in the genome. I think one of the key challenges to get the maximum value out of ENCODE is going to be getting the word out to the non-genomics community that there is now a detailed resource about their favourite gene that they can access and use in their own studies.

4. We need better ways of visualising all this data, I like the UCSC browser but it’s current design isn’t really set up to show me RNA sequencing from 80 different experiments and the binding sites of 100 different transcription factors plus all the other data I’m interested in. As ENCODE produces more and more data this problem isn’t going to get smaller, how are we going to visualise data on 1000 transcription factors in 100 cell lines for example?. Looking through all this data can be a bit daunting for someone who uses the browser everyday, so it’s likely completely overwhelming for many who aren’t in a genomic space, but we need to make sure they can easily pull out the data that’s useful to them (see point 3). I don’t have the solution to this, but hopefully someone does.

… more to come (probably)


