BMI – Busted mass index?

There’s been a lot of press about a new paper by Tomiyama et al. reporting that using BMI alone incorrectly classifies some 75 million Americans as healthy or unhealthy. The authors make the fair point that relying solely on BMI to gauge health is a bad idea. Much of the news coverage, impressed by that big number, has taken the view that BMI is pretty useless as an indicator of health*.

The paper didn’t have any figures so I’ve used their results from table 2 to compare how metabolic heath compares with BMI.


Viewed this way, BMI doesn’t look so bad and shows a very clear trend. People who are classed as obese are highly unlikely to be metabolically healthy, while those of normal weight are mostly healthy. People who are overweight but not obese seems to be the grey area, their metabolic health is worse than those of normal weight, but BMI is not very instructive at an individual level. Still, there are over 300 million American’s, which means that even if BMI is wrong about the health of 75 million, it is still instructive about 75% of the time.

I guess this is why people have used and still use BMI, it is extremely fast, cheap and easy to calculate, (you only need to know someone’s weight and height), but you’d also want to know things like blood pressure, smoking status, family history of disease, level of exercise etc. to properly understand a person’s health.

The study was a snapshot of the cardio-metabolic health of people at the time. It didn’t follow people to look at if they got sick (morbidity) or if they died (mortality). A number of studies have done this. Not being a specialist in the field, it seems to be somewhat controversial at the moment with studies disagreeing about when increased weight correlates with increased mortality. One thing these studies do agree on though is that people with a BMI above 35 (Obesity II and III) are more likely to die that those in the lower BMI categories.


*Let’s also dispense with the ridiculous trope that always gets brought up about BMI, something along the lines of BMI is no good because it classes some professional athletes as obese. BMI doesn’t distinguish why someone is heavy, so you can have a high BMI due to large muscle mass. However, what percentage of the population are professional athletes with high muscle mass? Barely enough to be a rounding error. So it’s a bad argument based on nit-picking.


Leave a comment

Filed under Clarksy's corner

Lariats of fire – Genome-wide discovery of human splicing branchpoints

One of the cool things about science is using new technologies to address previously intractable questions. In a paper just published we did just that to create the first genome-wide map of splicing branchpoints in the human genome. But what are branchpoints and why are they important?

Like many things in biology, RNA splicing is a pretty wondrous process. Small exonic regions are picked out of vast intronic sea and pasted together to create a mature RNA. Many factors are required for splicing, including the proteins and small RNAs of the spliceosome, as well as nucleic acid sequence motifs that mark the borders of introns and exons and localise spliceosomal components on the RNA.

Crucial sequence features in introns include the 5′ and 3′ splice sites, which are present at intronic termini next to the upstream and downstream exons and the branchpoint. The branchpoint, which is generally close to the 3′ end of the intron, is recognised early in the splicing process and in doing so, the spliceosome selects the nearby exon for inclusion into the mRNA. During splicing the 5′ splice site and branchpoint nucleotide are brought together and joined to form an intron lariat. This both frees the upstream exon and brings it into close proximity with the downstream exon allowing them to be joined and the intron lariat cut adrift (see figure below).

Branchpoints are a vital component in RNA splicing and mutation of branchpoints can disrupt proper splicing and cause diseases such as cancer. In yeast, branchpoints are easy to find because their sequence is always the same. This is not the case in humans, where the sequence motif at the branchpoint is known to vary considerably, making them difficult to confidently identify by sequence analysis. Compounding this issue, the rare and transient nature of intron lariats makes branchpoints difficult to pinpoint experimentally. So, despite having hundreds of thousands of introns, only a few hundred human splicing branchpoints had been previously identified.

In our study we tackled this issue using two complementary experimental techniques that enrich for branchpoint sequences within intron lariats. We identify ~60 000 branchpoints in >10,000 genes, providing a first genome-wide map of splicing branchpoints in the human genome. Having this many branchpoints allowed us to a perform a much more comprehensive analysis than has been previously possible, providing some cool new insights in branchpoints and their role in RNA splicing.

So how did we do it?

Firstly we used RNA Capture Sequencing (targeted RNA sequencing/ RNA CaptureSeq). RNA CaptureSeq uses oligo probes as baits to pull out RNA regions of interest and is super sensitive. CaptureSeq has been used previously to find rare mRNAs but is also ideal for rare RNA processing intermediates. Here, instead of capturing exonic sequence, we targeted the 5′ of introns (which loops around to join the branchpoint) and the 3′ of introns (where we predict most branchpoints to be).

Secondly, we used RNaseR digestion, RNaseR digests linear (but not circular) RNA, removing most RNA species and leaving behind lariats and other circRNAs. Although we don’t expect RnaseR to hone in on the branchpoint anymore than the rest of the intron, (unlike CaptureSeq) it does give nice enrichment for intronic sequence and hence for branchpoints too.

The figure below gives you an idea of how these methods work compared to standard RNA-seq:

How many reads required to identify one lariat junction spanning read for different techniques. n is number of sequencing libraries examined.

How many reads were required to identify one lariat junction spanning read for different techniques. n is number of sequencing libraries examined. Figure reproduced from Mercer et al.


The reason this works at all is that reverse transcriptase can cross the unusual bond between the 5′ intron end and the branchpoint.

Intron lariat showing the 5'SS of the intron joined to the branchpoint adenosine. Direction of reverse transcriptase is shown

Intron lariat showing the 5’SS of the intron joined to the branchpoint adenosine. Direction of reverse transcriptase is shown by arrow. Figure reproduced from Mercer et al.

This means branchpoint sequences are present in RNA sequencing libraries. The difficultly (see figure below) is that the sequence you get is not standard for aligning to the genome, but does give you both the 5′ of the intron and the branchpoint. So you know not only where the branchpoint is, but which upstream exon was involved in this splicing event.

Branchpoint Identification

RNA splicing, lariat formation and branchpoint identifying sequence reads. Red-blue line labelled with B & A shows the direction traveled by reverse transcriptase and the branchpoint identifying sequence created.


Our analysis of ~60 000 branchpoints shows they are “predominantly adenosine, highly conserved, and closely distributed to the 3′ splice site.” We find that multiple branchpoints within an intron are common.

During splicing the U2 snRNA binds to the sequence around the branchpoint, an essential step in productive splicing. We analysed the conserved sequence motif, which we term the Beta-box, which overlaps the branchpoint and interacts with the U2 snRNA and identified the following distinct features.

  • The density of G & U-residues within U2 snRNA enables greater base-pairing possibilities with Beta-boxes through RNA wobble-base pair interactions. This allows high sequence diversity amongst Beta-boxes while maintaining Beta-box function, similar to previous observations in microRNA seed sequences.
  • The type of base-pairing allowed between U2 snRNA and Beta-boxes makes Beta-box function resistant to disruption by common transition mutations.
  • The abundance of Beta-boxes families differs widely within the human genome and diverges between metazoan lineages. “Branchpoints with strong U2 binding (strong B-boxes) outcompete those with weak B-boxes … to specify exon inclusion. In addition, U2 binding strength positively correlates with both B-box occurrence and conservation, supporting the importance of the B-box to efficient splicing.”
Beta-box counts, conservation, U2 binding strength and over-representation. Figure reproduced from Mercer et al.

Beta-box counts, conservation, U2 binding strength and over-representation. Figure reproduced from Mercer et al.

  • “B-box families preferentially associate with distinct classes of intron–exon architecture that can be distinguished by polypyrimidine tract nucleotide content, GC content, and conservation. It has been proposed these alternative architectures correspond to intron- and exon-defined splicing mechanisms. Therefore, B-box motifs contribute a further distinction between these two alternative architectures. This integration of multiple splicing features suggests the coevolution of B-box motifs with the surrounding sequence and their integration into the competitive and compensatory mechanisms that regulate splicing”

Looking at common and disease SNPs at branchpoints we find branchpoints are ~3.1-fold depleted in common SNPs, while disease associated SNPs are (16.5-fold) enriched at branchpoints, where they can cause aberrant splicing in patients. An potential outcome of the loss of a branchpoint sequence is exon skipping and we confirm that previously identified mutations in RB1 and the MET oncogene found to drive cancer development involve the elimination of branchpoint nucleotides.

Finally we took a look at branchpoint usage in primate specific exons, specifically Alu element exonizations. It has previously been noted that Alu elements are ‘‘pre-exons’’ well placed for inclusion into mature RNA transcripts in the inverted orientation. Our results build on this showing that inverted Alus have a strong Beta-box element just 5′ of an internal polyT tract in their native sequence. This cryptic Beta-box is widely used in exonized Alu elements and likely promotes their exonization.


Reference: Mercer, Clark et al Genome-wide discovery of human splicing branchpoints. 2015. Genome Research. 25: 290-303

Leave a comment

Filed under Clarksy's corner

RNAcentral v1.0 is launched

RNAcentral 1.0 provides a single access point to non-coding RNA data, vastly improving research into gene products.

RNAcentral, the first unified resource for all types of non-coding RNA data, has been launched today by the RNAcentral Consortium. It aggregates information from a federation of expert databases, and provides tools for easy browsing. The initial release of RNAcentral contains approximately 8 million sequences.

Since the 1950s, scientists have thought of RNA as an intermediate molecule that provides a link between stable DNA and proteins. However, over the past decade it has become clear that RNA plays a much wider range of roles in living organisms. Researchers have discovered a lot about different types of RNA, but until now these data have not been put in one place.

Before RNAcentral, finding the RNAs encoded by a specific genome required fetching information from several independent resources, for example miRBase for microRNAs and HAVANA for lncRNAs.

“There is plenty of published data on non-coding RNAs, but each subtype is maintained separately,” explains Alex Bateman, head of Protein Sequence Resources at EMBL-EBI. “This is the first time we have a central place where you can find it all: piRNAs, ribosomal RNAs, everything. A lot of that information has typically been locked up in supplementary materials, or referred to only by a non-standard gene name. RNAcentral is a big step towards making RNA sequence as easy to access for research as protein sequence.”

RNAcentral 1.0 offers access to data from ten different expert databases and provides stable accession numbers that can be used consistently in the literature, other molecular databases and search engines. The RNAcentral website features a faceted search, which lets users explore different RNA sequences according to source, species and molecular function. Further expert databases are expected to be included in future releases.

The RNAcentral consortium has its roots in a workshop held on the Wellcome Genome Campus in 2010, where members of the RNA community came together to discuss the lack of centralised access to RNA data.

“It is really satisfying to see this project come to fruition,” explains Sam Griffiths-Jones of The University of Manchester. “When the consortium first met in 2010, we were seeing very rapid growth in ncRNA sequence and functional information and that trend has continued. New types of RNAs continue to be reported and there has never been a greater demand for a universal resource for these data.”

Thanks to funding from the UK’s Biotechnology and Biological Sciences Research Council (BBSRC), partner institutes throughout the world were able to come together and build a practical solution to a shared problem.

BBSRC Chief Executive Professor Jackie Hunter said: “Fundamental research into non-coding RNAs has many potential applications, including disease diagnostics, new therapies and biotechnology. With the abundance of data now available due to next generation DNA sequencing, there is an urgent need for informatics tools to decipher it. RNAcentral is vital resource that will aggregate and integrate information to unify the data landscape and improve the discoverability and use of data by researchers worldwide.”

The resource uses EMBL-EBI infrastructure, notably data-submission and cross-reference services provided by the European Nucleotide Archive (ENA). It takes advantage of the nightly, global synchronisation of data from the International Nucleotide Sequence Database Collaboration (INSDC).

Future versions of RNAcentral will include additional data types and information about RNA structure, modifications, molecular interactions and function. A paper describing RNAcentral tools and features in detail has been accepted for publication in the journal Nucleic Acids Research.


RNAcentral expert databases

The RNAcentral consortium currently includes 24 RNA database resources. Ten of these are present in the first release: European Nucleotide Archive; Rfam; RefSeq; VEGA; gtRNAdb, RDP; miRBase, tmRNA Website, SRPDB and lncRNAdb, with the many others planned for coming releases. See for an up-to-date list.


RNAcentral partners

European Bioinformatics Institute (EMBL-EBI), UK; University of Manchester, UK; Wellcome Trust Sanger Institute, UK; University of California Santa Cruz, US; University of Texas, US; Auburn University, US; Sandia National Laboratory, US; University of Oxford, UK; Garvan Institute of Medical Research, Australia; International Institute of Molecular and Cell Biology Warsaw and Adam Mickiewicz University, Poland; Rockefeller University, US; Chinese Academy of Sciences, China; Peking Union Medical College and Taicang Institute of Life Sciences Information, China; Michigan State University, US; National Chiao Tung University, China; Stanford University, US; University of Thessaly, Greece; National Center for Biotechnology Information (NCBI), US.

Source article

RNAcentral Consortium. (2014). RNAcentral: an international database of ncRNA sequences. Nucleic Acids Res. (In revision).

Leave a comment

Filed under Uncategorized

Functional annotation of your long non-coding RNA

So you have performed some differential gene expression experiments and have discovered a (few) non-coding RNAs that are of conspicuous interest… What now? Unless you are lucky and someone else has already characterised your needle in a haystack, odds are little is known about this transcript. You might be tempted to paste that .fasta file into mfold and say: “Look! It folds into an RNA secondary structure!” yet this won’t tell you much, besides that your RNA might look like a Christmas tree in February. This video explains how you can find out which regions of your RNA transcript of interest might be responsible for its biological function.

Continue reading

Leave a comment

by | August 27, 2013 · 6:37 am

A novel role for Alu elements in epigenetic trans-regulation of gene networks

Screen Shot 2013-07-26 at 9.42.44 PMEver so often, you stumble across a magnificent work of science. This was the case for me a few weeks ago when this work popped up in my news feed. The authors investigate how a genomic locus that is the strongest risk factor for artherosclerosis produces a regulatory non-coding gene that regulates other genes associated to the disease.

They used stable over-expression and knock-down approaches to investigate the role of distinct ANRIL (a long non-coding RNA, aka lncRNA) isoforms in several key mechanisms of atherogenesis. They show that this gene guides epigenetic effector complexes to specific genomic loci.

Through what molecular mechanism you ask? None other than via endogenous transposable elements–ALUs specifically–that have been harnessed through evolution to perform regulation of gene expression in our genomes. FYI, repetitive elements compose ~46% of the human genome, 20% of which are ALUs.
Continue reading

Leave a comment

Filed under Smithy's structures

Evolutionary proof that much of our genome is functional

RNA2DEVOLast year, the massive ENCODE consortium disclosed that over 80% of the human genome appears to be functional through several detailed biochemical experiments. Their findings fuelled an already heated debate regarding the biological pertinence of similar findings. Many old-school biochemists and proponents of the “selfish” DNA hypothesis (who I collectively refer to as junk DNAy-sayers) dismiss the use of such data to support the notion that the majority of the genome is functional.

Amidst the nit-picking, bickering, and refutations, one logical argument stands out that somewhat confounds the ENCODE findings: the lack of detectable evolutionary conservation. Indeed, the statement that > 80% of the human genome sequence is biologically functional lies in stark contrast to the fact that < 9% of it is observed to be conserved throughout mammalian evolution. But is this estimate really accurate? Continue reading

1 Comment

Filed under Smithy's structures

The genome’s 3D structure shapes how genes are expressed

We have just been lucky enough to have a paper published in Nature Genetics (Mercer et al 2013) showing how the 3D structure of the genome appears to play an important role in gene expression. You can find the original paper here and the press release here.

While it has been known for a few years that many exons are marked by a nucleosome sitting over them, we found a subset of exons show the opposite. Instead, these exons seem to be have nucleosomes sitting adjacent and this lack of nucleosomes helps create DNaseI hypersensitivity sites (DHS) at these locations. Looking across a vast number of cell types investigated as part of the ENCODE project we can also see that the DNase sensitivity of these exons is specific to subsets of cells.

Where this gets more interesting is these DNaseI marked exons also show CHiP-seq profiles that look like those you find in other parts of the genome. Continue reading

1 Comment

Filed under Clarksy's corner