Category Archives: Clarksy’s corner

Clarksy’s corner of the net

BMI – Busted mass index?

There’s been a lot of press about a new paper by Tomiyama et al. reporting that using BMI alone incorrectly classifies some 75 million Americans as healthy or unhealthy. The authors make the fair point that relying solely on BMI to gauge health is a bad idea. Much of the news coverage, impressed by that big number, has taken the view that BMI is pretty useless as an indicator of health*.

The paper didn’t have any figures so I’ve used their results from table 2 to compare how metabolic heath compares with BMI.


Viewed this way, BMI doesn’t look so bad and shows a very clear trend. People who are classed as obese are highly unlikely to be metabolically healthy, while those of normal weight are mostly healthy. People who are overweight but not obese seems to be the grey area, their metabolic health is worse than those of normal weight, but BMI is not very instructive at an individual level. Still, there are over 300 million American’s, which means that even if BMI is wrong about the health of 75 million, it is still instructive about 75% of the time.

I guess this is why people have used and still use BMI, it is extremely fast, cheap and easy to calculate, (you only need to know someone’s weight and height), but you’d also want to know things like blood pressure, smoking status, family history of disease, level of exercise etc. to properly understand a person’s health.

The study was a snapshot of the cardio-metabolic health of people at the time. It didn’t follow people to look at if they got sick (morbidity) or if they died (mortality). A number of studies have done this. Not being a specialist in the field, it seems to be somewhat controversial at the moment with studies disagreeing about when increased weight correlates with increased mortality. One thing these studies do agree on though is that people with a BMI above 35 (Obesity II and III) are more likely to die that those in the lower BMI categories.


*Let’s also dispense with the ridiculous trope that always gets brought up about BMI, something along the lines of BMI is no good because it classes some professional athletes as obese. BMI doesn’t distinguish why someone is heavy, so you can have a high BMI due to large muscle mass. However, what percentage of the population are professional athletes with high muscle mass? Barely enough to be a rounding error. So it’s a bad argument based on nit-picking.

Leave a comment

Filed under Clarksy's corner

Lariats of fire – Genome-wide discovery of human splicing branchpoints

One of the cool things about science is using new technologies to address previously intractable questions. In a paper just published we did just that to create the first genome-wide map of splicing branchpoints in the human genome. But what are branchpoints and why are they important?

Like many things in biology, RNA splicing is a pretty wondrous process. Small exonic regions are picked out of vast intronic sea and pasted together to create a mature RNA. Many factors are required for splicing, including the proteins and small RNAs of the spliceosome, as well as nucleic acid sequence motifs that mark the borders of introns and exons and localise spliceosomal components on the RNA.

Crucial sequence features in introns include the 5′ and 3′ splice sites, which are present at intronic termini next to the upstream and downstream exons and the branchpoint. The branchpoint, which is generally close to the 3′ end of the intron, is recognised early in the splicing process and in doing so, the spliceosome selects the nearby exon for inclusion into the mRNA. During splicing the 5′ splice site and branchpoint nucleotide are brought together and joined to form an intron lariat. This both frees the upstream exon and brings it into close proximity with the downstream exon allowing them to be joined and the intron lariat cut adrift (see figure below).

Branchpoints are a vital component in RNA splicing and mutation of branchpoints can disrupt proper splicing and cause diseases such as cancer. In yeast, branchpoints are easy to find because their sequence is always the same. This is not the case in humans, where the sequence motif at the branchpoint is known to vary considerably, making them difficult to confidently identify by sequence analysis. Compounding this issue, the rare and transient nature of intron lariats makes branchpoints difficult to pinpoint experimentally. So, despite having hundreds of thousands of introns, only a few hundred human splicing branchpoints had been previously identified.

In our study we tackled this issue using two complementary experimental techniques that enrich for branchpoint sequences within intron lariats. We identify ~60 000 branchpoints in >10,000 genes, providing a first genome-wide map of splicing branchpoints in the human genome. Having this many branchpoints allowed us to a perform a much more comprehensive analysis than has been previously possible, providing some cool new insights in branchpoints and their role in RNA splicing.

So how did we do it?

Firstly we used RNA Capture Sequencing (targeted RNA sequencing/ RNA CaptureSeq). RNA CaptureSeq uses oligo probes as baits to pull out RNA regions of interest and is super sensitive. CaptureSeq has been used previously to find rare mRNAs but is also ideal for rare RNA processing intermediates. Here, instead of capturing exonic sequence, we targeted the 5′ of introns (which loops around to join the branchpoint) and the 3′ of introns (where we predict most branchpoints to be).

Secondly, we used RNaseR digestion, RNaseR digests linear (but not circular) RNA, removing most RNA species and leaving behind lariats and other circRNAs. Although we don’t expect RnaseR to hone in on the branchpoint anymore than the rest of the intron, (unlike CaptureSeq) it does give nice enrichment for intronic sequence and hence for branchpoints too.

The figure below gives you an idea of how these methods work compared to standard RNA-seq:

How many reads required to identify one lariat junction spanning read for different techniques. n is number of sequencing libraries examined.

How many reads were required to identify one lariat junction spanning read for different techniques. n is number of sequencing libraries examined. Figure reproduced from Mercer et al.


The reason this works at all is that reverse transcriptase can cross the unusual bond between the 5′ intron end and the branchpoint.

Intron lariat showing the 5'SS of the intron joined to the branchpoint adenosine. Direction of reverse transcriptase is shown

Intron lariat showing the 5’SS of the intron joined to the branchpoint adenosine. Direction of reverse transcriptase is shown by arrow. Figure reproduced from Mercer et al.

This means branchpoint sequences are present in RNA sequencing libraries. The difficultly (see figure below) is that the sequence you get is not standard for aligning to the genome, but does give you both the 5′ of the intron and the branchpoint. So you know not only where the branchpoint is, but which upstream exon was involved in this splicing event.

Branchpoint Identification

RNA splicing, lariat formation and branchpoint identifying sequence reads. Red-blue line labelled with B & A shows the direction traveled by reverse transcriptase and the branchpoint identifying sequence created.


Our analysis of ~60 000 branchpoints shows they are “predominantly adenosine, highly conserved, and closely distributed to the 3′ splice site.” We find that multiple branchpoints within an intron are common.

During splicing the U2 snRNA binds to the sequence around the branchpoint, an essential step in productive splicing. We analysed the conserved sequence motif, which we term the Beta-box, which overlaps the branchpoint and interacts with the U2 snRNA and identified the following distinct features.

  • The density of G & U-residues within U2 snRNA enables greater base-pairing possibilities with Beta-boxes through RNA wobble-base pair interactions. This allows high sequence diversity amongst Beta-boxes while maintaining Beta-box function, similar to previous observations in microRNA seed sequences.
  • The type of base-pairing allowed between U2 snRNA and Beta-boxes makes Beta-box function resistant to disruption by common transition mutations.
  • The abundance of Beta-boxes families differs widely within the human genome and diverges between metazoan lineages. “Branchpoints with strong U2 binding (strong B-boxes) outcompete those with weak B-boxes … to specify exon inclusion. In addition, U2 binding strength positively correlates with both B-box occurrence and conservation, supporting the importance of the B-box to efficient splicing.”
Beta-box counts, conservation, U2 binding strength and over-representation. Figure reproduced from Mercer et al.

Beta-box counts, conservation, U2 binding strength and over-representation. Figure reproduced from Mercer et al.

  • “B-box families preferentially associate with distinct classes of intron–exon architecture that can be distinguished by polypyrimidine tract nucleotide content, GC content, and conservation. It has been proposed these alternative architectures correspond to intron- and exon-defined splicing mechanisms. Therefore, B-box motifs contribute a further distinction between these two alternative architectures. This integration of multiple splicing features suggests the coevolution of B-box motifs with the surrounding sequence and their integration into the competitive and compensatory mechanisms that regulate splicing”

Looking at common and disease SNPs at branchpoints we find branchpoints are ~3.1-fold depleted in common SNPs, while disease associated SNPs are (16.5-fold) enriched at branchpoints, where they can cause aberrant splicing in patients. An potential outcome of the loss of a branchpoint sequence is exon skipping and we confirm that previously identified mutations in RB1 and the MET oncogene found to drive cancer development involve the elimination of branchpoint nucleotides.

Finally we took a look at branchpoint usage in primate specific exons, specifically Alu element exonizations. It has previously been noted that Alu elements are ‘‘pre-exons’’ well placed for inclusion into mature RNA transcripts in the inverted orientation. Our results build on this showing that inverted Alus have a strong Beta-box element just 5′ of an internal polyT tract in their native sequence. This cryptic Beta-box is widely used in exonized Alu elements and likely promotes their exonization.


Reference: Mercer, Clark et al Genome-wide discovery of human splicing branchpoints. 2015. Genome Research. 25: 290-303

Leave a comment

Filed under Clarksy's corner

The genome’s 3D structure shapes how genes are expressed

We have just been lucky enough to have a paper published in Nature Genetics (Mercer et al 2013) showing how the 3D structure of the genome appears to play an important role in gene expression. You can find the original paper here and the press release here.

While it has been known for a few years that many exons are marked by a nucleosome sitting over them, we found a subset of exons show the opposite. Instead, these exons seem to be have nucleosomes sitting adjacent and this lack of nucleosomes helps create DNaseI hypersensitivity sites (DHS) at these locations. Looking across a vast number of cell types investigated as part of the ENCODE project we can also see that the DNase sensitivity of these exons is specific to subsets of cells.

Where this gets more interesting is these DNaseI marked exons also show CHiP-seq profiles that look like those you find in other parts of the genome. Continue reading

1 Comment

Filed under Clarksy's corner

The end has no end?

There is an interesting new paper out in Genome Research from Eric Lai’s lab (Miura et al. 2013) that finds many genes have much longer 3’UTRs than previously annotated. Sometimes these extended 3’UTRs look constitutive, other-times they have found alternative gene isoforms with 3’UTRs that terminate transcription (on average) several kb further downstream.

In some ways this isn’t too surprising, having spent a lot of time these past years gazing at the UCSC genome browser, it is clear that 3’UTRs keep getting longer and longer. For those in the lncRNA field, this presents some difficulties in determining whether an RNA downstream from a 3’UTR in the sense direction is an independent transcript with it’s own start site, a processed RNA from the 3’UTR, or part of the UTR but for some reason transcription joining the two hasn’t been found. Generally transcripts downstream from a 3’UTR that pass whatever cuttoffs a study imposes will look like (and be called) long noncoding RNAs (lncRNAs). Continue reading

Leave a comment

Filed under Clarksy's corner

What does the human genome Encode?

So the ENCODE consortium, which aims to catalogue all the functional elements in the human genome, have just published their results to-date in 30 articles in Nature, Genome Research and Genome biology. As far as I can tell all the papers are open access so anyone can read them. A reasonably useful place to start is Nature’s ENCODE explorer.

To date I have only read a few of the new ENCODE papers (and no doubt will be busy reading more over the next week or two) but here are some of my initial thoughts.

1. As far as biology goes, the new ENCODE results don’t provide a lot of big novel insights. Many of the conclusions in ENCODE 2012 are the same as those from the ENCODE pilot paper in 2007, except now they are over all the genome, not just 1%. Continue reading

Leave a comment

Filed under Clarksy's corner

Do you want your genome sequenced?

After reading Mike Snyder’s recent (and very cool) mega-omics-on-self paper, it dawned on me that I am not the only Mike Clark working in genomics. As I’m sure my scientific (but more beardy) doppelganger can no doubt vouch, this happens all the time when you have a common name (there were, I believe, three Mike Clarks at my high school).

The Stanford Mike “Geneticist extraordinaire” Clark also does some science-based blogging and earlier this year outlined why he is going to get his exome sequenced. It’s a persuasive piece which I encourage everyone to read, although for many people the actual outcome to the question “should I have my genome sequenced?” may be more determined by legal and insurance implications, than whether or not they would like to know their genetic code. In my opinion, Standford Mike’s first reason of curiosity (both from a personal and scientific point of view) is reason enough, but I also think the potential medical benefits of knowing your genome are going to become progressively larger. As Standford Mike says:

Moreover, as more and more information regarding the genetic causes of various traits and diseases are discovered, my exome sequence will always be at hand for me to cross-reference. Imagine that tomorrow a study is released identifying a gene that tells you with complete confidence whether or not you’ll get type 2 diabetes. I would check that gene in my own exome for mutations immediately!
That may sound unrealistic, but when it comes to conditions like cancer, these kinds of studies come out all the time. I may identify a random mutation in a gene that pre-disposes people to getting a particular type of cancer in my own genome, and then I will know that I need to have my doctor monitor for that. Having worked closely on brain cancer for a few years, it struck me that the reason it’s the deadliest type of cancer is because by the time we detect it, it’s already at a very advanced stage. But if we have a gene or set of genes that we know predisposes people to get malignant brain tumors, we could look in our own exomes for mutations in those genes and then get ourselves MRIs starting at a particular age to try to detect them earlier and hopefully allow effective, long-term treatment.”

Something quite similar to this came out of the Mike Synder mega-omics-on-self data, where they predicted an increased risk of diabetes based on his genome and then during the course of the study observed the onset of diabetes. Blood glucose tests are cheap and type-2 diabetes is reversible with lifestyle changes (as Mike found) so this was a clear example of the benefit of monitoring yourself for diseases you have a predisposition to.

Myself and some other members of the Mattick Lab caught up with Mike Synder at HGM earlier this year, unsurprisingly the subject of sequencing your own genome/exsome was the main topic of conservation. While chatting about a particular phenotype I have but which is not found elsewhere in my family Mike suggested I should get my genome sequenced, something which is quite tempting. So what does everyone else think, do you want your genome sequenced?

Leave a comment

Filed under Clarksy's corner

Heart of darkness?

The extent of the genome that is transcribed (known as pervasive transcription) and the amount of “dark matter” RNA (uncharacterised and/or function unknown transcripts) produced by the cell has ignited a few controversies over the years.

This month marks 2 years since the last salvo from the pervasive transcription ‘skeptics’, van Bakel et al, 2010, was published in PLoS biology. van Bakel claimed, amongst other things, that previous studies had overestimated the pervasiveness of transcription due to false positives, that the small percentage of RNA “dark matter” in their RNA sequencing datasets supported this conclusion and that much of the low level transcription found in intergenic regions were probably transcriptional bi-products (ie: some sort of noise, be it technical and biological).

We were critical of many aspects of this study in a reply we published in PLoS in 2011 (further info here). Now that another year has past it’s probably worth seeing how their conclusions are holding up. Continue reading

1 Comment

Filed under Clarksy's corner