Switch Scores for EteRNA Switch Puzzles

  • 11
  • Article
  • Updated 1 year ago
An exciting direction in EteRNA is the study of riboswitches!

We have recently finished our pilot experiments with great initial success. Using a new technique that measures switching directly on a sequencing chip we directly observe the switching for thousands of designs at once. The signal is generated by a fluorescent RNA binding protein, MS2, and instead of the standard EteRNA score, which is based on the correct folding of each base, we have introduced a new Switch Score.

The Switch Score (0 - 100) has three components:
1) The Switch Subscore (0 - 40)
2) The Baseline Subscore (0 - 30)
3) The Folding Subscore (0 - 30)

The scoring scheme is summarized below. A more detailed description is given in this PDF:
https://drive.google.com/open?id=0B_N0OA9NROPGel80SG5LM0wtZms&authuser=0

A typical example of a switch puzzle is shown below:


The player designs the structures in [1*] and [2]. To observe the switching we then measure the fluorescent signal of MS2, which binds specifically to the MS2 hairpin seen in [2]. In the absence of FMN, the MS2 should bind and the switch is ON. On the other hand, if we introduce FMN, the ligand in [1*], the switch should be OFF and not exhibit fluorescence.

No switch is 100% ON or OFF in the absence or presence of ligand, but a good switch can come very close (and get a perfect EteRNA Switch Score!). A some MS2 concentration, the difference should be large (e.g., at ~100 nM MS2 in figure below). In practice, we don't know this concentration beforehand so instead we perform measurements at many concentrations to obtain binding curves. When the switch turns OFF (red curve), the effective dissociation constant increases. The dissociation constant, Kd, is the concentration where half of the RNA binds MS2.


The Switch Subscore quantifies how far apart the Kd's are in the absence and presence of FMN (horizontal distance between the red and blue curves).

The Baseline Subscore is a measure of how close the ON-state is to the the original MS2 hairpin (lower Kd is better, i.e., blue curve should be far to the left).

The Folding Subscore is high if MS2 bind properly in the ON-state at any concentration (the score should be high for the blue curve at high concentrations of MS2, i.e., high values to the right)

In our first experiments, we found that the easiest score to maximize is the Folding Subscore, followed by the Baseline Subscore. These two ensure that the MS2 hairpin is properly formed in the ON-state. The hard one is the Switch Subscore, which is the highest when the energy difference between the states is finely-tuned to the energy conferred by binding to FMN (or other future ligands).
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes

Posted 5 years ago

  • 11
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Thx Johana and Jnicol!

This is amazing and long awaited news. It was worth the wait. :)

I have done a few early thoughts on the data we got back, on what I think might work well for this kind of switch.

Thoughts about the lab results

I look forward to hear what all of you players thinks about the results from our favorite florescent puzzle. :)
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
Fantastic writeup!!

This was very interesting to read and I think that you found some very fascinating trends regarding the placement of the MS2 hairpin and the distance to the complementary segments.

I'm delighted and hope that a lot of players take a look at your findings.

Great work! I believe that the same labs are now up again for another round, so hopefully your ideas will lead to even better results. In the future we should perhaps also make puzzles that vary the position on purpose to test this idea more systematically.
Photo of whbob

whbob

  • 218 Posts
  • 68 Reply Likes
In the chart of high baseline & folding sub scores, a delta of about 10 between baseline and switch scores could be a trend for higher overall scores.  
If the switch works better with the MS2  towards the middle between the aptamer, does that seem to indicate that the MS2 is happier when it has more buffer space either side of its stem?
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
Thanks for asking some great questions!

1. By yield, I assume that you mean the number of designs that can be tested in one round.

Theoretical yield: The current protocol is primarily limited by the number of sequences we can order in one oligopool synthesis. That number is currently 92,918 (http://www.customarrayinc.com/oligos_...).

Actual yield: Our experience so far, from both the EteRNA pilot and other experiments in our lab, points to an actual yield higher than 95%. In these experiments we duplicated the sequences on the chip, effectively reducing the number by 2X or 10X. However, we did measure 96% of 46000 sequences so we are hopeful that we will consistently get 95% yields in future EteRNA rounds, even when sequences are not duplicated.

Theoretical turnaround: 2.5 weeks (1 week synthesis, 2 days PCR for sequencing library preparation, 1 day for sequencing, 4 days of data collection, 4 days of data crunching).

Actual turnaround: Expect a month. Things break, experiments sometime fail, small steps take longer than anticipated, and even scientists need to sleep. We hope you understand :-).

2. Excellent point. The fit errors were not reported in the figures, for clarity (there was already a lot of text), but I will try to post them in a spreadsheet online.

The figures with the curves also show each data point as a light-colored dot. As you notice there is a large spread between the RNA clusters. We use the median values for fitting.

The round-to-round variation is currently unknown, but we hope to repeat the pilot round next time to quantify this. Luckily, those sequences do not need to be included in the synthesis. We are still optimizing the protocol and will, for example, increase the laser exposure next time to achieve a higher signal. Our hope is that the normalization by the internal MS2 control will take care of some of the reproducibility issues associated with this and other currently uncharacterized sources of variation.
Photo of Eli Fisker

Eli Fisker

  • 2327 Posts
  • 539 Reply Likes
Johan, I love your explanation of the known phenomenon, the difference between theoretical and actual lab data return time.

"Things break, experiments sometime fail, small steps take longer than anticipated, and even scientists need to sleep. We hope you understand :-)."
Photo of Brourd

Brourd

  • 467 Posts
  • 86 Reply Likes
I hear ya about the sleep being needed!

Thank you for the clear and concise answers, Dr. Andreasson.
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
I was unable to resist the lure of the MS2 data any longer. Naturally, I made a spreadsheet:

MS2 Spreadsheet

Unfortunately, it doesn't have all the data I'd like to add to it, such as the theoretical folding shape and the number of AU/GU/GC pairs in the second state, but one should be able to make some pretty graphs for the free energy, melting point, and such. I myself should probably be working on some other things though, so I have no graphs to share... for now. :)
Photo of Eli Fisker

Eli Fisker

  • 2327 Posts
  • 539 Reply Likes
Beautiful :)

Big thx, Meechl!
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
This is great.

We are working on more details and Nando gave me a link to a script that should return the energy of the two states. If you feel adventurous you could try it:
http://nando.eternadev.org/web/script...
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1024 Posts
  • 332 Reply Likes
Meechl and salish99 - I've been working on getting something similar, with my main objective being the switching charts.

Although I'm using a spreadsheet as an interim step, the end result will be a Google fusion table. One of the many cool features about fusion tables is that they can be "merged", which is the equivalent of doing a join on SQL tables.

It's easy to move data between fusion table and spreadsheet formats, using .csv files. So instead of collecting all the data fields I have been, I am going to concentrate only on the ones I need that Meechl doesn't already have in her spreadsheet. When I'm done (hopefully today), I'll convert therm both into fusion tables, merge them, and publish the result.

I'm bringing this up now, because it could serve as a more general mechanism for collaboration on gathering data. If anyone makes a spreadsheet where one column is the Eterna solution ID (e.g. 4789644 for Helter Skelter 2), it can easily be merged with everything others have collected.

Once data is in the merged fusion table, anyone can do many cool things with the stock fusion table UI. But even more, tool builders can use the RESTful API automatically available for fusion table to quickly access whatever part the data is of interest to them, and then focus their effort on exploring more customized, Eterna-specific, presentations of the data.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
nice work.
Just to confirm - did you mean to say "one column is for one solution"?, not one row?
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1024 Posts
  • 332 Reply Likes
Each row should hold the data for one solution, yes. But one of the spreadsheet columns should be for the solution ID, since that's what is needed for merging the data.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
A few ideas for designing

- For those who miss ideas for making designs for the MS2 lab, I suggest either modify a design from last round or modify designs submitted in this round.

- Having a design with only minor changes, compared to one with a known score is very helpful for analysis and finding out what really works, as one can see what small changes, that may improve or make things worse.

- I have been making like 5 to 10 mods of a design from last round. The idea is making a small cluster of data, around a design with a known score. That way I have something to compare, that might be able to tell me what I'm after knowing. Its sort of a way to test multiple hypotheses, from a known spot.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Test the same ideas in many different designs

I have ended up often making somewhere between 5 - 20 mods of a design. Either one I modify from another player or one I start from scratch. And think of these sibling designs as clusters.

I expect my small data clusters to be helpful for comparison. So instead of comparing 100 of different designs with each other, I am testing multiple puzzles with the same set of ideas.

I'm testing many of the same things in different design. In hope that I can get to know if eg. a segment will do good in several different designs and it is not just a coincidence in one.

So basically two designs be two sets of data clusters, but have many of the mods in common.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Optimal range of different bases in relation to each other

I believe there is an optimal % range of G's, A's, U's and C's for the good designs in the MS2 labs. I went through 10 of the top scoring designs from round 1 in the MS2 lab and noted their base distribution. For some bases (C and G) those ranges were quite narrow and close in number, while the range of A's and U's varied more.



- G’s have a narrow range (15-21%)

- C’s have a narrower range than G’s (10-15%)

- U’s range vary the most (6-24%)

- A’s are the most prevalent base, covering a range from 49-65%

- The high scorers generally have more U’s than C’s.

- The high scorers generally have more G’s than Cs.

- The high scorers generally have equal amount or higher % of A's, than all of the other 3 bases together.

Around 37% of the bases are fixed, due to the locked bases.

I expect there to be slight variation between sub lab basis. I also expect the ranges to widen for the next round. But yet I believe there is some optimal range for the highest scoring designs, which will be useful to weed out bad solutions, without looking at base pairs, which we can’t know anyway. But the base content is fixed.

I believe we can use the knowledge that generally C’s will be less numerous than C’s. Than A’s will generally be more numerous than most of the other bases together.

I simply think we can use these base % relationships to our advantage for picking out winners. The truer range will display itself to us soon, when we get more data back.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1025 Posts
  • 332 Reply Likes
Eli, I think this is a good thing to be investigating. Independent of Meechl, I added these stats to the merged fusion table.

Something to think about is that if a statistic, or range of statistics, is proposed for picking out good designs, it needs to not only pick out high scoring designs but also reject low scoring designs. With that in mind, here's a graph of all the designs from Round 1, with each of the base percentages plotted against the Eterna score.



The general impression it gives me is that the ranges for high scores are not all that different from the ranges of other scores. But this is with all designs lumped together; looking at filtered subsets of design may show something else entirely.


Also,the fusion table has a lot more designs with scores of 71 or better. Did you manually exclude some of the designs because of obviously low data counts? Or maybe the fusion table has got some invalid records in it; I haven't checked it carefully. If you find any, let me know.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
nice
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Hehe, wow!

I went to bed having posted a hideous handwritten note, and I woke up to this.

Awesome spreadsheet and fusion table enhancements, beautiful graphs and thoughtful analysis.

You guys are amazing! :)
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Omei, you asked for where I found the top scorers and why I didn't find more.

I got them from the total lab overview, where all the sub labs gets shown in one lab:

http://eterna.cmu.edu/web/browse/4736...

And I can see that I got my sorting somehow wrong as I didn't see the complete set of designs scoring over 70, that actually was included. Thx for catching this.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
top score was 79.96
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
The aptamer matrix

Lately I have been talking a lot about C and G segments in relation to the MS2 lab and them helping make the switch happen.

Aptamer sequences in green boxes, G and C segments highlighted with blue.


I kept seeing them pop up around the MS2 hairpin, but also around the aptamer and it got me wondering if something similar happened for FMN aptamers in general.

Which color door to take?

The MS2 lab results, sent me down a rabbit hole to the old switch lab data.

Image credit

Try imagine the stems around the aptamer is like a gate. Each with a double door. Each strand a door.

The image above like a pretty solid house. But really the aptamer house is more like a tent with door/s, and a person inside the tent as the molecule.



Some aptamers tents will have two doors because both stems around the aptamer will be active in the switching area, when the aptamer gets turned of.

But aptamers can also have just one gate at either end. The doors will have different colors. So the one door will be mostly green/blue and the other door will be mostly red. But which it will be will depend on different things.

While I was working one question, lead me to the next. I started a spreadsheet and ticked of different options to get a clearer idea what was going on.

Double gated aptamers

The clearest pattern showed up for the double gated aptamers. The aptamers that has the switching area on both sides of it. There the two first door on right hand, going through the aptamer from sequence start, will have a tendency to be mainly green and blue. Especially if there are no multiloops in either states. Most of these switches was also short range moves.


Spreadsheet: Patterns for segments in switching area next to aptamer

Example:




And here is the reason why this pattern happens to the short range switching but all moving stems around the aptamers. The locked aptamer sequence itself is creating the repeat.



Something Brourd has pointed out.

Gate before the aptamer (Switching area before FMN)

Those aptamers that had switching area before they started, were among the higher scoring switch labs. Most of them were also only partial moving switches, which I have earlier pointed out is a far more successful strategy for solving a switch, compared to having the whole switch move. Except if the full moving switch is short in length. Different types of switches

These aptamer gates have a different pattern for first door. Some have green, others have red.

However what do fit is that if the switch moves backward - the aptamer door pairs up with a sequence with a lower base number than it had before - then the door is red. Those was short range switches.

Example of red switching gate, moving backwards.


If the switch moves forward - the aptamer door pairs up with a sequence with a higher base number than it had before - then the door is green. Those were long range switches.

Red door first - aptamer before or after switching area

Also these 3 labs with red door first, stuck out by one thing more. They have their magnet sequence before the aptamer, placed in single base area. 2 of the labs have it in the hook area (Stratospheric and My screw up) and last have it in a multiloop area ring area (Top Notch). All of them having the magnet sequence at short range before the aptamer.

I have earlier mentioned multiloops to be good. I suspect multiloops are actively helping make the switch happen and are good to have in unbound state at least. And some of the best scoring labs had multiloop in both states. I think one may have more options for switching with a multiloop (Top notch) and 2 stems moving, than a big internal loop with one stem moving and a loop area becoming stem. Though the Top Notch lab shows that it is possible.

Green door first

Most of the time the first aptamer door one will go through will be green. Thats the overall tendency.

Gate after the aptamer (Switching area after FMN)

There were only very few switch labs of this type.

Mixed pattern for aptamer door/s

Some labs have mixed patterns for the doors. With doors neither green or red. But I suspect we might still be able to use the knowledge that a certain pattern occur often, while we can not fully predict which color and bases around the aptamer yet.
Now not every switch lab have the aptamer closed by double G’s paired with double C’s or a similar G and C/U heavy strands. For short stems after an aptamer, even lines of A/U’s, can sometimes do the job. However if the design can’t go for the stronger C element, there is often a number of U’s instead, at similar spot.

MS2 segments as door magnets

I am starting to think about the G and C segments that help get the MS2 hairpin moving as magnets. Magnet to catch the door when it swings open in addition to closing the aptamer loop.

So both doors and magnets for switching. And I think this is why the aptamer house with a single door often is closed with eg. two GC's lining up, despite that is a bad choice for something that one wants to get moving. but they are short. Usually 2, 3 or 4 base pairs long. And if 3 and 4, not then not all the base pairs are GC's in line

Machinelves: Physics and chemistry is all about balance and proportion, so it doesn't matter if GC is strong, if there is an equal weight pulling elsewhere

I also think this is why the magnets that are pulling the MS2 hairpin open, usually are a bit longer than those normally needed to open up the aptamer doors. While the one aptamer door and the MS2 magnet, sometimes do mix up to do the task together. As the MS2 sequence haves

So why am I talking about these aptamer doors? Because I think it matters what colors we paint them and it will depend on where the gate are and how many there are. And things like if the sequence near the aptamer is moving backward or forward compared to the sequence it pairs with. And if there are multiloops in one or both states. And this again will determine what magnet sequence to use in the design and where.

I have been seeing something else for the old switches. Which kind of makes sense now. The leading first door in or out or both, is most often green or bluish. That leaves the last aptamer door when walking around along the sequence, red. This door will often want an outside magnet sequence to help it open up. Which can explain that there are often one or several stretches of magnet sequences outside of the aptamer area that are blue and green.

Aptamer doors versus MS2 segments

I have been struggling with getting my mind around this segment thing. Because I saw the G and C segments show up around aptamer sequence and sometimes also MS2 sequence, but they didn’t seem to be working in the same way.

The aptamer has a door made of two doors, that needs to close in on state, but be open in another. But the MS2 hairpin, need to have its G and C segments, if it has both, out of synch. It can't have them both close to the MS2 sequence as they were doors, because then they will rather pair up with each other, and at no condition let out the MS2 sequence to play. So the MS2 hairpin will not open up.

Machinelves suggested they could be swivel doors like in a western saloon - not ever locking closed. And if you can imagine each door being placed far apart not even across each other.

Or instead try imagine that the MS2 hairpin has one set of doors, hidden up inside the MS2 hairpin. Each door can then go on to forming either one or two gates with outside magnet segments.

Often the red segment of the MS2 go far after the MS2 sequence, to sometimes fuse with the red door of the aptamer. Even using the red bases in the aptamer itself, sometimes.

Both FMN aptamers and MS2 sequence has a thing in common. That they both often like to use G and C hook segments. Lines of C's and G's, that they themselves can grab onto, to get the switch moving.

MS2 element is made for switching (Has 3 out of 4 as G’s, has 3 C’s in line)
FMN aptamer element is made for switching. (Has two double G’s in line)
Orange marked area shows where aptamer doors often turn up.



Sum on aptamer gates

I basically see the same mechanism in play both in the MS2 labs for aptamers and MS2
but also that FMN aptamers have patterns in common for how they like to get solved, depending on if the sequence around the aptamer moves backwards or forwards, and depending on which of the stems around the aptamer is in an active switching area.

But in difference to the MS2 segments, that are usually placed apart, as they are generally longer and lines of C’s and G’s love to pair, the G and C segments for the aptamer are often smaller and weaker, and are often pairing as to close the one end of the aptamer. On the side which is involved in the switching.

Most of these double gated aptamers had in common that the whole switch was moving too, which I earlier have mentioned is less than optimal for gaining a high score. Different types of switches

So I wonder what will be best. I think one of the 3 type of aptamer doors will be better than the others. For now the most successful ones are the few ones that starts with a red door and a short range switch and has the switching area before aptamer. I wonder how the aptamers with switching area on both sides, will do in designs that are have a smaller actual switching area? Will they do better, if given better starting conditions? I think I have my bet mainly on single gated aptamers. And for now switching area before aptamer looks best. But when other elements like MS2 bricks needs to get involved, I think I’m in favor of having the MS2 side held between the aptamer sequences. The MS2 lab that had exclusively worst for average eterna score was Exclusion 1 which had the MS2 sequence outside and not in between the two aptamer sequence. Source Omei's screenshot

Acknowledgement

Thx to Machinelves for listening to my lab talk, commenting and helping me visualize and put mental images to this - with gatekeepers, aptamer houses and swivel doors.

Perspective

I didn’t start out on this journey with any particular aim, but the aim turned into being a strive to predicting sequence around aptamers and MS2 magnets. I can’t do this fully, but as I have tried to shown, there already seem to be tendencies for aptamer closing segments in the small set of switch labs we have solved.

Now I pointed out the mentioned patterns turn up in specific kind of aptamer settings and seems to get the switch solved. But are they also indicating that the aptamer will work well too? There might be needed some adjustments to this idea for best possible overall switch, but solid aptamer binding too.

I also think there is something like a good before and/or after MS2 sequence magnet position. Just like with the aptamer house. This will depend partly on position of the MS2 sequence to the aptamer sequences. But I think there will be a pattern for when to need a red or a green magnet sequence before or after the MS2 pin and how close it is to be positioned to the MS2 sequence. Likely also depending on which way the switching area moves.

We still have rather few switch labs, but I suspect that when we get more a clearer picture will arise. Hope it will be a help for a start.

I’m certain there are many more factors to take in and more connections to be gained from from the spreadsheet and with more lab data. You could see other connections than I. Please bring up what YOU see, so we can all get better at solving switches for lab and help each other further science and medicine. Your turn. :)

Resources
Image document: Where is my switching area versus my aptamer
Spreadsheet: Patterns for segments in switching area next to aptamer
Photo of rhiju

rhiju, Researcher

  • 416 Posts
  • 125 Reply Likes
based on this magnet/door picture, can you make any predictions? That is, do you see any pairs of sequences that johan is synthesizing in the current FMN/MS2 round where you can predict one design will switch better than the other? [Or will we need to design special experiments in the next round to explicitly test?]
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Hi Rhiju!

I have tried to answer your question in this post below:

https://getsatisfaction.com/eternagam...
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
I would assume that these are the calculated, and not the measured shape data, Eli.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
We don't have SHAPE data for the MS2 labs. More like florescent green signals. So we don't know the exact shapes. You can read a bit more about the experiment here:

http://eterna.cmu.edu/web/lab/4736274/
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
my point exactly
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
I found Brourd's plots so great that I decided to replicate them, this time also including all the ~1100 sequences that resulted in no clusters and hence weren't included in the dataset I posted online.

I tried to match the color scheme. To fit the data on a log-scale for the y-axis, all 0 values have been changed to 0.5.

I also separated the R88 and R93 populations that were mixed in a different concentrations.

The trends identified by Brourd seems to extend down to the absence of clusters on the chip (not surprisingly). A next step could be to investigate if the problem is overall content or if the effect is largely caused by long stretches of identical bases that could throw off sequencing or synthesis.

All sequences:



R93:



R88:

Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Errr, out of curiosity, how do you display a zero (i.e., zero clusters) in a log axis?
I presumre you arbitrarily gave this a value, e.g. 0.1 or 0.01, to distinguish it from the rest of the data set, and still be able to display it?
Photo of Brourd

Brourd

  • 477 Posts
  • 86 Reply Likes
Based on a very rough preliminary analysis, it would appear that the overall content of adenine residues may be important in determining the cluster yield, although consecutive adenine length is just as or more important for determining cluster yields. It's hard to tell, given that as a player breaks up more consecutive adenine sequences, they lower the overall content of adenine at the same time.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
(or, 0.5, judging by the graph values...)
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
yes, Brourd. I wonder if we could introduce a counter for "total A", not just "A's in a row" and set it, based on the length of the RNA string, to max. 55% of total length, and make it a hard rejection rule (as with the 4 C's in a row)
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
The counter is coming
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
Out of curiosity I checked to see how various bases and sub-sequences correlate to cluster size. To eliminate outliers I used only sequences with clusters between 10 and 250 in size.

As Brourd has already shown, too many As seems to be harmful, with an increased A count negatively correlated with cluster size:



However the context seems to matter. When looking at length two sub-sequences, the worst offenders are AA and AG:



For length three, The worst offenders are AAA, GAA, AAG, AGA, and AAU:



While correlation is not causation, it seems that using large open loops with multiple As, optionally speckled with Gs or Us, may not be a good strategy for good cluster yields.

The Google Sheet can be found here. Best viewed in Chrome on a machine with at least 8GB or ram. (You have been warned.)
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
I think that looking at these higher order frequencies is a great idea!

How did you calculate the digram percentages for a sequence? I checked your spreadsheet, and they don't seem to add up to 100%, as I would expect. In general, they add up to something less than 200%. It would be fine if they were all simply doubled, since it shouldn't change the correlation coefficients. But the actual sum differs among the sequences.

As a specific example, it looks to me like only one CC digram got counted in the sequence AUUUUACAUGAGGAUCACCCAUGUUUUGGCGGGCAGGAUAUAGAUCGGAUGAGUUCUGUCUAGAAGGGACAUGUU, where I would have expected the CCC trigram to contribute two. (This sequence is for JR_SS001_Sub511, the first on the sheet.)
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
It's a quick hack using string substitution. CC and CCC match one CC. CCCC and CCCCC match two, and so forth. So for repeated/overlapping sequences it can be off. I'm looking for rough correlations, not absolute counts. Also, I think it would be better to correlate to ln(num_clusters) rather than num_clusters. Still playing with it.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
Rhiju wrote:

A probable explanation of the result is here (note that depurination primarily affects A's):

http://nar.oxfordjournals.org/content/38/8/2522.abstract

And so it seems. When I restrict the MS2 data to round 93 and clusters of 10 to 250 in size, it seems that the strongest correlation to cluster size is simply the number of As, followed by AAs and AAAs. In sequences of length three, having at least two As is bad news, regardless of the third base.

Google sheet here.

So: High percentage Adenine seems adverse to yield. That GAA and AGA and AAG are also fairly bad is likely due to their presence in breaking up large runs of AAAs in open loops. Avoiding large open loops seems one way to reduce the excess As.

Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
I was looking at correlations on score and sub-sequence vs cluster size and sub-sequence, but spreadsheet conflicts are causing questions regarding results. I'll see if I have the time to start over.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
When you restrict the cluster size to 20..250 and the score to 50..80 to help eliminate outliers, there is still a large negative correlation between %As and cluster size (-0.641). Even if you limit the %As to under 40% the correlation is still very large (-0.606).

However, there is no strong correlation between subsequences and score, with the largest correlations (of sequences up to length 3) being 0.148 for the subsequence UAC and -0.136 for the subsequence UAA.

https://docs.google.com/spreadsheets/...
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
This is a great analysis!

Limiting the total number of As seems to be the most straightforward method for achieving better yields and it's something we will implement for the next round.

Unfortunately we have no direct control over the actual synthesis but I'm looking forward to seeing the results, if any, from limiting the A percentage.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Finally, here is one more really good reason why repeat A's can be a problem. Besides the depurination explanation, that we also got.

macclark52: Eli or somebody may already have posted this, but here is a neat news story on multiple AAAs in a mRNA stalling out protein synthesis.

AAAAA Is for Arrested Translation 

I had not. So big thx to Macclark for sharing. :)
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
This news story actually fits quite well with our investigations of poly-RNA in the SHAPE labs.

For those that weren't active at that time, we found quite clear evidence that poly-A chains longer than about 6 bases started forming some kind of 3D structure of low enough energy that it protected them from the SHAPE probe.  Whatever it is, this could be the same 3D structure that is jamming up the translation mechanism.  Now if we could only figure out that structure is ...

Another thought. So far as we know, the variations in cluster count is originating in the DNA amplification that gives us the RNA to test, not in the part of the process that actually experiments with the RNA.  As such, a poly-A RNA structure might not be relevant to cluster variation.  But since we don't know what poly-A RNA is doing, perhaps poly-A DNA might do something similar.

Rhiju, suppose we were to submit some designs in pairs, one containing a long poly-A sequence, with its matching pair being identical except having the some of the poly-As replaced by Gs.  If breaking up the poly-A strings with Gs raised the average cluster count, would that be evidence in favor of 3D structure, as opposed to de-purination, being a (the?) cause of the inverse correlation between poly-A and cluster count?
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
More on %seq vs ln(clusters). Looking at sequences in MS2/93 with 20%≤A≤50%.









The %A has the strongest (negative) correlation, followed by C and U. The %G is not very correlated with the cluster count. If you reduce the range to 20% to 40% for A, the correlations diminish somewhat, and there is a lot of spread.

Still these graphs gives some rough ideas of possible target percentage ranges.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Good correlations, j!
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
Additional descriptive data

Meechl sent me a PM saying she had created an extended spreadsheet with some additional columns, i.e. secondary structure, estimate of the free energy and base pair counts / frequencies for both states.

I created a Google fusion table from her spreadsheet.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Hi Omei and Meechl!

This is extremely helpful. Big thx! :)
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
I got inspired by playing around with the fusion table and ended up writing a small intro on how one can get started using such a one.

Have fun! :)

Intro to Fusion Tables
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
Thanks Eli! I've been struggling trying to figure them out.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
I find them of limited use - my xlsx sheets sadly never upload, for one, due to the extensive tab use (I suppose) or the graphs
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
Structure metrics for round 2 submissions

I've calculated some structure metrics for (each state of) each round 2 submission and merged them with the previous data to create a new fusion table.

The metrics are:

* Stem segments. This is the number of distinct groups of consecutive ( or ) characters in the estimated folding. For example, .((((((((....))))((((....))).))))).. would be counted as having 5 segments.
* Hairpin loops. This has the obvious interpretation. The above example has 2.
* Unpaired bases. The number of dots in the secondary structure, 12 for the above example.
* Dangles: Unpaired bases at the ends of structure, 3 in the above case.

In each case, there are three columns -- metric, metric_2 and metric_delta. If anyone has a desire to see other metrics and I can calculate them with a Google sheet macro, I'll add them. If it requires a more complex calculation, we'll have to discuss the effort/reward tradeoff.

If the fusion table UI doesn't suffice for your needs (it is most certainly limited), you can easily get a CSV file with the File/Download command and load that into a spreadsheet, R, or whatever toolset you use.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Omei, this is simply amazing. I have long been wishing to have data like amount of stems available. I see what you done. :)

Since you asked, here are a few other ideas for what could be useful. Just mentioning and not expecting:

- Length of stems.

This is something I have been counting in a past spreadsheet, since I found it useful for categorizing lab structures. Like pressured designs that are hard to solve. Since a huge amount of certain stem lengths (namely short) makes lab puzzles harder to solve.

One more idea (expecting this one to be particularly hard) of things I find useful:

- How many of the stems are actually estimated to be in the switching area?

For switches what really counts, when counting stems, is if they are judged to be in the switching area. Because if they all are, the switch will generally be harder solving. While not necessarily impossible. The majority of the winners we have gotten back on MS2 switches and microRNA are partial switches, not full moving switches - not all of the stems appears to be switching. Most of the Exclusion lab designs seemingly had the aptamer sealed up and static in one end. And several of them had an extra static stem form instead of having dangeling bases.

One day I would love eternabot to have a chat with Meechl's spreadsheet, your fusion table and DataMiner. If the bot knows what is normal amount of stems (interchange with any other interesting feature as well) in winners for a particular kind of structure, then it can hedge its bets better. Read the rules from the data. :)
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
Thanks, Eli. I like your stem length suggestion. Do you think a single value for the average stem segment length would be useful? That would fall in the easy category. Eterna has Javascript code for determining all the stem lengths, and that code should be useable inside a Google spreadsheet, so it would involve an intermediate amount of work. But it would also produce a bunch of values, which makes it harder to draw simple comparisons. Do you know what you would do if you had the full list of all stem lengths?

Differentiating between stabile and switching segments is also probably an intermediate difficulty -- not hard, but more than can be done with a reasonably compact spreadsheet formula. I'm inclined to see how much utility we can get out of the low-hanging fruit (i.e. simple formulas) for now. We've only got 10 more days to learn what we can from round 2 and apply that to round 3.

And yes,I think that while looking at round 2, we can develop specific recommendations in a form that Eternabot could use to predict results for round 3, before the round 3 measurements are made. And I think it would make pretty good predictions. :-)
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Even an stem length average will be useful.

I think that together with your stem segment count could help us shoot towards a more optimal value for both.

By the way, one comment on your find in your Maslow analysis. You found that having more stem segments had positive influence on the folding score. I can add that many of the designs that did well, actually added an extra static single stem. Either for stabilizing one end of the aptamer but some also directly in the middle of the switching area, which I suspect is for bringing the switching elements closer to each other. (https://docs.google.com/document/d/1u...) Both are things I think aids the switching.

Page 2: https://docs.google.com/document/d/1L...

I originally used stem length of different labs to single out what would make a hard static design.

Background documents:
https://docs.google.com/spreadsheets/...
https://docs.google.com/document/d/1a...#

I basically think the same can be done for switches. Stem length together with switching area will help show what will be most optimal.

Oh, and I'm very pleased to hear that differentiating between stable and switching elements might not be a real hard nut to crack. But good point to focus on what can help us for next round.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
OK, I added two new structure metrics. On is the average stem segment length. The other is the position number of the first closing base pair of the first hairpin (LowHairpinPos). So for example, if the 5' end of the structure started with

...((....))..........

LowHairpinPos would be 5. It's kind of an ad hoc metric, but it was easy and it might seemed like it might be interesting.

This new URL is https://www.google.com/fusiontables/D...
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Thx Omei!

I like it :)
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
Johan, I've been looking forward to seeing the binding curve image files that we got for Round 1. All the numeric statistics are great for finding designs of interest. But when it comes to evaluating a specific design, that one image is much more informative than a row of numbers.

Are they still in the plan for round 2? Or perhaps they already are available and I just missed that fact?
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
Thank you. Those are interesting.

But it does seem like the one image we got in round 1 more succinctly conveyed a lot about how the switch behaved. I don't know how much effort they took, but I, for one, would really appreciate seeing them.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
Much appreciated. A few naive questions from someone a bit late to this party: What are Kd and Fmax and how should one read/interpret these graphs?
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
After fitting the data to a sigmoidal curve, FMax is the top asymptote of the curve, and Kd is the concentration at which the intensity is half the maximum.


The Fold Change, which is the basic measuring of switching, is the ratio of the logs of the two Kd's. Since the horizontal axis on the graph graph is log scaled, the horizontal distance between two Kd values on the graph directly corresponds to the Fold Change.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
kd is the MS2 conc at the point where the UV-vis level is half the maximum
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
hm, maybe I should wait for the page to finish loading... see Omei's picture above for explanation, much better than my ramblings...
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
Comparison of similar designs - the Maslow mods

I'm particularly interested in comparing designs that have similar sequences but a large difference in scores. With that in mind, I selected the 58 designs from round 2 that had Maslow in the name, and looked at how the various metrics derived from the predicted secondary structure of each state could be use to predict and improve the Eterna score. (Actually, I focused on the FoldChange value rather than the total Eterna score.) I came away with some specific recommendations for improving the scores for Maslow-based designs in Round 3. For the specifics, see Variations on the Maslow design, Round 2.

Leaving aside the specific numbers, I'm going to go out on a limb and hypothesize that the following general recommendations will generalize to all the exclusion puzzles.

For the next round, if you are making mods to existing designs, consider:

* Decrease the number of unbound bases (increasing the number of bound pars) , in both the ON and OFF states.
* Pay particular attention to increasing the AU pairs (in absolute numbers) and GU pairs in percentage, i.e. from none to at least 1 or 2, in both states.
* Consider increasing the number of stem segments (as opposed to combining segments, or simply extending the existing segments).

These recommendations are made with the intent of nudging average scores higher. Each design still has to be judged on its own merits, and many good design mods will undoubtedly move in the opposite directions.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Beautiful work, Omei!

This is very practical advice for designing plus a fine guide to good settings for the fusion table. Nice analysis. :)
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
I notice in looking at the R93 results that there seem to be some common cases for how relatively high-scoring designs "missed" and have some suppositions on how they might be tweaked.

High Eterna_Score, Low NumberOfClusters
example: Exclusion 1 (5480276) 'Nebkaure Khety II'
Typically these have a high percentage of As (e.g. 54.1%), often unbonded. Reduce the number of As.

Exclusion: High Baseline_Subscore and Folding_Score, Low Switch_Subscore
example: Exclusion 1 (5490401) Tirebi
MS2 bonds well in the ON (FMN-) mode, but the low switch score means there is little difference between the ON (FMN-) and OFF (FMN+) curves. Need to change stacks and boosts to increase the kcal delta between the states. If baseline is high but not 30, may also need to improve the stability of MS2 hairpin in ON (FMN-) mode (e.g. clean up dotplot?).

Same State: High Folding_Score, Low Switch_Subscore
example: Same State 1 (5510749) Garnet 75
MS2 bonds well in the ON (FMN+) mode, but the low switch score means there is little difference between the ON (FMN+) and OFF (FMN-) curves. If baseline==30, we probably need to change stacks and boosts to increase the kcal delta between the states. If baseline is less than 30 it may be too hard to form the MS2 arm in the ON state meaning that a higher concentration of MS2 is needed for its bonding bonus.

less often:

Exclusion: High Switch_subscore, Low Folding_Score
example Exclusion 3 (5498302) salish99-ex3-25
The low folding score means it doesn't get a strong MS2 signal in the ON mode, meaning the MS2 arm is not forming well. It could be too many misfolds of that region. Check the dotplot. High switch score may mean the kcal delta is too large between the states, so it is very slow to switch.

Does this make sense? Anyone see it differently?
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Excellent sumamry. as for the 5498302, I typically tried to have clean dotplots for the early series (numbers 10-30), but I had difficulty doing this, especially for the ex3 and ex4 series...
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
I started making spreadsheets and splitting data for individual labs and adding a few categories that I thought might reveal something interesting. I ended up making drawings from those. What I find interesting is that many of the lab winners have a fairly similar pattern for switching - that each lab generally has a few variant types that works.

I have been making drawings for the exclusion labs also now. But added the rest too, since I have redone some. I have mainly been interested in the winners and designs scoring over 80.


Lab drawings


























What MS2 wants

I think it is not just the aptamer that wants space around it so it gets its optimal door colors. I think the MS2 hairpin will often need it too.

With part of the FMN aptamer sequence next to it, as in the exclusion labs, it doesn't have that full luxury of free choice like the Same State labs and microRNA.

It is possible solving these exclusion puzzles, as there have been winners in most of the exclusion labs. There are just very few of them, compared to the Same State designs and the microRNA labs. Another thing, all the labs that turns on the MS2 in state 2 (Mir 208A, Same State 1 and Same State 2) have been doing better than the turn off labs.

Now in the Exclusion labs, MS2 can’t as easily have these gate doors on both side of the MS2 hairpin, since the aptamer sequence next to the MS2 sequence is already a given.

So instead the FMN sequence in some cases gets made into one of the MS2 gate door sequence with a complementary sequence on the other side. This is the case in all of the Exclusion labs in at least one of the variants of solves. (Exclusion 1 A + B, Exclusion 2 A + B. Exclusion 3 A + B, Exclusion 4 B)

The MS2 hairpin can get turn on and off in different ways. But what characterize many of the winners, are that is that they get turned on and off in specific ways.

Background articles:

Aptamer doors
MS2 gate doors


Exclusion 1 and 4

MS2 is on the outside of the FMN sequences. Since the aptamer sequence which contain twin G’s and A’s, gets made into MS2 gate door sequence, its complement will be U and C heavy. And that CU element also often gets used for turning the MS2 hairpin off also. So these magnet segment doors for the MS2 hairpin, have a similar function to the sequence around the MS2 hairpins in the logic gate puzzles, despite the base composition being different.


Exclusion 2 and 3

Here the MS2 sequence is between the FMN aptamer sequences. Here the trend is to place a magnet element at the side of the MS2 that do not involve the FMN sequence. Can either be a G or a CU element. For both turn on (together with the FMN on the other side of the MS2 and for turn off of the MS2 hairpin. Also one of the winners solved went without the G and C magnet segments but used mirroring of the MS2 hairpin for turn on and turn off. (Exclusion 2 A)

But else the trend is that the magnet element placed after the MS2 either pairs up with the FMN sequence before MS2 or the MS2 G’s, if a CU or reaches for the MS2 C’s if its a G magnet segment.


MicroRNA

The microRNA labs had many winners, when thinking about that it was the first round for them.

The majority of the 208a solves had a quite different way of solving the MS2 hairpin compared to the MS2 labs. Much less magnet segments and more like mirror complementarity to a MS2 fragment. Which gave long MS2 gate doors on either side of the MS2. Only a minority of the top scorers had a G or C magnet segment in use to turn off the MS2 hairpin.

Of cause microRNA labs has another reason for not needing as many magnet segments, as the FMN aptamer are not there. The twin G's segments in FMN is part responsible for introducing C magnet segments and these can sometimes pair up magnet segments in MS2 or share them. So that too is reason for the different solving style in the microRNA labs. But the microRNA labs show same pattern as the XOR puzzles with MS2 gate doors - with sequence just before and just after the MS2 hairpin. I think we are going to see a lot of that for the future.

MS2 sequence highlighted. Focus on the area close by. Notice the palindromic like sequence around the MS2 hairpin. Its a fragment from the beginning of the MS2 sequence, mirroring on to both side of MS2.




Messy spreadsheets

Exclusion 1
Exclusion 2
Exclusion 4
Same State 1
Same State 2
Mir 208A
Turnoff variant 2, v2

I have also added a bit of details on some of the labs in my thoughts about the MS2 data:

https://docs.google.com/document/d/1u...

Plus I'm beginning to sum up the document:

https://docs.google.com/document/d/1u...
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
GU content

There are differences in GU content between different solving variations in the same lab. Meaning that top scoring design siblings will have a similar amount of GU content. But since many of the labs have more than just one way of solving, not all styles take an equal amount of GU’s.

With solving styles I mean, that the winners often will have only 2 or 3 main ways of solving, meaning they switch in the same spots and use same way of doing it. Just like there are design siblings.

So I think GU will partially depend on what way one goes about solving the puzzle. A complementary mirroring solving style leaves a different amount of GU content than a magnet segment solving style.

When I go on about magnet segments, I mean designs that uses small short elements of C/U's (often to bind up with the MS2 G's or twin G's in the aptamer or short element of G's that target the MS2 C's.

The other type, usually uses longer stems and have them complementary to a piece or more of the MS2 sequence and have that fragment mirroring outside.

The differences of GU between different types of design can be seen in some of the spreadsheets in the above post.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Switches needs higher entropy than static designs. I think there is a connection between repeat bases and higher entropy.

I basically think base repeats are a way of raising entropy. Repeats in themselves are not enough, the ratio between repeats also matters for the result too. I think we need to raise the amount of repeat bases when solving switches - that is except for A repeats.

Putting some of the RNA switch puzzle pieces together

I was reading a new science paper on riboswitches. It says that entropy is higher for RNA designs that are also riboswitches.

Pablo gave a lecture entropy in relation to switches some time back and mentioned that switches lands in a higher entropy range. (Here an intro by Machinelves and a link:

Entropy, RNA and free energy.

Usually our good static designs of the past had low entropy - meaning they were highly ordered and unlikely to take on many other forms. Although low entropy was no guarantee of winners. Ding on entropy. Whereas designs with high entropy was always bad.

However since natural occurring riboswitches and our RNA switches, needs to be able to shift shape, it makes perfectly good sense that they will also need to have higher entropy.

What particularly caught my attention in the new switch paper, was the collection of riboswitch sequences on page 31-33. I thought I saw raised amount of repeats compared to what I thought normal for an average RNA static design, so I brought out my Indian Ink. :)

Paper:
Secondary structural entropy in RNA switch (Riboswitch) identification













How to raise entropy?

All these repeats in the riboswitches made me wonder if there was any connection between repeats and high entropy.

What has been telltale of high entropy designs earlier? Bingo - repeat bases. :)

The bad scoring designs of past static designs, not only scored bad, they often contained a hideous amount of repeats. Something I was not pleased about.

Back then we had rather small designs and I later learned that repeat bases are more welcome certain places. In particular in longer stems and for U repeats in bigger loops.

Base repeats in natural riboswitches

I also noticed that natural RNA switches often seems to be riddled with repeat bases, to a different frequency than normal static and was perplexed about the heavy G and C repeat base sequences. Because I thought GC pairs and lots of them as hallmark of making things stable, which was not something we were too interested in for switches. But while there was a huge extent of G and C repeat, there also were exceptions, so I didn’t knew what to make of it.

Collection of screenshots of riboswitches:

Rfam picture archive of riboswitches

Patterns in the natural riboswitches set

Dataset:
https://docs.google.com/spreadsheets/...

What patterns I see for the riboswitches that I have been playing with.


  • There is an unusual low amount of A repeats sometimes just above, but most often below to the other highest amount of repeat bases of other colors. In the static RNA labs, A repeats are usually the most frequent type of repeat base.


  • There is a higher than usual rate of G and C repeats. And Us as well.


  • There often is a relation between the number of A and U repeats and between C and G repeats. No surprise there. Its similar to that there is a relation between A and U bases and C and G bases, because of Watson Crick pairing.


  • There is a different repeat base ratio for switches compared to static designs. Whereas static designs as a rough estimate seems like they may have 20-50% repeat bases - depending on factors like stem length and structure in general, the riboswitches more often may have a repeat base ratio of 30-50%. I’m guessing where a static design often will land in the 30% repeat base range, a switch more often lands in a 40% repeat base range.


  • Even should static and switch designs have same amount of repeat bases, the ”colors” of repeats are not distributed in the same way. There are fewer A repeats for the switches.


  • Generally entropy of switches is high. Between 0.8 and 2.8 (there can be multiple peaks). With an average of 1.8, which is pretty high compared to what is normal for static designs.


  • I was wondering if the areas that are supposed to be non switching has low entropy compared to the supposedly switching areas. It seems to be the case so far. So while Vienna is only meant for single state puzzles, it seems to be revealing the switching areas, in the positional entropy drawings. Still there are many repeats in the static parts of the switches too. Something that I have been wondering about. I might have found a possible explanation.


  • Typically it is the shorter riboswitches that show a less of a pattern for the repeats.


  • Ensemble diversity is lower for short sequences - Kind of makes sense - as the shorter the sequence, the fewer alternative pairing options there will be.

    Ensemble diversity is how much time RNA stays in the actual "target" shape. When we want RNA to stay in one target shape, then we want ensemble diversity to be low.

    Ensemble diversity often isn’t as low for many of the switches as it is for the static designs.

    Therefore it makes sense that when we want RNA to change between multiple shapes, then we may want ensemble diversity to be higher.


  • C repeats - somewhat close but most often not as close as G repeats. Sometimes there are 3 C repeats in a short row and similar distance between.


  • Short sequence = lower entropy than longer sequences.



This also might explain why there is a minority of riboswitches that don’t have the general patterns that I noticed for riboswitches with the raised amount of repeats and C and G repeats in particular. These exceptions are usually the short ones. Short riboswitches often don’t have as high entropy as the bigger ones. (I think this is a general for short RNAs versus long RNAs be they switch or not.

And I know that natural RNA when real huge, also gets a lot more longer repeats. But even with growing length of RNA for both switch and static, this will mean more repeat bases - the ratio and the kind of repeats are not the same.)

The short exception among the riboswitches, don’t have the same amount of repeats bases or C and G repeats. I think them being short means they can compensate other ways. They often don’t even have that repeat bases I otherwise found in many of the riboswitches.

RNA switches and base repeats

In the early switch labs I noticed that repeat sequences (not base repeats) seemed to spread like wildfire in some of the labs. Periodic repeats in rna switches

Now a lot of the switch sequence repeats seemed to be caused by due to spread of FMN aptamer repeats. As Brourd demonstrated with the TEP design, the TEP aptamer repeats doesn’t spread the same sequence repeats as the FMN aptamer ones.

But most of the natural riboswitches don’t even have FMN aptamers, and still most of them have an unreasonable high rate of repeat bases compared to static labs. And when looking at the TEP design with that in mind, this particular design has an unusual high rate of repeat bases (55%).


Image taken from Brourd’s comment

Even in switches where the repeats caused by the FMN aptamer, were contained in the switching area, due to the switch only being a partial switch, there is still a raised amount of repeats outside of the switching area. Also the FMN aptamer causing repeats, does doesn’t explain the repeats in the many other types of switches, that does not have FMN.

Sequence repeats

I suspect that base repeats are not the only thing that raise entropy. I think that some specific sequences may also contribute more. And it also matters where they are placed. Some sequence may cause more trouble if placed in stem or loop.

Switches are also rich in other kind of repeats that are not necessarily base repeats. Such as sequence repeats. Such as ...CUC... and ...GUG... which is normally less than helpful in static puzzles in bigger amounts. Example from static labs: Strand repetition ban.

Similar goes for sequences like GUGG, CUUC and so on too. Basically the basepairs are not well enough mixed - too many are turning the same way - to form a stable stem in a static design, especially not if there are multiple of these sequences. However in switches they thrive.

Sequences like GUG and CUC can make stems unstable, if there are too much of them and they continue beyond just a few bases. While G and C base repeats in loops can make stems nearby unstable. Actually the FMN aptamer and for its switch mechanism, and even the MS2 hairpin while it hides its repeats in the stem.

And both these kind of repeats - base and sequence repeats - are high in switch labs and similar in failed static designs.

What sequences do you think will help raise entropy and make switch RNA switch?

Eterna at work

This that the frequency of A repeats needs to be lowered for switches and the A repeat fragments made shorter also plays fine together with why there needs to be a limit on A’s for our MS2 switches. As the story from the Eterna lab goes:

After we got first round results back from the MS2 lab, Johan gave us an update:

“The highest scoring designs had very few clusters, so beware when interpreting the results. “

http://eterna.cmu.edu/web/lab/5448678/

This made me wonder if I could find anything that separated designs with low amount of clusters from the winners. I noticed that designs that got low cluster size had high A percentage and long A repeats.

This got confirmed with graphs by Brourd and later janderson made statistics and we got the A meter - thx, Nando!



However if you check jandersonlee’s numbers for UUU, GGG and CCC repeats, those do not score bad, compared with the AAA’s one.

So it is only the long repeat A’s that gets in trouble and if there are lots of A’s repeat. In natural riboswitches there is regularly 4 G’s in row - something that our lab prohibit - and I have even seen a fiver.

What’s frequency got to do with it?

It is not just that having more G and C repeats are good for switches, there is also something about A repeats, that makes them problematic. U repeats also seems to be beneficial in a higher degree than usual for static labs. Something that has also been visible in our recent MS2 and mirRNA lab results.

Which reminds me of the intrinsical labs (http://eterna.cmu.edu/web/labs/past/?...) that were playing with frequency A repeats. These labs had forced base frequency.

When I checked the Intrinsical 8 lab, the designs tend to have super low ensemble diversity, something which is normally counted good for static designs. But most of the intrinsical labs has crappy signal to noise ratios and they score crappy. The longer the stretch of A before a breaking base, the lower the entropy.

A few frequencies came back with better signal to noise ratio than the others. But else most of the rounds came back with bad noise. And the bigger frequency labs didn’t have winners.


From. Meech’s Signal to noise ratio spreadsheet

The trend however is that the lower the frequency is - the shorter there is between the A’s - the higher entropy shown in Vienna's estimate. Although there are more winning designs when they were run for lab. Not all colors of base affect entropy equally much.

Perhaps someone remember these particular Intrinsical labs breaking several lab batches a while back. Now the labs results from these labs finally seems to make some good use. :) Beyond that we got the needed hard restraints for base repeats. ;)

When I ran some of the the Intrinsically Red lab designs - they have monster long red repeats, entropy goes super low along with ensemble diversity. This is not the case for Intrinsically blue, there entropy goes high. And in Intrinsically green, entropy goes super high.

Real long base repeats ranked after estimated entropy inducement:

High
Green
Blue

Low
Red
Yellow

So it seems that having repeat A’s and many of them are a way to ensure super low entropy = static structure. No wonder we had low entropy in the classic eterna labs, where many of us thought it bad to have much other than A bases in single base area. Vienna overused bases other than A and repeats in single base stretches, spread in an unbalanced way, which made this strategy look bad, and the lab method for obtaining data back then, generally showed A’s to beneficial in the loop area. Something that changed with Cloud lab.

While red also gives low entropy, red repeats can’t be used the same way as yellow repeats, as they cause trouble for the polymerase, when there is too many of them.

So as I like to say:

Basically RNA folding is a game of frequencies.

There is no one right answer
on how to fold RNA,
beside the questions,
what length,
what elements,
how many parts?

What color, base frequency and base repeats are needed really just depends on what structure you want to make, and what function you want the molecule to perform.

Repeat the same small sequence with a too high frequency and too close together, and have two such sequence frequency repeats that are complementary and misfolds are bound to happen. Thats unwanted in static designs, however we can turn it into our advantages for switches.

Stems take different base frequencies than loops, small designs takes different frequencies from big ones. Similar elements vary with size. Big loops take different base frequencies from small loops.

Balancing the repeats

For the riboswitches, the A repeats are lowered much compared to normal frequency in static labs. In static labs, Repeat A’s mostly are the most frequent one. In really big loops the repeat frequency is also changed. There the A repeats are most frequent too, just to a more extreme degree. And often with raised amount of U repeats too.

Rough guess for ranking on base repeats in different type of RNA designs



The different base repeats affects entropy differently. UU, GG and CC are not alike. If these are added in a loop, the one with the stronger pull (CC and then GG) have more power to disturb the fold of a nearby stem. Meaning they can make the stem go slightly or very unstable. Meaning that they actually help facilitate movement of that stem. The more pulling power and longer base repeat, the more potential disturbance. I think this is why U repeats are regularly longer than C repeats, as U repeats are less aggressive.

I basically think we can make better switches if we up the amount of repeats in them - that is, with the exception of A repeats. And keep these repeats in a fine ratio balance.

Why base repeat ratio matters

If static labs generally have a 30% repeat ratio and switches have a 40% ratio, it will make a difference for folding opportunities.

Not only do sequence repeats provoke higher entropy - which is normally bad for getting a solid fold.

Lets do a small example. Imagine you have two designs with a different amount of repeats.

Now imagine a riboswitch with 50% base repeats - as some of them has that many and a static design with just 30% repeats. Let's say both designs have 100 bases each.

Now the switch design will have 50 bases that are repeats and the static design will have 30 bases that are repeats.

Let's say the repeats on average are around 3 bases long.

Switch: 50 single bases + 17 repeats = 67 base regions

A raised amount of the repeats will have an option to pair with each other. But overall there are a lot fewer ways the RNA can fold.

Static design 70 single bases + 10 repeats = 80 base regions.
Now there are so few repeats that not all of them will pair with each other. However overall there is a huge opportunity for many different RNA folds, and some of those folds being real strong.

I think the repeats may find it a bit easier pairing with a repeat than pairing with single base regions.

In the single state design, the repeats are kept in control by A’s repeat that are less likely to pair with anything else, than other kinds of repeats. Repeat A’s actively lower entropy. The most interactive repeats as C and G are kept at a lower rate and generally the C’s are kept shorter than G’s. Also they are safer placed in stems as if in loops they will love to wreak havoc. Which is basically what Vienna generally did wrong in past classic eterna labs.

Higher entropy as switching force and C and G repeats as anchors

So I basically think that raised entropy is what what unleash the power for getting the switch moving, the repeats greatly helps limits the structure folding options and the G and C repeats are the skeletal in the switch mechanism. Or put a bit different:


  • Raised entropy leaves the RNA wiggle room to change shape.

  • General raised repeat limits the general pairing options and steers the switch towards the correct fold.

  • G and C repeats makes strong enough anchors for each different states, to make up for raised entropy. I’m imagining the G and C repeats as kind of Snap fasteners.

  • Perhaps the U repeats that are also often long and more present than usual in static labs, are helping the switch slide.



Things that could be interesting looking at


  • Percentage of repeat bases, versus single bases in switch puzzles. And then again the same for static puzzles.

  • Percentage of repeat U’s against repeat A’s, percentage of G repeats versus C repeats. Plus a combo of those two groups.

  • Average length of repeats according to base.

  • Optimal distance of base repeats. They seems to be spread without the design with no big gaps.


Perspective and thoughts for the future

When I run switch winners through Vienna and watch positional entropy, their entropy is often in around a 1-2 range, which is far outside of what is normal range for a static design. Usually static designs range somewhere between 0.2 and 0.9.

Now Vienna is only a tool and as we well know far from perfect. I wouldn’t trust it to do my static lab designing. However it may still be useful for pointers of some of the really bad designs - as it was in the past.

I basically think we can use entropy as discriminator if a design is really switches. If you have designed a switch and it scores below 0.9, then it is likely not a switch at all. If it has entropy above 2.5 it is likely not switching the way you want it. And if it lands in a good switching range, you have absolutely no guarantee that your switch will work the way you intended it. You only know that there is a good chance that it will actually switch.

I even think we can use the colors of entropy to help determine if a switch is potential happening in the area we want it most. These entropy range numbers are just a rough estimate, watch out for what you think of as optimal.

I have been swearing off outside tools for a real long time. But I think Vienna is my new best (old) friend. :)

So advice for future switch labs - where there already is a few winners. Try send them through Vienna to get an idea of what entropy range may be smart aiming for - to repeat the success.

Thx to Machinelves for input and discussion.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Part III - Potential switch pattern?

While doing the color marking of the repeats in riboswitches - as mentioned in the above post - I noticed a peculiar pattern that appeared to be around in many of the switches. I noticed it because I was looking for patterns in the positions of the G segments compared to each other. A lot of the switches had at least 1, sometimes more of their G segments really close, whereas the C segments often also had some closeness, but not to same degree. And this pattern kept turning up. (Y)GGNNGG, (Y)GGNGG, to some degree also a reduced pattern (Y)GNNGG, (Y)GGNG and variations on that theme (Y signifying pYrimidine - meaning C and U bases). Regularly there is a pyrimidine in the end too. Plus at the beginning of the second G repeat.

Peculiar enough this G segment repeat pattern often appears to land in the switching area, and I can also find it in a number of the eterna switch winners, both FMN and TEP, although not all, though the pyrimidine start gets lost, due to locked FMN sequence. As this is where this two close double G repeats often turns up. 1 of the G repeats often being in stem and the other being in loop. Which reminds me of something else. A number of the G and C repeats are placed in loops and I think they are left there as to initiate the switching.

Later I ran the natural occurring riboswitches through Vienna RNA fold, to see if this sequence would land in high entropy area, and it regular do.


>Magnesium riboswitch mgtA: E. coli. Alteration: Normal.
CUUACCGGAGGUUAUAUGGAACCUGAUCCCACGCCUCUCCCUCGACGGAGAUUAAAACUUUUCCGGUAAGCCCGUCUUUUCACGGCGUUACCGGAUGCGUAAGGCCGUGA

The pseudoknots riboswitches mostly seems to be excepted from this sequence pattern. Perhaps they have another switching mechanism?

The two close G repeats often work this way where the one repeat will be embedded in stem and the other in loop area, in one state and then the repeat G’s in the loop helps as anchor for with the shifting to the other state. Similar to the twin G’s in FMN where the twin G’s are in the aptamer loop, but often gets bound up when state is shifted.

Now I think I finally understand, why there are the many G and C repeats in riboswitches. I think the C repeats help raise entropy as do U repeats. Plus when one make the Entropy of the design higher through repeat sequence and thus highly raises the probability that the design can fold into many other structures than the target structure(s) - one also needs to make the binding parts stronger = lots of GC = lots of G and C repeats.

I think one can balance the entropy by on a good level, by playing the right amount and types of repeats. I even think there is a different frequency of what kind of repeat there is. C and G repeat occur to a much higher degree, where normally A repeat and to some degree U repeat dominates in static puzzles. I think the ratio between the different base repeats matters.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
Eli, I like your line of thought. I would like to support it from the perspective of the finer details that are ignored in nearest neighbor energy models (which is basically all the state of the art offers).

Consider the following. It is Chimera's "ladder" rendition of the 3D structure of
a hairpin from the human 7SK snRNA in complex with arginine.
(I selected it just as an example of an RNA that does not seem to rely on changing its shape to perform its function.)



This model comes from NMR imaging, which is capable of "seeing" multiple configurations that the RNA takes on. (Unlike X-rat crystallography, which requires that the all the molecules be "frozen" into one configuration, so a crystal can form.)
All the configurations are superimposed here, and you can see that for the most part, the differences are small. (I've called out the one exception, where the uracil bulge will occasionally form a hydrogen bond with the uracil on the other side of the helix.

In contrast, consider the following NMR model, which is for a riboswitch (specifically, a preQ1 riboswitch in the bound state).



The image in the upper left shows all the configurations. Notice how much more variety (i.e. entropy) there is. In particular, there is a lot of switching of specific hydrogen bonds, while the overall structure remains essentially unchanged. The other three quadrants each show just one of the 21 states that are superimposed in the first quadrant.

What I think is happening is that this local variation in states (substates?) form a "broad energy valley" of states that increases the stability of the general shape more than the single minimum free energy value suggests.

But entropy is a two-edged sword. If there is a lot of possible variation that stabilizes each of the two desired states (e.g., in the Exclusion case, one and only one of the FMN and MS2 bound ), that is good. But variation that allows for neither of them to be bound would result in a "mushy" switch, which wouldn't get good Eterna scores.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Hi Omei!

Thx :)

This is super cool! I love the images and the explanation. When I run the two sequences through Vienna, it shows quite accurate where change is happening according to the NMR you show. For the first one, only one base is higher entropy in the bulge region. The general entropy is low, but entropy spikes at the bulge base and its sometime partner.





So while the two designs seem similar in entropy when running them through Vienna, they are not. (First one 0.9 and second one 1) As the higher entropy in the RNA hairpin is a spike at only two bases, but the slightly higher entropy in the riboswitch is spread over the whole stem. The latter fitting quite nicely with that there is much more movement in the NMR images.





I find your thoughts interesting on entropy being a double edged sword. Its needed for getting work done, but it might not be doing what we intended it too.

I can't see anything about entropy in relation to state with Vienna since it only treats RNA as single state. Only where it thinks there is action. So anything that could show us entropy trends for the different states would be most helpful. Or if we could find some features that can help us tell on if one of the states are not going to form.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
I use RNAsubopt a lot, which shows multiple foldings rather than just the MFE shape. I've not developed a bot for switch design using it yet, but if I do I will definitely keep the entropy concept in mind.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Hi JL!

Thx for the RNAsubop tip. I will try take a look at it.

I also really like the thought of a switch bot taking entropy into account. :)
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
Good news on using ViennaRNA: Version 2.2, which is still considered beta but which is being used on the Vienna Web server, can calculate the partition function for hard constraints. What that means is that you can get statistics that represent the whole ensemble of possible foldings (like entropy and base pairing probabilities) using the same constraints that the Eterna UI currently uses to estimate the bound state MFE. You can find the constraints for the original six MS2 puzzles on Nando's eternadev server, here for example. And just knowing that in the constraint language, "|" means paired, "x" means not paired, and "." means "don't care", you can figure out the proper constraint for any other placement of the FMN aptamer.

Note that Nando has pointed out that this hard constraint is not exactly the same thing as modeling the actual FMN binding. But given all the other simplifying assumptions being made, I think it is good enough to yield some insight into the bound state. Nando hacked up (his description) a Vienna 1.x version that does a better calculation. I tried that version out, but it seemed like the current 2.2 version gave more plausible results in the unbound state, so I have been using 2.2 in my current investigations of predicting the MS2 switch scores from the partition functions of the two states. (Which, btw, has yielded some positive results, but I still have more work to do before "publishing" that.)

One other caveat: There was a bug in release candidate 2 of ViennaRNA that often caused it to abort when calculating the partition function in combination with the constraints for our MS2 RNAs. I submitted a bug report, and got a quite prompt response acknowledging the bug and saying it had been found, and the fix would be appearing in release candidate 3. I just checked, and the source code for release candidate 3 is now available. I'm guessing the Web server has also been updated to RC3, but I don't know how to verify it short of trying it out. If you do get the error "unbalanced brackets in make_pair_table", post the sequence. I'll be downloading RC3 so I can use it to check if the bug lives on, or if RC3 just hasn't made its way to the server yet.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Turning off MS2

I have been talking about MS2 turnoff sequences earlier. They seems to be using same method of operating in the turnoff labs. Working in concert with the MS2 gate.

In the turnoff labs (Exclusion type), MS2 is on and formed in State 1 and needs to get turned on in State 2.

MS2 is particular fond of having its turnoff sequence after itself or in front. It depends of the position of the FMN. This happens when a FMN sequence is close in front of the MS2 sequence. However when the FMN is close after the MS2 sequence, the MS2 turnoff sequence lands before the MS2 sequence.

Usually this turnoff sequence lands right next to the MS2 sequence. It typically consists of 4-6 bases, although it can in rare cases be shorter or longer. These 4-6 bases are typically complementary to a stretch inside of the MS2. Most of the time it contains an overweight of C’s and U’s. Also what I have earlier called a strong CU magnet segment - although these do not always need to be right next to the MS2 sequence.


Image examples with MS2 turnoff



What is quite interesting here is that the Sensor v3, variant 2 lab, that does not have an aptamer, has a kind of pseudo FMN sequence in front of its MS2 sequence, so it gets similarities to the Ex 3 and Ex 4 labs.




Perspective

One of the exclusion labs that stands out from the MS2 turnoff, is Brourd mod of Exclusion 4. In that lab most of the top scorers doesn’t use a long turnoff sequence for the MS2, neither makes a MS2 Gate. Instead they tend to solve in a style much like some of the Zipper complementary style of the turn on labs like Same State 2. Which I find interesting. I look forward to see if this pattern shows a way of escaping the more fixed pattern of MS2 gates and turnoff sequences.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
Thanks for the analysis.