Switch Scores for EteRNA Switch Puzzles

  • 11
  • Article
  • Updated 1 year ago
An exciting direction in EteRNA is the study of riboswitches!

We have recently finished our pilot experiments with great initial success. Using a new technique that measures switching directly on a sequencing chip we directly observe the switching for thousands of designs at once. The signal is generated by a fluorescent RNA binding protein, MS2, and instead of the standard EteRNA score, which is based on the correct folding of each base, we have introduced a new Switch Score.

The Switch Score (0 - 100) has three components:
1) The Switch Subscore (0 - 40)
2) The Baseline Subscore (0 - 30)
3) The Folding Subscore (0 - 30)

The scoring scheme is summarized below. A more detailed description is given in this PDF:
https://drive.google.com/open?id=0B_N0OA9NROPGel80SG5LM0wtZms&authuser=0

A typical example of a switch puzzle is shown below:


The player designs the structures in [1*] and [2]. To observe the switching we then measure the fluorescent signal of MS2, which binds specifically to the MS2 hairpin seen in [2]. In the absence of FMN, the MS2 should bind and the switch is ON. On the other hand, if we introduce FMN, the ligand in [1*], the switch should be OFF and not exhibit fluorescence.

No switch is 100% ON or OFF in the absence or presence of ligand, but a good switch can come very close (and get a perfect EteRNA Switch Score!). A some MS2 concentration, the difference should be large (e.g., at ~100 nM MS2 in figure below). In practice, we don't know this concentration beforehand so instead we perform measurements at many concentrations to obtain binding curves. When the switch turns OFF (red curve), the effective dissociation constant increases. The dissociation constant, Kd, is the concentration where half of the RNA binds MS2.


The Switch Subscore quantifies how far apart the Kd's are in the absence and presence of FMN (horizontal distance between the red and blue curves).

The Baseline Subscore is a measure of how close the ON-state is to the the original MS2 hairpin (lower Kd is better, i.e., blue curve should be far to the left).

The Folding Subscore is high if MS2 bind properly in the ON-state at any concentration (the score should be high for the blue curve at high concentrations of MS2, i.e., high values to the right)

In our first experiments, we found that the easiest score to maximize is the Folding Subscore, followed by the Baseline Subscore. These two ensure that the MS2 hairpin is properly formed in the ON-state. The hard one is the Switch Subscore, which is the highest when the energy difference between the states is finely-tuned to the energy conferred by binding to FMN (or other future ligands).
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes

Posted 5 years ago

  • 11
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Cluster siblings

I have been wondering about if a winning design had a good amount of clusters, if its lower scoring siblings would also have so. Check, confirmed.

Now I wondered if there were anything that made designs that have many clusters, stand out from the rest. Similarly if designs that had fewer clusters, had anything that set them apart.

The designs with a bigger amount of clusters had a tendency for having more than average repeat U’s.



The designs that have a very low amount of clusters have a more than average repeat of longer stretches of A’s and multiple of them.

Designs with only 1 cluster.


However as an odd curiosity these seems to score better than the designs with the highest amount of clusters. Not sure what is going on here.

But if a certain amount of clusters are good having (assuming there is a range that will also work well for high scoring designs), it seems that it is better to avoid having longer stretches of A’s (5+) and having many of them. A few are allowed with no problems and some of the winners have them. But it can get too many.

For more details see here.

Amount of clusters per design sibling

Also I noticed in the lab results that there were a tendency for designs that had more A's and longer repeats to more often end up at the bottom of the scoring list (and among the designs that didn't get picked for synthesis), despite also being somewhat possible in winners. There were just more of them at the bottom. Also different sublabs seemed to have a somewhat different tolerance to repeat A's.

Base distribution
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
Thanks for looking into this. There's definitely a sequence-dependent effect on the number of clusters and we don't know at this point if it happens during synthesis or during PCR amplification. One reasonable assumption would be that the GC-rich sequences amplify less during PCR. The low number of clusters for polyA's suggests that maybe the synthesis has problems with these segments.

I think that some of the high-scoring designs get a high score by random chance and if the the number of clusters is really low (m delighted that it is already underway. This issue will only rise in importance as the number of designs goes up.

Good job!
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Np and thx, Johan! :)
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Som thanks to all your comments.

First, I used kt = 3.2e-6 hartree and 1 h = 627.5 kcal/mol at a temp of 295K (well, what is the temp int he lab?) to achieve a conversion factor of 0.592 from kT to kcal.mol-1, lemme know if you agree. Then, we arrive at the following set of graphs for the dG no FMN set:



and individualized:

Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
I left the indicator triangle in place for all separated graphs for a reference. It becomes immediately evident that this is an indication of outstanding results (outliers) more than the norm of the bulk. In the latter case, the trianble would be lower.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
When I look back at these graphs, I'm "seeing" an open Z (in a lot of the Exclusion graphs) with the top bar at score 60 from -11 to -10, a sloped bar from around (60,-10) to (30,-9.5) and a bottom bar at around (30,-9.5) and rightward. (Perhaps most visible in Exclusion 3 and 4.)

In the Same State graphs there appears to be two clusters, one sloping up and right from (60,-11) and one horizontal at 30.

Any thoughts on what these might be?
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
These look like the many designs that got a zero on the switch score.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
The |z|'s are the threshold values at which labs hit rock bottom or glass roof of their respective score bracket (folding score, switching score, MS2 score), wither at 0 points or the max value for this subscore. Thus, all the mixed score graphs have these Z shapes.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
And here part two of the exercise, dG binding with FMN:


and with the best results from each individual experiment:

and the clean version

and split in the individual parts:






Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Top scoring entry of each individual lab highlighted
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
So, that's it from me for now - looking forward to your critique and comments.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
One question, if I have a sequence, such as, for example, GGUUGGCGAGGAUAUACAUGAGGAUCACCCAUGUGAGAGAAAAAAGUGAGGAAGAAAAGUAGAAGGCGCUGACG, how do I count the "A"''s in an automated system?
I tried sth like countif(table slot;"A"), to no avail. Tips and tricks appreciated. Thx.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
Assuming your environment has a function that will substitute all occurrences of one substring with another substring, you can substitute all occurrences of "A" with the null string ("") and compute the difference in length of the two strings. For example in Excel, the formula would be

=LEN(cell_ref)-LEN(SUBSTITUTE(cell_ref,"A",""))

There are more concise ways in other environments, but I think the above is probably what you are looking for.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Ah, thanks. excellent idea!
it's (with semi colons):
=LEN(cell_ref)-LEN(SUBSTITUTE(cell_ref;"A";""))
Photo of Brourd

Brourd

  • 477 Posts
  • 86 Reply Likes
This is a question related to data analysis:

Dr. Andreasson - How many clusters should we expect to be considered an adequate number for robust data analysis? 5? 10? 20? 30?

Should it be taken under consideration to synthesize each sequence on the chip twice, with different barcodes or whatever method is used for sequence identification and differentiation?
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
will probably return the standard error of the mean for log(kd) for all cases with n>2. Please let me know if you prefer another metric. I could also, in principle, upload the entire data set (the fits of each individual cluster) but only if people find it valuable.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Oh, I would find it valuable.
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
OK

Google Spreadsheet:
Eterna R93, expanded data


or as an Excel file:
Eterna R93, expanded data
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
Just uploaded an expanded data set that includes estimated errors and base content of all the sequences.
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
It's not the individual cluster fits, but it's more than before.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
I did some stat analysis on the KD values. First, the KD,off:





Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
then, the kd,on





Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Some discussion on the graphs: SS2 can achieve much lower KD values, in both on and off states.
Coupled with very high Kd,on Max values, this should make a beautiful switch (and does so, in 5494860).
The large standard deviation of the SS2 data set coupled with the highest skewness and curtosis also means this is the widest "spread" distribution of data.
Also, since starting to look at the data, I have been wondering why the ex4 set scored so much lower, both on average and in the top. This was the set, I actually invested most time into, and produced the maximum possible entries, too, so this is something of a personal fail, too. What would the analytics tell us? Well, for one, in the kd.off, the data has the highest skweness and curtosis. But this didn't actually bode badly for the kd,on SS2. The maximum scores in kd,off are in the 5000 range, about half what could be observed in kd,on SS2. It may be that, having no equivalent "outstanding" value in kd,on, is what is missing in the ex4 to allow it to score higher. If we managed to maximise this *in addition to the kd,off), ex4 could also produce a good FMN binding molecule.
Interestingly, the ex2 maximae are very low compared to the rest, and yet, this also produced some good binding molecules. It may be that the relative difference counts more than the absoulte difference, so once this ratio is sufficiently high, further difference in k will not improve the molecule towards our set parameters.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
For anybody interested, the data analysis, STAT analysis, all graphs, and the color coded data (as well as the recalculated data) is available to all, here:
https://www.dropbox.com/s/cdvxchcc3ag...
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
Preliminary data for the miRNA switches are now also available along with an expanded data set for the FMN switches.


Google Spreadsheet:
Eterna R93, expanded data

or as an Excel file:
Eterna R93, expanded data

There are many columns. Basically, I have estimated the standard error for the log(kd) = dG and also converted these to an 'error factor' for the kds and the fold-change. The 'error factor' is just exp(dG_err), and should be interpreted as a */ error instead or +-.

I have also included base content for all the sequences. Please let me know if you need anything else!
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Ok, I used Omei's algorithm to calculate base content, and included this and my graphs in tab
#data-sort-colorcoded. I also included Johan Andréasson's new set of data, incl more lab data (tab data-orig-extended. The set includes another 3000 molecule analyses, and the data diverges beyond the #16 best scoring (E.Sc.) result.

Here, again, the xls source file, with all the new stuff:
https://www.dropbox.com/s/cdvxchcc3ag...

Here the graphs for base distribution:







Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
So, what does this mean?

1) To achieve high binding energies, we need:
a) 10-15% C
b) 35-60 % A
c) 10-16 % U
d) 35 % G for Ex3, and 12-16 % G for Ex4 and SS1

and for the second data set

2) To achieve high Eternascores, we need:
a) 10-24 % C
b) 25-58 % G
c) 8-25 % U
d) 18-40 % G
!
These spans are, unfortunately, so wide that they likely encompass 3/4 of all the molecules in this (or any other) data set.

Of particular interest are those runaway data lines (e.g. at 38 % A and corresponding 14 % U for Ex3), where a whole set of very high binding energies can be achieved with essenstially the same composition, and just one minute change in the molecule makes a large difference, whereas any such change in molecules outside this very defined parameter set changes the binding energy barely at all. Similar peaks exist for other experiments, too, but I wanted to point this one out as it's very prevalent across the first four graphs.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
This is intriguing. Very interesting what you did there.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
General thoughts on this rounds results

While red and green magnet segments do play a role for switching and I have gotten some of my designs switch where I mainly focus on red and green segments and not adding much else, I have learned something. That alone most often is not enough. Having other stretches be complementary to other parts of the MS2 or microRNA is very important too. For the microRNA's, they generally did better if the design tried pair up with most of the mir hairpin.

And in general, the switching can also often be achieved alone with pure complementarity and longer stretches of it. Using only G or C segments, leave much of the sequence A’s and those don’t do well through the synthesis. So it isn’t bad using longer stretches of mixed bases for getting complementarity. Despite it means longer stretches of switching.

So based on what I see now, I will expect longer stems for the XOR puzzles.

Still when using G and C segments, positioning of them still often very much matters. Similar with complementary stretches to MS2.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Question:
How "natural" are molecules that contain a desert full of A's, e.g. the best in ex4:
AAAAAGUGAGGAUAUGGUAAAAAAAAAAAAAAAAAAAAACUCCAGAAGGCACAUGAGGAUCACCCAUGUAAAAA.
Would this form naturally? Should we rather cut short on those A-strings? Does it matter?
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Hi Salish!

From what I have seen of natural RNA so far, having that many A's in line with nothing happening in between, is not very natural. You can take a peak for yourself. Here is a link and intro to a database over natural RNA:

https://getsatisfaction.com/eternagam...

So I too would support some limitations on how many A's in line can get used.

Love how you describe the perpetrator A's as a yellow desert. :)
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Ah, that term isn't from me, it comes form DNA research. We have short intervals of genes that are activated and express in certain ways, and then, there are long stretches on DNA that do nothing, don't activate, don't control the activation, and aren't expressed - those are our "ancestry", old genes, no longer used since millenia (well, or so goes the theory, since we need those strtches for nothing (that we know of yet)) - those strtches are called desert. Found it appropriate to call those eternal A's the same.
We have limitations for A' (4 long) in the puzzles, but they are not enforced... And many well scoring proposals have long strtches of A, so it is beneficial, if not natural, for switches...
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
It may be an idea to put an overall counter limiter on the A's
Photo of Brourd

Brourd

  • 477 Posts
  • 86 Reply Likes
Here are some charts showing the relationship between individual nucleotide percentages and the number of clusters, for each sequence. On the right side of the image, is chart on a logarithmic scale.

Percentage Cytosine in a sequence as related to the number of clusters



Left Regular scale, Right Logarithmic Scale

Percentage Guanine in a sequence as related to the number of clusters



Left Regular Scale, Right Logarithmic scale.

Percentage Adenine in a sequence as related to the number of clusters



Left Regular scale, Right Logarithmic Scale

Percentage Uracil in a sequence as related to the number of clusters



Left Regular scale, Right Logarithmic Scale

According to the combined R88 and R93 data, there is clearly a correlation between the experimental yield of the clusters, and the percentage of adenine residues within a sequence, with a the percentage adenine of a sequence going up as the cluster yield lowers.

There is also a weak correlation between cytosine and uracil percentages, and the number of clusters, with an increase in uracil and cytosine percentages being related to an increase in the number of clusters for a sequence.

From this data set, it would appear that there is little to no correlation between the percentage guanine of a sequence and the experimental yield.

Experiment suggestion: Perhaps we should create a set of sub-puzzles where the number of allowed adenine residues is capped at 40%.
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
Great plots!

The adenine to cluster number relationship is definitely convincing. Capping the A content, or at least encouraging the players to reduce the number of long stretches of A, is a good idea.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
nice
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
I found Brourd's plots so great that I decided to replicate them, this time also including all the ~1100 sequences that resulted in no clusters and hence weren't included in the dataset I posted online.

I tried to match the color scheme. To fit the data on a log-scale for the y-axis, all 0 values have been changed to 0.5.

I also separated the R88 and R93 populations that were mixed in a different concentrations.

The trends identified by Brourd seems to extend down to the absence of clusters on the chip (not surprisingly). A next step could be to investigate if the problem is overall content or if the effect is largely caused by long stretches of identical bases that could throw off sequencing or synthesis.

All sequences:



R93:



R88:

Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
Errr, out of curiosity, how do you display a zero (i.e., zero clusters) in a log axis?
I presumre you arbitrarily gave this a value, e.g. 0.1 or 0.01, to distinguish it from the rest of the data set, and still be able to display it?
Photo of Brourd

Brourd

  • 477 Posts
  • 86 Reply Likes
Based on a very rough preliminary analysis, it would appear that the overall content of adenine residues may be important in determining the cluster yield, although consecutive adenine length is just as or more important for determining cluster yields. It's hard to tell, given that as a player breaks up more consecutive adenine sequences, they lower the overall content of adenine at the same time.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
(or, 0.5, judging by the graph values...)
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
yes, Brourd. I wonder if we could introduce a counter for "total A", not just "A's in a row" and set it, based on the length of the RNA string, to max. 55% of total length, and make it a hard rejection rule (as with the 4 C's in a row)
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
The counter is coming
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
I know, not the right place, but: did the eterna.stanford server vanish? i can only connect via cmu?
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
One thing I noticed when trying to recreate the better scoring molecules - either in Vienna1 or 2, one of the hairpins get predicted as forming incorrectly or not forming. I wonder, if this is one "quality mark" of good MS2 switches, as they are plastic enough to adjust to the FMN. Just a thought.
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
I have been making some drawings and spreadsheets for the Same State 1 + 2 labs.

I started the Same State 2 spreadsheet before I decided to cut at 10 for cluster limit. So I will try ignore the designs that have fewer clusters for the drawings for the sum up.

Drawings first and then some sum up. :)






Same State 1

- This one is one of the more flexible labs with multiple solving ways. However one thing is generally true. Early part of MS2 usually pairs up either before, after or at FMN2. Late part of MS2 usually targets before, after or at FMN1. Then there are the exceptions... There are several distinct different ways of solving this lab.

- Often 1 or two of the two sets of twin G’s in the aptamer segments, are bound up.

- CCCAC’s most often ends up as multiloop spacers, with at least one of the C’s being of a closing GC pair when the ms2 splits up and pairs.

- 3-6 base pairs after aptamer stem in state 2 where aptamer should form. (The longer stems seems to harbour GU’s a bit more often.)

- Most switches, seems to be partial moving, but there are a few full moving

- Good part of the designs have a GGG segment. The ones that have GGG segments, uses more GU’s in both states, than those without.

- Most designs made a static stem or two of the middle of the sequence after the aptamer.

Spreadsheet for Same State 1


Same State 2

- There are typically 3 basepairs after aptamer in state 2 where it should form.

- Most switches are partial switches

- CCCAC’s often ends up as multiloop spacers, with the C’s at either end be part of a closing GC pair when the ms2 splits up and pairs.

- Half the designs have some kind of GGG magnet segment.

- Most have early MS2 pair up with the first FMN sequence (FMN1) and similar late MS2 pair up with the second FMN sequence (FMN2)

- The majority has an extra static stem.

Spreadsheet for Same State 2
Photo of Eli Fisker

Eli Fisker

  • 2328 Posts
  • 541 Reply Likes
Single base to Base pair ratio

I think one factor that could be interesting looking into for the MS2 labs are single base to base paired ratio. In some of the labs there is clear preference for not having many bases left single in either state. In others it is more tolerated. Perhaps also single tail stretches, single loop stretches or gap bases could be worth looking into, for when to use and when to avoid.



When I look at high scoring designs with minimum cluster of 10, its my impression that Ex2, Ex3, SS2 prefers having its tails fold up together. (Those labs that have their MS2 sequence in a somewhat more towards the middle.)

The Ex1, SS1 labs seems to mind single tail stretches less if they are close around or involved around the switching area. (Both these have their MS2 sequence at the beginning of the RNA sequence.

Well thats except Ex4 - that also has switching before aptamer, but has its tail region filled out with other than A bases, too. So it is just switching and pairing between states. (This one has a late positioning of its MS2 sequence.)

I have been looking a bit into this from different angles. Here are links to these separate investigations:

Neck formation and dangling ends

Base distribution

What are the benefits of creating static stems?
Photo of Brourd

Brourd

  • 477 Posts
  • 86 Reply Likes
The following are a series of graphs relating the percentage adenine of a sequence to the number of clusters, based on the MAX number of consecutive adenine residues in a sequence, from 10+ consecutive adenine to 2 consecutive adenine.

Max 10+ Consecutive Adenine



Average Percent Adenine: 52%
Average Number of Clusters: 15
Total Number of Clusters: 7966
Total Number of Sequences: 542
Number of Seq. with less than 10 Clusters: 396
Number of Seq. between 10 and 100 Clusters: 132
Number of Seq. greater than 100 Clusters: 14

Max 9 Consecutive Adenine



Average Percent Adenine: 49%
Average Number of Clusters: 27
Total Number of Clusters: 3930
Total Number of Sequences: 145
Number of Seq. with less than 10 Clusters: 69
Number of Seq. between 10 and 100 Clusters: 66
Number of Seq. greater than 100 Clusters: 10

Max 8 Consecutive Adenine



Average Percent Adenine: 49%
Average Number of Clusters: 31
Total Number of Clusters: 8621
Total Number of Sequences: 335
Number of Seq. with less than 10 Clusters: 131
Number of Seq. between 10 and 100 Clusters: 128
Number of Seq. greater than 100 Clusters: 20

Max 7 Consecutive Adenine



Average Percent Adenine: 47%
Average Number of Clusters: 32
Total Number of Clusters: 10711
Total Number of Sequences: 335
Number of Seq. with less than 10 Clusters: 135
Number of Seq. between 10 and 100 Clusters: 178
Number of Seq. greater than 100 Clusters: 22

Max 6 Consecutive Adenine



Average Percent Adenine: 45%
Average Number of Clusters: 28
Total Number of Clusters: 28744
Total Number of Sequences: 1011
Number of Seq. with less than 10 Clusters: 392
Number of Seq. between 10 and 100 Clusters: 572
Number of Seq. greater than 100 Clusters: 47

Max 5 Consecutive Adenine Residues



Average Percent Adenine: 45%
Average Number of Clusters: 32
Total Number of Clusters: 50667
Total Number of Sequences: 1560
Number of Seq. with less than 10 Clusters: 588
Number of Seq. between 10 and 100 Clusters: 863
Number of Seq. greater than 100 Clusters: 109

Max 4 Consecutive Adenine Residues



Average Percent Adenine: 41%
Average Number of Clusters: 49
Total Number of Clusters: 137254
Total Number of Sequences: 2789
Number of Seq. with less than 10 Clusters: 889
Number of Seq. between 10 and 100 Clusters: 1603
Number of Seq. greater than 100 Clusters: 297
Number of Seq. greater than 500 Clusters: 6

Max 3 Consecutive Adenine Residues



Average Percent Adenine: 34%
Average Number of Clusters: 108
Total Number of Clusters: 255229
Total Number of Sequences: 2355
Number of Seq. with less than 10 Clusters: 213
Number of Seq. between 10 and 100 Clusters: 1372
Number of Seq. greater than 100 Clusters: 770
Number of Seq. greater than 500 Clusters: 71

Max 2 Consecutive Adenine Residues



Average Percent Adenine: 28%
Average Number of Clusters: 209
Total Number of Clusters: 328702
Total Number of Sequences: 1571
Number of Seq. with less than 10 Clusters: 58
Number of Seq. between 10 and 100 Clusters: 562
Number of Seq. greater than 100 Clusters: 951
Number of Seq. greater than 500 Clusters: 137
Number of Seq. greater than 1000 Clusters: 25

Based on the extensive statistics here, and on the trends of several of these graphs, there appears to be a correlation between the adenine content of a sequence and the cluster yield. However, filling a sequence with uracil may not necessarily reflect well on the riboswitch ability. How this may be used or implemented as a constraint in game for the puzzles is unknown as well.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
The curious thing to me are the outliers with higher Adenine content which still generate 100+ clusters. They almost seem to form a second parallel region to the main one. If we could understand what makes them different, perhaps that could help.
Photo of Brourd

Brourd

  • 477 Posts
  • 86 Reply Likes
I believe those are the sequences from R88 that Dr. Andreasson stated were mixed in at different concentrations, if I understood his post correctly.
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
Great analysis!

The outliers are indeed the sequences from R88 that were simply added at a higher concentration. If you filter out these sequences (as in some of the figures above) you recover a single population for R93.

Although simply adding more DNA for the sequencing is a solution that worked for R88 (500 sequences) we have already hit that limit for R93 (~11000 sequences).

Nando has been hard at work and I have already seen constraints implemented both for successive A's and for overall A content (<40%). Hopefully these can be implemented into all the puzzles for the next round shortly.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
nice
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
strong argument to make the restrictions on the A's "hard" restrictions in the lab, just as with the no G>4, C>3 etc...
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
We have, until now, looked at the individual scluster scores only as a function of the experiment (e.g. R93, R88). I have separate dthis out into the six sub units, and something very interesting emerges: The adenine relation holds true for all BUT the exclusion 1 experiment.

or, rather better:


separated into their individual parts:






The SS2 has the strongest affinity of cluster formation to adenine content, but that mey be due to the fact that we have the molecules with the lowest A content only for the SS2 experiments.

All above graphs are made with data from R93 only.

Furthermore, I was interested, whether there was a dependence of the KD,on/KD,off and the cluster, here the result:

Photo of Brourd

Brourd

  • 477 Posts
  • 86 Reply Likes
The first graph looks off, since few sequences in the entire R93 round with over 50% adenine content managed to have cluster counts higher than 100.

Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
I was wondering what was going on here, I finally found it in my data set...
can't post images here.... will do so somewhere else
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
The latter two graphs imply that, with the exception of outliers, there is no tapered correlation between KD values and cluster size. In fact, there is a preetty steep incline in possible cluster sizes at KD 10, and a (not as) steep dropoff at KD 500.
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
Good observation.

The KD=10 is close to the affinity for the MS2 hairpin (~16 nM for R88) and no values would be expected below that value, with the exception of a few outliers.

KD=1000 is about as high as we can measure since we only go up to 3000 nM during the experiments. I only removed fits with KD>40000 but its would not be surprising if some clusters simply don't yield a reasonable signal above 1000 nM. The dropoff is most likely due to imperfect switching rather than measurement noice.

Good switches, with eterna score = 100, should end up with KD>400, but theoretically an even higher KD is possible.
Photo of salish99

salish99

  • 295 Posts
  • 58 Reply Likes
got the data from the original 93/88 set above, which overlaid parts of the ex2 on the ex1. Here to corrected graphs separating each of the 6 sub-experiments:



Photo of Brourd

Brourd

  • 477 Posts
  • 86 Reply Likes
A minor question.

I believe Johan stated in his presentation that a significant number of the clusters on the chip cannot be used due to mutations that occur with the DNA. Are any of these mutations based on predictable sequences, or are they entirely random?
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
Good question.
It's something we hope to look into but we don't have the answer yet.
The data can be made available but it requires some coding...
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1026 Posts
  • 332 Reply Likes
If a mutated DNA sequence got replicated often enough that it created a statistically significant number of RNA clusters, would the data be any less interesting than the original sequence? Or is there some subtlety in the process that require the possible sequences to be known a priori?
Photo of johana

johana, Researcher

  • 96 Posts
  • 45 Reply Likes
I think that the sequence would still be interesting. It would not be a switch directly designed by a player, but it may represent a useful "natural" mod of another design.