Phylogenetic tools: So for each design, you need to include the mutations made for the RNA. It would even be cool as the experimental throughput increases to begin mapping phylogenetic trees of the distance these sequences have from the WT ribosome.
Viewing Options: I would even say that it would be interesting to see if you can incorporate a separate view in the of the ribosome in its 2-dimensional structure with the mutations easily highlighted. While unlikely, it would be even better to include 3D views of the ribosome crystals/Cryo-EM structures, with views of the mutations in their local environment.
Data: The ability to synthesize a single protein is probably not a good measure of robustness or betterness of any design compared to the WT ribosome. Similar to how selection processes work with development of aptamers and in vivo evolution of proteins, it could be that whatever players did made the ribosome really good at the synthesis of a specific protein or possibly even peptide sequence (unlikely, but it's something players need to remember when making mods of designs in the future). I'm sure that they're looking at taking the best of the best player sequences and transforming the plasmids into ribosome deficient E. coli cells, so this is definitely a comment about player conclusions that are made, but an important one to keep under consideration.
As an additional note on the data, it could be useful to include a numerical "relative change" score compared to the WT ribosome. It's not helpful to players to throw easily misinterpreted bar graphs and sigmoidal curves at the player. I would recommend a numerical score of more easily and summarized way to show success or failure, and error associated with that.
QOL Changes: The link to the lab needs to be in an accessible place. It needs to be in the news item, and links to each design should be in the news item as well.
Some of the very best designs indeed are among those that have fewer mutations. The 5S doing best has the fewest mutations.
Some of the designs doing worst have many mutations. 2 out of 3 of the 16S design that are doing worst, have 55 mutations.
However the happy surprise is that it seems that even designs from the 16S lab with a good bunch of mutations are actually doing good also. But I do wish to highlight that Jieux design with just 4 mutations is doing really well.
I have made a spreadsheet for quicker overview of the actual mutated bases in relation to their designs.
5S and 16S mutation count and mutated bases
My first general thought (and please disagree if I am way off base) is to find designs that fix global misfolds while honoring IUPAC constraints. This could mean 4-8 mutations to the 16S that address one or two global misfolds, or it could be 10-20 mutations needed to address multiple misfolds. Actually, from what I experienced, it might take 15 mutations to correct one global misfold.
My second general thought is to explore phylogenetic covariation, looking to see if certain phylogenetic mutations occur in tandem with each other frequently across rRNA variants in E. coli.
More specific approaches: the 16S central spine, redesign pseudoknots
Omei, any clue as to when we get 23S results and any estimation when Round 2 puzzles will be up?
Recently jandersonlee took the sequences from a couple of the POTW puzzles and played around with it. He created a google sheet that can calculate the number of mutations against the puzzle starter sequence. Plus if mutations are in stems or single bases.
I think we can use things like this a tool for future analysis in lab as it can easily be fitted to new datasets. Plus more functions added on as we figure we need them. So hereby I present the 5S and 16S lab data, jandersonlee style. He made all the formulas, I just ripped them off. :) He even helped me with creating the last big chart for the 16S data and fixing a few errors. I hereby pass it on. You can make a copy of the sheet and play around with it as you like.
5s and 16S rRNA from Jeff's RibosomeChallenge1.2
Here are some images from the sheet:
As you can see for the 5S rRNA design we mutated mostly in the loops (red):
Another mutation pattern turned up for the 5s design. Generally we have a lot of bases mutated to A. Plus in general stronger bases in the original 5S ribosome sequence mutated to weaker bases.
For 16S we are more all over the place in relation to mutating stems and single bases.
Jandersonlee took the 16S (Noller) and 5S (Gutell) Helix sheet and broke the data up in a new way for the 5S lab, plus added a lot of new functions. I played the copycat and continued for the 16S lab.
He said that I should let you know that: "...that some data is manually entered and should be checked. For example the helix numbers and mismatched pairs don't always exactly match with the structure. I've highlighted a few questionable cases in red."
Also be aware that the structure in Nollers ribosome image don't always match up with the dot bracket structure from in game.
5S and 16S Helix Map
New spreadsheet functions:
IUPAC - The entire sequence in IUPAC, with information on which bases can be changed.
Pairs with - you can now see which base a base is paired with. (As per @DigitalEmbrace's wish - while not in the game yet, this is a start.) jandersonlee has added this manually for both sheets. (manually entered)
Base Cons - conserved bases. Based on the IUPAC sequence that holds information on if other close relatives to our e.coli ribosome (gammaproteobacteria) allows mutations at a specific base. The IUPAC sequence is what is used to give us the colored rings in the ribosome lab and puzzles and show what changes nature has already approved of in e.coli relatives.
Fixed - Fixed bases that don't have other base placing options.
Pair Base - Show the partner base to the sequence base
Pair Cons - Show if the partner base is conserved.
Better Mods - Holds information on which bases were changed and how many. (manually entered)
Worse Mods - Holds information on which bases were changed and how many. (manually entered)
Player designs + their sequence - Sorted in the order Better (left) and worse (right).
Astromon asked what does these t's mean?
Joke aside. If you wish to learn more on error bars, here is what I have found.
Humorous walkthrough of the dangers of ignoring error bars.
THE IMPORTANCE OF UNCERTAINTY by Chris Holdgraf
Standard Error by Bozemanscience
jandersonlee has updated his previous script. Now it can highlight a range of bases.
Let's say you want to highlight the bases in helix 10 in the 16S rRNA puzzle.
You can find the basenumbers related to helix 10 in this spreadsheet:
5S and 16S Helix Map
The bases in helix 10 are 198:207,212:219.
The script is called: Report/Mutate/Mark/Unmark Bases (v1.1)
To run the script, make your own copy and save it as a booster. Then pull the script from your booster list and enter the bases, you wish to highlight.
I highlighted the areas where 2.21 did better at correcting misfolds than 2.18. Perhaps these areas are more critical?
Here I chose to compare the predicted structures of the designs, rather than their sequences. The same technique can also be used on sequences, which can give a different perspective.
For those who are not familiar with hierarchical clustering, it's a way of organizing the designs to give an overview of which designs are the most similar (in this case, in their predicted structure) and which are most dissimilar.
The analysis starts with all the designs grouped together and decides how to divide that group into two subgroups to best minimize the differences within each group while maximizing the difference between the two groups. In this case, it determined that 1.18 was more different from all the others than any of the others were with each other, and so it split out that design from all the others.
At the second step, it calculated that spliting out 1.20 and 1.24 as one group and leaving all the other designs in the other group was the best choice. Since there were only two designs in the first group, there was nothing more to split there, but there were six designs in the second group, so it next worked on that group. It continued in this manner until all the groups had either one or two members.
Looking at the final graph, we see that 1.22's predicted folding was the most similar to that of the WT, and that 1.17 was closest to the those two. Designs 1.19 and 1.23 were closer to each other than to their three closest neighbors, and so on up the tree.
The cluster analysis, by itself, doesn't say anything about the experimental results; it is just a structure that can be useful for organizing the data in a way that prompts interesting questions. In this case, I manually annotated each design number with three numbers derived from the the data, as described in the key.
For example, 1.20 and 1.24 are close in predicted secondary structure, but extremely different in results. Why so different? In this case, there is an obvious candidate -- 1.24 was the design that pretty much ignored the IUPAC constraints reflecting what nature has found works for E coli's close relatives, the gammaproteobacteria. So this is not a surprise, but it is nice to see a vote of confidence for the relevance of the constraints.
Looking for another pair showing similarity in structure with differences in results, 1.22 and the WT stand out. Design 1.22 didn't violate any of the constraints. What aspect of the difference with the WT did cause the problem? That's an interesting question, and I don't have the answer. Any theories?
Anyway, this is meant as just an illustration of why I find hierarchical clustering useful. The two tools I used for this analysis are the RNAFold website for the (Vienna 2) structure prediction and R/RStudio for the cluster analysis. The RNAFold website interface is easy to learn, while R/RStudio is not quite so easy. (But if you're comfortable making CSV files, I have written notes for how to turn such a file into a cluster diagram like the one above.) If anyone would like help in learning to use these tools, this is as good a place as any to ask questions.
Here is a helix map for 23S in jandersonlee style. (Third sheet)
5S, 16S and 23S Helix Map
Big thx to Gerry for shipping me the numbers for the base pairs, this saved a lot of work. Thanks also for catching a good bunch of errors.
NB. The areas with red are those where there were discrepancies between Noller's ribosome image (http://rna.ucsc.edu/rnacenter/images/figs/ecoli_23s.jpg) and eterna's dot bracket structure.
1. Given Astro's 2.21 high score and many mutations, there are positive and negative mutations in this design.
2. Given low scores by 2.18, 2.20 and 2.24, these had no positive mutation effects.
Delete all common mutations from low three designs from Astro's design and see if that improves 2.21's score. Deletions (common mutations) are:
RNA motifs - cheat sheet for the ribosome
I think RNA motifs will be really helpful for further limiting the task of which bases we are most likely to get away with modifying in our ribosome labs. However they may not all be equal.
DigitalEmbrace: Many of our best designs have mutated bases in the motifs, so disrupting these motifs is not necessarily a bad thing. I wonder if perhaps altering certain motifs can sometimes be beneficial, or at least benign. Also, I'm still figuring out which NTs are flexible within each motif style. For example, it appears the Watson-Crick pair can be changed within the A Minor motif without disturbing the structure?
Andy Watkins: really interesting! yeah, there’s no guarantee that each motif is so critical that on its own mutations would kill ribosome function, or anything like that
Astromon: what is a motif? (A Minor motif) this sounds like a guitar chord to me :)
Andy Watkins: a motif is a recurrent, conserved feature in RNA structure. To be a little too cute about it, it’s anything other than a helix.
Andy Watkins: from the eterna-perspective — you know how one really great way to stabilize a 2/2 bulge is to make it GU/UG? that’s a sort of motif.
in particular the reason people are interested in motifs is because it seems that their properties — in particular their 3D structure — are relatively independent of their sequence context
Andy Watkins: not all motifs have this “modularity” property, but many do
Rhiju has some fine lists with the base positions of where the RNA motifs are in the ribosome.
Which motifs are safe to change?
DigitalEmbrace: “I wonder whether we need to conserve all these motifs in the ribosome. Perhaps a structure formed by a motif is hindering performance? Perhaps a certain motif in a certain location is the problem? The A Minor motif helps form tertiary connections so I think those are the most important to preserve.”
She volunteered to look through the RNA motif list versus our 23S lab results to see which of the designs that mutated in the motifs compared to how well they did. Plus to see if specific motifs violations were more or less involved in ribosome accident than others.
Notes: Shaded Loop E motifs are covered by the corresponding Bulged-G motif. I only searched for the first base in the A Minor motifs, the “A” component. I only searched for the first component (2NTs) of the UA handle.
2.17 - 0 motif violations - does fair (1M)
2.18 - 0 motif violations - does bad (32M)
2.19 - 1 motif violations - does fair (15M) (Platform/Bulged G)
2.20 - 14 motif violations - does bad (72M) (Platform/GA minor, Loop E, GA minor, Platform, Bulged G, U-Turn, GA minor, U-Turn, Bulged-G/Platform, U-Turn, Platform) *Two bases were changed in U-Turn, GA Minor, U-Turn*
2.21 - 4 motif violations - does fair (43M) (UA handle, GA minor, Tandem GA)
2.22 - 0 motif violations - does OK (4M)
2.23 - 1 motif violations - does fair (7M) (A minor non-WC pair)
2.24 - 4 motif violations - does bad (35M) (A minor, Platform, U-Turn, Platform/Bulged G)
5 that does rather well and in total have 6 motif violations - 1.2 violation per design
Together they have 70 mutations - 14 mutations per design
3 that does bad, have 18 motif violations - 6 violations per design
Together they have 81 mutations - 26 mutations per design
2.09 - 0 motif violations - does fair (2M)
2.10 - 0 motif violations - does fair (13M)
2.11 - 1 motif violations - does fair (21M) (Z-turn)
2.12 - 2 motif violations - does bad (55) (2 A-minor,)
2.13 - 0 motif violations - does OK (17M)
2.14 - 2 motif violations - does bad (11M) (Platform, GA-minor)
2.15 - 2 motif violations - does fair (4M) (Same Z-turn motif)
2.16 - 4 motif violations - does bad (55M) (2 A-minor, Z-turn, Loop E submotif)
5 that does rather well and in total have 3 motif violations - 0.6 per design.
Together they have 57 mutations - 11.4 mutations per design
3 designs that does bad and in total have 6 motif violations - 2 per design
Together they have 121 mutations - 40.33 mutations per designs
Sum up on motif violations
Four designs (2.11, 2.15, 2.19 and 2.21) performed well despite motif violations.
Three defective designs changed a base in an A Minor (2.12, 2.16, and 2.24), but those were the designs with the highest number of mutations, so we can’t necessarily blame the A Minor.
In 2 out of 3 cases(2.11 and 2.15), the Z-turn violations doesn't seem to harm the design.
A high number of motif violations do seem to turn up in designs that does less well. However for both 23S and 16S it seems to be that having a high number of motif violations that also comes together with a high number of mutations, seems extra bad.
Perhaps we can get from the conserved bases (IUPAC) in combination with the motif positions, for which motifs are also most conserved and which motifs we will be more likely to get away with modifying/breaking.
DigitalEmbrace asked me: Let me know if you find a clear definition of what each motif is.
The motifs in Rhiju's RNA motif lists are called things like A_Minor, GA minor, Bulged G, GNRA Tetraloop, Incalated T-Loop, Loop-E submotif, Platform, P-Loop, Tandem GA Sheared, T-Loop,UA Handle and U-Turn.
RNA Motif Definitions
I have dug up a bunch of papers with explanations and started a list. I haven't gotten all the explanations in yet, also I will update alongside that I meet better definitions.
One base moonlighting as two motifs?
DigitalEmbrace took notice that specific bases seemed to belong to more than one motif.
Switch motif? ;)
DigitalEmbrace: Just so you know, sometimes bases are listed in more than one motif.
DigitalEmbrace: Can a base be involved in two different motifs in the 3D structure?
Andy Watkins: yes, in a couple of different ways.
for example, consider:
UA_HANDLE C:QA:305-306 C:QA:310 C:QA:312
put these two together, and you get
T_LOOP C:QA:305-310 C:QA:312
there is a “compositional” element here. some motif definitions are more granular than others, and “submotifs” exist.
DigitalEmbrace: Example: A_MINOR C:QA:917 C:RA:79 C:RA:97 and LOOP_E_SUBMOTIF C:QA:915-917 C:QA:860-862 both involve 917. Three contain 1085: U_TURN C:QA:1083-1085 and A_MINOR C:QA:1085 C:QA:1055 C:QA:1104 and PLATFORM C:QA:1083 C:QA:1085-1086 C:QA:1082. Two involve 1393: U_TURN C:QA:1391-1393 and A_MINOR C:QA:1393 C:QA:1338 C:QA:1314.
Andy Watkins: very easy for A_MINOR to involve bases something else involves. Two of those bases are BPed to each other and the third is an adenosine making some interactions with the BP (edited)
DigitalEmbrace: Great, thank you!
I have found a paper with an image that shows the additive nature of some motifs.
It seems that the U handle in this case is a submotif in a bigger motif. Read the text under the image in the paper.
In another paper, I read that the submotifs could get added to bigger motifs. So motifs are sorts of like legos. Building blocks.
I've been thinking about the significant global misfolds LinearFold (and Vienna 2 to an even greater extent) are modeling in the WT ribosome. If such large misfolds were actually happening, the ribosome would not be able to function. This makes me wonder, how close are these helices to each other in the folded ribosome? Can I view them in a 3D representation?
The helices involved are:
50-65/1930-1945 H5, H6/H71
175-180/1830-1835 Loop between H10, H11/H67
(Bases are approximate)
It’s pretty much domain I interacting with domain IV. How close is domain IV to domain I? Are they interacting with each other in a way that is impeding ribosome efficiency? Or is the issue in our modeling?
I lean towards thinking the modeling is incomplete and am not nearly as focused on the global misfolds as I was initially. Instead I now wonder if we may find more benefit from addressing local misfolds.
Diagram indicating no base pairing between domain I and domain IV.
If Eli or anyone else who has learned the 3D molecular modeller wants to pull up these helices, I’d be curious to learn the position of these helices in relation to their possible mispair.