Making strategies for good automated barcodes

  • 4
  • Article
  • Updated 5 years ago
Rhiju suggested we started a new barcode post to continue the discussion from here:

https://getsatisfaction.com/eternagam...

“@Eli: Regarding barcodes, now that we have very high quality data from R88 on the same design with different barcodes, would you be able to work with other players to draw conclusions on good barcode strategies? Maybe post a link in your project description?

I have now run into other applications at Stanford requiring hairpin barcodes, including new efforts to understand how cells send packets of RNA to each other via 'exosomes'. It would be wonderful to formalize what we are learning in EteRNA so that other scientists can make use of it, and we need your help!

Perhaps it makes sense to open (or revive) a separate thread on barcodes so as to not mix the discussion with error rates.”

Hi Rhiju!

Here is the link to the background for my History Tour.

Background for the barcode experiment

Meechl and Omei, I could sure use your help if you are interested.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes

Posted 5 years ago

  • 4
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Photo of rhiju

rhiju, Researcher

  • 404 Posts
  • 123 Reply Likes
Thanks for putting this together!

I reviewed the prior barcode discussions -- I actually did not know about further progress on the G*U idea from several of the players, and would be very curious to see if the recent barcode labs bear out the picture that having one G*U's are 'optimal' in the context of designs with long stems. Can you give us an update on that?

I also am intrigued by your recent results (in the google docs) suggesting G-C to be the best closing base pair for RNAs with zero gap between design and barcode, but less of a preference for longer gaps. Are these conclusions based on R88 results only? Do they hold up with the most recent reacquisition of those data? Do we have other data that we can bring to bear?

As a specific goal, I want to know if we can explain these observations/rules in terms of computational models. That would be the critical point at which I'd feel comfortable using these rules for other applications.

For example, if we get 'dot plots' or NUPACK 'ensemble defects' on designs with the G*U-containing barcodes vs. mutants with G*U, do we see cleaner predicted dot plots for the former? If so, then our existing nearest-neighbor rules provide a basis for understanding a strategy. If not, then we'll perhaps have to think beyond nearest neighbors, e.g., with the 'backbone strain' models.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Hi Rhiju!

Np.

I haven’t looked at the R88 data from in game, so they have nothing to do with my analysis. Could they be loaded to the ingame interface. I know we have the rdat data, but I can answer your questions better if I can see it in colors.

You mention a google doc. I'm unsure which google doc you are referring to. Please point me in the right direction. I will love to help.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Hi Rhiju!

You asked: “I reviewed the prior barcode discussions -- I actually did not know about further progress on the G*U idea from several of the players, and would be very curious to see if the recent barcode labs bear out the picture that having one G*U's are 'optimal' in the context of designs with long stems. Can you give us an update on that? “

I still think my GU strategy for adding GU in barcodes in long stemmed design still has merit, when looking R88 which has good data.

To get full confirmation on the usefulness of GU in barcodes, you could run a mutant experiment like both you and Brourd suggested. Similar for our long stemmed designs, as these have been haunted by high error rates, it would be useful with an extra check, to fully get confirmed that really long stemmed designs do crave GU’s (Lab designs that crave GU) (I believe that many non complementary base pairs can work similar to GU’s also - but that not all combo's would work equally well)

So you could run a long stemmed lab which could now get good S/N data - if run without alternative barcodes. That way you can rule out that it was just an accident that got enhanced and looked good due to the error rates. However I suspect it will actually hold from the weaker tendencies I see for with the GU distribution in the much shorter stemmed Cross lab designs in the good R88 data.

I have done a small sum up in this document:

GU count in Cross lab - Round 88

Afternote: One thing I really do wonder about is that the two frozen labs from a winning design with difference in gap between barcode and main design on 0 or 4 bases, the one with 0 gap didn’t had as strong a need for GU’s in the barcode. It kind of fits with that I consider it more risky having a GU in a adjacent barcode, than one spaced away from the main design. I wonder if this is generally the case? I also saw that while the SHAPE data was the stablest for the lab with adjacent barcode and 0 gap, this lab had fewer winners compared to the labs which had bigger gaps. It being one of the shorter Cross labs, means it ought not to be hurt by error rate due to length. So I think there is something else going on. Adjacent stems means calm SHAPE data, but they seems to make it harder to make a massive amount of winners, compared to similar designs with gaps between their stems. I simply wonder why.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
More comments on your question:

"For example, if we get 'dot plots' or NUPACK 'ensemble defects' on designs with the G*U-containing barcodes vs. mutants with G*U, do we see cleaner predicted dot plots for the former? If so, then our existing nearest-neighbor rules provide a basis for understanding a strategy. If not, then we'll perhaps have to think beyond nearest neighbors, e.g., with the 'backbone strain' models."

I have located 10 designs in the 88 data, which were identical except for the lacking or presence of a GU in the barcode. Earlier I had 2 identical designs and mutant barcodes, and several close to similar. There the tendency was that the design with the GU scored just a bit higher than the design with the barcode without. The tendency I saw from earlier continues.

1. Overall the design with the extra GU in the barcode scores higher than the one without. (9 of 12) (Though still within error) Only two scores even and only 1 design with GU in the barcode scores lower.

2. Melt and dot plot for both the design with barcodes with GU and mutant without are remarkable similar. There are only very minute differences. Most of the time they look virtually identically. Especially the melt plot.

Only one design with GU score lower than its twin without. This is in the frozen cross lab with 0 gap. The one I noticed has less of a need for GU's in the barcode compared to the identical lab with a 4 base gap.

Note that both of these designs already had 3 GU's in the main design already.

You can see the new collection of GU mutants here:

https://docs.google.com/document/d/1a...
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
My best bid for good solving strategy for barcodes which could even be automated is datamined barcodes. Combine the following:

1. Jandersonlee’s LabDataMiner for searching out great barcodes
2. Mat747’s cutting and paste method for “splicing” RNA elements together.

The LabDataMiner can pick up the best barcodes from the past winning designs and Mat’s cut and pasting method will help ensure that there is a good match between the barcode and the main design it gets attached to.

Basically Mat cuts elements 2 base pairs into the element. This is what I did in my lab design for the current round:

http://eterna.cmu.edu/game/browse/499...

I simply cut here in the original design and took along the two bases U and G from the main design.

Original design without barcode:


Then I added the U and G when I searched for a barcode with the LabDataMiner. Notice that I took more bases from the neck along to make sure that I didn’t get a barcode from a design with a real short neck. Short necks solve differently than longer ones. This ensures I get barcodes which have been attached to a similar design. Also I don’t make the neck exact and full length to the original neck, as to not rule out to many potential barcodes. There will be far fewer designs with long necks, and as such I would limit myself from a good pool of potentially great barcodes.

How I searched for the barcode with the LabDataMiner


Also notice that I took the yellow gap bases along in my search. A design will need a slightly different barcode, just depending on how close the barcode is to the main design.

New design with Datamined barcode. Notice that the U and G from above is along.


Cut and pasted barcode with explanation


What Mat and I usually do is put two GC pairs at both ends of the barcode. Mat sometimes puts even 3 GC pairs at the first part of the barcode (furthest away from the UUCG loop) Usually stems don't have a need for having two GC pairs at both ends, actually designs that do this all over regularly do less good. However since the new synthesis method, the neck and barcode has showed to often have a need for it. Something Mat noticed earlier than I. (Earlier there was a preference for low GC necks, like the old design above - which is part of why I recycle it to see how it will fare with a regular good barcode.) The neck and the barcode has a higher need for GC pairs than regular average stems. Something that Mat in particular has noticed.

However how many GC pairs the neck needs is also hugely related to its length. Like short stems, short necks have a high need for GC. And longer stems can live with less, like longer stems. However the neck and barcode in general can take quite a lot of GC pairs.

Designs and barcodes can sometimes be solved with 1 or 0 GC base pair for the closing and next pair, especially for the non loop end of the stem, but GC is generally the most reliable.

Other good advice:

1. Orientation of GC pairs in the barcode matter too. For following reasons. How close the barcode is to the main design, often give a bias to which way GC is orientated.



NB: If there are many double same turning GC pairs in the main design, having double same turning in the barcode can be bad. Because this means more stretches of the RNA is complementary and especially with so strong bases, it can create misfolds. Else double same turning GC base pairs often do quite well at both ends of the hairpin barcode stem.

Additional NB from Jeff on the LabDataMiner on min score: However as I've come to realize, looking at the shape data in just that one region does not tell how well the whole shape functioned and whether or not those bases were bonding to the right partners. That's why it's good to set a fairly high score threshold as well. (e.g. minscore=94).

Loose thought: I think there can even be a difference in need of GC ratio for a barcode, just depending on what type of design it is attached to. Depending on things like gap distance or length of stems in the design itself.

Learn more here

Mat's lab design strategy:
https://getsatisfaction.com/eternagam...

Here are two demos of jandersonlee using Mat’s lab analysis method:

https://docs.google.com/document/d/1S...

https://docs.google.com/document/d/15...

Intro to LabDataMiner
http://eternawiki.org/wiki/index.php5...
Photo of rhiju

rhiju, Researcher

  • 404 Posts
  • 123 Reply Likes
OK – additional strategies based on stem lengths in the main design? Also what if there are no barcodes that match the desired sequence or secondary structure in the LabDataMiner? 
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
I forgot to mention that in the LabDataMiner, I can preset which bases I want where. So I set eg closing and next base pair of the barcode hairpin, to be Strong, so either GC or CG. I can simply ask smart for what I wishes, not by calling specific base, but calling for eg which two or 3 bases I will prefer to see at a certain spot.

According to the IUPAC notation: http://eternawiki.org/wiki/index.php5...

Also I can set the ranking system of importance of the base. As you might have noticed, I set closing bases til 9 in the ranking settings and middle bases till 6. This means I will mainly get barcodes which has strong good SHAPE data for the closing bases.

This part of the tool is something which Mat has wished for and something he has been thinking into designing the whole time. See his Computationally Selected Elements:

https://getsatisfaction.com/eternagam...

Closing basepairs and after those, next pairs, are far more important to overall stability than the basepairs in middle of a longer stem. (Which is why they can just be set as N= for any basepair. The middle basepairs can be almost anything. GU, non canonical. As long as the closing basepairs and next base pairs are good and well picked, they will hold the stem together. The longer the stem, the more true this becomes. Like I love to say, the power of Hydrogen bonding forgives a wealth of sins.

"Also what if there are no barcodes that match the desired sequence or secondary structure in the LabDataMiner?"

Lower requirements... With time there will be an extensive database of good barcodes, from a ton of different designs.

So in case I can't get the two bases from the main design to exact match? Then I will ignore the base from next pair (The base pair just behind the closing base pair) and try get the closing base. Else I drop that requirement. Similar logic for if I can't get a good barcode which has two strong basepairs at each end - which ranks over a certain ranking threshold which you can set as a minimum when automating - then I first lower requirement to one for the bottom of the hairpin. And later the requirement to two GC closing basepairs for the loop, till one. Also one can shorten length of the barcode one search for. The more specific you search, the more barcodes gets ruled out.

"OK – additional strategies based on stem lengths in the main design?"

Yes. I think I have or can come up with a few. But I will call it a night now. Late my place. :)
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Rhiju, you mentioned that you were working on "new efforts to understand how cells send packets of RNA to each other via 'exosomes'."

Cells sending RNA packages to each other reminded me of a cool article I recently read, that Nascarnut shared with me. It was about parasitic plants using RNA signals from their cells to control cells in their host plant.

Plants communicating via mRNA

I wanted to understand more about Exosomes and why they are important to humans - despite that I already think it is fascinating that plant cells can communicate with RNA and even between cells in different species.

I learned that exosomes are involved in cancer and other diseases. Here is the best material I have found till now.

This is a miniature video series which is really well made and gives a great intro to what exosomes are and why they are so important.


Here is a more visual introduction:


Fine small background video.


Also the WIKI article on exosomes has a great introduction to the subject.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1008 Posts
  • 324 Reply Likes
I think this is very interesting topic, with lots of subtleties. I regret that I haven't had more time to actively participate. :-(

Perhaps I just haven't looked at Eli's background material carefully enough, but it seems that the single biggest issue surrounding automated barcode generation hasn't been mentioned yet. That issue is misfolding. I suspect most experienced Eterna players routinely check the dot plot to make sure the barcode they have chosen doesn't have a high probability of causing misfolds. But apparently the algorithm used for automatically generating barcodes does not. This is most easily seen in "expert" labs because none of the expert labs get expert barcode assignments. Here's an example:


The dot plot clearly suggests that some of the barcode sequence will pair up with the main design, and the SHAPE results suggest that's what really happened.
Photo of rhiju

rhiju, Researcher

  • 404 Posts
  • 123 Reply Likes
There is a check that the barcode should at least not disrupt the main design in the lowest free energy structure. but even that check isn't satisfied for all designs -- its a hard computational problem to optimize the barcodes for 1000s of sequences... that's partly why it would be useful to have heuristics and one motivation for me asking Eli to start this thread...
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
Indeed. As far as I can tell Rhiju, based on what I saw of Eli's barcode results, there are a few important details necessary for a good barcode "hairpin".

1. It cannot be pure G-C. The data indicated, even when there was a high signal to noise and readable data, that pure G-C barcode hairpins had an adverse effect on the chemical reactivity signal globally.

2. Unpaired barcode sequences should probably be avoided.
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
Other than that, it mostly comes down to having 3 or so G-C base pairs, spread out, per barcode

Eli has mentioned the use of G-U wobble and noncanonical base pairs. It's difficult to ascertain the usefulness of adding these to the helix.

There has not been a significant number of actual sequences that test this hypothesis in Eli's project. For example, randomly picking 20 or so sequences, adding a G-U to the barcode helix and submitting the mutants, taking those same 20 and creating a noncanonical mismatch in the helix and submitting those mutants, etc.

Then, determining how much the SHAPE reactivity signals deviate for each submitted sequence.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1008 Posts
  • 324 Reply Likes
Rhiju, I'm not clear on what "There is a check that the barcode should at least not disrupt the main design in the lowest free energy structure" means. Can you elaborate? Here's the puzzle maker's estimate for the lowest free energy without the barcode.


It is different than the estimate of the lowest free energy with the barcode.
Photo of rhiju

rhiju, Researcher

  • 404 Posts
  • 123 Reply Likes
Just sent a note to Jee, as he coded up the algorithm...
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Rhiju, you asked: "For example, if we get 'dot plots' or NUPACK 'ensemble defects' on designs with the G*U-containing barcodes vs. mutants with G*U, do we see cleaner predicted dot plots for the former?"

I assume you mean mutants without GU in their barcodes.

I can’t tell you much with specific to barcodes as I have found only two cases with designs that are identical with almost identical barcodes except for a GU. Anyone knows of near identical designs with just a AU and GU to difference in the barcode? Feel free to dig them out. I think there are more cases.

Here is what I have collected on identical designs with close to identical barcodes.

Identical designs with similar barcodes

However here is my experience from designs with long stems and them with and without GU:

I would expect the melt plot to grow generally worse for the designs with GU versus those without. When the melt plot goes worse, the dot plot also often does. They are interconnected. I have written about Melt plots and GU’s here:

https://docs.google.com/document/d/1D...

Also I ran the LabDataMiner on barcodes in general,

I set min score at 94 to only get folding designs and I'm likely to get working barcodes. I put in a GU base pair and set the barcode to be any base except for that one base pair and I also fixed the hairpin loop to make sure to only drag barcodes. Then I submitted and watched the average ranking - which means how high the SHAPE average score was for all the barcodes which fits my search. Then I interplaced the GU with a AU at the exact same spot and ran the DataMiner again.



I then repeated for all spots and orientations in the barcode. Except for a very few places - typically close to the ends of the stem, all cases had AU scoring higher ranking than the case with the GU. Yet in most cases the average design score were slightly higher for the cases with GU over AU. (Ok, the hairpin loop don’t like being closed up with much other than GC and preferable two.)



This would suggest that the GU do not much in itself to add stability to a design on its own, but rather that its positive effect might be due to something else.

So I think I’m in favor of the backbone strain hypothesis, when it comes to GU. As a well placed GU can have a positive effect or if placed unlucky, can have a dramatic worsening effect, far far away. Still it seems to depend a lot of the surrounding sequence.

You can read more here. How a single GU can have a dramatic effect on an otherwise identical designs.
https://docs.google.com/document/d/1D...
Photo of rhiju

rhiju, Researcher

  • 404 Posts
  • 123 Reply Likes
@Eli, can you explain in more detail what rank means?
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Hi Rhiju!

Rank is based on SHAPE data. So it basically means that the barcodes are ranked after their overall SHAPE score.

Here I just did a search where I say I want strong (GC) closing basepairs and the LabDataMiner spews out barcodes ranked after their overall SHAPE score.


The reason I have set 94 as min score is to make sure I pick only barcodes from overall stable designs with both barcode and main design forming. Like Brourd mentioned: "Unpaired barcode sequences should probably be avoided."

However what I did in the search I showed further up in this post, with the numbers added also, was to make an additional check to ensure that the barcode had actually formed, by saying that I wanted barcodes which had both their closing basepair and the next pair forming, by setting the number till 9 for rank by.

Translated to SHAPE data, this would mean deep dark SHAPE data for the closing bases.

Image Mat got from request by devs:

(Source : Computationally Selected Elements)

This number scala is set after the same score scale which is used for SHAPE data. So what I basically told the tool, is to find barcodes that has formed and I care less about the SHAPE values of the middle bases in the stem. If the end bases hold, the barcode usually holds and does its job.



Additionally I asked to make sure that the main design was stable also by wishing closing and next base from the main design to be extremely stable also (9). It doesn't matter that the barcode is stable, if its attachment is not. I wanted to ensure there was a match, soI didn't take a barcode that looked good, but might not fit well and make part of the neck unstable when connected.

The numbers added to RankBy is simply an option to tell the tool what I value most to have ranked highest. This is Mat's idea and something we have been playing with.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Barcode strategies

GC priority positioning

On the importance of being GC pairs. I have here ranked how important I think it is that specific spots are being GC pairs.



(If the basepair at spot 4 is an AU, the one to the left should be a GC)

Avoiding mispairing between barcode and design

Omei mentioned mispairing as an important factor in picking barcodes. This is really where players have a huge advantage over an automated algorithm, because we see RNA as a whole and often see potentially wrecking situations.

1) If a design has many double G's and C's (or even longer stretches of these bases in line) in the main design already, then having double same turning GC's in the barcode can pose a risk for mispairing.

Despite exactly in the barcode, double same turning GC pairs have proved to be particular useful for stabilization at ends of the hairpin barcode.

2) I also dislike having double same turning AU base pair right after the closing base pair at both ends of the barcode stem. I have regularly seen barcodes with these open up. Double same turning AU pairs also post an extra risk, if there are many of those in the main design. I usually only want these in the middle of the barcode, where they can be useful.





Put max and min on GC, AU and GU for a barcode

1) Preferred range 3 - 5 GC pairs, AU for the rest. Only use GU’s, if the stems in the main design are real long.

Personally I prefer having 4 GC pairs in a barcode. Two at each end of the barcode stem. But barcodes can often work with 3 or 5 too. Mat often prefer 5. Since both neck and the barcode seems to like having a big more GC than stems of a similar length, I will say I would prefer 5 GC pairs over 3 for a barcode.

Like Brourd mentioning not liking 7 GC pairs for a barcode, we dislike that too. They actually do fly sometimes or 6 GC’s with a GU. But generally having this many GC’s in a row means trouble.

2) Max 4 AU's. Similar a barcode with all or almost all AU’s can also fly if attached to a design with a extremely long stem, also solved with all AU’s. Again this is extremely rare and not recommendable.

However, the LabDataMiner have no way to limit max or min number of base pair, so barcodes with 7 GC can turn up as top ranked barcode. I think imposing a GC max and min limit for the barcodes will benefit designs in general. Similarly to that I think that designs will benefit from having an equal energy distribution all over the design. Which is actually the same as a max and min base pair limit per element.

3) Also imposing a GU limit and set it to 0 for designs with generally short stems. Like already mentioned above in this post. While barcodes can fly with 2 GU's inside generally it is best to not have more than 1.

Its very possible to find barcodes which will works, but do not follow the above. I just count them to pose a bigger risk, as I think they will not work in as broad a range of situations than a regular good barcode.

Pick the best barcodes for the designs with the shortest necks

Designs with very short necks (2 or 3 bp) those needs a good barcode the most. In particular if there are no gap between neck and barcode.

For designs with long necks or just long stems in general, if the main design itself is great, they will do mostly fine almost no matter what barcodes they get. The barcode may break though. :) There were very few real low scoring designs in the two Frozen Cross lab designs, which was based on a past winner.

Examples with lowest scoring designs from the Cross lab.


Frozen Cross labs:
http://eterna.cmu.edu/web/browse/4739...
http://eterna.cmu.edu/game/browse/473...
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
One more small part strategy. In case the base pair in one of the orange marked box is an AU and not a GC, the base pair next to it (spot marked with red X's) should not be a GU.



This is no matter how the base pairs are orientated. If the orange marked spot is not solved with a strong base pair, introducing a GU next to it, can make the barcode weaker and more likely to break open or mispair.
Photo of rhiju

rhiju, Researcher

  • 404 Posts
  • 123 Reply Likes
All: I'm discussing with John Nicol the possibility of formalizing these rules into an automated algorithm -- we may need some help after that in devising the incisive tests of the method within eterna. In addition to improving barcode generation in eterna, these rules -- and an automated algorithm to create libraries -- should be useful for other applications (incl. at stanford) -- so we might consider writing up into a paper, if you're up for it...
Photo of rhiju

rhiju, Researcher

  • 404 Posts
  • 123 Reply Likes
To folks following this thread, this barcode design paper/algorithm might be of interest:

http://elledgelab.med.harvard.edu/?pa...

was recommended to me by a neighboring lab. I don't think they were working on hairpins like we are...
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
On the paper you shared, one thing I noted that they sort their good and bad barcodes in each their batch. There is one option in LabDataMiner I don't think I highlighted well enough. The option Min count.

This box can be used to say that I only want results of barcodes which have been used at least eg 3 times. Pairing this option with a minimum score requirement of 94, I can make sure to get only barcodes which has proven to be successful in multiple cases.

Of cause this will limit the output of barcodes a lot. However the more data you accumulate over time on designs and barcodes, the better you will be able to pick out the overall most successful barcodes, despite of their context.

Even though I still do believe that a barcode's context like distance to design, stem and design size also affect how well the barcode will folds. Just like I showed that adding a different neck to identical designs, could affect stability in the main design, I believe the same is happening for barcodes. (https://getsatisfaction.com/eternagam...)

Also I think that designs where the barcode were the same but the design was not, that can affect how well the barcodes fold. That it won't neccessarily work equally well, just depending on which design it gets attached to.

It would be great if you can also get all the additional data from the alternative barcodes you autogenerated, into this coming barcode algorithm.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1008 Posts
  • 324 Reply Likes
I'd like to reinforce the point that Eli makes. If we state the problem as "design an algorithm to assign good barcode sequences", we lose sight of the fact that the relationship between the barcode and the main design, e.g. the number and type of bases separating the barcode from the main design, is at least as important as the barcode sequence itself.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Strategy for distance between main design and barcode

For now I believe that a gap size of 2 bases between main design and barcode is one of the more optimal ones. I believe that having a barcode too close to the design - while it may give calmer SHAPE data overall - may actually lead to fewer winning designs.

Background on why I think this

Earlier I did a check of labs. I had listed winning percentages for labs that had a barcode hairpin and one stem attached in the hook area. As I wrote: I believe there might be an optimal distance between the barcode and nearest stem. For now 2 base distance looks more optimal. What makes a RNA design hard?



Since we have had a lot of labs. However this time I decided to instead ask LabDataMiner. (Thx jandersonlee and Mat!) I set minimum score to 94% for a design, to rule out misfolds and designs with high error rates. I also asked for closing base pairs in the barcode to be GC pairs. I also added the hairpin sequence to make sure I mostly get barcodes. A lot fewer stems (of which even fewer will be of 7 bp) uses that loop sequence. (For the scientists making the barcode generating algorithm - you should preferable set exactly the neck length your design has, as this will raise your chances of a good barcode versus design fit)

I choose a mid length neck (5 base pairs). LabDataMiner can not distinguish stem length and as such will take in both short necks and much longer which fits the bill too. So a short neck would give too many results - many of which might not work exactly the same for finding a barcode for a design with a longer neck, whereas setting a long neck will give far too few results.

Example search for 0 gap.


Then I simply noted amount of winning designs that came out per gap length.


I ran the same request to LabDataMiner but without minimum score, to get an idea of distribution of designs, to rule out that I’m not just finding amount of designs, but that there really is a barcode gap tendency on its own.



Then I just divided the count of winners with the count of total barcodes for that gap size, to get the % of winners pr slots in each gap size category. A barcode gap with 2 base pairs are the combo that gets most winning slots per lab.

There are fine count numbers up till gap 6. So I can’t say anything for sure about gaps above size 5. But it looks like a gap of 2 is better than 0 and 4, through 1 and 3 don’t look too bad either. Despite that the actual SHAPE data is ranked lower for the barcode with a gap size of 2. However this gap size gives the designs the highest average score. I think if the design generally does well it might matter a little less if the barcode is actually super stable. A barcodes main function is to function as an identifier and since it is there and affecting the design, I think it is our job to make sure that it affects the design as little as possible in a negative direction but perhaps instead aid that the design folds well. I have done a few comparisons based on the recent good lab data, against identical designs with and without barcodes. It actually seems that barcodes in some cases act to stabilize bases in the neck or in the 4 way junction that was unstable in the original design. (https://docs.google.com/document/d/1g...)



Now this is still not connected to stem length of a lab design itself. I think it could very well be that a design with a short neck and generally short stems, might need a bigger gap distance than a design with a long neck and generally long stems.

So when we have a lot more data, I think it could be very interesting to have the designs grouped in such a way that one could draw barcodes based on things like designs with similar size neck, but also based on design size and general stem length. However for now the gap size and the early sequence bit of the neck is the easiest way to dig up a barcode.

An additional note. Usually when Mat cut an element to pieces to join it with another element, he look at the full stem of an element and do a blunt end cut. Similar when I design I watch the whole base paired element and not just one strand of it. However with the LabDataDiner we can’t watch the beginning strand of the Neck area, so we only use the last bases from the last strand in the barcode search in the lab data miner. But when we design manually, we take the full neck into account and in particular the closing and next pair.

More background on barcode distance

On why I believe that having a barcode too close to the design - while it may give calmer SHAPE data overall - may actually lead to fewer winning designs.

I believe the closer the barcode is to the design, the more the sequence of the design, that the barcode is attached to, matters. I simply think that the sequence of both barcode and design affect each other more, when they are real close. I think that this limits the possibilities to get a good match.

However when there is distance, I think it means a little less how the barcode gets solved, that there is simply more legal solves. I can’t say this with 100% certainty yet, but these are the beginning tendencies I see on background on my latest lab Eterna History Tour - Introduction to Lab - The Cross lab which had good signal-to-noise ratios and great data.

There I had two sublabs with 2 identical designs based on a past winning design from the early Eterna days and only the barcodes could be changed. The first lab had a 0 base gap between barcode and main design, the second had a 4 base gap between barcode and main design. The lab with the 4 gap size had almost 3 times more winners than the lab with the 0 gap.

16 winners


http://eterna.cmu.edu/game/browse/473...

All designs on display are winners (45 winners all in all)

http://eterna.cmu.edu/game/browse/473...

Even the lab with 0 gap comes first in the lab project, and players usually solve in order, so they would likely have picked their best barcode guesses for this lab and not the 4 gap lab that ended up with the most winners.

Based on what I have seen in other labs, I think that the distance 2 is better than 4. But when I designed the lab project, I choose the bigger distance between the labs to better provoke a bigger contrast in the results, to better have a chance to see what was going on.

For the Bulge cross lab which had identical shape except for a 1-1 loop, the pattern was reversed. However this lab had far higher error rates, so I wouldn’t trust those results too much.

Despite the past labs generally have shown higher error rates for the designs that had bigger gaps, and as a consequence of that fewer winners, I think this might be an artifact of growing error rate due to length of a design and our problems with bad lab data.

So at the moment I think that despite that barcodes when adjacent to the design gives the calmest SHAPE data in general, I think it might produce fewer winners, compared to if the design and barcode is given some distance.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Strategy for gap bases between barcode and main design

Should a difference distance between barcodes and main design than 2 bases be chosen:

0 - 4 base gap

- Keep it all A’s

I believe there to be no real benefit from using non A bases in the gap, when it is 3 or less. At gap size of 4 and up, the benefit starts raise. However I will first start recommend adding in non- A bases for gap sizes on 5 and up.

Though a boosting base sometimes do enhance stability of the design, I have also often seen it help cause misfolds, if places in the main design are weak and have wishes to be elsewhere. Then such an extra base could be the invitation.

5 base gap +

I will recommend a strategy similar to the one I recommend for 5’ tails for the Eterna History Tour project:

- Make around 1 in every 5 bases something other than A (20%)

For which color to use: U's generally has less risk of interfering with the rest of the designs, compared to both C and G. Pick G’s over C’s if you don’t pick an U. But both G’s and C’s often ends up causing mispairing.

- Keep Non-A's away from boost spot.

Though there seem to be some benefit of boosting the 5’ side of the gap bases, I will still recommend keeping the non-A bases away from boost spots.



If your barcode and stems in your design is stable, they don’t need a boost as the boost can easily become a spot for mispairing.

Should you still choose to want to place bases at boost spots. G’s seemed to be the preferred on the ‘5 end of the gap string, then C and then U.

Better though, you can make LabDataMiner spit out the most successfull cases of boost and similar for the best cases of where to place the non A base to break a poly A sequence.
Photo of rhiju

rhiju, Researcher

  • 404 Posts
  • 123 Reply Likes
I realized that in addition to finding a barcoding scheme that may be general for other experiments, we're really doing the first very deep test of how to design one motif so that it is independent of its sequence neighbors. This could be very powerful for eterna as we move to much more complex switches and longer structures, especially if we repeat what we are doing here systematically for other common motifs.

OK, sounds like we have strategies to test based on the paper lab project so far.

Roughly:

A. Just a random barcode
B. Random barcode with some MFE optimization (current default for creating alternative barcodes)
C. Barcode optimized to have highest average base pair probability in hairpin -- something like a dot plot.*
D. Eli's heuristics.*
E. LabDataMiner.*

*Automated code not available yet, I think.

Now the question is what would be the most rigorous test. Any ideas?
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1008 Posts
  • 324 Reply Likes
As a prelude, I'll (re)state my belief that a significant factor in how the barcode affects the SHAPE scores of the the main design is how the barcode and neck stacks align. If they align parallel (or is it anti-parallel?) because of mutually reinforcing stacking energies, the two tails are directed toward the other stack, encouraging tertiary interactions and affecting the SHAPE scores in ways we don't fully understand. Eli often refers to this as "unsettling" of the SHAPE scores.

I doubt we're going to find a "best" strategy that works for all neck/barcode separations. So I suggest we start with the one that has the fewest uncertainties, i.e. zero separation. Here, I think, parallel alignment between the neck and barcode is basically impossible, due to the need of the tail sequences to get out of the gap.

Of course, there are undoubtedly other considerations that should be taken into account before making a decision; this is just one.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
@Omei, thx for bringing this up. I very much agree with what you say here:

"I doubt we're going to find a "best" strategy that works for all neck/barcode separations."

The overall SHAPE data for the cross lab with long stemmed designs looked best for the 0 gap distance - though it gave fewer winners. I think designs with short necks are going to be hurt most by an adjacent barcode, that this type of design will benefit more from distance between barcode and design.

I don't understand why it is so, but I find your theory interesting.

I picked the 2 base pair gap because I think this is the distance that would end up with most winners in the test tube - no matter what kind of design plus it might put less limits on what barcode would be useful.

Which reminds me that I don't understand what is the most important for RNA when it is in a cell. If the most important thing is for it to have superblue SHAPE data. Rhiju did mentioned about wanting to use this barcode technique for exosomes and transport in and out of cells.

Actually I'm reminded about a conversation I had with Rhiju a while back. I asked him: Is natural RNA dark blue in SHAPE data? Here is the very interesting answer he gave me:

https://docs.google.com/document/d/18...
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
I've fallen a bit behind in this discussion, but a while ago I thought it'd be interesting make some graphs to help visualize some of the trends Eli mentioned with barcode success, particularly the gap size and the amount of GC, AU, and GU (because they sounded the easiest to do).

I only used the data from Eli's Level 0 puzzles in the History Lab Tour, because the only variable in those puzzles is the barcode, since the rest of the design is locked. However, I did not separate re-runs (like 87 and 87.1), nor did I separate the alternate barcodes.

I cut down the title of each lab to be just the name of the design and the size of the gap (also, Bulge Cross was shortened to Bulge Cros). I looked at both the score and the S/N. The S/N is more interesting, so I'll save that for last. :)






Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Hi Meechl!

Thx for the kind offer on more graphs. :) I'm still very much learning on graphs. But I will say if I think I have a good idea pop up.

Really neat observation:
"Mostly what I noticed was that the gap size had a big impact on S/N."

When I watch the Cross lab with 0 and 4 size gap, the 0 gap size has a far better S/N ratio.

I think what you did with taking in the alternative barcodes was good. So no need to do extra work with that. Its a smart way of doubling the data set. This is not what I call in to question.

What I am concerned about is that the Finger Lab and the Bulge cross lab overall had much worse signal-to-noise average than the Cross lab. Especially the Finger lab. I simply don't think they represent the truth. I think that a lot more of the sequences that we submitted for these 2 labs were supposed to be winners. But due to bad signal, many of the sequences didn't made it through and got a lower score than they ought to. Therefor I believe we can not draw too solid conclusions on the data from these labs, but only on the Cross lab which had good data. I simply think that the cross lab data paints a truer picture.

The reason why the Cross lab had better Signal-to-noise ratio average was because the alternative barcodes were forgotten to be put in.

Like Rhiju said in this comment thread: (https://getsatisfaction.com/eternagam...)

"Also note that in most prior rounds, we synthesized all the RNAs with the user-defined barcode as well as with an alternative barcode. But in R88, there was some confusion (on both ends), and the company did not synthesize the alternative barcodes."

Here I have added the S/N ratios (that you shared earlier) for the 3 history tour labs for comparison:

87: 1.004 The finger lab
87.1: 1.731 The finger lab
87.2: 1.109 The finger lab

88.2: 5.627 (The cross lab)

89 3.149 (The bulge cross)

Also recall your image of this round - that I added some doodles in:

https://d2r1vs3d9006ap.cloudfront.net...

3.149 was the S/N average for this whole round, however Brourds mimic (to the left) had a much better signal-to-noise ratio than The Bulge cross lab (to the right), leaving my labs with a much lower S/N ratio than the round average.

I don't know if I'm totally right about this. But I suspect that only the data on the winners are really to be trusted in these 2 labs where the error rate is high. Because when signal-to-noise ratio is low, I noticed that there generally don't are not many winners, if any, in such labs.
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
I see what you mean. I too have noticed that the score is generally lower when the signal to noise is lower.

I didn't separate the rounds, so the plots include all three R87's and both R88's. Do you think that the data would be more reliable to just look at R88.2, because R88's signal to noise was much lower (5.627 vs 0.918)?
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Oh. Yes, I think it will help if you seperate R88 and R88.2. This will give us even better data. I will love to see R88.2 on its own. :)
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
Okay, I made the R88.2 graphs. First, I'll share some general stats:

Average S/N:
0 nt gap: 6.099
4 nt gap: 3.873

Average Score:
0 nt gap: 92.0
4 nt gap: 92.4

The graphs:



imgimg
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Hi Meechl!

Big thx for your graphs and numbers, the Lets break the barcode ones too.

I’m still wondering about the Cross lab, because it was long, but still had great signal-to-noise ratio. Yet it follows the same pattern of earlier labs that was hit by high error rates - likely due to both length and the chemical problem that was identified as causing some of our lab data problems. That error rates grew worse with longer gap size. However when looking at the good lab data (The cross) with the sane error rates, the bigger gap gets the most winners.

Which of cause also coincided with length - but 3 of the cross labs had same lengths as 3 of the bulge cross labs - but without the alternative barcodes they got fine results. Yet the pattern is the same. Bigger error rates - the bigger the gap.

This is part of why I suggest a barcode distance of 2 bases to the main design - to get the best from both worlds. More winners + not too bad error rates.

However if this signal to noise change for identical designs with only a barcode gap different, I was wondering if it will be exactly the same, if we were looking at two identical but smaller designs instead. I was looking at the graphs you made for the Finger lab and despite this design being much shorter, the trend for gap size and signal-to-noise ratio was the same. Bigger gap - worse error rates. I know these data are not exactly to be trusted based on their bad error rate and due to the lab problems we have had.

But I have high hopes for the coming lab results from the current open lab, The asymmetry, as this is a smaller design and because we are expecting good data.

So if the trend remains the same with the bigger the gap - then higher the error rate, I suspect that this is not just an artifact of the design design being just at the limit of what the lab synthesis method can handle, triggering bad error rates by just adding a few bases to the design length - but rather something which is related to the gap size itself.
Photo of Hyphema

Hyphema

  • 91 Posts
  • 25 Reply Likes
Wow Meechl, incredible. There looks to be some good infor in these graphs. From just "eyeballing" the graphs, haha, it looks like the designs with gap #0 scored slightly better and had better S/N ratios overall than the designs with gap #4 irregardless of the number of GC, AU, and GU. There may be something to Eli's thoughts on recommending a gap size of 2 and avoiding gap lengths greater than 4.

On a side note, i would be interested to see an overall graph of the actual free energy of the barcode as a factor in score and S/N ratio and gap size. Perhaps it is not just having a GU that helps, but rather an ideal free energy value to have in the barcode?? If so, then this could be a better means of generalizing good barcodes for automation.

Meechl, there is another set of labs (JR's Break the Barcode, three of them in two different rounds, and JR's original Break the Barcode in another older round) that was testing the barcode with a locked winning 100% design with varying gap lengths. The point of that lab was more or less testing how weak of a barcode can we use without affecting a winning design. It would be interesting to see graphs on that as there would be many barcodes with less than ideal sequences (I.E. number of GU's, intended mispairings within the main design, etc.) I wonder if the alternate barcodes were used in those labs as well as? That may be very interesting to compare the automated barcodes which may actually be more stable sequences than the players barcodes.

I would recommend continuing with Eli's History Tour and throw in some locked designs with scores from 85-89 and another set with 90-93 with varying gaps of 0, 2, and 4 to test a few things. One, the hypothesis that designs overall score better and have better S/N ratios with gaps of 0-2 rather than longer gaps. Two, designs with one or two GU's in the barcode is still "optimal" per Eli's thoughts. Three, see if there is an " ideal" free energy value to have in the barcode.
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
In addition to that, both the 3nt and 2nt gap were a part of Round 82, which was not a particularly great round for synthesis.
Photo of Hyphema

Hyphema

  • 91 Posts
  • 25 Reply Likes
Thank you Brourd . Yes, as I recall, round 82 was fraught with errors and making any sound interpretations from that data may be difficult. Which was why I didn't bother making any observations with the lab.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1008 Posts
  • 324 Reply Likes
Actually, round 82 was an extraordinarily good round if you look only at the alternate barcodes, which never made it into the game UI. When I calculated it just now from Meechl's spreadsheet, I got an average S/N ratio of 5.23 for the alternate barcodes, as opposed to 0.75 for the primary ones. (Looking back at a previous post, I had come up with slightly different numbers, but I trust the ones directly from the spreadsheets much more.) Unfortunately, we don't have a nice Eterna-centric browser to display the alternate data (yet), but the data is there in the Meechl's spreadsheets.

A major contribution to this difference, I think, I think is the padding strategies that were used for that round -- 5' random for the primary set and 3' random for the alternate set. In both rounds where these were pitted together, 3' random was much better than 5' random. (Unfortunately, in both these labs, the 5' random is what got into the game database.) 3' random as been used for the primary set in all rounds since 83, and it has consistently been somewhat better than the 3' poly-A used for the alternate set.

However, 3' random has never been pitted against 5' poly-A, which was routinely used until round 80 (which is where we started to see a large dependence of the S/N ratio on the length). So I think we need to make that comparison.

But I guess I'm wandering off-topic for this thread.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Meechl, thx for your smart way of showing S/N averages with a red line in the images. I'm still very much learning when it comes to Whiskers and Box plots. What you did, was a help for me to understand.
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
Thanks for the feedback everyone. There are a lot of variables in the data and it's hard to keep track of everything (for me at least).

Eli, I like to use the box-and-whisker plots to get an idea of 1) where the average is and 2) how spread out the data is. The thick black line is the average, the box contains the middle half of the data, and the whiskers basically show the range of the data. There are sometimes a couple outliers that the whiskers don't include though.
Photo of nando

nando, Player Developer

  • 388 Posts
  • 71 Reply Likes
Not sure if this fits very well here, but I'd like to question the choice of a locked UUCG tetraloop.

The most important argument in its favor, is that it is an "ultrastable" apical loop, which contrary to GNRA ones, has almost no tendencies to engage in tertiary interactions. This argument looks rather feeble in my opinion: the really ultrastable UUCG conformation only happens if the supporting base pair is exactly C-G. In the other orientation (G-C), or if the closing pair is different, the apical loop has just as much potential to engage in tertiary interactions, if not more. Locking the apical loop this way seems to also influence the designs, since it pushes designers to avoid sequences like CGAA, CGGA and/or CGAG, which could be useful in the main design.

If there are other good arguments in favor of a locked UUCG, I'd like to hear them.

A silly idea: unlock the apical loop, and why not, make it part of the barcode. With 11 positions instead of 7, there would be less pressure felt by players while designing for a "busy" round, and more freedom in tuning the designs themselves (if the main design "clashes" with a UUCG barcode loop, the option of changing the apical loop is still available)

Thoughts?
Photo of rhiju

rhiju, Researcher

  • 404 Posts
  • 123 Reply Likes
allowing the uucg to float is a good idea.

we were originally going to use the uucg as an 'internal standard', like in this paper, but we ended up figuring out how to do the standardization by including a separate RNA in the experiments.

we are actually looking into a protocol that might let us get rid of the barcode altogether now. this might take 3-4 months though.

I'd still love for the community to 'close the loop' -- that is, have a formal writeup of what we learned -- as a prototype for others who need barcoded libraries and also as template for how to robustly design other modules.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
I think there is one thing that speaks in favor of keeping the loop sequence locked is that unlocking it will make it even harder finding identical designs with identical barcodes - something which has been possible to now and which I consider valuable to tease out information on which is best in cases where only slight changes could alter the outcome. Recently I have compared designs which as only difference had a GU interchanged for an AU in the barcode at the exact same spot. This was the only difference and because the loop were locked I was able to find these close to identical designs. The designs with the GU in the barcode generally scored a little higher than those with AU.

Had the loops been a lot different, I would have a lot fewer designs to compare.

@Nando, you said: "The most important argument in its favor, is that it is an "ultrastable" apical loop, which contrary to GNRA ones, has almost no tendencies to engage in tertiary interactions. This argument looks rather feeble in my opinion: the really ultrastable UUCG conformation only happens if the supporting base pair is exactly C-G."

I fully agree here. It takes at least one closing GC to make this UUCG barcode loop happy and more often two.

I also very much agree with the notion that the sequence of the barcode loop potentially could interfering with the rest of the design if the first part of the barcode stem isn't stable. This is what happened to a huge extent for Vienna, when it threw a lot of other than A bases in the loops, which in case the stems are weak, loves to interfere. I consider GAAA loop the far safer choise than the UUCG loop.

@Rhiju, :) on your loop pun.

I also like the prospective of getting rid of the barcode in the future, as I think it is interfering with the design and thus the result.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
@Rhiju, one question for you. A while back you said that Eterna's lab results had around 4% error. You also said that you would be working on getting more precise results.

What is the error margin for the Eterna results of today?
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
I got this answer back from Rhiju on how to calculate current error rates:

That was estimated via replicates (see also the PNAS paper) — if you look at some of your recent tests with the same sequence and differenct barcodes, you can get a similar error estimate based on standard deviation of the score.
Photo of nando

nando, Player Developer

  • 388 Posts
  • 71 Reply Likes
Rhiju wrote:

Roughly:

A. Just a random barcode
B. Random barcode with some MFE optimization (current default for creating alternative barcodes)
C. Barcode optimized to have highest average base pair probability in hairpin -- something like a dot plot.*
D. Eli's heuristics.*
E. LabDataMiner.*

*Automated code not available yet, I think.


Whether C, D or E, it all sounds like an assignment problem to me. For the cases D and E, it could be that the cost calculations are cheap (CPU-wise) enough to generate the (num_sequences) times (6^7 possible barcodes) long list of costs and apply the Hungarian algorithm. For the case C, it is obvious that this strategy wouldn't scale very well (computing partition functions is comparatively quite expensive), and maybe some variant of the auction algorithm could be successfully used...

Hmm, anyone majoring in CS tempted by the challenge? :P
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
A few things it does not seem like Eli has answered...

You mention the "Eterna" score of sequences quite a bit in your analysis. Does the "Eterna" score of the barcode have a significant effect on the individual measured SHAPE reactivities for all the residues in a sequence?

That is, if a barcode is exposed to the chemical probe, does that affect the rest of the sequence. If not, wouldn't that make a "bad Eterna scoring" barcode just as great as a "good Eterna scoring" barcode, for the purposes of it being an identification marker?

In addition to this, how do you know one design is a "winner" compared to another, if there is sequence dependent variation in the SHAPE reactivities of individual residues, say, in the GaaaaC tetraloop for instance?

Second, what are the average SHAPE reactivities for each individual residue, and how do they compare across the 4 nucleotide gap project and the 0 nucleotide gap project. Are they identical, completely different? In addition, is there any significant difference between the standard deviation for residues in the 0 gap sequence compared to the 4 gap sequence?

Third, how do the SHAPE reactivities of the "unpaired" control sequence barcodes I designed compare to those that you consider to be "bad scoring?" Are they similar, completely different? How do they compare against the average chemical mapping profile for the locked R88 sequences with both barcode gaps?

Fourth. Is it possible we are not looking at a problem of "design" but actually at a problem involving both the computational aspect of converting cDNA libraries to readable data, as well as a chemical problem, where the greater the number of exposed residues being modified before the target sequence, the greater the deviation in the data?

Finally, is it possible that I have already looked at these statistics? Drawn up charts and conclusions for my own amusement. Quite possible I suppose, however, I do not have the time to formally draw up such conclusions, other than that I believe Omei's point about a barcode that focuses on a rigorous MFE optimization is probably the best option, along with a few *simple heuristics.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Hi Brourd!

Thx for adding your cent. You ask many great questions.

Everybody, feel free to help answer and share what you got of statistics and charts.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
The UUCG loop - A hint on good barcode distance?

I was comparing the 0 gap and 4 gap size, but identical design labs in the Cross lab.

One thing particularly caught my eye. Something that I find very interesting. I was watching the SHAPE data for the UUCG hairpin loop. What pattern seems to stand out was the for the highest scoring designs and where the barcode was also generally most stable, The first U generally was yellow, but else the UUCG loop mostly had a blurry blue for many of later bases. (In particular if the closing base before the loop was a C) However as score dropped and barcodes didn’t always form, the SHAPE colors for the UUCG loop moves more towards yellowish.

But what I found most remarkable was that the labs showed difference in the SHAPE painting in the barcode loop. The 0 size lab had the most bluish loop parts for its winners. Which I actually takes to mean that the loop itself had base connections to itself and helps additionally in stabilizing the barcode.

Gap size 0

http://eterna.cmu.edu/web/browse/4739...

Gap size 4

http://eterna.cmu.edu/web/browse/4739...

Now all I want to know is if this is also the case in other labs where there is a big or small gap between barcode and the main design. :)

Nope. Something else seems to be going on, beside gap size. Seems like length of stems also plays a huge role for this. Aires lab which has adjacent barcode, but enormous long stems, is mostly yellow in the SHAPE data for the UUCG loop. Where as the lab An arm and a leg, which has a huge gap, but real short stems, has more bluish SHAPE data for the UUCG loop.

20 Gap size

http://eterna.cmu.edu/web/browse/2712...



0 Gap size

http://eterna.cmu.edu/web/browse/2333...



But what I find really interesting is that there is this big difference in the SHAPE data for the exact same loop. A difference that ought not to be there, well except for things like backbone strain going on.

I decided to go check somewhere I could show this was stem length dependent too. So have close to same structure and same gap size, but different stem lengths. The Cross lab level 1 and 2 were such labs. They had same gap size and only differed by one of them having a shortened down stem by 4 base pairs. Since this is the main difference (despite different sequences for solving) the results ought not to have been this different.

Lab Entry Level 1 - The Cross lab revisited - Gap size 2

http://eterna.cmu.edu/web/browse/4739...



Lab Entry Level 2 - The Cross lab modified - Gap size 2

http://eterna.cmu.edu/web/browse/4739...



So I conclude that both gap size and stem length affects how stable the UUCG hairpin loop seems to be. This plays into what Omei said that it would likely not be possible having a gap size that would satisfy every puzzle. I agree.

So while I’m still not sure how to give concrete advice on how we can use this small piece to the barcode puzzle, I suspect the SHAPE color variance as they show up in the UUCG loops in the winners in in a lab, might be useful, like the canary in the coal mine, giving a warning about growing instability. I think the designs who’s UUCG loop in the barcode, for the last 3 bases generally show bluish, tells that this hairpin loop generally happy and will help stabilizing its barcode and perhaps also the rest of the design, so that this barcode has a good distance to the main design.

Whereas I suspect that designs with barcodes and the UUCG loop being mostly yellowish for the last 3 bases, might be a warning about that the loop is not helping adding additional stability to the barcode, and as such the barcode might not have the optimal distance to the main design for this lab.

I think that winning designs with general short stems can give a more bluish SHAPE color to the UUCG hairpin loop for the last 3 bases. Where as winning designs with generally longer stems will cause the UUCG hairpin loop to go towards a more yellowish SHAPE color for the last 3 bases.

Similar, I think that bigger gaps cause the UUCG loop to go more yellowish in the SHAPE colors, where as smaller gaps makes it go more toward bluish.

Let me hear what you guys think about this.
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
Did you check the identity of the closing base pair?

Quoting Nando in a comment above:

"...the really ultrastable UUCG conformation only happens if the supporting base pair is exactly C-G."
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1008 Posts
  • 324 Reply Likes
I don't know what exactly what Nando had in mind when he said that; it might be based on a melting experiment, or someone's interpretation of NMR experiments, or something else. But if doesn't hold up very well as a prediction of SHAPE scores for the barcode hairpin.

Here's a query for the barcodes hairpins from jandersonlee's database


Although CG closing base pairs are used about twice as often as GC, the GC closing actually does a little bit better in terms of how it effects the barcode. The average score column is the Eterna score for the full design. The "rank" is jandersonlee's own computation, indicating how close the individual SHAPE scores are to the ideal 0 or 1 predicted by the base pairing, but only for the bases selected in the query. By both measures, the average GC closing is slightly better than CG.

This is consistent with my own, much less comprehensive, analyses of individual
labs (and in one case, a full round of labs) where I couldn't find a difference I thought was large enough to be meaningful.
Photo of nando

nando, Player Developer

  • 388 Posts
  • 71 Reply Likes
I would be careful about drawing conclusions based on EteRNA scores, precisely because of cases like the one you're presenting, Omei.

It's been long known that such artefacts exist. You can search this forum for discussions about why AAAA scores on average better than GAAA although every biochemist know, and all predictive models say that GAAA is more stable.

The same discrepancies between the model and the SHAPE signal allow me for instance to get away with 2 successive 7-7 loops and still score 100 in http://eterna.cmu.edu/game/browse/435...

For the UUCG loop, the data you're presenting is actually consistent with what I said: with the CG closing pair, the loop is so stable that protection signal appears in the region where the model predicts reactivity, thus having a negative impact on the EteRNA score. With GC, the tetraloop is less stable, causing more reactivity in the region where it is expected by the model, thus resulting on average in better scores.

In conclusion, these stats aren't invalidating my assertion. UUCG is freakishly stable only when supported by a CG pair.
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
Eli's hypothesis is:

"I think the designs who’s UUCG loop in the barcode, for the last 3 bases generally show bluish, tells that this hairpin loop generally happy and will help stabilizing its barcode and perhaps also the rest of the design, so that this barcode has a good distance to the main design.

Whereas I suspect that designs with barcodes and the UUCG loop being mostly yellowish for the last 3 bases, might be a warning about that the loop is not helping adding additional stability to the barcode, and as such the barcode might not have the optimal distance to the main design for this lab.

I think that winning designs with general short stems can give a more bluish SHAPE color to the UUCG hairpin loop for the last 3 bases. Where as winning designs with generally longer stems will cause the UUCG hairpin loop to go towards a more yellowish SHAPE color for the last 3 bases.

I think that bigger gaps cause the UUCG loop to go more yellowish in the SHAPE colors, where as smaller gaps makes it go more toward bluish."

However, the SHAPE reactivity signal for the UUCG tetraloop is affected by the identity of the , and Eli only mentioned it in passing in the summary. As for nando's statement, he was referring to the U x G pair that forms in the CuucgG tetraloop, and the conformation of the tetraloop not being conducive to tertiary interactions as a result.

Now, I am not quite sure what Eli means by the reactivity signals of the UUCG being a good hint to the distance between the barcode helix and the sequence that is going to be probed, other than that, perhaps, the barcode helix coaxially stacks when adjacent to another helix, preventing the kind of backbone flexibility that could occur if that was not the case.



As for this, I would like to point out the average reactivities for each closing base pair/uucg tetraloop, pulled from Jandersonlee's data mining tool. This is significant, as the result shows that for each closing base pair, there are some minor variations from Eli's "1 exposed, 3 protected" idea, enough to point out that the closing base pair of the barcode tetraloop may cause a significant deviation for a tetraloop residue's reactivity.
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
Since it does not seem like any players have answered any of my queries, I guess I have to do it myself.

Reactivity Average Comparisons

The chart above is a line graph of the average reactivity for every nucleotide at a specific location across all RNA sequences in the Cross Revisited target, for both the four residue gap sequences and zero residue gap sequences.

Std Dev

This second line chart is the standard deviation for each residue, the blue line once again representing the four nucleotide gap, and the red line the zero nucleotide gap.

The first thing to note: the average reactivities for both the 4-gap and 0-gap sequences are practically identical, except for the final eight residues before the gap and barcode, where there is a noticeable increase in the average reactivities for the 4-gap sequences.

The second thing to note: the 4-Gap standard deviation is higher than the 0-Gap standard deviation, but overall, both sequences show a significant amount of precision in the measurements for residues that are base paired, especially the 0-gap designs. We'll revisit hypotheses as to why this is near the end.

So, first for the 0-gap sequences, a comparison of the average reactivities to some of the lowest Eterna score barcodes, and some of the highest Eterna score barcodes.

High Eterna score

High Eterna score 3D

The two charts above are line charts of the barcodes that scored "higher" with the Eterna score, in the sense that the average reactivity for all 7 residues was lower than 0.125.

Lowest

Lowest 3D

These two charts above are the main sequences with those designs that had the "lowest Eterna score" for their barcodes. This would be defined as those designs where most of the 7 residues in the barcode have a chemical mapping reactivity of 0.5 or higher.

From these charts alone, what can we determine? Well, for one thing, in both the lowest and highest scoring designs, certain residues have a specific chemical map that is constant across all sequences, and featured heavily in the average. These would be nucleotides 14 and 76, both of which have a spike in reactivity compared to their surrounding nucleotides, and both of which feature prominently in almost all replicate sequences charted here.

In addition, the GaaaaC tetraloop sequences in this target had especially reactive Guanine nucleotides in some instances, causing some of the sequences to be penalized in score.

What I conclude from this particular target, just from the 0-Gap results, are some very specific conclusions.

1. The Eterna score of a barcode does not necessarily have a direct cause and effect on the global reactivity scores of an RNA sequence. Most likely, these sequences are still forming a barcode, just with highly mobile backbones allowing for the 1m7 probe to modify the residues, or there is something occurring in the processing, synthesis, or sequencing.

By extension, this means that some of Eli's more elaborate strategies for maximizing the "Eterna score" of the barcode are unnecessary, such as data mining, or special placement of every single base pair. A barcode is a means for identifying a sequence, and any interaction with the ensemble will probably need to be direct to significantly affect reactivity of specific residues.

2. This most likely means, that the most effective method for barcode design is, as Rhiju stated and Omei and Nando have pointed out before,

Barcode optimized to have highest average base pair probability in hairpin -- something like a dot plot.*

I'll see when I can get the 4-Gap sequences up, as that may shed a little light on far more interesting matters, as well as some sequences that use truly "unpaired" barcode sequences.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1008 Posts
  • 324 Reply Likes
This is great, Brourd! I love to see data summarized so well.

I find the first pair of graphs most interesting because the differences between 0-gap and 4-gap stand out so well, especially in how consistently the SHAPE scores vary more for the 4-gap version than the 0-gap version. This dependence on gap size, which Eli has noted and gathered data on for some time, deserves a forum topic all its own. If I weren't so swamped by work right now, I would start it this evening. :-)

I have a couple of questions about the second pair of graphs, because when I went to the data to investigate more, I couldn't quite match things up.

1) Are the design names in the legend the exact names for the designs or paraphrases?

2) Is the data in the graphs from the 88.0 or the 88.1 results? It seems like it is probably 88.1. But then, is the selection of "best" and "worst" designs based on the 88.1 data also, or the 88.0 data?
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
1) They should be the exact name (or similar). For example, there were three designs with the exact name "Fiana." It would probably be a good idea to replace those names "Fiana1" and "Fiana2" with "Fiana[Barcode Seq]" and the same for the second.

2) Data is based on 88.1. I could maybe draw up reactivity averages and StdDev as well for the Low Signal-to-noise results, but I am busy as well, unfortunately.
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
The barcode gap problem is an interesting one. I'll probably have time Tuesday night to make the charts for the 4-Gap data, and post my personal hypothesis on it.
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
This plot should have all of the correct design names (as well as the first 7 residues for each Fiana Barcode)

Plot 2 correction
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Wow, beautiful graphs, Brourd!

Thx for adding your charts and thoughts to the data party.
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
While I have the time, this is my suggestion for the continuation of Eli's project in the next round.

1) A better experimental setup. As it is right now, it is not really a paper lab. Eli has proposed several hypotheses, but, has rarely designed the experiments to test these. I do not particularly agree with this setup, as it stops being an actual experiment, and instead turns into a bunch of players submitting barcodes for a sequence that Eli chose.

For example, Eli's G-U barcode hypothesis. Picking out of the data of those sequences that have a barcode with a GU pair randomly stuck in does not lend itself well to credibility, since there is no direct comparison to similar helices with a -G-U condition.

2) Now the question is what would be the most rigorous test. Any ideas?

Rhiju's question is an important one to think about. Is there a target that we can rigorously test barcode design against?

Omei suggested placing the barcode helix adjacent to other helices in the sequence, but what does that accomplish? To a certain extent, it only facilitates the formation of coaxial stacking and other helical interactions, as well as removing residues between the barcode and main design that could potentially affect the data through their own modification. If you set the main design sequence apart from the barcode helix, the modification of residues between the two increases, while removing any immediate direct tertiary interactions.

With this, it is probably a good idea to look towards the use of RNA switch secondary structures, and measure how barcode interactions may possibly be affecting the rates of reactivity for each target state.

In addition, perhaps we should create an experiment that tests a sequence's ability to interact with the rest of the ensemble. What this experiment would entail is beyond my current ability to think on, as I am sadly busy. Maybe Eli has some ideas?
Photo of rhiju

rhiju, Researcher

  • 404 Posts
  • 123 Reply Likes
thoughts on what might be a rigorous test:

To me, a convincing test would be to take 10 diverse designs for different challenges, perhaps including sequences that are supposed to switch and therefore might be quite sensitive to the barcode.

And for each, have perhaps barcodes spaced from the design by 0 or 4 nts. So 20 sequences.

Then for each of the 5 barcode strategies generate 3 barcodes. The assessment will be how well the SHAPE data match between this triplet of barcodes for the same sequence.

It would be important here to get barcodes from the different strategies at the same time -- because of the uniqueness criterion, barcodes at the 'end' will necessarily be suboptimal. It will probably be important to also randomize the order of the sequences. Maybe each member of the triplet of barcodes could be chosen at different times also. Perhaps best might be an Auctioning algorithm as nando has been exploring.

Thats a total of 300 sequences, which would be doable in one round.

The question is -- how do we get the code, and then the sequences prepared...
Photo of Hyphema

Hyphema

  • 91 Posts
  • 25 Reply Likes
Rhiju said : To me, a convincing test would be to take 10 diverse designs for different challenges, perhaps including sequences that are supposed to switch and therefore might be quite sensitive to the barcode.

Did you mean take 10designs "from" different past lab challenges? Meaning use known sequences that have a baseline SHAPE then compare out the different barcode strategies? If not, then that is what I'm thinking.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Hi Rhiju!

Here are the 10 labs I would suggest for this test. I have picked a mix of labs. Besides switches I have picked some harder labs, with many short and similar length stems that I know also to be sensitive to small changes, just like switches. Similar I have picked a long stemmed lab, as I suspect these to also behave different. And last I have picked what I consider to be uncomplicated designs for comparison.

I have chosen labs which already had a barcode but have aimed to pick labs with a gap distance of 2, but with a few exceptions. I have also mainly focused on the earlier cloud labs to avoid the later labs with high error rates.

Also picking designs with barcodes already, should give data that will be more comparable, compared to the classic Eterna designs which was run with a whole other lab method and no barcodes. However I put in a few classic labs

Pressured or otherwise hard designs

1)
Cloud Lab 16 - Section from lab Water Strider by Brourd
Link: http://eterna.cmu.edu/web/lab/3376094/

Reason for pick: Pressured design.
Original gap distance 3
Winners: 3

2)
Tighter Two Stacks for 3-way Multi-Loop testing
Link: http://eterna.cmu.edu/web/lab/3376139/

Reason for pick: Somewhat short and similar length stemmed design that looks like it could be grumpy with bad barcodes.
Original gap distance 2
Winners: Only 3

3)
Random 4
Link: http://eterna.cmu.edu/web/lab/3376082/

Reason for pick: This a short stemmed design and I believe it to be potential vulnerable due to its short neck. In particular for short gaps.
Original gap distance 2
Winners 12

Switch designs

4)
Top Notch
Link: http://eterna.cmu.edu/web/lab/3376119/

Reason for pick: This is a partial moving switch so I’m expecting it to be less touchy when it comes to different gap sizes compared to a full moving switch
Original gap distance 2
Winners: 2

5)
My Screw-Up Corrected?
Link: http://eterna.cmu.edu/web/lab/3376113/

Reason for pick: Because this was the one of 3 cloud switch labs to have more than one winner. (I consider the lab Will it bind? to odd to choose, despite it has 4 winners)

Original gap distance
Winners 3

6)
Simple RNA Switch
Link: http://eterna.cmu.edu/game/browse/885...

Reason for pick: I have picked a Eterna classic switch, the first one we solved and worked with for 9 rounds. I picked this one because it seemed extremely touchy, so I expect this one to get triggered for worse, if it doesn’t like its barcodes or gaps. One base change could send score down like 20%.
Original gap distance : None
Winners: 2

7)
FMN aptamer 20

Link: http://eterna.cmu.edu/game/browse/142...
Reason for pick: Eterna classic switch. I picked this one as it is a full moving switch and it has short stems in the one state.
Original gap distance : None
Winners: 2

Normal easy design

8)
Simplify
Link: http://eterna.cmu.edu/web/lab/3376123/

Reason for pick: To not have only the outlier labs but to see that the barcode strategies work well with a normal relaxed lab.
Original gap distance 2
Winners: Lots

9)
Relaxed multiloop 1
Link: http://eterna.cmu.edu/web/lab/3376234/

Reason for pick: Uncomplicated to solve, all designs scored 92% or above.
Original gap distance 2
Winners: Lots

Long stemmed designs

10)
Anaconda
Link: http://eterna.cmu.edu/web/lab/3376096/

Reason for pick:
Original gap distance 2
Winners: Lots
Photo of rhiju

rhiju, Researcher

  • 404 Posts
  • 123 Reply Likes
Super! These look good to me... possible to deploy for the next round (November deadline)?
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
That should be easy enough to do, Rhiju. Who are you specifically asking to generate the sequences, structures, and barcodes?
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 1008 Posts
  • 324 Reply Likes
Brourd, I think your description of the effect of the spacing between barcode and neck is not right. Having two helices adjacent (in general) prevents, not facilitates, coaxial stacking.

To see this, think about the exception, where then RNA molecule consisted solely of two hairpins, with no tail at either end. Here, stacking forces between the two helices would encourage their coaxial alignment, and the 5' and 3' ends would be drawn into position very close to what they would be if the RNA were a closed loop.

But if the backbone at one or both ends of the molecule continues on to single stranded RNA (which is always the case in Eterna), the two helices are forced out of alignment in order to make room for the unmatched bases. In essence, the backbone section that joins the two helices is forced to act as a hinge, opening up space on the other side for the single strand(s) to "escape".

On the other hand, having a few unmatched bases between the two helices allows them to minimize their combined energy by stacking coaxially, while leaving enough space for the ends to escape. If what we were shooting for were coaxial alignment, there would probably be some optimal compromise between too close together (forcing the alignment away from coaxial) and too far apart (weakening the interaction between the individual stacks). From what I have seen from the lab data, I'm guessing that optimal number is in the range of 2-4 bases.

But coaxial stacking of helices is, in general, not what we want to minimize tertiary interactions. Coaxial stacking of two helices with single stranded tails points each of the tails in the direction of the other helix, facilitating the formation of tertiary interactions between the helix and the ssRNA. These interactions can take the form of either base triples or base-backbone interactions. Although the understanding of these interactions isn't well enough developed to predict exactly how they affect the SHAPE data for the backbone (or the ssRNA), we can certainly see the effect in the lab data by looking at the level of deviations in the SHAPE data for the helices, i.e. the extent to which the values differ from the predicted value of zero. The data Eli has collected shows that this deviation is lower when the gap is zero than when it is 2 or 4.
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
By the way, there is a thing I have been wondering about. It is not just the growing gap between barcodes I have seen higher error rates for. Also for single base stretches between multiloop stems. So with the size of gap widening, is followed by a growing error rate.

Here is an example from my relaxed multiloop labs. (
http://eterna.cmu.edu/web/labs/past/?...)

Relaxed multiloop 1 Gap size 5
Relaxed multiloop 2 Gap size 6
Relaxed multiloop 3 Gap size 7
Very relaxed multiloop Gap size 8

The signal-to-noise ratio is dropping the bigger the gap gets.

Image from Omei's fusion table. (https://www.google.com/fusiontables/D...)

Relaxed multiloop was in round 77 that did overall had worse average error rates (1.714) than round 79 (3.046), that Relaxed multiloop 1 and 3 was in.

Very relaxed multiloop was in round 81 (2.620) I know these labs keeps getting longer. Still they are not as long as other labs that caused real bad error rates. These results are horrible. (This last lab was intended to test max number of single A's in line, before the odd SHAPE pattern appeared. (http://eterna.cmu.edu/web/lab/3376348/) So this one I'm not surprised about getting real bad error rates.

I’m basically thinking it is the number of single bases after each others, that might be triggering this. It seems it both long stretches of A’s that push error rate higher. Designs seems to relax quite a bit if they get a few non A spacers.
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
Average and median for the reactivity errors, Omei

(The measurements for median and average 1m7 reactivities for both the 0GAP and 4GAP targets are quite similar, I can post those as well.)

Plus, what would be the point of synthesizing the same sequence 100 times if you are not able to average and find the median of those the measurements for that sequence, and then use that for comparison against all other chemical mapping profiles?
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
Designs seems to relax quite a bit if they get a few non A spacers.

Again Eli. The only way to test a hypothesis is to develop an experiment to do so.

And this could be the case. Perhaps the 1m7 probe has an affinity for homopolymer AMP sequences (or possibly AMP is just extremely mobile in comparison to the other residues, meaning it is more likely to be modified. This behavior could be an attribute of purines in general as well), and this over-modification leads to errors that propagate through the entire sequence.

As for what you are saying Omei, I can understand your apprehension to using the average reactivity/residue as the "true value" for nucleotide reactivity in a sequence. I agree, that to a certain extent, I would prefer the synthesis of a target over multiple rounds with various barcodes before I would try to compute a "true value."

However, with that said, even with a higher standard deviation for the chemical reactivity at each residue, both the median and average reactivities for the 4GAP sequence are extremely similar to the average and median of the 0GAP sequence. The precision of these two average measurements is too high for it to be random coincidence, especially when it is over a relatively high number of RNA sequences (around 100 for each target.).

Still, I would not trust the data enough to make specific measurements about the accuracy of a single sequence in comparison. On the other hand, broad observations can be made, and for the single stranded barcodes, the observation is that the reactivities of the sequence are spiked both positive and negative in comparison to averages for a target. Additionally, the reactivity errors for the single stranded barcodes are very high.

Did that answer your question?
Photo of Eli Fisker

Eli Fisker

  • 2289 Posts
  • 518 Reply Likes
Something Cody said yesterday, made think about my earlier single strand barcode experiment.

In this lab I put single stranded barcodes on designs that I already had data on for double stranded barcodes. (Unfortunately the single stranded barcode don't have the exact same distance to the design as the double stranded though.) I looked at it again.

What stood out in the images of the designs of with the single stranded barcodes versus the ones with double stranded barcode, was that for designs with single stranded barcode, was that the GGAAA sequence of '5 tail was far more blue in the SHAPE data, than the 5'tail of the designs with our regular barcode.

I’m guessing that the presence of a lot other than A bases in the single base barcode area of the design has a big effect on this. For making a stem or a loop base frequency matters a lot. However I’m not sure this is why there is this distinct difference. Here is an example. The tendency is general.



https://docs.google.com/document/d/1m...# (Page 21-22)
Photo of Brourd

Brourd

  • 461 Posts
  • 84 Reply Likes
Well, that barcode sequence has a high likelihood of forming a 3x2 loop with the complementary GG sequence. Unfortunately, we don't have much data on single stranded barcodes, and to add on to that, there is something that has an indirect effect on the data as well.

When a barcode is single stranded, at least seven residues will be exposed to the chemical probe, but not have any data collected for it whatsoever. The effect that this has on the cDNA library as well as data processing is unknown.