What does the lab data say about the various causes of high error rates?

  • 6
  • Article
  • Updated 5 years ago
For quite some time now, the Das Lab has been working to improve the higher-than-desired error rates associated with the massively parallel Eterna lab protocol. Some of those causes, just as a batch of reagents going bad, or a specific DNA template strategy that proved to adversely affect RNA amplification, have been clearly identified. In other cases, there are causes which the Lab thinks are relevant (high GC content, large variability in sequence length, ...) but are less well quantified. The newly announced plan to try making all the sequences in a round the same length is an attempt to see how big an improvement that makes.

What strikes me is that as Eterna players, we have generated a tremendous amount of data about this, since every individual RNA molecule analyzed in the Cloud lab comes with an estimate of the error associated with the SHAPE values for that molecule. This estimate is called the signal-to-noise ratio, often written as S/N ratio. These have been recorded in the RDB files archived in Stanford's RNA Mapping Database, but have not been easily accessible.

But now, Meechl has created a folder of Google spreadsheets containing most, if not all, of the Eterna synthesis data produced by the Cloud Lab. There is data here on many more syntheses than we can see from the game API. For instance all recent labs have synthesized each solution twice, with two different barcodes, but only one set has been reported in the game. And just as importantly, it is now available in a familiar format,so that any player who can use a spreadsheet to sort, filter and summarize data has full access. Meechl deserves a huge round of applause for this! In addition Meechl and Eli have created a table in the wiki that gives the round number (and hence the spreadsheet) for every cloud lab, starting with the player projects that were used as pilot experiments.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes

Posted 5 years ago

  • 6
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
I just posted copies of Ann's lab notes for round 73 and 74.
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
It looks like ETERNA_R88_0002 and ETERNA_R88_0003 were added to the RMDB database, but the two files look the same to me? I updated the spreadsheets for R88.0002, but didn't include R88.0003.

The good news is the average S/N does look a lot better now.
88: 0.918
88.2: 5.627
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
That's a huge improvement!
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Thx for the great news, Meechl!

Now the scores for the History tour lab for the frozen winner designs, looks a lot more like I would have expected them. They now mostly have fine scores somewhere in the 90'ties.

And agree, R88.0002 and R88.0003 looks like they have identical values.
Photo of Hyphema

Hyphema

  • 91 Posts
  • 25 Reply Likes
So the obvious question is what did they do to make such a drastic change in quality?
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Good one, Hyphema. I'm curious too. Have the problem been identified and solved?
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
This is good news, right!

We had the library for R88 re-synthesized by the company. Largely based on our discussions with you, they agreed that there must have been a problem on their end; they had also changed a chemical in the synthesis process and suspected it was bad.

Also note that in most prior rounds, we synthesized all the RNAs with the user-defined barcode as well as with an alternative barcode. But in R88, there was some confusion (on both ends), and the company did not synthesize the alternative barcodes. That means that there were effectively fewer RNAs on which to concentrate the sequencing reads, so signal-to-noise should be high.

Company is resynthesizing R87 now for us...

Big thanks to Ann Kladwang & the company Customaray and to you all for helping us get on track again!
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Rhiju, this is really great news!

Looking forward to good lab data from now on. :)

So I guess we should know from the re-synthesis of R87, if the signal-to-noise ratio will still be nice and high, when there are alternative barcodes also.
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Just another thought. If signal-to-noise rate drops considerable for the R87, what about removing the alternative barcodes? When did we start add them?
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
i've been considering it. we originally created the alternate barcodes to allow for tests of whether the barcodes were interfering with the main RNA. but it has been difficult to get those data to show in the game viewer for technical reasons. let's see how things look for R87 and discuss here.
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Since we have the Rdat files, I was wondering if it was possible to make a graph with data from player picked barcodes held against one with alternative barcodes? Each design name gets mentioned twice in the spreadsheet, for each type barcode. So I guess it could be tricky making it.

What I hope is that it could give us an idea about how close the different results were. And if the alternative barcodes is doing their job well. If it is possible, I'm not sure what should be compared. Score?
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
I really like this idea. (Sometimes the alternative barcodes also used different padding strategies, which might influence signal-to-noise.) A score comparison does seem like the natural set of plots to make -- any takers out there?

We could look to see how often the alternative barcodes produce dramatically worse or better scores than the player-derived ones. If there are often outliers (e.g. >30%), we should probably just not bother with these alternative. If there are not often outliers, perhaps we can consider averaging the data over the barcodes and presenting them in the game, which will improve signal-to-noise.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
This is actually something I have been wanting to ever since I found out that each design was often synthesized twice, with different barcodes. But all the labs whose templates I had examined used different padding strategies for the two variants, as Rhiju said, and the average differences between padding strategies was large enough that it would make the differences due to barcodes alone hard to tease out.

But now that the data is so much easier to access, I scanned through the files and found that RMDB file 76.1 is a perfect test case. Each design was synthesized with two barcodes, and both variants used 5' poly-A padding.

The results are definitely interesting. I'll follow up with graphs and specific numbers, but I can say already that

1) The algorithm-generated barcodes have very slightly higher average reactivity errors. Despite the small magnitude of the average difference, there is enough data that there is no question about the statistical significance of the difference. So this is another case where Eterna players beat the machine algorithm.

2) There are a lot of intriguing outliers in the SHAPE value differences of specific base positions in specific designs.
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Omei, thx. This already sounds very interesting. I look forward to your conclusions and graphs.
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
@Eli: Regarding barcodes, now that we have very high quality data from R88 on the same design with different barcodes, would you be able to work with other players to draw conclusions on good barcode strategies? Maybe post a link in your project description?

 I have now run into other applications at Stanford requiring hairpin barcodes, including new efforts to understand how cells send packets of RNA to each other via 'exosomes'. It would be wonderful to formalize what we are learning in EteRNA so that other scientists can make use of it, and we need your help!

Perhaps it makes sense to open (or revive) a separate thread on barcodes so as to not mix the discussion with error rates.
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Hi Rhiju! Started a new thread over here:

https://getsatisfaction.com/eternagam...
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
I've been looking at the differences between the original and new barcodes, and (surprise) I made some graphs. I know Omei mentioned differences in processing in many of the rounds, but I looked at all that were available anyway, for fun. :)

First, I should note that I made two assumptions with these graphs:
1. Designs with a new barcode have "-0" added to the end of the Sequence ID
2. The new barcode designs are listed in the same order as the original designs
I am pretty sure these are valid assumptions for every round I looked at.

These first graphs are the scores for all of the designs in the rounds with alternate barcodes. The original designs are in pink, and the new ones are green. Rounds 77 through 83.2 are not included because their scores are not listed on RMDB.

Score
Bigger Picture

It looks to me like the scores are fairly similar for the first five runs, but then in the last three runs, the alternate barcodes generally score lower.

These next graphs are comparing the S/N ratios between the original (pink) and alternate (green) barcodes.

S/N 1
Bigger Picture
S/N 2
Bigger Picture

Again, they look pretty similar, but in rounds 80 and 82, there are patches where the alternate barcode had much higher S/N, and in rounds 87, 87.1, and 87.2, there are patches where the original barcode had much higher S/N. So, my question is: why?

If the large difference are based on the lab project, what about that lab project could have made it so much different with a different barcode? For example, in round 80, the big differences are in: The RFAM Mapping project and Nando s Zippers - Shape 0. Is there something special about those labs?
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Hi Meechl! I love your surprises. :)

Wow!

You asked if there was anything special about Nando's Zippers - Shape 0 and the RFAM Mapping project.

I can say that the RFAM mapping project as a single project contained many different shapes and lengths where a usual lab contains a lot of designs with hopefully all the same shape. Also most of these RFAM designs were really small and short.





Similar Nando's Zipper was a real short lab.




(Source: https://docs.google.com/spreadsheets/...)
(Edited)
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Also these two labs have some weird looking S/N averages.



The RFAM project ones are calculated pr design.

But what strikes me as weird, is that they have high S/N numbers, which should be good. But they don't look equally good when viewed with error rates turned on, in the ingame interface. Any ideas why?


http://eterna.cmu.edu/web/browse/3562...


http://eterna.cmu.edu/web/browse/3562...

Also we don't have any of the vital info like Score etc.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
Well done, Meechl!

So far, I've only looked into the Round 80 results, which is the one containing the RFAM mapping project and Nando's Zipper. Although there is nothing in the RDAT file that makes this obvious, these were submitted to Eterna via a route that didn't require (or allow, I think) barcodes to be assigned by the investigator. Thus, I suspect both sets of barcodes were assigned by an algorithm It's possible, though, that Jee was experimenting with different algorithms for assigning the barcode. Still, the consistency of the differences over whole projects like that makes it seem more likely that something differed in the processing of the two variants.

Looking at Ann's master spreadsheet (https://docs.google.com/spreadsheets/...), I see that there are actually three lines for Round 80.



but only 1 RDAT file. My suspicion is that new templates were made for some (all?) of the scientist labs, which then went through a separate lab processing round by themselves. The RDAT file we see is probably a combination of those two lab runs.

Ann and Rhiju, can you tell us just what happened in Round 80?
Photo of tsuname

tsuname, Alum

  • 12 Posts
  • 2 Reply Likes
@Meechl The ETERNA_R88_0002 and ETERNA_R88_0003 were indeed duplicates. I've eliminated them and put the newest data with higher S/N in the entry ETERNA_R88_0001. Sorry for the confusion!

Pablo
On behalf of the RMDB team
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
Eli, I think the discrepancies you point out in your last comment can be accounted for by identifying which variant of a design you are looking at.

For example, here are the two copies of The RFAM Mapping project - Shape 98.



The S/N ration of 20.219 you see in your spreadsheet screenshot is the -0 variant. The variant shown in the Eterna UI is the unmarked variant, as you can verify by checking the barcode sequence, CCACUUAUUCGUAAGUGG.
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Thx Omei. Then I understand better. So we have the worst variant ingame and not the best data.
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
Any suggestions for how to show the alternative sequences and/or data in the eterna UI? Clearly here's a case where we'd want both. The devs agreed that it would be confusing to average the two data sets, since they are different sequences.

One option would be to show only the player inputted sequence & data by default, but then allow a button click to show or hide alternative sequences/data as well.
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
I very much like the idea with a button to call the extra data. I agree with you. We want all the data, to learn from it and put what we learn into use.

However the lab interface is very buggy, so it takes too long to get from one design to the next.

For us to be able to analyse all the data that we have, we need smarter ways to view it.

Imagine that we could get from one design on the list, to the next in one click - with full design view. Not loosing any of the precious SHAPE data information or having an endless list of windows open.



Based on Mat's FlickRNA idea:
https://getsatisfaction.com/eternagam...
https://getsatisfaction.com/eternagam...

Imagine we could see both SHAPE data of a design, besides the letter sequence



Mat has several beautiful ideas for how we can view the lab data in a better and more effective way:
https://getsatisfaction.com/eternagam...
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
Thanks for the reminder about the buggy and inefficient lab browser/viewer. This might be a good time to revisit it -- let me bring up with the dev team next week.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
Rhiju, Here's a suggestion for a minor change, based on my experience of putting the data into fusion tables. Simply add one extra data field that contains the "suffix", which I've started calling the "variant". So the user would typically see two rows for each design, one with a blank variant and the other with a variant of -0. The user could then used the existing mechanisms in either of the current sequence browsers to hide one or the other set if so desired.

(If there's a technical problem in the UI with distinguishing "filter on the variant being blank" and "no filtering at all", the "unmarked" version could be given some non-blank name, like "original".)

I don't want to minimize the importance of finding better ways of comparing designs. But I know that's a much bigger project than just adding a field.
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
that's genius. one potential issue is if we require the ID fields to be unique in the database then we'll have a problem -- but I can see how to get around that separate IDfull, IDnumber, and IDsuffix fields.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
Question about DNA templates

Rhiju and Ann,

I don't know how it would explain the particular results we see, but I have noticed something that seems odd when 3' tail random tail padding used.

In the RMDB files and in the game, the 3' ending sequence is specified as AAAAGAAACAACAACAACAAC. This sequence ls also there in the DNA template files when poly(A) padding is used.

However, when random padding is used, the final C is replaced with a random base, as though there is an off-by-one error in the algorithm for assigning the random bases.

I doubt this is on purpose. But what affect do you think it might have on the results?
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Omei, what you said reminded me of something weird I see in the lab interface, though I'm not sure it is related. There is often a white line of letters at this specific spot in the lab data. I strongly suspect they are not supposed to be white, I think they are supposed to be grey.

Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
@omei, during the PCR the final C will get added back to the amplified DNA template and then show up in the RNA. I'm not too worried about it.

@eli, can you send a PM to jnicol to look into this? that looks messed up. Also there should be data going out all the way to the UUCG.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
I agree, Eli, that is a very strange report. Looks like two fifths of the SHAPE scores one would expect are missing, in addition to the one column that looks to be all 0.5.

My guess is that it is not related to the final C of the 3' end being overwritten in the DNA template.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
Thank you, Rhiju. Given what I have learned in the process of studying Ann's lab notes, I can actually see why it gets added back. :-)
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
@Rhiju, will. I have seen it in several labs.

@Omei, thx.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
More on Meechl's findings for round 87

I looked into the data for Round 87.0 in light of Meechl's graph, which I'll reproduce here.



Looking at the data in a fusion table, I first noticed that almost all of the syntheses with high S/N ratios were from Nando's Cyanocobalamin project, so I went and looked at the barcodes in that project. Interestingly, the project seems to have been submitted as an "expert" project, meaning that all the barcodes were assigned by algorithm. That would seem to preclude "good" or "bad" barcodes being the issue here.

I next discovered that all of the synthesis with S/N ratio of 4.0 or more had two characteristics in common. First, they were relatively short, and second they were the variants made from templates with random 3' tails. So I created the following two charts. The first plots S/N ratio against length for all the designs.


It has the kind of dependence of S/N ratio on length that we've seen since round 80.

The second chart plots the same for only the sequences with poly-A padding.


The difference is obvious -- the shorter sequences that had high S/N values have all disappeared. Otherwise, the graph patterns look pretty much the same. (For the careful reader who wonders why there are some points in the second chart that aren't in the first, the charts built into fusion tables seem to be optimized for visual appearance by plotting only a (hopefully) representative sample of the data.)

I don't know what to make of this. If we believe that shorter sequences are "taking over" the PCR, thus having higher average S/N ratios, why would having poly-A tails prevent this so completely? As I understand it, the padding has been removed after the first two (of 30-40) PCR cycles.

Rhiju, do you have a theory?
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
Round 89 is here!

Scores (Pink = original barcode, Green = alternate)


Signal to Noise


Eli's History Tour Labs start at index 332. Since those labs are all at least 15 bases longer than the Mimic Labs, I suspect that is why the S/N ratio dropped.

Average S/N for original barcode: 3.998301
Average S/N for alternate barcode: 2.299623
Overall Average S/N: 3.148962
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Hi Meechl!

I found this really interesting. In particular that it do stands out now we got generally better data, that the alternative barcode scores a lot worse on average.

Since my four history tour labs from the Bulged cross lab, are close to identical I think we can tease one more interesting thing out from it. 3 designs had identical main design. Only the single gap bases varied. The 4'th had 4 bases less than the others, as to make it. Despite this, the signal to noise varies between the labs.





The last lab between 698-800 do overall best. This is the lab that has 4 bases less (2 base pairs less) in the main design compared to the other. So even this short lengthening has effect. Since this is slightly better than the design between 589-698 which has exact same gap size.

What I else finds very interesting is that there is a pattern for which labs have worse signal-to-noise ratio. This is actually part of what I have been after the whole time with the History tour project.

The lab that does worst, is the lab with the gap size 4, (486-588) then followed by gap size 2 (698-800) and the ones with best signal-to-noise ratio is the one which has 0 gap. (332-485)

Link to background info on the barcode project:

The bigger the gap between the design and barcode, the higher the error rate seems to become
https://docs.google.com/document/d/1g...

I'm aware that the first lab with gap size, by nature also will be 4 bases shorter than the second one with 4 base pair gap. So I decided to add in base numbers also.



So now I wonder if it is really the gap difference or actually the length difference that gives such a big difference in very similar shapes with very few base lengths difference. I still suspect gap size. It is not completely off the hook.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
Ann has put up the template order for round 89. The recent practice of 3' padding the unmarked variant with a random sequence and 3' padding the -0 variant with poly-A was used here again.
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Hi Omei!

I have been wondering how you see the padding thing.

By the way, I have started comparing round 88 and 89 from inside of game. I was looking at barcodes at the time. And I think I found something interesting. Round 88 held the Cross Lab and Round 89 held the Bulged cross lab. Both shapes are practically identical except for bulge cross contain a 1-1 loop. They have same length. However round 88 had far better error rates, than Round 89. The SHAPE data is pristine, compared.

The Cross - 128 bases (R88 data)


The Bulged cross - 128 (R89 data)


What it really looks like to me, is that we can safely do longer designs and get good data back, if we skip the alternative barcode. So the length limit gets imposed earlier, because of the alternative barcode.

Because all of these cross labs have exactly same length as the bulged cross labs in the R89 image in my comment above. (Except for the forth lab in each) But all the R88 look fine, where as the R89 are not as fine.
Photo of Brourd

Brourd

  • 446 Posts
  • 82 Reply Likes
In addition to the bulged cross lab, Round 89 contained the second round for the thermodynamic mimic sequences, which were all around 110 residues in length.

I have to say that I get fantastic data from the alternative barcode sequences there, and prefer to keep it!
Photo of Brourd

Brourd

  • 446 Posts
  • 82 Reply Likes
Another option for the alternative barcodes is to use a different custom padding. Instead of a homopolymer-A sequence. Perhaps AAAAAN repeated until the padding length is reached.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
Eli, you wrote: I have been wondering how you see the padding thing. I'm presuming you'e asking how I know what padding strategies was used in a particular round. If that's not what you meant, let me know.

The authoritative source for this information is the DNA template files. These are copies of the files exactly as they are sent to CustomArray, the company that synthesizes the DNA. Here's how I typically use the files to determine what padding strategies were used.

1) Pick a sequence of interest from the RMDB file, noting what variant (typically either none or -0) it is. (Do not use a sequence whose length is the longest in that round, since it will have no padding.)
2) Mark the barcode plus some of the trailing sequence (but don't go all the way to the trailing C) and cut it.
3) Search for that sequence in the corresponding template file after you have replaced any U's with T's. You should find multiple lines that match, with the contents of each matched line being identical. I usually make at least a mental note of how many matched lines there are, because it often comes in handy when things don't seem to add up.
4) Using one of the matched lines, select the molecules that make up the T7 promoter and the bases that correspond to our RNA. (Remember that RNA U's will be DNA T's.)
5) Hopefully at this point it will be obvious as to where and what represent the padding. Two things to keep in mind - 1) there will always be a few bases before the T7 promoter (typically but not always TTC) that are not padding, and 2) if 3' random padding is used, the last C in the RNA gets replaced by a random base. This is not on purpose, but Rhiju is not concerned that it will have any significant effect on the outcome.

Everyone agrees that the padding strategies should be recorded in Ann's master document. I've been really bad about not collecting my findings in a central place. If you start looking at them, will you make notes and share them with me? That will help to make sure we get them into the main document.
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Omei, thx and beautiful! This was exactly what I was after. Now I understand better how to get to it. If I get to making notes on this, I will share. I will pass this requests to others too, who may wish to look into this.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
Brourd wrote: I have to say that I get fantastic data from the alternative barcode sequences there, and prefer to keep it!

Something doesn't seem right here. If I've done the calculations right, the alternate barcode sequences (aka the -0 variant) used 3' poly-A padding, while the unmarked variant used 3' random. The average S/N ratio for your mimic sequences were much better with the unmarked variant (6.47) than the -0 variant (3.47). So it would seems you would favor getting rid of the poly-A padding.

You don't actually describe your preference in terms of the padding strategy, but in terms of the "alternate barcodes." Are you looking at the data differently?
Photo of Brourd

Brourd

  • 446 Posts
  • 82 Reply Likes
The thermodynamic mimic project, at its core, is a project studying riboswitch sequences. Ideally, one would hope that every single RNA molecule folds exactly the same when all conditions are equal. However, a riboswitch sequence, being designed to adopt two stable structures at equilibrium, may have varying measurements of reactivity across multiple sequencing runs.

In addition, the current method for scoring the mimics is crude, at best, in order to reflect this potential for varying reactivities.

As for the padding, it should not be a problem in future sequencing runs, since the length of all future DNA templates ordered, should be at the maximum.
Photo of Brourd

Brourd

  • 446 Posts
  • 82 Reply Likes
I also do not have as much of a problem with the randomly generated barcodes, as a solid design heuristic would make them perfectly fine.

In addition, according to the data from Eli's R88 project, the SHAPE signal of the barcode helix did not affect the SHAPE signal globally, even when the barcode helix was exposed. You can interpret that in any way you like, however, I see it as being that the barcode helix is not actually "unpaired" but in a state of high mobility, or there is some form of error in the reactivity measurements there.

Unless we were to base everything on the Eterna score, then that means any barcode sequence following some basic design heuristics should work fine.
Photo of Brourd

Brourd

  • 446 Posts
  • 82 Reply Likes
Ah, I see Rhiju's comment down lower about the padding strategy being necessary with piggyback orders. In that case, the random padding appears to be the best option based on all the data gathered so far. I would say it may be an interesting idea to try other custom padding strategies, and compare those to the random sequence padding. Unless that is not feasible, of course.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
What is the lab testing with Round 88?

Rhiju and Ann,

I took a look at the various files for R88. The RDAT file says you synthesized 377 designs, with just the one unmarked variant. The DNA template file, however, has lots of sequences that look like an interesting experiment, but I can't tell what of. It does look like they use a different transcription promoter. Can you tell us what you are testing? Or is some entirely different project just piggybacking on the same template order.
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
piggybacking -- some sequences from a different lab who will be testing determinants of RNA/protein binding, and some sequences that Johan and I will be testing to allow FRET 'molecular ruler' measurements for 3D information on DNAs and (I hope in the next year) RNA designs from EteRNA.
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
(1) Based on data so far, can you all now give us 'final' advice on whether to use poly(A) or random padding?
I think in future orders, we will primarily have fixed lengths, but if we piggyback non-eterna orders we will need guidance on padding.

(2) Can you give us final advice on whether we should be making alternative barcodes? It doesn't cost us more in terms of synthesis, but it does reduce signal-to-noise at the final sequencing step as the sequencing counts are spread over 2x more RNAs.
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Comment to 2)

I will like to have the longest designs without alternative barcodes to ensure that we get good data for them and that we in time to come, can make longer designs.

Brourd requested to keep alternative barcodes for his shorter sequences for the following reasons:

"In addition to the bulged cross lab, Round 89 contained the second round for the thermodynamic mimic sequences, which were all around 110 residues in length.

I have to say that I get fantastic data from the alternative barcode sequences there, and prefer to keep it!"
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
My 2 cents:

I haven't looked at all the labs, but I think we have enough evidence to strongly suspect that poly-A padding has an adverse effect on S/N ratios. So I would say stop using it routinely. If, after understanding everything we can from the data we already have, we realize we need to do further tests to make sure poly-A isn't taking a bum wrap, we can always do that.

As for studying alternate barcodes, and how to improve their algorithmic generation, I suggest we make a change and treat that as a research project within the normal Eterna environment. For example, there might be a research project that has two sub-projects, the first being designed completely by players. The second subproject would not be open to players, but would duplicate the main sequences and use an algorithm to generate the barcodes. Or there could be multiple "machine" sub-projects to give multiple algorithms the chance to beat the players. Except for that research project, we'd no longer routinely synthesize two different versions of each design, but focus on getting cleaner results for the designs that players make.
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
Thanks to all. Based on above, I am working with devs to:

1. use only random padding (not polyA) when its needed.
2. no longer generate alternative barcodes, unless needed (e.g. for designs without player barcodes).

Also, as mentioned in the barcode thread, we will explore the idea of improving our automated barcoding strategy and testing through incisive eterna experiments.

Overall, thanks much for the careful analysis and concrete advice -- its been very exciting to see how these discussions have made real improvements to a cutting-edge experimental technique.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
Rhiju, this makes sense given what we've we've confirmed so far. But recall that the results prior to round 80 had a better pattern of S/N ratios than those of 80 and after. The S/N ratio didn't seem to be affected by length, and we didn't have the symptoms of the shorter designs taking over the amplification, resulting in extremely high S/N ratios for a few designs and lousy ones for most. This wide distribution raises the average S/N ratio, but in a way that isn't desirable.

Eli and I are working on a systematic listing of the padding strategies and S/N behavior for all the rounds, and we've found a smoking gun for what changed in Round 80. (To see the current status, check out the last section in https://docs.google.com/document/d/1w...). In rounds 76-79 (which are the earliest we have the template files for at the moment), 5' poly(A) padding was used for both normal and alternate barcodes. Starting with Round 80, all kinds of variations were tried, but as far as we've found, there has not been another round with all 5' poly(A).

It's too early to say this with any certainty, but it is looking like the data says that 5' poly(A) padding may actually be better than 3' random. If we can get better results, while not mandating that all Eterna designs be a fixed length, it will be an all-round win.
Photo of Brourd

Brourd

  • 446 Posts
  • 82 Reply Likes
I believe the use of 5' padding was discussed in this forum post.

https://getsatisfaction.com/eternagam...
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
That's an interesting observation. At that time, I concluded "The end result is that the round 82 Repro lab results make a strong case for padding the 3' end, not the 5' end." In retrospect, this seems not to have been the appropriate generalization. What the data actually indicated was that 3' random padding is better than 5' random padding; I jumped to the conclusion that the 5' vs 3' distinction was all that was relevant.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
The results are in for the data wrt how padding strategy has affected the S/N ratio in the past Eterna labs. Meechl pitched in to finish it.

Summary:

The lab has experimented with 4 different padding strategies

* 3' random
* 3' poly(A)
* 5' random
* 5' poly(A)

but the head-to-head comparisons haven't tested all the various combinations. From those that have been tested head-to-head, we can say

* 3' random is better than 3' poly(A). [Rounds 83-87]
* 3' random is much better than 5' random. [Rounds 80 and 82]

So of those three, 3' random is the best choice, and this is is the variation that has been used for the data reported in the game UI for all the latest rounds.

The fourth variation, 5' poly(A) was used in a number of labs (76-79), but always for both original and alternate barcodes. So we don't have any direct comparison between 3' random and 5' poly(A). But we had observed previously that rounds up to 79 exhibited little or no dependence of the average S/N ratio on RNA length, whereas in later labs there was a strong dependence on length. Thus 5' poly(A) may well be superior to 3' random.

I think a direct experimental comparison between 3' random and 5' poly(A) should be part of the next round.
Photo of Brourd

Brourd

  • 446 Posts
  • 82 Reply Likes
Is there any padding for the DNA templates since the sequences are at their maximum length?
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
Sorry, Brourd, for the slow response.

All of the rounds we included in this study had projects of varying lengths. So all of the RNAs, with the exception of the longest for each round, had padding added as necessary to make all the DNA template sequences the same length. (This is desired because it minimizes the cost of the DNA synthesis step.)

Is this responsive to your question?
Photo of Brourd

Brourd

  • 446 Posts
  • 82 Reply Likes
Ah, my apologies. I should have been more specific. That question was actually a response to:

I think a direct experimental comparison between 3' random and 5' poly(A) should be part of the next round.

Since all RNA sequences are now at the maximum length, this is not possible, unfortunately. However, if you can convince the Das Lab to *re*order an older library to use this as a comparison, I believe that would work out well for the hypothesis and experiment.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 970 Posts
  • 304 Reply Likes
Ah, got it.

There are still some choices available for the next round synthesis round.

If the synthesis included piggyback labs of different lengths than the Eterna experiments, then padding will be required,

If all the designs are the same length, padding isn't required for the DNA synthesis step, but neither is it excluded; the padding would just need to be the same length for all templates.

But it may well be that Rhiju wants to run this next round using the simplest possible scenario, to establish a "best case" baseline for what the lab is capable of producing under ideal circumstances. In that case, any test of padding strategy would be put off, of course. But requiring experimenters to use a pre-specified length for their designs is a significant constraint. Given that none of the experiments done with 5' poly(A) showed a significant correlation between S/N ratios and RNA length, it could be the perfect solution for us.