Discovery that padding strategy seems to affect reactivity error

  • 5
  • Article
  • Updated 5 years ago
I haven't gotten around to updating my data mining tool for all the cloud lab changes since last fall, so when I wanted to look at the most recent Reproducibility Lab results in round 85, I went to the RDAT files. There, I discovered that the most, if not all, of the round's designs were synthesized twice. For example, one of the Repro lab designs has a MAPseq ID of 3573904 and another set has MAPseq ID 3573904-0. So I decided to compare the Repro lab results between the "unmarked" set and the set marked with "-0".

What I found was that there was a significant difference between the results for the two sets, both wrt SHAPE scores and errors. Designs in the unmarked set almost always had higher average values than the corresponding design in the "-0" set. I knew that the lab had done some experimentation with different padding techniques, so I suspected that might account for the difference, but the RDAT file doesn't give any indication of how the designs were padded.

Background: Rhiju had told me that the standard procedure for the cloud lab was to add extra bases to the DNA templates they submitted for synthesis, so that all the DNA templates were the same length. As I understand it, the purpose of this is to prevent shorter sequences from "taking over" the PCR amplification, resulting in too little data for longer designs. But that led to questions such as whether it was better to add the padding to the 5' end or the 3' end, and whether it was better to use all adenines or randomly chosen bases for padding. I knew that in Round 83, a number of padding strategies had been tested by duplicating designs, using different padding strategies, distinguishing them with the "-0" suffix in the ID field

Anyway, I went through the RDAT files for the four rounds that have included the Repro lab and discovered that in every case, each design had been synthesized twice! Furthermore, I already had a complete list of what DNA template sequences (including the packing) were ordered in Round 82, which is one of the rounds that included the repro lab. So I decided to dig into the data for round 82.

The end result is that the round 82 Repro lab results make a strong case for padding the 3' end, not the 5' end. Here's a graph summarizing the error rates using the two padding strategies:


The horizontal axis, labelled 1-40, represents the 40 designs in the Repro lab set. The vertical axis represents the median REACTIVITY_ERROR score for all the bases in that design. Clearly, padding at the 5' end resulted in generally higher error values than padding at the 3' end. I'll ask Rhiju for more of the DNA order files, to verify that this pattern isn't a one-round fluke.

Rhiju, I have a theory for why it isn't good to pad the 5' end, but it rests on an assumption that I don't know is true. I am thinking that when transcribing the DNA template into RNA, you add a primer that binds to the repetitive AAAC pattern in the 3' end, marking that as the place for the transcription enzyme to start. Is that what actually happens? If so, it seems that padding the 5' end of the DNA results in doing the SHAPE analysis on an RNA molecule that includes the tail padding, while any padding on the 3' end of the DNA doesn't get transcribed into RNA. If I'm right about all that, the results mesh well with my general feeling that the single stranded tails can and sometimes do form tertiary bonds with the main design, which does affect the SHAPE results. What do you think?
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 968 Posts
  • 304 Reply Likes

Posted 5 years ago

  • 5
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 968 Posts
  • 304 Reply Likes
For all those who wrote off the round 82 data as being too noisy to be interpretable, here's some good news. It appears that all the Eterna labs got synthesized twice, and the data in the RDAT files marked with the ID suffixed with "-0" is much better than the unmarked data (which is what we see from the UI).

Here's an example for the "Late bulge control" lab. The average signal_to_noise ratio for the designs loaded into the UI is 0.49. For the other data set, it is 4.90. That's right, a full order of magnitude better!

The bad news, of course, is that there is no UI (so far as I know) for displaying data in an RDAT file in a way that is meaningful to players. Currently, my approach is to cut and paste the relevant sections from the file into an Excel spreadsheet and then interpret the data with formulas and charts. That works for summary data, but it doesn't help much with interpreting the data at the level of individual bases.
Photo of Eli Fisker

Eli Fisker

  • 2223 Posts
  • 485 Reply Likes
Hi Omei! Thx for the good news. I'm glad there is good data.

I wonder if there is anyway we can get the bad set of data interchanged with the good one, so we can see it. It will make it a lot easier doing interpretation.
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
I'm checking with devs
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
Omei – I think your hypothesis about the mechanism isn't quite correct, since we do PCR of the DNA template with primers that select the 5' end to be a special T7 promoter sequence. The transcription of the DNA into RNA starts after this sequence with GG.. and then goes all the way to the shared AAAC... pattern (which we call tail2 in the lab). I think the differences in signal-to-noise are likely due to problems in the actual synthesis that occur when padding at the 5' end.

Still, its great that you caught that we have excellent data for all the sequences. Am working with devs to figure out how to display.

Could you also use your analysis tools to look at prior labs with different padding strategies too?
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 968 Posts
  • 304 Reply Likes
Thank you for the clarification, Rhiju. Let me see if I understand correctly. You start PCR with the templates (that have the padding). If all goes as expected, the results of the first amplification step gets rid of the padding, and further amplification isn't affected by what the padding was. But you think the padding on the 5' end interfered with the initial amplification of the original template, so it may have taken many temperature cycles before the exponential amplification of a given sequence started?

And yes, I'll look at more of the RDAT files. Can I safely assume that within any given round, all the unmarked lab IDs used one padding strategy and all the "-0" ones used a different one? I didn't verify that for all the round 82 results; I just matched up a few barcodes against the lab order file, and hoped it was consistent across all the projrcts.
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
You're right on the padding being removed during amplification. One model for the observations involves knowing that the initial DNA library is synthesized 3' to 5' ('backwards'). With 5' padding, the sequence of interest is synthesized first, which sounds great. The problem is that afterwards, they are still present upon further chemical steps which can actually damage those nucleotides.

As for "-0" vs. not, I think you've got to check the library orders unfortunately to figure out which padding strategies were used... prior to November 2013 (I think), we did not test padding strategies, but still got data with alternative barcodes...

BTW, are you able to directly visualize RDAT files from RMDB in your 'custom' sequence browser?
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 968 Posts
  • 304 Reply Likes
No, for my analysis I'm cobbling together Linux command line tools and Excel.
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
Thanks for posting about this, Omei. I've been wondering why all the sequences were run twice.

For visualizing the data, I like to reformat everything to make it easier to read, like I did here with R85 (X is reactivity, E is error): https://docs.google.com/spreadsheets/...

It would be nice to see the data in-game though.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 968 Posts
  • 304 Reply Likes
Meechl and Rhiju, maybe we can pool our resources. I have my data mining tool, but it only understands the JSON format that is used in-game, not the RDAT format. It also has to be downloaded and run locally, which has limited its use. Meechl obviously has a RDAT file parser and Rhiju has control of the Eterna servers.

I don't know why the (semi) open source initiative faltered, but maybe now is the time to push through any obstacles and create a tool that would give all Eterna players a useful UI to the RDAT files.

Your thoughts?
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
I'm not sure I'd be very helpful. I only look at the data in excel and R, and anything beyond that is out of my range of knowledge. Hopefully the JSON format tsuname mentioned will work for you, because I've never used JSON before. :)
Photo of rhiju

rhiju, Researcher

  • 403 Posts
  • 123 Reply Likes
I agree. Now is the time to develop an open source initiative to read RDAT files.
We could even imagine creating a browser for RMDB that can be deployed and stringently tested there, and can then be merged into EteRNA's sequence browser.

If JSON output from the RMDB would be useful, I think it would be pretty easy, I think (it may already exist, actually). I'll ask tsuname to join in the conversation.
Photo of tsuname

tsuname, Alum

  • 12 Posts
  • 2 Reply Likes
Hi Omei,

There is indeed a not-so-well advertised small JSON API for the RMDB. Take a look here https://sites.google.com/site/rmdbwik...
You can download any entry in JSON format.

We also have python and matlab scripts for reading RDATs, but they're barely documented, which I think should be the next step to make this more accessible.
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 968 Posts
  • 304 Reply Likes
Hi tsuname,

I took a look at the JSON output for a single RMDB id. With some changes for the different JSON structure, my lab data mining app (a HTML/Javascript app that currently understands the JSON sent by the Eterna server) could serve as a good starting point for displaying groups of related designs (e.g. and Eterna lab) in an interactive UI. I'm sure they'll be some issues (in particular wrt the larger files), but I'll go ahead and code up a prototype for further discussion.

I'm currently looking at this from the viewpoint of an Eterna player. But if it's going to be a browser for all RDAT data, there are undoubtedly things in the data files I know nothing about. As an example, the bulk of the data in the 16SFWJ_1M7_0001 entry is labeled "trace", which is an addition to the "value" and "error" data. Do you and Rhiju have a sense of what you would like to see?
Photo of Omei Turnbull

Omei Turnbull, Player Developer

  • 968 Posts
  • 304 Reply Likes
Status update:

When I got into the details of converting my Eterna data mining app to the RMDB JSON API, it turned out there were a number of issues -- the JSON query would only return a small part of the Eterna data, being abruptly truncated, resulting in invalid JSON; there didn't seem to be an API for requesting a subset of the results in the file; much information that an Eterna specific data mining tool would want isn't included in the RDAT file.

So I decided to take a different tack. My current plan is to first construct a SQL-like database (initially using Google Fusion Tables) that combines all the data from both the RDAT files and the Eterna database. Meechl has made a big contribution to this effort by converting all the RDAT files into Google Sheets. Once I've worked out the details of adding additional fields that Eterna players are accustomed to seeing (e.g. free energy), I'll redo my tool to use the Fusion Table API. Since all Fusion Tables expose a well-documented SQL query capability, all that data will be available to any player-created apps.

I realize this doesn't contribute anything to making the non-Eterna RDAT files more accessible. :-(
Photo of tsuname

tsuname, Alum

  • 12 Posts
  • 2 Reply Likes
Hi Omei,

Thanks for testing out the JSON RMDB capibilities -- it's very useful since we haven't fully tested this on Eterna's large datasets. I'm very much interested in which entries it failed so I can fix the bugs and make the data more accessible to you guys, do you have a list of RMDB ids in which the JSON results are truncated?
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
Is there any chance we could enable CORS on the server to allow programmatic access to the data? I'm getting this error message when trying to grab files from the site.

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at http://rmdb.stanford.edu/repository/a.... This can be fixed by moving the resource to the same domain or enabling CORS.
Photo of Meechl

Meechl

  • 81 Posts
  • 27 Reply Likes
I also tried this neat little utility at jsonlint.com. It showed me where the JSON formatting went wrong in round 86, not that I could do much about it.


Parse error on line 581:

... "errors": "3.1697,3.1697,3.169

-----------------------^

Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '['
(Edited)
Photo of tsuname

tsuname, Alum

  • 12 Posts
  • 2 Reply Likes
I tried jsonlint and it validates the json of round 86 with no errors. An important note though, I used the pro version, since the normal version seems to have a limit on the number of input charactes and truncates the json if you just copy paste.