How can we improve search?

  • 4
  • Idea
  • Updated 6 years ago
Hi everyone. I just joined the EteRNA team and will be improving the RNA search functionality. A bit about myself -- I'm also an engineer at Google, and am excited to have the privilege of spending my 20% time on a project that is revolutionizing the way scientific research is done.

Right now, I'm collecting ideas on how you'd like us to improve the RNA database search. Please throw any suggestions you have my way. Thanks!
Photo of Andrew First

Andrew First

  • 1 Post
  • 0 Reply Likes

Posted 7 years ago

  • 4
Photo of Eli Fisker

Eli Fisker

  • 2236 Posts
  • 495 Reply Likes
Hi Andrew!

Big welcome in Eterna. :)

Engineer from Google, awesome! That might bring us closer to Adrien's vision that "We need to become the Google of RNA".

You have propably already been introduced to our forum posts with ideas on the topic. But in case you haven't, I have dug up some of the posts from the forum, with our ideas on a comming RNA library system. Here are the links.

New idea on the RNA sequence search tool
How can we handle big amounts of data
In here are some ideas on a RNA library system:
Eterna dreams
CSV file

Good luck!
Photo of Edward Lane

Edward Lane

  • 139 Posts
  • 8 Reply Likes
Nice to have more coding power, welcome and thanks in advance :)
Photo of Eli Fisker

Eli Fisker

  • 2236 Posts
  • 495 Reply Likes
One more from here:

As I understand from the science papers Rhiju posted, RNA has motifs, special 3D patterns that reoccour. I'm thinking, if there there is an information system that specializes in saving and grouping 3D information for comparative search, (for when we are going to know something about RNA's 3D structure), that could be useful too.

I'm thinking that some other science projects involved in eg. protein folding, molecule structure or Ribosomes, already have a library structure for how to view and search after things in 3D, that we perhaps could learn from.
Photo of jandersonlee

jandersonlee

  • 554 Posts
  • 129 Reply Likes
Hey Andrew, Great to have you on-board!

In addition to the Forum Posts that Eli Pointed out above, the Discussion on new lab system with 20,000 synthesis per month thread has some discussion near the end on searching the proposed "database" of lab results.

Searching an RNA results database seems (to me, a relative search novice) to be complicated by how many variables there are to search on and how contextual and interrelated they are. For example:

Target/Estimated Secondary Shape:

Looking for solutions to a "tetraloop" may be relatively straightforward, assuming you want a tetraloop with an attached stack of at least 3 base-pairs, you might search with a selection clause something like "shape contains '(((....))))'" . But how do you tie in whether you want "successful" results or "all" results and constrain the selection to labs with a synthesis score >= 95? Also, how do you search for a more open pattern like a four branch multiloop with no nucleotides between the stacks? Using a typical "*" in the search pattern for an arbitrary match wont work. You would have to add a character like "#" to represent a () balanced sequence of structure characters then search for something like "((((((#)))(((#)))(((#))))))".

Sequence:

Sometimes you want to find lab results that contain a particular sequence of nucleotides instead. For example look for "GUGU", to see if that causes troubles or not. But then you might like the result to be able to include the sequence, the matching portion of the secondary shape, and the SHAPE data results for that sequence. It might also be interesting to see what (if anything) that sequence was matched with in either the Target Shape, the Estimated Shape, or both.

SHAPE Data:

The SHAPE Data is part of the lab results. It records whether a nucleotide appeared to be bonded or unbonded in the lab test. Right now the access is mostly visual rather than numerical. Also it can be binary (bonded/unbonded) or binned/continuous measurements. Sometimes it could be useful to know how successful a particular sequence was in forming a target shape, or to search for sequences and shapes that meet some particular threshold of success. How to specify that threshold in the search is an interesting question, for instance, how to look for "near misses" that had only one mismatch. Also, some cases like GNRA tetraloops may give misleading SHAPE results; how to ignore/compensate for that?

Synthesis Score:

Lab results are assigned a synthesis score from 0 to 100. Sometimes it may be useful to constrain a search to only look at labs with a score in a given range (like score>=95). This is based on the idea that mismatches and miss-folds in one part of a design could affect what might otherwise have been a successful region.

Visual/Query-By-Example:

Ultimately it might be nice to have a visual search interface like the puzzle-maker where you could build a shape, fill in some of the nucleotides and leave others as undefined (e.g. N=any) or partially constrained (R=GA, Y=CU), select some of the nodes to define a target shape (e.g. 4 branch multi-loop), then ask for matches. This might work well for associating Secondary Shape and Sequence in a query, but how to add SHAPE Data constraints etc. is even more of a UI research question.

JavaScript:

There is talk of creating a programming mode where users can design and test (against the model) RNA sequences in JavaScript. Some way to integrate the database search with this functionality could be a great boon. For example, if a design could be specified where at least the initial value for some region could be selected from a query result or from a user library of shape data populated from a query result. Just thinking.....
Photo of Adrien Treuille

Adrien Treuille, Alum

  • 243 Posts
  • 33 Reply Likes
I'm so excited to see progress on this side of EteRNA.
Photo of Quasispecies

Quasispecies

  • 100 Posts
  • 9 Reply Likes
I had the idea of building a "fragment library" awhile ago.

The idea is to break every sequence into fragments that correspond to a loop and its attached stack(s). This database could be searched based on:

-Loop type (hairpins, internal loops, bulges, multiloops, and external loops)
-Loop size (number of unpaired bases, branch distribution in multiloops)
-Length of their attached stack(s)
-Synthesis score(s) of the molecule(s) where the fragment occurs
Photo of Eli Fisker

Eli Fisker

  • 2236 Posts
  • 495 Reply Likes
Hi Jpbida!

I'm having problem with the Git.
Photo of jpbida

jpbida

  • 7 Posts
  • 0 Reply Likes
Github has a lot of documentation describing how to clone a repository. Google "cloning a github repository" and you should be able to find a solution.

If you don't need the code and only want the data you can download the files through the web.

https://github.com/jpbida/RSIM/tree/m...

All the pdb files for individual components are in comps.tgz
The secondary structures of the components are in ss_comps.txt
The sequences are in seqs.tgz
Photo of Eli Fisker

Eli Fisker

  • 2236 Posts
  • 495 Reply Likes
Hi Jpbida, now I know where to look at the data. I'm still not sure how to use it. As I don't know how to see from the data, if a sequence does good or not. But time and chat discussions will propably help. Thx for pointing me in the right direction.
Photo of jpbida

jpbida

  • 7 Posts
  • 0 Reply Likes
@Fisker
I just created a merged dataset that makes it a little easier to find sequences that match a target secondary structure.

https://raw.github.com/jpbida/RSIM/ma...

You can search for a target secondary structure in this file and find a sequence that has been shown to experimentally fold up into it.

For example,

Searching for (((....))) finds the sequence CCUUUAAGG, in the pdb file 1c2w (http://www.rcsb.org/pdb/home/home.do). So you have the secondary structure, the sequence, and the 3D structure. Hope this helps.
Photo of Eli Fisker

Eli Fisker

  • 2236 Posts
  • 495 Reply Likes
Hi Jpbida!

Big thx, now it looks more understandable. :)

I can't accest the last link, where I should be able to get hold of the 3D structure. It says: HTTP Status 404 - /pdb/home/home.do) - description The requested resource (/pdb/home/home.do)) is not available. Mat got the same problem.
Photo of tsuname

tsuname, Alum

  • 12 Posts
  • 2 Reply Likes
Hi all,

I'm a member the Das lab where we have recently finished coding up a search tool for our repository of chemical mapping data. The repository is found at http://rmdb.stanford.edu/repository/ and the search tool at http://rmdb.stanford.edu/repository/a.... Maybe you guys could take a look at the tool and see if there is anything you like/dislike about it as a starting point for the EteRNA search tool. Just a bit of a warning, wer are just starting to test the tool and it may break =P
Photo of jpbida

jpbida

  • 7 Posts
  • 0 Reply Likes
Hi Andrew,

Welcome to EteRNA. I outlined a general graph representation that could be used to for search and for building classifiers or scoring functions. You can checkout the description here:

https://github.com/jpbida/RSIM/wiki/C...

The idea is to represent all RNA structures as graphs and use existing algorithms to search for subgraphs.
Photo of jandersonlee

jandersonlee

  • 554 Posts
  • 129 Reply Likes
I started a new thread called Modular RNA using Junctures coupled with Index and Search that relates to this thread. It proposes constructing a database of substructures based on using 3-pair stack ends that act as junctures between stacks and "fobs". The database would be populated using RNA sequences with known secondary structures from PDB, the EteRNA lab results, and elsewhere.

It seemed a different enough idea to warrant a new thread, but related enough to be worth mentioning here.
Photo of Quasispecies

Quasispecies

  • 100 Posts
  • 9 Reply Likes
@ jpbida - Is RSIM assembling fragments only for tertiary structure prediction? If so, do you think the approach could be extended to inverse folding as well? Some of the challenges look like you're doing both.

@ tsuname - It may be useful to add some more specific search options, especially with so many new sequences about to be synthesized. For example, allow searching by the size of bulges, the number of unpaired bases in hairpins, the dimensions of internal loops, the distribution of branches in multiloops, etc. Then it would be possible to very quickly search for all RNAs with a 2x2 loop (or some other desired element).
Photo of jandersonlee

jandersonlee

  • 554 Posts
  • 129 Reply Likes
@Quasispecies - my thread on Modular RNA poses one possibility in that regard. It breaks a structure into segments which could then be classified by shape/size. Then you could index back to the full sequence/structure.
Photo of merryskies

merryskies

  • 40 Posts
  • 3 Reply Likes
Hi,

This is an exciting project, and it would be great to get an update on how things are going with it.

Regards,
Merryskies