The OPRELA RNA Bot

  • 5
  • Idea
  • Updated 9 years ago
OPRELA stands for Optimized Revision Layers. It is a bot that uses algorithms that, using the advanced data from the top percentage of previous lab synthesis results, computes pairs that are more probable to bond under synthesis while attempting to maintain a good MFE frequency. This bot is still in its infancy.



How the probability matching works

The process is really simple. It does not use the average inverse folding models to predict the sequence. When the synthesis results are posted for a particular design they are imported into the bot's database. This data consists of 3 parts. The string notation, the sequence, and the result data. The result data is a string that holds the probability of bonds from 0.0 to 1.0 in which 0.0 is equal to 0% and 1.0 is equal to 100%.

The first step the bot takes is to collect the design's pairs and loops in a list. It then cycles through the data from the lab results and compares the probabilities. It will then rate the probability of the pair. The higher the rating, the more chance it has to bond and is then added to the predicted sequence.

There are 3 modes the bot will use to compute the prediction sequence. Single pair checking, Quad-nucleotide checking, and Tri-nucleotide pair checking. Single pair checking only takes into consideration of single pairs and not neighboring pairs.

Single pair checking is the only mode that is enabled right now.

Below is an example of how the Single pair checking works.



SN* String Notation
SPC* Single Pair Checking

As you can see, right now, its really simple with only a few functions. It basically keeps the most probable bonding pairs. There are a few exceptions and preset measures such as preferring A, C, and G over U in loops, in order it prefers G then A then C and last U. This is visible in the n2* and n3* columns. It also prefers AU over GC pairs in the middle of stacks and GC over AU on the ends. The n1* columns show how the bot computes the most probable pair.

With the other modes the calculation is identical to how the n1* columns are computed only it looks for the best Quad-nucleotides or Tri-nucleotide pairs instead of just the single pairs. It uses several passes to find the most probable occurrences.

Next step is MFE optimization

After the predicted sequence for the given shape is produced the MFE optimization layer tries to refine it. Its level of refinement is controlled. It is set 0.0 to 1.0 depending on the desired integrity you want to keep. When I say integrity I mean how much of the predicted sequence you want to keep the same.

Right now I have only implemented stack and tetraloops. Multiloops and bulges are ignored but I will add them at a later time. I also want to add checking for equal loop energy distributions as noted by Eli's observations.

Whats next

This version of the bot is just 2 layers of calculations. What I'm hoping to achieve, in the future, is a stable database of synthesized designs with an additional layer which will look at these designs and see what works and what doesn't according to certain attributes like stack lengths and/or mirrored energy distributions etc..

As more data and new attributes for the bot to compare are added, the better it will understand what works and what doesn't under actual synthesis.

I'm planning on making an online version of the bot and releasing some binaries and source when it gets further developed.
Photo of iojp

iojp

  • 3 Posts
  • 1 Reply Like

Posted 9 years ago

  • 5
Photo of Jeehyung Lee

Jeehyung Lee, Alum

  • 708 Posts
  • 94 Reply Likes
This is just incredible iojp!

Simple, but yet makes a lot of sense. I have been hearing many players talking good things about this bot design.

I have one question - after creating sequences from most probable bond types, how do you make sure the whole sequence folds into the target shape in the game?
Photo of iojp

iojp

  • 3 Posts
  • 1 Reply Like
I'm glad people like it. :]

@Jee - It actually doesn't do anything to make sure it folds correctly. When I first added just a few results to the program it kept producing sequences that wouldn't work in game. The more results I added and the tighter the MFE optimization, the more foldable sequences started forming.

I don't want to publish a whole lot of these predictions using just the 2 layers though. It might just optimize the potential from the work already done but I'm really not sure.

What I'm planning on doing is using a modified version of Vienna RNA and then meshing the 2 results next. I also want to have the bot measure and compare as many aspects as possible in order to figure out patterns and adjust its predictions based on the data.