Periodic repeats in RNA switches - How can they be programmed?

  • 3
  • Article
  • Updated 7 years ago
(Title snatched from Rhiju’s answer)

I have been discussing switch patterns with Rhiju. Here is a part of our discussion:

Rhiju: One other thing that I thought of, after your note -- these 'periodic' repeats appear to be hallmarks of switches that we have designed. Are they hallmarks also of undiscovered switches in the RNAs in our cells?

Yes, I think so. As they will arise each time two, three (and perhaps sometimes more) nucleotides jump from one stem to another, as a group. And as natural occurring switches have stems, strand sequence from one stem will turn up in as the exact same strand sequence in another one stem, and thus causing repeat. Also to some extent in their complementary strands.

(Switch diary outline 27-8-2012: Repeat sequence seems to be more allowed in the switch puzzles. Sequences like CUC binds up to GAG and needs to be double to serve for the switch.)

Where in single shape lab I particularly identified sequences like (CUC) as big trouble makers. Here is my strategy Strand repetition ban explaining my reasoning behind:

Later thoughts, I think the reason why sequences like CUC and GUG makes a lot of trouble in single shape lab, is because both C and G are nucleotides with a pretty strong pull, compared to U and A. And having them close together and not much variation, makes them want to jump to elsewhere. Which is sort of practical in switches. :)

So far what I see, is that switches are more tolerant to repeat sequences than single shape lab, where my motto for all basepairs was: Twist em, baby! And my advise was: No two basepairs of the same kind beside each other, unless in very low abundance and in longer strings. Little variation makes bad bind.

Switches appear to be more tolerant to double same turning AU-pairs too and to some extent same turning double same turning GC-pairs. Plus repeat sequences.

I actually just did a calculation the other day on the current winner. Because I was considering a switch strategy and based on my single shape lab experience I hated repeats, especially double same turning repeats. But Jnicol's 92% design made me change my mind. I have marked the same turning double pairs. They accounts for 12 nucleotides of all in all 32 in the unbound shape. That is a percentage of 37,5 which is rather high.



Ok, back to sequence patterns. I think there might be patterns specifically for collection of nucleotides that jump from one stem to the next stem. And I think that can be used for a robot switch strategy, as the repeats sequences turns up in pairs in a shape (same sequence patterns in two strings at least in the same shape - if they don’t spread all over the whole shape, which all of them probably won’t when we make more balanced designs.)

But I think we might later be able to find patterns also for for groups of nucleotides that jumps from loop to loop. (fewer of those jumps happening), nucleotides that jumps from strings to loop and nucleotides that jumps from loop to string.

Rhiju: Can we come up with a computational metric to discover such components in our 'junk' RNA/DNA?

Could one imagine that one run a program that looks for eg. a sequence like CUC predicts a potential repeat in a switch?

If such a program existed, it should break each strand into fragments on two and three nucleotides. I don’t think the repeats gets much longer than that. (From what I see happen for now, but it might change if we get switches with much longer stems.) Then the fragments should get hold up against the string they jump into in the second shape and checked for overlap. And it should also be possible determine in which direction the search should appear, as the two shape jumps in a defined direction depending on each other. So one will not need to look at both the left and right string in a shape for a match for a fragment in the middle string. But only one of them. And I guess it will be computationally quite heavy.

Also one needs to decide which fragment sequence to move for. I think it will be crucial to move for the C instead of the G, as the c will determine it’s partner, where G can pair up with both C and U. So some sequence fragments will be more interesting looking for than others. And some might turn up with a higher frequency too. (I got inspiration for this section from a recent discussion with Mat on his system on scoring designs.)

Rhiju: ... perhaps someone out there with some computational skills could help write a 'discriminator' of eterna switch sequences from non-switch sequences, and then apply it to bacterial and eukaryotic genomes. There are a few 'riboswitches' and 'attenuator' systems that have been well characterized in these organisms, so they could serve as 'positive controls', i.e. components that should emerge from such a search.

I like your suggestion with using natural occurring switches as check points. And if a repeat sequence search could be useful in practice. Did you mean using sequence repeat fragments as direct search query in whole genomes. Ah, I think I understand now. Lets go BLAST it! :) This is just awesome. I sort of assumed that RNA switches was being coded up as genes. Are you saying they are lying around in the "junk"? And that we just need a method to find them?

Rhiju: Well, there's a lot of RNA in 'untranslated regions' that flank 5' and 3' ends of protein coding genes, but are part of the RNA messages for those genes. Also there are huge parts of our genomes that don't code for proteins but appear to get transcribed to RNA. Its kind of a big mystery what if anything those RNAs are doing, but one hypothesis might be that they harbor a lot of sensors and switches and act as computing elements.

I would love to have someone with programming skills have a go at this and be creative on how it can get done.
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes

Posted 7 years ago

  • 3
Photo of Brourd

Brourd

  • 466 Posts
  • 86 Reply Likes
So, as long as the aptamer loop is a part of a base pair in the 1st state, should we see highly repeated designs. And, what happens when we add a different molecular trigger, like theophylline, into the equation? Here is a view of the secondary structure, using the sequence from the PDB entry.

http://www.pdb.org/pdb/explore/explor...



And, here is a screenshot of the 3 Dimensional structure, using Chimera molecule viewer.



Looking at this aptamer, we see the number of sequences that could be potentially repeated in a design increase from UCU, UUC, UAU, etc. to the point where you would have to check for several repeating sequences throughout the RNA, ranging from AUA, to GGA, to GUC. I don't personally know what the sequences of natural junk RNA and DNA can be, but how many times would a sequence have to be repeated in a design before we can say it is a switch? 2, 3, 4, 5 times? In the lab puzzles, we only see sequences repeat a handful of times, but who knows how many times a sequence could be repeated in an RNA without it actually being a switch.

Then, what about RNA like FMN Aptamer 2.0? In that lab, the locked bases of the aptamer loop were not directly affecting the base pairs in the design. So, if a switch was like that, we would actually have to pull a sequence out of the RNA, and determine how many times that sequence could potentially pair up with other sections of the RNA. So, it would have to break the RNA up into multiple 3 base sequences, and then determine how many times a potential match could occur. These are just a few of my thoughts on the subject.
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
It would definitely be interesting if something could be said about how many times a repeat would occour, dependent on number of stems, their length or the length of sequence.

I find it likely that aptamers could cause repeats or patterns. They are gruops of nucleotides that jumps from strings to loop and nucleotides that jumps from loop to string. The aptamers we have seen in game have been very similar in color content of the bases on both sides of the ring, and thus they will be able to spread repeats too. That is when that blue/green pattern comming from the locked nucleotide in the lab FMN switch 2.0 gets eliminated and balanced out. I suspect that then the aptamer repeat, could go into action.

Do you see this aptamer repeat pattern in the first two labs?
Photo of Brourd

Brourd

  • 466 Posts
  • 86 Reply Likes
Actually, I believe you are correct, the first lab did not have any repeating patterns, as the stack bonding to the aptamer loop switched to a loop, and the other sections of the design were loops bonding with loops to create a stack. The only possible way one could create a program in order to determine if that was a switch or not, would be to take the sequences in the loop, and determine if there are any complementary sequences.

Here is Eli's mod of Tebowned as an example





Looking at complementary sequences for 34-36, we can already see there are 2 options. Ultimately, it comes down to the length of the sequence that you are searching for and you would probably need to match multiple sequences.
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
Thx, interesting note.
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
Here is more about the background of this idea. In the beginning I didn’t knew what caused the pattern, but ended up getting a clearer idea why. It all started out with a blue/green pattern I noticed sometime back. Some of this content comes from my Switch diary. The rest is from the mail discussion with Rhiju, I referred to in the post intro.

BLUE/GREEN PATTERN

There is a special pattern turning up in a majority of the solves for the current lab. There is an overweight of blue and green on one side of the strings. I used my knowledge of this pattern for the solving of some of the switch puzzles.

I have marked the repeat sequences in Starting point by Alacarus, as they stand especially out in this design. (Outline from my "Switch diary" 27-8-2012 where I started writing about this pattern.)



I later described this blue/green pattern that turns up on a specific side of the strings to my cousin. And he summed it up real nicely, that the blue green pattern didn’t showed up on the north side of the puzzle.

I don't know what causes it. There can be different things in play. Either the pattern I see can be partly caused by the locked basepairs. Or it can be encouraged by the energy model, which I find a bit more likely. The energy model encourages it more than nature, though even nature seems to favor this tendency. I will look forward to more lab results. If the tendency continues, for shapes that don't move far from each other, it is handy knowing that 60-70% of the blue green bases have to be on a particular side of the string.

The first two labs did behave different in relation to this pattern, I think it is because the strings in the two shapes fold up farther away from each other, they don't just jump a few nucleotides to one side to get to the other shape.

Actually it might be the jumping from one shape to another with just 4-5 nucleotides between most of the jumps, that is the origin of this pattern. When I sit and watch the design Starting point and switch back and forth between the two shapes, I really think it is. And the origin of the pattern is started by the locked nucleotides. The pattern gets repeated from there. So the repeats gets induced by the shapes being close to each other.

Unbound: The locked nucleotides crave blue and green.


Bound: And from there the pattern spread


I think we can use this, knowledge about shapes that jump short will create repeats, that spreads and cause repetitiveness in the solution, on strand basis.







I think this pattern will mainly hold for shapes that don't jump far from each other. And as most switch challenge puzzles didn't, I could use the pattern for solving them. It was my second step after filling out the puzzle in a sudoku like manner.

(I wrote most of the text below before the newest batch of switch puzzles.)

The blue/green pattern I told you about, I have been wondering a couple of things. Namely about our experiment setting. I think the pattern arose in last lab and the challenge switch puzzles, solely due to the fact of the colors of the locked nucleotides (yellow and red) and because the short distance the two shapes moved apart from each other. And that both things added up and caused the pattern spread in the design.

So I'm wondering if the exact opposite pattern would arise in the design, if the locked nucleotides had the colors changed to be blue and green. Would the yellow and red nucleotides primarily end up mainly at the south part of both shapes, as the blue green pattern did? I think this could be tested by trying the idea out in the switch challenges puzzles and check if the opposite pattern shows up in our solves. So almost same puzzle as the lab one (or preferable the same) but with different colored locked nucleotides.

I'm sure the colors of the locked nucleotides and their position are picked for a particular reason. If they can't be changed, we have a pattern we can use for one particular type of switches. But for now I'm more worried about the experiment setting creating this pattern. No matter which, we should be able to use this to learn more about our switches. When I saw the pattern, first thing I realized was it's potential for puzzle solving. So I shared it with some of the new players that attempted solving switches and were stuck and it did help some.

New puzzles comment: The blue/green pattern is still present in the south part in the one shape of my solve of Pinchers, while the other puzzles looks like the placement and more mixed colors of the locked nucleotides toned down the pattern. Drakes still have a pattern of blue and green to it in unbound shape, while Swimmer and Sad little hairpin follows patter to much lesser degree. So I guess I partly got my answer already. It is the locked nucleotide colors and placement that causes the pattern. Let me guess, you didn't find this blue/green pattern in the natural occurring RNA switches.

Suggestion for picking of lab puzzle, if you want to avoid this green/blue pattern repetition all over the puzzle, look through our switch puzzles solves and pick one where this pattern doesn't occur to a big extent. You can use the puzzle solves to get an idea if the puzzle is balanced, before sending it to lab.

Rhiju: I think your hypothesis about the locked residues setting the overall pattern is probably right -- we can actually test this by permuting the two strands on either side of the FMN aptamer. I'll bring this up with devs today!
Photo of Brourd

Brourd

  • 466 Posts
  • 86 Reply Likes
FMN Switch 2.0 is an interesting lab, as over 50% of the bases are affected by the locked aptamer bases, when they only amount for about 20% of the bases in the puzzle. Here is a visual of how it spreads through the design.















You can do this with any lab or switch challenge in order to determine which bases have their identity determined or influenced by the locked aptamer bases.

Now, you compare this to the 2nd lab, FMN Aptamer 2.0. There, the locked bases are in the loops, and have little effect on the identity for the rest of the design. Even more interesting, we got a winner in the 1st round, as our designs were not limited to a specific pattern, and puzzle design was really just like any other lab we have done in the past. However, when it comes to the in silico problem, the only time the locked bases cause an issue is when they lock certain parts of the design, limiting us to specific solutions, like the switch challenge FMN Switch 2.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
Yes. If you follow through from the fixed bases you find that many others are constrained. In some cases there is only one option (e.g. paired with fixed A, must be U; or paired with C, must be G) while in others cases it is more open (e.g. paired with U, must be purine A or G; or paired with G, must be pyrimidine: U or C). I once tried solving a switch using an Excel spreadsheet to track the constraints, but it got complicated.

It might help though it we could optionally have colored marker rings (e.g. orange for A or G, aqua for C or U) to help track the constraints.

I sometimes start a switch by initially pairing U+A and G+C in the initial setup, then know that I can *often* morph C->U and A->G and *sometimes* morph G->A and U->C.
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
You did some nice explaining. I agree that many other bases are constrained by the locked nucleotides. I like the idea about marking the partners, and the comming nucleotide partners of those, each time one of the locked nucleotides gets presented with a matching nucleotide. That way one can see if the move one plans, potentially gets one in trouble somewhere else in the puzzle.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
Ideally when we get around to "programming" EteRNA scripts, it would be nice to have some "declarative" way to specify constraints. For example from the current switch:

bondsWith(60,10) and bondsWith(10,55) and bondsWith(55,16) and unpaired(16,48) and bondsWith(48,37) and bondsWith(37,28)

Or, using / for bonds and ! for unpaired:

chain(60/10/55/16!48/37/28)

Setting any one of these seven bases potentially has an effect on the other six. Aside from 16!48 where a C in one does not preclude an A in the other, they must be purine/pyrimidine/purine/... (or pyrimidine/purine/pyrimidine/...).

Having these sorts of constraints may help to limit the search space.

Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
I like your idea of setting a constraint - it looks like you have spred them out with a relatively even distance - I'm sure you got some thoughts behind that. I especially like that purine/pyrimidine part.
Photo of Brourd

Brourd

  • 466 Posts
  • 86 Reply Likes
I decided to do the same thing as my previous example of FMN Switch 2.0 with the Switch challenge Pincers, which currently has the most solvers of the new batch of switches.















So, the ultimate question with this is, would this be a good lab? Only the identity of 2 base pairs in the 1st state can be determined by the player, and up to 5 base pairs in the 2nd state. Another interesting question is, why does this have more solvers than the other switch puzzle? Is it easier for players to follow along as the identity of most bases are determined by the aptamer loop? Has it just been mentioned more in the chat compared to the other switch challenges?
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
I really like the demonstration you did here.
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
I was on a longer walk, when I realized something about switches. Something that made me understand them a little better. Some of this I have already mentioned and some will be commonly known.

I think that stem nucleotides that jumps to become stem nucleotides a different place, have a much stronger pull, than group of loop nucleotides (aptamer included) that jumps to become stem and vice versa.

When I thought about this, there is something that is is very characteristic for RNA designs, and make loops stand out compared to stem. Imagine that you take two single loop strings and pair them up so they become a stem. That stem will on average look very different from an average stem. What is typical of a stem is that it has an overall higher negative free energy than if two loopstrings were forced to pair up. Or said in another manner a strand from a stem compared to a loop string has a different energy potential, if matched with a partner string. Some colors of nucleotides will more often end up in a stem than in a loop. Put a different way, it makes much better sense for G and a C to form a basepair (to cause a stem) than end up in loops, as they have strong binding forces. The nucleotides with the strongest pull will be drawn towards stem formation. That was what I tried to show in this post on Clean halo in multiloop ring.

I later made the strategy Gravitation of nucleotides based on the post above. Basically reverse the recipe I made for stem formation, and I think it will indicate how much trouble these colors of nucleotides will cause in switch loops.

There will be a higher ratio of C's in stems compared to loops. And generally a higher ratio of A’s in loops. The distribution of color difference of the nucleotides between loop and stem, stands especially clear out in single shape lab. So the distribution of nucleotide colors is a little different depending on if it is a string or a loop. I think that hold for both single shape lab and switch lab, though the nature of the switches, causes much more colored nucleotides to end up in loop ring, multiloop ring and hook area.

To make a long story short, I think I now understand why switches don't necessarily need to have GC-pairs at the end bases of a string. (The loop nucleotides that jump and doesn't end up in either loop, multiloop ring or hook, have a habit of turning up in the beginning of a string.) Like they did to a great extend in single shape lab. Because it is better for the color distribution in the loop, that the nucleotides from the closing basepairs will jump into, is not a G or C. As that will raise the G and C content present in the loop area.

I think this is why we sometimes see a somehow higher distribution of GC-pairs in middle of strings and a fewer at the end bases, compared to single shape lab. As an attempt to keep most possible strong pulling nucleotides out of ring area.

Of cause this will all need to be looked into, but I just wanted to share my thoughts on this.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
Any interest in looking at data from what nucleotides show up in natural loops, end-loops, and multi-loops? I still have that old Protein Database RNA data around with about 62K loops extracted from real RNA.
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
Sequences of real RNA. Sure, put it up. :)
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
I mentioned this in the post above:
“The loop nucleotides that jump and doesn't end up in either loop, multiloop ring or hook, have a habit of turning up in the beginning of a string.” (this is what happens in designs with 4 basepair long stems)

Now I’m thinking that when we get switches with longer strings, that GC-pairs might show lesser preference for being in the middle of the stem rather than closing basepair, compared to now, where the stems are quite short. I think longer strings will lessen the pressure on the loops, because more basepairs from stem will jump from stem to stem, instead of stem to loop. And thus there will be fewer of the nucleotides in the loop, of such colors that they have a wish to pair up elsewhere.
Photo of Brourd

Brourd

  • 466 Posts
  • 86 Reply Likes
This is actually the subject of a strategy I am writing up xD

So, I guess I could just spill what I've been looking at so far.

Anyway, when it comes to loops, base identity does matter, although, not to such a great extent as we like to make it seem. Now, my strategy is, in a nutshell.

Loops Defined

Open Loop/Hook


Internal Multiloop


Internal Loop


End/Hairpin Loop


Cytosine Strategy

Stack to Stack - The preferred state for cytosine.

Stack to loop - A state that is not wanted, however, if necessary, the identity of the loop must be either a hairpin loop or internal loop, preferably with few other bases around it. The open loop is allowed somewhat, but no hard rules have been determined for that yet.

Loop to Stack - Like the above, the preferred loops for cytosine are hairpin loops and internal loops. The open loop is allowed, and the general rule is to make sure the cytosine is isolated from other bases and surrounded by A's.

Loop to Loop - chances of needing this are slim, but like the above, hairpin loops and internal loops are the preferred starting point.

I was in the middle of coming up with strategies for the other bases :P

Some patterns I saw, Uracil is great in loops. It can be used in internal multiloops, the open loop, hairpin and internal loops, and it seems to do quite well with switching between most states.

Adenine and guanine are both great for stack to loop switches.

So, that's what I had so far.
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
More on the idea of programming a tool to search for periodic repeats in a complete genome to identify potential RNA switches.

Rhiju have helped me on getting started and Pablo has some fine ideas on how we can get on with this.

From mail discussion with Pablo, my questions and answers are marked my name in the front of the section.

Hi Eli,

I think the pattern repetition hypothesis is quite interesting and RFAM is definitely the resource you want to use to give you a tentative answer. Like Rhiju mentioned, the easiest way to go would be to look at repetitions occurring in riboswitches versus random RNA sequences and RNAs that we wouldn't expect to have switching behavior (ribosomal RNA, also known as rRNA, is a good example of this). So, the way I would go about this would be to:

1) Quantify sequence repetition somehow: How can we summarize the periodic repeats of an RNA sequence in one number? You can think about it and create a scoring system yourself or use some webserver to give you the answer (see, for example). Note that this question about periodicity in sequences of DNA and RNA is an old problem in computational biology, and many people have come up with different ways to find pattern repeats and sequence scoring...the methods have found lots of regions in our own genomes that are highly repetitive and that we have no clue why they are there.

Eli: I also thinks stem to stem repeats, will differ from stem to loop and loop to stem repeats. And loop to loop repeats too. And I think stem repeats will be strongest and most intersting search factor. Of cause when we search in junk dna we wouldn't know what is what. But I just think that will mean the most interesting fragments to look at is the one that will be typical in stem region, according to known switch data. What I'm saying those should be scored most in a scoring system. A periodic repeat should receive a lower score the farther away they are from each other.

Eli: So with scoring system, do I understand this correct: Some fragments with periodic repeat will be more worth looking for than others. Like a fragment from stem that jump to stem. If it has 1 or 2 C, it will have a more determining factor than if it is with two u's or two A's. Also some repeats are more likely to be a true repeat and not just random repeats.

2) Once you have a scoring mechanism in place, try writing a program (or convince someone to do so =) ) to score all the sequences contained in a file. With that, you could score the sequences in RFAM.

3) Try scoring also a shuffled version of all sequences in RFAM, that is, take each sequence and scramble the nucleotides. You can do that per single nucleotide or per two nucleotides (like Rhiju was suggesting), per three nucleotides, etc. I already have a scrambled version of RFAM per single and double nucleotides if you are interested (I can put it in a dropbox or somewhere else, the files are too big to fit by email).

Eli: I will first be interested in pr. two nucleotides, like Rhiju. I think most of the repeats will be of that size. I do see a few of 3. But if the switches stand out against other types of RNA at two nucleotides repeat, then it would be interesting running 3 too.

What I have are two files, one with all the RFAM sequences where each have been shuffled per nucleotide (just reordered the nucleotides in the sequence)

and one where the sequences have been shuffled per two nucleotides (grabbing pairs of nucleotides and reordering them at random in the sequence). These can serve as good controls, since you expect that repetitions would disappear when you randomize the sequences.

4) Plot lots of histograms (you can do this in excel for example). You can plot a histogram of your RFAM scores, then a histogram of scores for only riboswitch sequences, then a histogram of scores of only ribosomal RNA sequences (look at the RFAM names, where you see rRNA, that is a ribosomal RNA), then a histogram of scores of RFAM scrambled by one and two nucleotides. Do the histograms look like each other? Does the riboswitch scoring histogram look a lot different than the rRNA and scrambled histograms? If the riboswitch histogram looks shifted as if the scores are higher than the other histograms then you are on to something!

This can seem daunting at first, but in reality only very little coding experience is needed, you can definitely write simple scripts that would solve this in less than 50 lines of code!

Let me know how it goes.

Pablo

Eli: I got a beginning idea for the programming of a periodic fragment search. We know a twin repeats usually will be at one half side of the RNA switch molecule and the other half. And typically in two stems strands next to each other. (okay, it can jump from one half of the design to the other at middle sentence point) But what I'm saying are that they should not count as repeats, if they are too far apart in the sequence. They will have a limited number of nucleotides between them. So that should be put in as constraint of the search program.

More on how it started

I have been playing a little with RFAM to get a feel for the switches and see what I would find. Get an introduction to Rfam - database over natural RNA

Rhiju suggested that I checked another type of RNA that was not switches, to check if that also had repeats. And as I didn’t knew much about different types of RNA I asked for an example. I got The ribosomal RNA sequences should be less likely to have secondary structure 'switches'.

I have been comparing switch RNA to non switch RNA, to see of periodic repeats were just as common in non-switch RNA.

I ended up getting a little sceptical about the microRNA mir’s. I caught one of them turning up in both my searches on riboswitches and Ribosomal RNA. So is it both a ribosomal RNA and a switch? I checked a random one as it looked like there was overlap between my two search lists. Here is it:

RF02021 mir-3179 MicroRNA mir 3179

The mir’s did look like they had some repeats and they showed up in both my search on Ribosomal RNA and RNAswitches. So I ended up excluding MicroRNA mir’s from by search by groing for the Riboswitches. I did the search "Ribosomal RNA -switch" and made a random check and I will say Ribosomal RNA have a much lower repeat rate than Riboswitches.

Rhiju: Another test set (a 'negative control' set) would be to take the switch sequences and 'scramble' them. Or better, scramble in blocks of two, to maintain the RNA's dinucleotide composition.

If a starting sequence is:

ACGUCGAU...

I meant just breaking it into doublet blocks:

AC GU CG AU ...

And then writing a program to permute them randomly:

GU AC AU CG ....

Then test your idea on these sequences.

Repeats should occur in longer stems in other RNA types too, but not to the same extent. And I guess they won't be as periodic to.

Another thought, I’m wondering if aptamers (okay, I know natural RNA don't have aptamers as those were evolved in lab, but I mean the binding part of the switches) have enough characteristics, to be used for fragment search too? If they have, they could be the second search deployed after that a repeat period search revealed that there is a high content of fragment repeats that could suggest a RNA switch.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
I suspect that to be critical a repeat probably needs to (be able to) form a stack with lower than some threshold of energy, at a guess around -2.0 kcal, although it might be less. (What is the energy distribution a brownian-motion bump I wonder?)

AU can bond with AU (-1.1) or GU (-1.4) and is probably not as critical a repeat as GC which can bond with GC (-3.4) or GU(-2.5). On the other hand, the triple UAU can bond with GUA (-2.7), AUA (-2.4), GUG (-2.4) or AUG (-2.2), so a UAU repeat might have additional effect because it can form a stack with four different sequences. (Note, one should probably not ignore the boost from a G-A bond in a loop when considering stack energies.)

It would be interesting to see the %distribution of pairs (16), triples (64) and quads (256) in both switching and non-switching RNA and whether some show up more often than others. (I don't know if there is enough lab/database data yet to see statistical significance on triples and quads, but there may be enough for pairs.)
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
Hi JL!

I find your energy approach interesting. I think you may be right that there might not be enough triplets. I don't beliveve in qauds either, though I have seen one. I have been making a primive start on making a scoring system. Here is what I have got so far:

I checked for two nucleotides at a time. I tried to limit stretch of stem to stem area to around 25 nucleotides length of sequence apart. Also I didn’t counted repeats that was less than 3 nucleotides apart, as a loop should have a chance to form between the repeats. I also ignored repeats that was already paired up. I started in reading direction left to right. Like in our lab. I also tried not to count in again the number two of the first set of repeats, if there were more than one set of periodic repeats. I haven't either counter periodic repeats as repeats if they were on the same strand. I only picked RNA switches that had the name riboswitches in them.

What already stands out clear after just analysis of few riboswitches, is that repeats show up in a particular pattern. If one set of periodic repeat, there is a quite big chance of there being one more set of repeats of the same kind. That goes in particular for GG, CC and to a lesser degree UU, AA. In that order. GG’s in particular as they can have 2 different partners as opposed to CC’s. Another one strongly represented is GC and GC, which is not so odd, as these have a real strong pulling power. Actually all with G or C in their combo have a higher abundance. Though A and U to a lesser extent.

Numbers of GG repeats pr design.


Numbers of CC repeats pr design.


Other repeats are real low in abundance, which can also be useful knowledge when ruling out some potential switches from a full genome search

Smaller switches have a lot fewer repeats, probably due to exactly the fact that they are being short. So I predict that fragment periodic repeat search probably won’t be much help with those. I guess they may have stem to loop and loop to stem repeats. But haven’t checked yet. I analysed stem to stem first as I think those will have the strongest effect. And be most abundant. I suspect another patterns for loop to stem and stem to loop periodic repeat.

Picture of my data dragged together. Link to spreadsheet.

Notice how the repeats is centered on the middle. My data might be less precise in the first riboswitches, as I was still working out how to count the repeats.



Actually all repeats with G or C in their combo have a higher abundance. Though A and U to a lesser extent.

The designs with no repeats, are typically quite small. But as you can see, there is a big difference in how many repeats these riboswitches have. After watching behavior in switching difference for present lab, I think I should allow longer distance between the periodic repeats. Also I think distance for a periodic repeat will vary depending on the RNA switch size.

Some repeats are even 4 nt’s long

Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
I have a hypothesis that many RNA structures fold starting with the hairpins/arms and work back towards the neck. For this, possible matches that are in close proximity would be important as a starter, like a spark to start a fire. But they need not be on the same side of the target, nor identical. For instance, a GG might match with a CC on one side of it when folding one way and a CU on the other side of it when folding another way. The CU and CC are not identical repeats, but do target the same matching pair.

As a secondary complication, once a hairpin or arm(s) forms, it brings potentially matching pairs closer together, potentially enhancing the chances for interaction, For example in the diagram above, the UC|GA closing the 7-6 loop between the two red arms are initially far apart, but after the first hairpin forms they are closer together and more likely to interact. Likewise the neck pairs are far apart until all of the other arms form. So the potential for interaction of a pair or pairs would seemingly increase if there are other matches in between them. I'm not quite sure how to experimentally test this yet.

Futhermore, there are only 5 combinations of pairs that can hold closed a hairpin on their own and those are over a small range:

GG|AC and CG|AG can pinch off a stable tetra-loop with a GA-boost. GC|GC can form a tetra-loop to 13-loop, and GG|CC and CC|GG can hold closed a tetra-loop to 8-loop. Any other stable loops require additional assistance from additional pairs or boosts. Thus we must either consider unstable transient states or longer matches (triples or quads) to explain how many shapes form.
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
"For instance, a GG might match with a CC on one side of it when folding one way and a CU on the other side of it when folding another way. The CU and CC are not identical repeats, but do target the same matching pair."

I really like what you pointed out above. I think this is why an excess of the periodic repeats are CC or GG's or mixtures of those. Those will have the strongst pull, so their partners don't have to be repeats. Repeats of CU will be less frequent, as a GG repeat is stronger and will more effective,

What you wrote made me think of following situation. Imagine having have two sets of periodic repeats, a GG and CU match up, that would limit very much the number of legal solutions. At least if it happens with many sets of periodic repeats. I know that the above picture actually have a double set of periodic repeats. The two times 4 U's and the two times 4 A's. But they are rarer.

Thanks for your addition of ideas. Keep them coming.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
I'm not sure if this fits in this thread or should start a new one, but it is an extension of the discussion above.

I've been playing with the EteRNA energy model (via the puzzle maker) and the hypothesis that many shapes form in stages, by folding hairpins/arms first, and fold by forming small changes of one or two bonds at a time.

For a first step, I started by looking at the modeled energy of various stack cells and loop closures. A non-exhaustive sampling can be found in a Google Doc here.



The rows indicate various possible pairing combinations. For example, spreadsheet row 2 is a GC pairing with GC combination, which is the strongest possible pairing according to the energy model. The columns indicate the free energy depending on the spacing of the strings. For example when the second GC is offset 5 nucleotides (inclusive) from the start of the first sequence, the result is a tri-loop. At an offset of 6 nucleotides (e.g. GC starting at 3 matches GC starting at at 9), it forms a tetra-loop, and so on.



The top portion of the spread-sheet lists fully bonded stack cells. The bottom portion lists a few cases for simple loop closures (boosted or not boosted.)

Very few of the combinations form a loop that lowers the free-energy of the structure. (Only five, with very limited spans.) Assuming that hairpins do form patchwork rather than whole-cloth, there must be some allowance for a temporary increase in free-energy during the structure formation.

I'm curious whether some of the stronger bonding runs (e.g. GC, CC, GG, CG, GA, UC, GU, AC) show up more frequently in either the single-shape or switch RNA sequences. Also, is there a difference in occurrence frequency between the first tier (GC, CC, GG, CG) and second tier (GA, UC, GU, AC) runs?
Photo of Eli Fisker

Eli Fisker

  • 2311 Posts
  • 531 Reply Likes
Hi JL!

Thanks for your thoughts. Yes, I think there will be a difference in the frequency of occouring between the strongest GC,CC, GG,CG and then the weaker GA, UC, GU, AC. Also for single shape lab. I think that the weakest "pairs" will be most prevalent in single shape labs. As I think the strongest, particularly the double GG, CC, but also AA and UU, will be more prevalent in switch labs. Thus making these base repeats more of a RNA switch marker.

However there is a difference between designs in single shape lab. The designs with more long strings will habour much more of the first tier as than strings with short ones. As long strings are more tolerant of repeat bases.

I have taken a look at the two top scoring designs in a lab with long strings, The cross and a lab with short strings, The backwards C. To be sure I covered the normal variation between singleshape designs.

I added a spreadsheet with my numbers. If I take the average of the first and second tier for the two labs together, still the sceond tier comes out strong. Though the long stringed designs helped mask the tendency. I think there is a good chance this tendency will continue if carried out on more designs. More numbers can and propably should be done.

I think the reason why the second tier GA, UC, GU and AC, are particular common in single shape lab, is that almost all strings are most succesfully closed by a GC-pair. Leaving a G or a C to finish off a strand. Thus all the second tier will be present in big amounts as all junctions gets a GC-closing.
Photo of jandersonlee

jandersonlee

  • 555 Posts
  • 131 Reply Likes
Hmm. A GC can have a range of effect much greater than other runs. A GC|GC stack cell can create a 32-loop with a FE rise of just under 1 kcal. For a GG|CC stack cell, a 21-loop has a FE rise of 1.0 kcal. For CC|GG it is a 23-loop. For CG|CG it's a 10-loop. For UA|UA, even a 6-loop has an energy rise of 3.2 kcal.

So always looking out 25 positions for repeats may *under estimate* the number of some critical repeats, and over-estimate others.