Issues with disappearing and reappearing replies have been solved

Some of you may have noticed over the past few days some strange issues with our topic pages. Replies and comments would disappear and reappear.

I've tracked down the problem and solved it. You shouldn't be seeing this particular behavior again. I can go into the details of the cause and solution if anyone is interested.

I apologize for any frustrations.
 
happy I’m tired
Inappropriate?

Follow this discussion to get notifications on your dashboard.

  • Inappropriate?
    Oy! Thank you! I've been having this problem. Feel better already.
     
    happy I’m relieved
  • mdy
    Inappropriate?
    I'm interested in the cause and solution.
     
    happy I’m in a good mood
  • Inappropriate?
    We use Memcache on topic pages pretty heavily to speed things up. Each reply on a topic page has many moving parts, and that translates into a *lot* of partial templates calls. We break up our code into partials so that we can more easily reuse portions around the site. Pretty average stuff really, but there are so many partial template renderings on a topic page that it can take up to a second to generate the HTML for a page with 50 replies. We had two choices to eliminate this sap on speed: Eliminate the partials and deal with duplication and less readable code, or cache the final generated HTML using fragment caching. We chose the latter option.

    Every piece of data in memcache has a key that accesses that data; to help prevent us getting stale data (and to protect strange cacheing error if we rollback a change) we prepend the revision number of the deployed code so that each key looks like: 9250:topics_11232/full_conversation. I developed this code around the 2nd week after we decided to build satisfaction, around February 2007, and it hasn't been touched since. But here's the key: It has a fail-safe it it. If it cannot retrieve the revision number from subversion it will default to using a timestamp rather than crashing. Basically, if the servers all get started within the same 10 minute block things work fine since they all have the same timestamp. 99% of the time this happens and so even if the revision number can't be retrieved from subversion there is no problems.

    Now, at the time I developed this code, we were using subversion. Around June of last year, I think, I started using Git to manage my personal development. It integrates well enough with Subversion and provides such a much better experience: I'm in love with a piece of software. Cameron, our other back-end developer followed several months later, and just last month I convinced Ted our front-end man to follow along. With the trifecta of switching complete, we were able to eliminate subversion from the mix and only use Git to manage our source code.

    The only problem, I had forgotten about the revision key code I talked about above. Trap set! If you recall I said that the fail-safe works as long as all of the application servers get starting in the same 10 minute block. The trap didn't get sprung until this past weekend when I was switching over the site to a new DB setup. I took down the site to move data over and then I spun up each application server independently, not even thinking this was an issue. It looks like they got restarted across the 10 minute boundary and our cache began to fragment.

    To make matters worse, I pulled the age-old microsoftian solution for random problems yesterday when Eric reported to me about the weirdness: I rebooted the server. But no, I couldn't have rebooted all of the server processes at once. I did it 5 at a time. Well, the trap hit me again, and then we were running on 3 fragments of the cache.

    ----

    So, when diagnosing this, I had no clue what the cause of it was, but my immediate attention went to the network and the database. I had just had a ticket resolved with our host that turned out to be a problem in their network. So, I postulated that one of our app servers wasn't able to connect to one of our memcache server and it was failing over to the other. If I had thought it through, this wouldn't have made sense, but instead I ended up spending an hour diagnosing the network to see if I could recreate the problem in a debug session. After that I thought it was a database issue. Here's the chatlog between Thor and I (please excuse the grammar and spelling):


    9:41:59 PM Thor Muller: is this a db issue?
    9:42:41 PM Scott Fleckenstein: it's a caching connectivity issue. looks like I need to install some logging
    9:43:39 PM Scott Fleckenstein: ill get to work on it. In the meantime, if you refresh enough you'll hit the server that can access the memcache with the latest data.
    9:44:18 PM Thor Muller: hmm, you'll have to explain this in detail tomorrow
    9:44:30 PM Thor Muller: thanks
    9:44:34 PM Scott Fleckenstein: np
    10:14:40 PM Scott Fleckenstein: nevermind, it is the database, im investigating


    Given the solution I gave him, you could see I had no fucking clue what I was talking about. The third fragment hadn't exposed itself, so I was still thinking it was a connectivity issue at heart. I perhaps thought something was causing Mysql to cache results at the connection level. Perhaps one of the server wasn't committing a transaction and was operating in its own little unfinished transaction land. I really had completely forgotten about the cache revision key stuff and had no clue. I quickly ruled out the DB though, maybe 10 minutes later, and went back to work on the cache.

    The thing I did was get lower to the metal, finding the actual server that should be serving the cache for this particular key, and what that key actually was. Bingo! Once I saw the raw key the whole timestamp-revision-key-fallback-thingy came rushing back and I was able to put in the fix. Couple lines of code later and I'm done!
    2 Comments Sprite_screen Add a comment
  • Comment_icon
  • Comment_icon
    thanks for the super thorough explanation! Leslie Chicoine (Official Rep), on May 06, 2008 18:23
  • mdy
    Inappropriate?
    Scott, thank you so much for taking the trouble to explain it in such beautiful detail. I appreciate the honesty with which this was written.

    I'd been curious about the way the team at GSFN works for a while, and your explanation gives me a much-appreciated peek into what goes on behind the scenes. 8-) Thanks again!

    Edited to add:
    You should have heard my laugh when I got to this part...

    I pulled the age-old microsoftian solution for random problems yesterday when Eric reported to me about the weirdness: I rebooted the server.

    ... because I would have done exactly the same thing!
     
    happy I’m thankful
  • Inappropriate?
    Thanks for investigating and fixing this Scott. It was a bit strange when replies I was emailed weren't showing up on the site but it got really weird when replies I had already SEEN on the site were mysteriously gone. So glad it's been solved! Nice work.
     
    happy I’m relieved
  • Inappropriate?
    Your welcome for the explanation.

    The best part was when I figured it out. It was like I was some amnesiac hero from the movies who gets his memory back in a series of flashing images before he saves the day, only instead of a hero I'm just some schmuck slapping my forehead.
     
    silly
    1 Comment Sprite_screen Add a comment
  • Comment_icon
  • Tim
    Inappropriate?
    This is a wonderful writeup that gives an outsider a real feel for what writing and debugging code is like, especially the experience of building something, having it work fine and so forgetting about it, then breaking it by accident later and having no F'ing clue why its broken. Happens to me all the time.

    You have a gift with words, Mr. Fleckenstein.
     
    happy I’m delighted to read brilliance
User_default_medium