Questionable Content Feed missing items

  • Problem
  • Updated 9 months ago
The feed is here

http://www.questionablecontent.net/QC...

And some of the items don't show up in Newsblur. For example the Sept 2nd one isn't there. I checked the feed itself and it is there to be found but somehow it's being missed.
Photo of

  • 3 Posts
  • 0 Reply Likes

Posted 2 years ago

  • 21
Photo of Benjamin SchwartzBS

Benjamin Schwartz

  • 3 Posts
  • 0 Reply Likes
Same issue! Didn't even realize I'd missed it until I saw this post.
Photo of steveS

steve

  • 8 Posts
  • 0 Reply Likes
This has been mentioned before, and still has not been fixed. This feeds functionality is horribly broken.

QC generally runs M-F every single week, and is a linear, story and plot driven comic. Skipping days, or weeks even is absolutely unacceptable as it leads to a completely broken storyline that is practically unreadable.

Regardless of the reason that the maintainer provides, this feed *always* worked on Google Reader, and I have a feeling that the several direct competitors to newsblur are able to handle it....
Photo of

  • 3 Posts
  • 0 Reply Likes
I definitely had problems with this feed on Google Reader. It would go missing for a week, and sometimes the previous day's post would be shown. Not too dissimilar to what's happening now on Newsblur.
Photo of steveS

steve

  • 8 Posts
  • 0 Reply Likes
I've been going through and comparing the feeds.....there are bits of story dating all the way back to MAY that I completely missed out on due to spotty, incomplete display on Newsblur....

I've already paid my $36 this year, but QC is one of my favorite comics, and that would be a dealbreaker if it remains broken.
Photo of Samuel Clay

Samuel Clay, Official Rep

  • 5248 Posts
  • 1170 Reply Likes
I just checked and the Sept 2nd story is there. Does that look correct to you now? The site url is http://www.newsblur.com/site/774/qc-rss.
Photo of C Dave

C Dave

  • 21 Posts
  • 0 Reply Likes
The Sept 2nd isn't showing up for me, but is in the RSS, with a unique ID

[code]

Ackbar, Repeat, Ackbar
http://questionablecontent.net/view.p...

<![CDATA[
<p><!-- Beginning of Project Wonderful ad code: --> <br /> <!-- Ad box ID: 39770 --> <br /> <map name="admap39770" id="admap39770"><area href="http://www.projectwonderful.com/out_n..." shape="rect" coords="0,0,728,90" title="" alt="" target="_blank" /></map> <br /> <table cellpadding="0" border="0" cellspacing="0" width="728" bgcolor="#ffffff"><tr><td><img src="http://www.projectwonderful.com/nojs...." width="728" height="90" usemap="#admap39770" border="0" alt="" /></td></tr><tr><td bgcolor="#ffffff" colspan="1"><center><a style="font-size:10px;color:#0000ff;text-decoration:none;line-height:1.2;font-weight:bold;font-family:Tahoma, verdana,arial,helvetica,sans-serif;text-transform: none;letter-spacing:normal;text-shadow:none;white-space:normal;word-spacing:normal;" href="http://www.projectwonderful.com/adver..." target="_blank">Ads by Project Wonderful! Your ad could be here, right now.</a></center></td></tr><tr><td colspan="1" valign="top" width="728" bgcolor="#000000" style="height:3px;font-size:1px;padding:0px;max-height:3px;"></td></tr></table> <br /><img src="http://www.questionablecontent.net/co...>
]]>

Mon, 02 Sep 2013 23:34:08 -0400
02B7D3AF-B5FA-49C6-A76B-4B92440D5CBE

[/code]
Photo of Kazriko Redclaw

Kazriko Redclaw

  • 51 Posts
  • 5 Reply Likes
I actually went and looked, and the story that I missed out on isn't the one referenced by Sept 2nd on newsblur. The one showing up as Sept 2 on Newsblur is Sept 3 in the feed, and the one showing as sept 3rd is sept 4 in the file. Looks like a timezone difference. Not a problem, but that could be the communications breakdown.
Photo of Kazriko Redclaw

Kazriko Redclaw

  • 51 Posts
  • 5 Reply Likes
Huh. I just found the strip. It's set to the 19th of August, tagged on top of the comic for that day. Odd. And it actually missed more than just that and the guest strips. 2514 up to 2522 are missing, aside from 2516.

On the plus side, Schlock mercenary's been updating perfectly for over a month. Yay!
Photo of Steve Beattie

Steve Beattie

  • 1 Post
  • 0 Reply Likes
Nope. Going there now, my feed has the following top 6 entries:

  • Putting in Hours, 12:29am (Sept 4) (maps to comic 2527)

  • Did You Know?, 03 Sep 2013, 1:23am (2526)

  • Proportionate Response, 29 Aug 2013, 11:59pm (2524)

  • Division of Labor, 28 Aug 2013, 11:21pm (2523)

  • Highdeas, 27 Aug 2013, 10:28pm (2522)

  • The End will Camelid A thief in the Night, 19 Aug 2013, 10:36pm (2516, not marked in comic itself)



Hope this helps diagnose things...
Photo of C Dave

C Dave

  • 21 Posts
  • 0 Reply Likes
The 2517 to 2521 strips were guest strips, and never appeared in the RSS feed.
Photo of Kazriko Redclaw

Kazriko Redclaw

  • 51 Posts
  • 5 Reply Likes
I just checked the RSS feed, and they are in there.
Photo of

  • 3 Posts
  • 0 Reply Likes
The link you give is missing the Sept 2nd one as previously mentioned. As you can see from Steve's lists it is missing 2525 which is the Sept 2nd comic.
Photo of canpolatC

canpolat

  • 8 Posts
  • 0 Reply Likes
I also recognized this problem now. Some of my feeds don't update properly either (most of them are Yahoo Pipes). One example is: http://pipes.yahoo.com/pipes/pipe.run...

Between August 1st and today, Newsblur shows 2 items whereas The Old Reader shows 15 items for the same period.

This is not acceptable. If this is not resolved soon, I want a refund. (I'm a primary user)
Photo of Bryn HughesBH

Bryn Hughes

  • 6 Posts
  • 1 Reply Like
Wow, hadn't even realized I was missing stories until I saw this. Now I'm wondering about the rest of my feeds....
Photo of Samuel Clay

Samuel Clay, Official Rep

  • 5248 Posts
  • 1170 Reply Likes
I realize this is frustrating, but the feed is at fault here. It needs to publish GUIDs that don't change and don't overlap. I'll see what I can do, but know that this problem is specific to this feed. The publisher should fix this, ideally.
Photo of Benjamin SchwartzBS

Benjamin Schwartz

  • 3 Posts
  • 0 Reply Likes
There has to be some solution, other RSS readers seem to handle it fine, guest strips and all.
Photo of Samuel Clay

Samuel Clay, Official Rep

  • 5248 Posts
  • 1170 Reply Likes
Here's the raw log of the feed:

[Sep 05 02:28:20] ---> [QC RSS ] IntegrityError on updated story: So Demure
[Sep 05 02:28:23] ---> [QC RSS ] IntegrityError on updated story: 2458
[Sep 05 02:28:24] ---> [QC RSS ] IntegrityError on updated story: Wear The Guitar Lower Next Tim
[Sep 05 02:28:25] ---> [QC RSS ] Parsed Feed: new=0 up=97 same=0 err=3 total=100

So that's 97 OK stories, and 3 error stories. Those 3 errors are because the GUIDs match another story already in the system. NewsBlur is much more stringent about GUIDs, since they are effectively the publisher saying that these stories are the same. I know that they work in other readers, but that's a bug that probably causes duplicates in many, many more feeds.
Photo of tedder42

tedder42

  • 135 Posts
  • 9 Reply Likes
I tried reaching out to Jeph on twitter. We need someone with verified weight. He's producing his RSS feed by hand. There's a much better way, I'd do it for free. I can actually do it without him but nobody would adopt it.
Photo of Kazriko Redclaw

Kazriko Redclaw

  • 51 Posts
  • 5 Reply Likes
It seems odd that there would be duplicates as they seem to be UUIDs, which are 128bit numbers that are very unlikely to collide. Could you show us which stories match the GUIDs? I'm also quite curious as to the entries where we're getting a mix of multiple posts collapsed into a single news story on news blur, I think that may be giving you some of your duplication issues. Perhaps you're getting the title from one post, and the GUID from a different post because of some malformed CDATA causing the XML parser to gag and merge posts. Perhaps some detection of end tags accidentally located inside of the CDATA?
Photo of tedder42

tedder42

  • 135 Posts
  • 9 Reply Likes
I tried reaching out to Jeph on twitter. We need someone with verified weight. He's producing his RSS feed by hand. There's a much better way, I'd do it for free. I can actually do it without him but nobody would adopt it.
Photo of Ben Lowery

Ben Lowery

  • 2 Posts
  • 0 Reply Likes
Yeah, he updates the feed by hand. No idea *why* it's this way, but it is what it is. That's the root cause.
Photo of spiffytechS

spiffytech

  • 19 Posts
  • 5 Reply Likes
@tedder42: if you do it and post it here, at least I'd adopt it. Something reliable sounds better than something unreliable :)
Photo of A CarisetAC

A Cariset

  • 12 Posts
  • 1 Reply Like
So it looks like it's stopped working again; the last two updates aren't showing up. I don't know if this is a change from before, but the feed currently claims to be generated by "Feeder 2.3.8(1705); Mac OS X Version 10.8.4 (Build 12E55) http://reinventedsoftware.com/feeder".
Photo of

  • 3 Posts
  • 0 Reply Likes
The current feed says it was last published on "Tue, 17 Sep 2013" so, since the guy produces the feed by hand (see other posts in this thread), it seems he just hasn't updated the feed in the last few days.
Photo of

  • 93 Posts
  • 28 Reply Likes
He's done his back in, so no comic the last few days.
There's a filler up today, but those are never in the feed.
Photo of Kazriko Redclaw

Kazriko Redclaw

  • 51 Posts
  • 5 Reply Likes
There's still 2 items in the feed that aren't on newsblur yet though, even though the filler isn't in the feed. 2535 and "New Shoes", whereas the 3rd in the feed, Prince Albert, is the first on newsblur. Here's another oddity to the feed on QC. It has a... <table ]]>
<![CDATA[ cellpadding="0" border="0" cellspacing="0" width="728" bgcolor="#ffffff">
near the start of several of the posts. Some parsers might choke on that.
Photo of tedder42

tedder42

  • 135 Posts
  • 9 Reply Likes
Since the feed is updated manually I'm sure he hasn't been updating it.

Ugh, time to write something.
Photo of WillW

Will

  • 24 Posts
  • 0 Reply Likes
I'm having problems with QC now. Haven't had much issue until last week, but now it's pretty much completely broken. I'm using the same RSS URL as above, but this maps to a different Newsblur URL - http://www.newsblur.com/site/774/ques...

It's not picking up new stories, at the same time as repeatedly re-marking an older one as 'unread'. Today the "unread" count in the main list of feeds is probably correct, but doesn't match up with the actual list of stories (which contains none of the several entries since the 16th).

If the sidebar doens't match up with the story list - that's got to be a Newsblur bug, right? I've had a look through the QC feed and there doesn't appear to be a recent issue with duplicate GUID's.

Thanks,
Will
Photo of Karl Craven

Karl Craven

  • 2 Posts
  • 0 Reply Likes
I'm actually getting the most recent couple of QC strips returning to my Newsblur feed again and again. Enough, already! I've read those!
Photo of Samuel Clay

Samuel Clay, Official Rep

  • 5248 Posts
  • 1170 Reply Likes
There's still a bunch of stories that are being duped:

[Oct 21 17:00:52] ---> [QC RSS* ] Feed fetch in 6.06s
[Oct 21 17:00:52] ---> [QC RSS* ] Checking 100 new/updated against 95 stories
[Oct 21 17:01:10] ---> [QC RSS ] IntegrityError on updated story: 2514
[Oct 21 17:01:13] ---> [QC RSS ] IntegrityError on updated story: So Demure
[Oct 21 17:01:14] ---> [QC RSS ] IntegrityError on new story: F2D902B7-90E9-4C1B-8D45-04C008AE3845 - Tried to save duplicate unique keys (E11000 duplicate key error index: newsblur.stories.$story_hash_1 dup key: { : "774:ab98f3" })
[Oct 21 17:01:18] ---> [QC RSS ] IntegrityError on updated story: On The Scent
[Oct 21 17:01:21] ---> [QC RSS ] IntegrityError on updated story: 2458

If you check above in the comments, you'll notice that the publisher maintains the RSS feed by hand, and NewsBlur checks every single GUID against every other GUID. So it's rather hard to check by hand.

Note, I have two options. Keep it as is and disallow multiple stories with the same GUID (the purpose of which is to globally identify stories and is essentially the publisher saying that two stories are the same and one has been updated). Or I can allow dupes and suddenly you'll see duplicates in a bunch of other unrelated RSS feeds.

My competitors allow dupes and that seems to be more acceptable, but I have no plans to change. Not only is it a major architectural change which will never happen, but the decision to de-dupe is the right one, as it is essentially doing what the publisher is explicitly stating by using the same GUID on multiple stories.
Photo of jamuraaJ

jamuraa

  • 1 Post
  • 0 Reply Likes
Honestly I would take the tack that this is uncommon, but the feeds are unlikely to fix themselves - and somehow mark the URL as "problematic" and have the parser ignore the GUIDs for those feeds. Then you get the more strong de-duping on everything except for the feeds that are the exception to the rule.
Photo of WillW

Will

  • 24 Posts
  • 0 Reply Likes
I completely understand your handling of GUIDs. No one wants to implement a broken system because of broken implementations (I'm a software engineer for radio comms - that sort of battle is my life!).

I just took a dump of QCRSS.xml and checked all the GUIDs in that with a little Python script, and none of GUIDs are duplicate. That's obviously only some recent history so can't check back further than that.
Are these GUID collisions against older, previously downloaded, items?

Is the above log several stories colliding on that one quoted GUID, or is that several stories colliding with some previously stored GUIDs?

Is it possible that the author has sorted out his GUID problems now (possibly by changing some GUIDs?), but you've still got some older values stored? How far back does your history go - the oldest in the XML is over a year now, could previous history be flushed in some way? (obvously I have little idea about how your system works apart from what you've hinted at above).

Wow that's a lot of questions...
Anyway, grateful for the rapid response!
Photo of tedder42

tedder42

  • 135 Posts
  • 9 Reply Likes
Will, Jeph manually creates the RSS, so it tends to be a problem if he copy/pastes an entry and then fixes it later.
Photo of WillW

Will

  • 24 Posts
  • 0 Reply Likes
From what Mr Clay has said above, surely that would update as an update to the old entry that the GUID collides with (or no change because the error is noticed), and then when he fixes the GUID the entry should be processed properly and appear in the feed. There might be a delay but it should get through.

I've validated the current set of GUIDs as being unique but we're still a few days behind in the list. That's why I'm a little confused by the log output just above - it appears to be showing collisions that I can't see.
Photo of WillW

Will

  • 24 Posts
  • 0 Reply Likes
Is the same GUID problem still ongoing? I checked for duplicate GUIDs again and everything looks unique, but I haven't had any new entries since 24th Oct (and they're in the feed pretty much daily).
Photo of C Dave

C Dave

  • 21 Posts
  • 0 Reply Likes
Looks like Jeph knows about it. https://twitter.com/jephjacques/statu...
Photo of WillW

Will

  • 24 Posts
  • 0 Reply Likes
Should have probably actually looked at the XML!
The GUID's are all unique, but that doesn't help us if none of the entries are in there, apart from the most recent and that's got the wrong date!

Thanks C Dave.
Photo of tsuckow

tsuckow

  • 6 Posts
  • 0 Reply Likes
Is this issue back? I am noticing a lot of missing posts.

All - Newest First:
2654
2650
2645
2538

Maybe we need to get the feeder devs involved?
Photo of Kazriko Redclaw

Kazriko Redclaw

  • 51 Posts
  • 5 Reply Likes
After poking around a bit, I found it. https://github.com/samuelclay/NewsBlu... << I'll pull it down sometime this weekend and try it.
Photo of WillW

Will

  • 24 Posts
  • 0 Reply Likes
Yeah, I had a look at that earlier. Technically the feed_fetcher is actually using 'feedparser_trunk.py', but the two files are identical. I've run it against the current contents of QCRSS.xml and get a guid for every one, and none are duplicates.
Photo of WillW

Will

  • 24 Posts
  • 0 Reply Likes
It's actually a bit subtler. It's Mongo that's actually throwing the error. The stories are keyed on a hash of the GUID, calculated as 'hashlib.sha1(guid).hexdigest()[:6]'. I've checked all of those though too - no duplicates on those either (for the 353 entries starting Mon 20th August).
Photo of WillW

Will

  • 24 Posts
  • 0 Reply Likes
My current guess, but this is a complete guess, is that there are older stories in the database than are in the XML feed. Jeff made a few errors in the past with the feed but it's now fine, but the new feed content is colliding with the old feed content. That's the only thing I can think of.
Photo of Kazriko Redclaw

Kazriko Redclaw

  • 51 Posts
  • 5 Reply Likes
I wonder what it would take to wipe the feed out and reimport it?
Photo of Patrick O'Doherty

Patrick O'Doherty

  • 2 Posts
  • 0 Reply Likes
This happened to me today.

Screenshot of what Newsblur is currently showing me for the QC feed
Photo of WillW

Will

  • 24 Posts
  • 0 Reply Likes
Yes, that's what I've got. Everyone will see the same thing - the feed is only parsed once, for everyone.
Photo of Samuel Clay

Samuel Clay, Official Rep

  • 5248 Posts
  • 1170 Reply Likes
Good news, I went in with a whacking stick and start clearing out debris from the feed fetcher's overly aggressive de-duplicator. What was happening was that NewsBlur was determining that because the stories were 98.9% the same in terms of content (they were, just a single digit was different between some of them), it thought they were the same story.

I made the de-duplicator check to see if the two stories share at least 75% of their title and are published within a day of each other. That will probably do the trick, although keep an eye on some of your other feeds and let me know if anything looks amiss.

This feed will return to normal. Unfortunately there are 7 stories I couldn't get back into the archive, but future stories should work.
Photo of C Dave

C Dave

  • 21 Posts
  • 0 Reply Likes
Thanks!

(Aside: Surely there are a lot of feeds where only the url changes by a single digit as counter / date urls update?)
Photo of Kazriko Redclaw

Kazriko Redclaw

  • 51 Posts
  • 5 Reply Likes
Thanks a lot! I'll keep an eye on it and see how well it works.

But that would explain why I was getting a whole lot of other comics having similar issues.
Photo of Samuel Clay

Samuel Clay, Official Rep

  • 5248 Posts
  • 1170 Reply Likes
They should all be fixed with this update.
Photo of WillW

Will

  • 24 Posts
  • 0 Reply Likes
The feed has been perfect from what I can see for the last 4 months, but the problem is back. Most of the posts from the last week are missing again.
Photo of C Dave

C Dave

  • 21 Posts
  • 0 Reply Likes
It's his week off, and he creates the RSS feed manually.
Photo of WillW

Will

  • 24 Posts
  • 0 Reply Likes
Good point. Thanks for that.