Aggressive filter for Duplicate items.

  • 2
  • Idea
  • Updated 2 months ago
  • (Edited)
I know this has been asked for many times in the past but any movement on an aggressive duplicate item removal?  The new google news feeds have severe issues with duplicate items.  Can we de-duplicate based only off headline or some other method?  I really like your product otherwise!!
Photo of Thomas Pemberton

Thomas Pemberton

  • 14 Posts
  • 1 Reply Like

Posted 6 months ago

  • 2
Photo of Samuel Clay

Samuel Clay, Official Rep

  • 6512 Posts
  • 1474 Reply Likes
Give me a bit more context. Include screenshots of stories you wish were de-duped.
Photo of Thomas Pemberton

Thomas Pemberton

  • 14 Posts
  • 1 Reply Like
Photo of Thomas Pemberton

Thomas Pemberton

  • 14 Posts
  • 1 Reply Like
Another example from today, although I'm sure you get the idea by now  https://imgur.com/g7zInOT

feed
https://news.google.com/news/rss/headlines/section/topic/TECHNOLOGY?ned=us&hl=en&gl=US
(Edited)
Photo of Samuel Clay

Samuel Clay, Official Rep

  • 6512 Posts
  • 1474 Reply Likes
Are the stories empty? NewsBlur already has an aggressive de-duper on a per feed basis. But it needs > 100 characters in a story to check against.
Photo of Thomas Pemberton

Thomas Pemberton

  • 14 Posts
  • 1 Reply Like
With their old feed (that didn't have many dupe issues) they would have the headline and a one paragraph blurb about the article and what it was.  With the new feed they just seem to have the headline and not much else:

Photo of Emil Pop

Emil Pop

  • 12 Posts
  • 1 Reply Like
I also have duplicate items in my feeds. That would be a great feature to be able to filter these.
Photo of Victor Cunha

Victor Cunha

  • 1 Post
  • 0 Reply Likes
This would be really great! I also have these issues with Google News
Photo of Thomas Pemberton

Thomas Pemberton

  • 14 Posts
  • 1 Reply Like
Any update on this? Looking at other reader sites such as inoreader they have the same issue.  See forum post example here: 

https://forum.inoreader.com/topic/12092-all-google-news-feeds-deprecated/?tab=comments#comment-28809

The end of the url is different and maybe triggering the duplicate.  Any way to filter of this to reduce duplicates?
Photo of Samuel Clay

Samuel Clay, Official Rep

  • 6512 Posts
  • 1474 Reply Likes
NewsBlur is already using a pretty aggressive de-duplication heuristic. I sometimes lower it, but it ends up eating comic sites that publish new comics with the same title. In most cases, the feed is claiming it's a different story and NewsBlur's de-duplicator doesn't have enough of a story (minimum of 100 characters) to make a determination whether it's a dupe or not.

The code is available here for you to peruse: https://github.com/samuelclay/NewsBlur/blob/master/apps/rss_feeds/models.py#L1909-L2009