dupeGuru still find duplicates after I deleted all of them in a previous scan
I performed a scan, deleted all duplicates, then made another scan with the same preferences, and dupeGuru still finds duplicates. Why?
3
people have this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
The best answer from the company
-
This is because in some cases, some matches are not included in the final results for security reasons. Let me use an example. We have 3 file: A, B and C. We scan them using a low filter hardness. The scanner determines that A matches with B, A matches with C, but B does not match with C. Here, dupeGuru has kind of a problem. It cannot create a duplicate group with A, B and C in it because not all files in the group would match together. It could create 2 groups: one A-B group and then one A-C group, but it will not, for security reasons. Lets think about it: If B doesn't match with C, it probably means that either B, C or both are not actually duplicates. If there would be 2 groups (A-B and A-C), you would end up delete both B and C. And if one of them is not a duplicate, that is definitely not what you want to do, right? So what dupeGuru does in a case like this is to discard the A-C match. Therefore, if you delete B and re-run a scan, you will have a A-C match in your next results.
The company says
this answers the question
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?This is because in some cases, some matches are not included in the final results for security reasons. Let me use an example. We have 3 file: A, B and C. We scan them using a low filter hardness. The scanner determines that A matches with B, A matches with C, but B does not match with C. Here, dupeGuru has kind of a problem. It cannot create a duplicate group with A, B and C in it because not all files in the group would match together. It could create 2 groups: one A-B group and then one A-C group, but it will not, for security reasons. Lets think about it: If B doesn't match with C, it probably means that either B, C or both are not actually duplicates. If there would be 2 groups (A-B and A-C), you would end up delete both B and C. And if one of them is not a duplicate, that is definitely not what you want to do, right? So what dupeGuru does in a case like this is to discard the A-C match. Therefore, if you delete B and re-run a scan, you will have a A-C match in your next results.
The company says
this answers the question
-
Inappropriate?What would you think about the program advising the user to run another scan? It could either have a status message somewhere or it could even post a dialog (search, cancel) once existing entries in the list have been dealt with.
I’m confident
-
Inappropriate?I always considered this situation as a corner case where the user likely don't want to delete any further duplicates anyway. Also, in the A-B-C example above, if A-B is displayed but false duplicate and it turns out that A-C is the real good match, you're not going to find the A-C match anyway because you will not delete B.
... Which gets me thinking... I'm not sure that I verify, in that case, that only the dupe with the highest match % is kept... I'll look that up.
Anyway, I don't think that this corner case is worth this notice, which would only be confusing to users who don't know about it. -
Inappropriate?I'm just trying to avoid the need to keep scanning and scanning until it comes up empty. It can take awhile, and it'd be nice to know that the program already has noticed further potential matches. Maybe there's a better way. Heck, if the program cached nondisplayed match candidates, it wouldn't even need to take the time for an entire additional scan. ;)
-
Inappropriate?The thing is it's not supposed to happen often. So what you're saying is that you routinely have those kind of "second scan matches"? Either there's a bug in DG or your filter hardness is not low enough (if it would be low enough, all the files would match together).
It might be a lot to ask, because it's not really possible to know beforehand when it happens, but it would be nice to have a specific example of the compared names (filename or tags, depending on what you use) as well as the preferences you use. The tricky thing is to remember "B"'s metadata after you've deleted it. Well, if you ever happen to remember. -
Inappropriate?Yes, it's definitely routine to have subsequent matches. I keep the filter set at 80% hardness for "filename - fields" and I don't think I've EVER had a single scan show everything. There's always at least a second or third, and sometimes 5-6 before I stop getting results.
If there's way to export results I'll be glad to contact you privately and show you what I'm seeing. (A screenshot won't be much help if the list goes to multiple screensful.) -
Inappropriate?Sorry for the second reply -- somehow during editing my note lost some of the text. Here's the rest:
________________________
I first compare filenames for the likelihood that it's the same song, and if it's a possibility that there's a match then I compare bitrate and duration. If those are really close then I compare tag contents and audition 'em in my favorite player. -
Inappropriate?I haven't verified yet if only the highest matching % is kept in that A-B-C situation (it involves digging in the core of the matching code and there are many things to look at), but I think we should wait until then. If it doesn't, I'll fix it. Then, we can have a better ground to have a discussion about this issue.
Because when you say you routinely perform several scanning rounds, do you mean that you delete all the files in each rounds, or do you ignore matches, then perform another round? Because if there is indeed a flaw in the matching code, always making sure that only the highest matches are kept could fix the issue altogether and make it so that second scan rounds are an exception, as it should be.
As soon as I'm finished with moneyGuru 1.1, I'm on it. -
Inappropriate?> Because when you say you routinely perform several scanning rounds, do you mean that you delete all the files in each rounds, or do you ignore matches, then perform another round?
̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄ ̄
Yes. :) I don't completely understand the question, because as I understand it that's the only way to proceed -- eIther delete or ignore, depending on the situation. I resolve all potential matches in the first list and then go again. (And again...) :)
What I normally see is something like this. First run:
file1 [live].mp3
file1.mp3
Second run, after the file1.mp3 is either deleted or ignored:
file1 [live].mp3
file 1.mp3
...something like that. Maybe those particular examples would all appear in one run as a group of 3 or more, but that's the general pattern anyway. The second run either produces other variants of examples from the first run, or it produces ones I haven't even seen in the first run.
By the way, I'm glad you started this topic -- it never occurred to me that this was not expected behavior! -
Inappropriate?I did take a look at the matching code, and it already keeps the highest matching pairs in situations like the one described in this thread.
I was under the impression that this case was supposed to be rare, but it seems like I was wrong. At this point, It's been a long time since I had any duplicates to manage :) all I have is test data sets, so I'm a little out of touch with what actually happens in the real use cases.
It would be a definite possibility to add a notice in the status bar. However, I have to ponder it. I'm afraid such a notice would be confusing to most users.
I’m pondering
-
Inappropriate?
It would be a definite possibility to add a notice in the status bar. However, I have to ponder it. I'm afraid such a notice would be confusing to most users.
Well, since it'd be new, it's possible most people wouldn't even notice it. I think probably the cleanest solution would be a new option (with the default being disabled to remain compatible with existing behavior). When enabled, it would post a dialog offering to run it again when there are additional known dupes (maybe even loading the known similar ones from cache). "OK" would be the default so you could just hit "enter", and cancel would respond to ESC.
I’m encouraged
-
Inappropriate?I'm pretty much done with the next releases and I implemented a "(X discarded)" notice in the stat line. It works as expected (if there is no notice, your next scan will be empty), but it's interesting to see how fuzzy things can get at lower threshold. The notice could say "(1 discarded)", but when you scan again, you get 6 dupes and a new "(12 discarded)" notice. That's of course if you do "Add to Ignore list". If you delete the files, it won't happen (as the same files can't come back in the results. But when adding something to the ignore list, you only ignore pairs of files, not the files themselves, so the files can show up again, paired with other files).
-
Inappropriate?This sounds really good! I would just ask that you rethink the term "discarded" -- seems like it could get new users thinking that files got deleted or something. Also, since the exact number is subject to change when files get ignored, why not just say "Known files are pending" or something? Wouldn't that cover all the bases without scaring anybody? ;)
-
Inappropriate?I don't think that "X discarded" is scary. Given the way dupeGuru works, where you have to mark and delete, I don't think people would think that dupeGuru would actually delete files during the scan.
The problem with "pending" is that it's not true. If you have 1 discarded match, if you happen to delete a file that is contained in that match in your current result, the next scan will be empty.
This notice doesn't mean "the next scan will have stuff". The lack of notice means that the next scan won't.
1 person says
this answers the question
-
Inappropriate?Is there a chance the "X discarded" process could cache the potential matches? There was one time it listed nearly 100 "discarded" items! The next search cut it way down to 30-50 (I don't remember exactly) but it took numerous rescans to get the discard numbers down. When it gets down to the single digits it often takes an individual time-consuming scan for each discarded item. Any optimization of this weeding process would be gratefully welcomed!
By the way, prior to the discard notification, it was hit or miss as to how many more scans were needed, so it's definitely good to have the additional info. Thanks for adding that. -
Inappropriate?I looked at the grouping code this morning, and yes, I think it might be possible. It would complexify the grouping code, but I think it would be worth it. Then again, there's the possibility that when I start working on it, I realize that I was stupid and that it's not possible after all.
But I'll revisit that code again. I created a ticket for it. -
Inappropriate?Thanks much! You're not stupid, regardless, but I understand what you mean. :)
I see your point in the ticket about unconfirmed matches, too, but caching discarded matches would also obviate the need for a lengthy rescan (presuming nothing has changed the filesystem in the meantime). It could be left up to the user whether or not to worry about that.
Loading Profile...


EMPLOYEE