Is dupeGuru PE (1.5) using all available cores?
The new dupeGuru PE 1.5 is very thorough, but it struck me as a bit slower than it seemed like it should be if it were taking advantage of all the resources my system was making available to it. On an 8-core 3.0 GHz Mac Pro with 9 GB of RAM it took 45 minutes to get through a collection of 15,000 images at 95% hardness and exact dimensions, and 75 minutes without exact dimensions. I noticed in the Activity Monitor that dupeGuru PE was running 16 threads, but I know some of those are used for the UI. To what extent is dupeGuru PE taking advantage of multiple cores to process these work units in parallel?
1
person has this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
The best answer from the company
-
dupeGuru PE only uses 1 core for matching/processing. I am aware that this is a problem, because PE is the only dupeGuru edition which has a big CPU-bound task to crunch. I really don't want to blame it on Python, but threads in it are rather pseudo threads and all run on the same core. I could do this in multiple processes and do IPC (inter process communication), but I never did that and I never got around learning it.
The good news is Python 2.6 is coming out soon, and it will include a library to do launch subprocesses and do IPC easily (it is supposed to be done like with threads, code wise). I look forward to it, and if it works well, I intend to use it in PE.
The company and 1 other person say
this answers the question
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?dupeGuru PE only uses 1 core for matching/processing. I am aware that this is a problem, because PE is the only dupeGuru edition which has a big CPU-bound task to crunch. I really don't want to blame it on Python, but threads in it are rather pseudo threads and all run on the same core. I could do this in multiple processes and do IPC (inter process communication), but I never did that and I never got around learning it.
The good news is Python 2.6 is coming out soon, and it will include a library to do launch subprocesses and do IPC easily (it is supposed to be done like with threads, code wise). I look forward to it, and if it works well, I intend to use it in PE.
The company and 1 other person say
this answers the question
-
Inappropriate?Are you still interested in a dupeGuru PE version that uses more than one core? Since python 2.6 came out, I've been toying with the new IPC library and it works rather well. Since I was into performance tweaks, I also rewrote the core bottleneck of the app (that's a tiny bit of code that takes all the matching time) in C. I hoped for a much better speed increase, but I still get 2x-3x speedup on my dual core. I would be curious to see how faster it gets on your 8 cores machine. It's a very preliminary version, so crashes are likely to happen, but I manage to go through rather big scans here. Here's the link:
http://download.hardcoded.net/dupegur... -
Inappropriate?I'm still very much interested, yes! I tried your new version on the same set of 15,000 images, same hardness (95%), with inexact dimension-matching, and this time instead of 75 minutes it completed in 15, which is a 5x improvement. This was with all 8 cores engaged and no other active apps, so dupeGuru PE had virtually 100% of the available processing power.
I noticed that with large (e.g. 4+ megapixel) images, all the cores were maxed out, while for smaller images the cores never exceeded about 50% utilization, so there may be some more efficiencies to be gained by pre-queuing, or by reducing the overhead involved in starting each new image, since it's clearly more efficient with large images than smaller ones.
Very promising in any case, though! I'll take a 5x speed-up anyday! :)
I’m excited
-
Inappropriate?Wow, this thread is already 8 months old... Anyway, I wanted to let you know that I released dupeGuru PE 1.7.0 today, which is much faster than 1.6.0, due to better optimized bottlenecks (thanks to Cython) and better multiprocessing code.
Normally, this version should be very, very fast (well, its algorithm still inherently takes time... but well, you know what I mean by "fast" :) ) on your 8-cores machine. I'd be curious to know, in fact... -
Inappropriate?The 1.7.0 version is definitely much faster--7 minutes 44 seconds to perform the initial analysis (empty cache) on 14,415 images (95% hardness, scaled images), 21 seconds to perform the dupe-matching thereafter. The initial analysis consumed perhaps 14% of each core's potential, which is understandable given that this stage is largely disk I/O-bound (albeit on a RAID 1+0 stripe in this case). The matching stage consumed 100% of all cores until it was finished about 20 seconds later. Very nice!
One minor note is that the progess bar during the initial analysis stage may be miscalibrated--it reached 14415/14415 yet only about 20% of the progress bar was filled in before it jumped to the dupe-matching bar.
Apart from that, I also noticed that my license doesn't seem to work with this new version. I purchased a license in October (for 1.5.0 when this thread was started), but 1.7.0 says my old registration code is invalid. -
Inappropriate?21 seconds... woo, this seems almost too fast (those machines are monsters!)
As for the initial analysis, it is not multiprocessed (since it's mostly i/o bound. There's also significant cpu processing going on though. I might try converting it to multi-processing one day to see if it's worth it), so it must be 100% of one cpu instead of 14% of all (100/8 = 12.5).
As for the miscalibration, on normal machines (2-cores like my laptop), especially when images are cached, the matching phase takes a much larger part of the scanning time. That's why I assigned 20% of the job progress to the analysis phase, and 80% to the matching phase.
And for the registration, the initial build I published had a bug preventing registration from working. Just download it again (the version didn't change), it will work.
21 seconds... now that's a difference! -
Inappropriate?I can confirm (using the core monitors in iStat Menus) that the analysis stage does indeed distribute its load across all 8 cores, varying from 8-15% utilization each for the duration (2-3% during subsequent runs when images have been cached). This parallelism may be taking place at a lower level however (e.g. software RAID drivers). The load is distributed more or less evenly, with one core slightly more utilized than the others.
As for the progress bar calibration issue, perhaps it would be simplest to just provide two progress bars, one above the other--one for the analysis stage, one for the matching stage--rather than trying to estimate the relative durations of each stage, since this proportion will clearly vary with hardware.
As for your comment about 21 seconds being perhaps "too fast", perhaps there is some truth to that. I noticed for instance that while the analysis stage examined 14,415 images, the 21-second matching stage only reported 6,211 images. Is this expected, or should the matching stage have processed the same number of images that were seen during the analysis stage? -
Inappropriate?I was about to confirm with you that you really had "Match scaled" on (because when it's off, it is indeed possible to have a smaller number of image matched than analyzed), when I realized that there's a bug causing the Match Scaled option to be reversed.
Another possibility is that dupeGuru PE choked on pictures during the analyze phase. If it did, your Console will tell you.
So, until I release v1.7.1, set Match Scaled off if you want it to be on. -
Inappropriate?All right, switching "match scaled" off (to actually turn it on) provides more realistic results, though they are somewhat disappointing. The "preparing for matching" stage lasts a full minute now, and the matching stage itself took about 45 minutes to complete, using 100% of all cores for the duration. This is about 3 times slower than 1.5.1, and with significantly more processor utilization.
-
Now that's strange... I'll try to figure it out when I have some time. -
I've been looking at it, and while I found an error in the bottleneck code that cause 7 needless conversions (thus theoretically making the scanning process significantly slower), the net gain in speed I have is not so great. But then again, I did not experience the slowdown you had from 1.5.1 to 1.7.0. I think that the bottlenecks on my machine and on yours occur at different places. I'll include these changes in the 1.7.1 release (which fixes the "match scaled" bug). Hopefully, this will at least get you back to the 1.5.1 levels (those extra conversions were made at the core of the bottleneck code, so it really should make a difference). -
Inappropriate?With 1.7.1 the same task took about 40 minutes, all at 100% utilization on all eight cores. It's an improvement, but not a big one, and it certainly does not compare to the performance of 1.5.1, which did the same job in 15 minutes using only 50% of each core's potential for most of the run. I do appreciate your efforts, though, and if you have a debugging version you want me to run to log some profiling data for you I'd gladly do that and send you the results.
-
This time, I think I got it. I haven't released 1.7.2 yet (I still have the windows build to make), but it's on the server at http://download.hardcoded.net/dupegur... -
Inappropriate?Ok, 1.7.2 is a significant improvement over the earlier 1.7.x releases--the same tests I've been running above now complete in 19 minutes. It's still not quite as fast as 1.5.1 was, despite using 100% of all cores the whole time, which is a bit puzzling, since 1.5.1 wasted a lot of processor cycles. Still, it's in the same ballpark now, which is encouraging :)
Loading Profile...



EMPLOYEE