find duplicate photo's - fuzzy

User avatar
ErikJan
BronzeLounger
Posts: 1264
Joined: 03 Feb 2010, 19:59
Location: Terneuzen, the Netherlands

find duplicate photo's - fuzzy

Post by ErikJan »

I have thousands of pictures. They are generally well organized (named yyyy-mm-dd etc.) but sometimes I find a new album or box somewhere with undated pics that I all digitize.
I'm looking for a picture scanning tool that allows fuzzy matching, i.e. it should flag pics that are almost identical (bit also lower res versions, jpg vs TIFF).

I know there are many out there, but the problem is that each scan starts fresh and seems to save nothing. That means that is I have 200 pics and want to check for duplicates in my 20K collection, every run could take 200 x 20K comparisons but often many more as every file is compared with every file (which will take days per scan).
I'd like a tool that can scan all and creates and stores a 'fingerprint' of every picture my collection ONE time so that new pics can be compared super fast (and if an appropriate match is then found, I'm fine it if still reads the files for an in-depth analysis).

To date, I don't think I've found something like that so I'm really hoping someone in this forum can help me in my search ;-)

User avatar
stuck
Panoramic Lounger
Posts: 8195
Joined: 25 Jan 2010, 09:09
Location: retirement

Re: find duplicate photo's - fuzzy

Post by stuck »

I've never found a simple way to find duplicates in my photo collection. I work on the principle that digital storage is cheap and I have plenty of it so I can deal with duplicates, manually, as an when I come across them. Life's too short for any other approach.

Ken

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15660
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: find duplicate photo's - fuzzy

Post by ChrisGreaves »

ErikJan wrote:
10 Mar 2023, 21:53
... to check for duplicates in my 20K collection, every run could take 200 x 20K comparisons ...
Hi ErikJan.
I know almost nothing about content matching, but I do know what I have worked towards in the past,

My problem is not visual images but musical tracks, and to compound matters i am cursed with a musical ear; I suspect that this is the equivalent of your "fuzzy but similar" images. FWIW I too started scanning photos back in 2011 and have a collection of original (4MB images) and a subset used in web pages VSOImageRezizered down to about 500KB each.

So in theory we can discuss theory.

A gazillion 3rd party applications claim to locate duplicates. Mostly they match name, and sometimes size in bytes.
Musical MP3 tracks can be identical duplicates but one track has the original diacritical marks in the file name, the other does not.
I have d/l the same version of Eugène Gigout's "Grand Chœur Dialogué" five times (Well, I *like* it!) but trimmed the ending two minutes of applause from the end. Or (Audacity.exe) used Effects, Fade-out on the applause.
Worse: I have three versions of Don and Phil singing "Temptation", one of the versions remastered.

I worked around diacritics and teenage mis-spellings ten years ago with something like Soundex, but that still left me with my excellent musical ear and memory (equivalent to your "fuzzy images") which daily tells me "I've just heard this a day or two ago; why am I hearing it again?"

The answer for MP3 tracks would appear to be in the "packets" of sound of which the body is composed. I reason that no matter that I have trimmed out the applause as the conductor strides onto the stage, if the tracks are from the same recording, then the internal "packets" of sound, each packet heralded by a meta-tag of some sort, should match.

So I can ferret out some duplicates by extracting, say, three such packets from one track and then trying to match them, in the same sequence, against the packets in a second suspect track. Real Soon Now, I promise! [[ I got as far as storing an array of filenames with (say) three checksum packets for each track, then quickly eliminating non-matching tracks from contention with the array stored in RAM; this is akin to Void's use of indexes in Everything.exe ]]

Which leads me to:-
(1) Do images in the same format (for example, both JPEG or both PNG or even, at a pinch, both converted to BMP for the comparison) consist of like-minded packets of data
(2) Would Everything with its Dupes: function and its Content: modifier and its new all-singing all-dancing SHA256 mechanism be capable of getting the job done?

Cheers, Chris
He who plants a seed, plants life.

User avatar
ErikJan
BronzeLounger
Posts: 1264
Joined: 03 Feb 2010, 19:59
Location: Terneuzen, the Netherlands

Re: find duplicate photo's - fuzzy

Post by ErikJan »

Chris,

Heard about Similarity? https://www.similarityapp.com/. That is something I've been using a bit a while back for music.
Just realized this would also word for images so maybe I'll re-check that.

On Everything, I though that was a super quick search tool only... does that do fuzzy then?

PS. My pics are all 'behaved' (so rotated correctly), but I would like to be able to deal with JPG and TIFF and with different resolutions and crops (B&W vs color would be nice to have, not need to have)

User avatar
ErikJan
BronzeLounger
Posts: 1264
Joined: 03 Feb 2010, 19:59
Location: Terneuzen, the Netherlands

Re: find duplicate photo's - fuzzy

Post by ErikJan »

Not sure if I'm supposed to mention commercial tools here but I continued my search a bit more this afternoon and found this site https://www.mindgems.com/

They have a tool called "Duplicate Image Finder" which seems to be doing exactly what I was looking for... They also have a tool that does this for audio: "Audio Dedupe". Both fuzzy and I quickly tested the demo's and was impressed by the speed (and they allow a cache to be saved so that future scans are super fast). I'll do some more testing but I'm considering buying these...

User avatar
stuck
Panoramic Lounger
Posts: 8195
Joined: 25 Jan 2010, 09:09
Location: retirement

Re: find duplicate photo's - fuzzy

Post by stuck »

They seem to have at least four variations of the same software:
https://www.mindgems.com/products/VS-Du ... -About.htm
https://www.mindgems.com/products/Fast- ... -About.htm
https://www.mindgems.com/Duplicate-Photo-Finder.html
https://www.mindgems.com/products/Dupli ... -About.htm

Also, Mindgems has only two reviews on Trustpilot and both are warnings to avoid the company:
https://uk.trustpilot.com/review/www.mindgems.com

Ken

EDITED TO ADD:
Another thing that makes me feel uncomfortable about this company is the very dated 1990s text under System Requirements where they mention the need for 'True Color display and video card'

User avatar
ErikJan
BronzeLounger
Posts: 1264
Joined: 03 Feb 2010, 19:59
Location: Terneuzen, the Netherlands

Re: find duplicate photo's - fuzzy

Post by ErikJan »

Image Finder is fuzzy and paid. Fast Duplicate is 100% match only and free. The last one is the one for audio I mentioned.
But you are right... not sure what "Duplicate Photo Finder" is and how it relates to "Duplicate Image Finder" (Update: it seems to be the same tool)

Thanks for the Trustpilot warning, I read them. The second I could understand, not nice maybe but they gave out the license information so I could understand. The first is weird... it detects all duplicates and the person paid. Then the paid version would not delete... why would they do that? The tool works... I did search on Google for reviews but didn't find anything too negative. But OK, I didn't buy this yet ;-)

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15660
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: find duplicate photo's - fuzzy

Post by ChrisGreaves »

ErikJan wrote:
12 Mar 2023, 14:28
Heard about Similarity? https://www.similarityapp.com/. That is something I've been using a bit a while back for music.
Thanks for this tip, EroikJan, and I note the discussion about other packages that this has sparked.

I don't expect to eliminate all duplicate tracks because "duplicate" is in the ear of the beholder. For some folks one version of Widor's "Toccata from Symphony No. 5" is too many. For at least one other person twenty-five is fifty too few.
On Everything, I though that was a super quick search tool only... does that do fuzzy then?
That's what i thought prior to 31st January this year when I abandoned the idea of a few quick tips and decided to go the whole hog and write a User Tutorial. Topics to date include Basic Everything Help, Bookmarks Rework, Column Formulas, Command Line Interface, Command Line Options, CommandLineInterface, CommandLineOptions, Comments, Content Indexing, Customizing, Dark Mode, DatabasesOfEverything, Decomposition String Matching, Editorial Board, eraseme, Etp, Everything Server, Everything Service, Everything, EverythingManagement, EverythingService, Extracizes, FAT Indexing, File Lists, File Types Known to Everything, FileLists, Filter Rework, Find Duplicates, Findbar, Folder Indexing, FolderIndexing, Functions, Guide To Template Features, Hard Link Tracking, Http, Ignore Punctuation and Ignore White Space, Import and Export Settings, Index Journal, Index Virtual Folders, Indexes, INI settings, Ini, Installation, Installing Everything, Keyboard Shortcuts, KeyboardShortcuts, Lexicon, Macros, Mix Files and Folders, Multiple Instances, MultipleInstances, Omit Results, Options, Plugin Support, Prefix and Suffix Search Options, Previous Versions, PreviousVersions, Properties, Reason For Being, Recent Changes, RecentChanges, Results, Reular Expressions, Run History, RunHistory, Sample Data Sets, Sdk, Search Commands, Search Functions, Search History, Search Modifiers, Search Preprocessor, SearchHistory, Searching, Searching2, Simplest Search, SimplestSearch, Sorting, Stickied Threads, Supported Languages, SupportedLanguages, Template, Terminology, Translating, Traps For Young Players, TrapsForYoungPlayers, Troubleshooting, Tutorials, UI, Undo System, Uninstalling Everything, UninstallingEverything, Update in Background, User Tutorial, Using Everything, Weighted Searches, and this weekend's ZIP file uploaded at 56MB, of which 20MB are screenshots, admittedly.

David Carpenter's only fault is that he responds to every question with "I'll put it on my Todo List", and he is churning out enhancements faster than I can document. An Exciting application.

I suggest you post an enquiry under "Everything 1.5 Alpha" or in "Suggestions".
Cheers, Chris
He who plants a seed, plants life.

User avatar
ErikJan
BronzeLounger
Posts: 1264
Joined: 03 Feb 2010, 19:59
Location: Terneuzen, the Netherlands

Re: find duplicate photo's - fuzzy

Post by ErikJan »

Thanks Chris. Again, I'm not an expert in "Everything" but fuzzy searching is very different from 'normal' searching. It requires very special algorithms in order to determine matches and calculate 'similarities'. With these you can identify the different versions of Widor's "Toccata from Symphony No. 5" (to use your example) and determine how 'close' the different versions are.
In my -audio- situation, the recordings are the same and I want to find them even if the files are different (e.g. shorter recording by a few seconds, different bit rate). And yes, I DO want live and studio versions to stay as to me these are certainly not duplicates.
Maybe I misunderstand "Everything" -and I read some of your tips in this forum- but it still seems to me this is an advanced (admitted) search tool but it can't do similarity scoring. I can imagine it might identify real duplicate via hash or bit-for-bit comparison.Am I wrong?

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15660
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: find duplicate photo's - fuzzy

Post by ChrisGreaves »

ErikJan wrote:
12 Mar 2023, 21:26
Thanks Chris. Again, I'm not an expert in "Everything" but fuzzy searching is very different from 'normal' searching. It requires very special algorithms in order to determine matches and calculate 'similarities'. With these you can identify the different versions of Widor's "Toccata from Symphony No. 5" (to use your example) and determine how 'close' the different versions are.
In my -audio- situation, the recordings are the same and I want to find them even if the files are different (e.g. shorter recording by a few seconds, different bit rate). And yes, I DO want live and studio versions to stay as to me these are certainly not duplicates.
Maybe I misunderstand "Everything" -and I read some of your tips in this forum- but it still seems to me this is an advanced (admitted) search tool but it can't do similarity scoring. I can imagine it might identify real duplicate via hash or bit-for-bit comparison.Am I wrong?
Hello again ErikJan.
I can't really answer your question!
I have little idea what "similarity scoring" is, but I may have thought about it under a different name.

I am writing a user tutorial NOT because I know about Everything, but because I want to know about Everything, and my experience is that the best way to learn is to teach!
I figure it will take me another three months just to master and document Functions and Modifiers, let alone things like Macros, Bookmarks, command-line processing.

I stumbled across posts about SHA256, but have not used it or pursued it (I have about a dozen on-the-go threads hanging around from as far back as Feb 20th)

My advice would be to trot over to voidtools forums, register, and then post a question about "duplicates" and "content searching".
"Dupes:" is the function in Everything. Or it may be that you will read something in the Dupes: posts that rings a bell for you, something I would not have recognized.
There again, Void may have already researched your field of duplicates and have an answer tyhat he has not yet exposed in the forums.

Everything 1.5a is pretty stable right now.
Cheers, Chris
He who plants a seed, plants life.

User avatar
ErikJan
BronzeLounger
Posts: 1264
Joined: 03 Feb 2010, 19:59
Location: Terneuzen, the Netherlands

Re: find duplicate photo's - fuzzy

Post by ErikJan »

Chris, I registered in the void forum (just in case) and searched for similarity / fuzzy. I found a feature request from 2021 but no more. Then I scanned the info on the site but I couldn't find references. I think it's as I wrote: Everything is a simple but very powerful search engine. It can search inside files but it cannot find duplicates, let alone 'almost duplicates' (and the latter becomes even more complex as 'almost' is different than 'bit wise' differences as for e.g. sound it uses a 'sounds like' algorithm (so it can still detect 'duplicates' with different bit-rates) and for pictures it's doing something similar (and that detects images with different resolutions or even rotated pictures).

User avatar
stuck
Panoramic Lounger
Posts: 8195
Joined: 25 Jan 2010, 09:09
Location: retirement

Re: find duplicate photo's - fuzzy

Post by stuck »

ErikJan wrote:
10 Mar 2023, 21:53
...
I'd like a tool that can scan all and creates and stores a 'fingerprint' of every picture...
I've reread that bit of your original post. While I can imagine that such a digital fingerprint could be generated I can't see how it would help with finding duplicates. I say that because I'd expect such a fingerprint would have to be something based on the 'bits' in each image but the 'bits' of an image of a tree encoded as a JEPG will be very different to the 'bits' of an identical image that's encoded as a TIFF, i.e. despite the two images being a photo of the same tree.

What you really want is some sort of pattern recognition algorithm and that realisation prompted me to look at the 'Tools' menu in Google Picasa. There is a feature on that menu, buried under 'Experimental', called 'Show Duplicate Files'. Using that, Picasa found 226 pictures in 29 albums in 0.029 secs from the 32,746 pictures it has indexed on 'My Computer'. Some of the 226 pictures are obvious duplicates but many are just near matches. Some are shown as single pictures yet the folder they are in do indeed contain duplicates. It's clearly not a perfect tool but it certainly is a simple starting point if you really do feel compelled to find duplicates in your collection.

The flaw in this approach is that Google dropped Picasa years and years ago so it is no longer available for download (unless you want to risk getting it (and perhaps more than you'd like) from a third party website. This is not a fatal flaw though, send me a PM and I will explain why not.

Ken

User avatar
ErikJan
BronzeLounger
Posts: 1264
Joined: 03 Feb 2010, 19:59
Location: Terneuzen, the Netherlands

Re: find duplicate photo's - fuzzy

Post by ErikJan »

Yeah, maybe I formulated in the wrong way. Of course, I didn't mean a real fingerprint (as in a hash) because we're not looking for real duplicates but 'approximate duplicates mostly" but a representation that is calculated once and then stored to allow new pictures to be quickly evaluated without re-doing all the calculations on the original large collection.
And yes, I know Picasa and I used it in the past. As you said, it's old and the option you mention is (was) experimental (I have used it in the past). There are better tools out there nowadays and I've been in contact with the company I talked about earlier and I might actually try that out as it looks very nice (and they have tools for picture and audio!). I'll report back here what I will find.

User avatar
stuck
Panoramic Lounger
Posts: 8195
Joined: 25 Jan 2010, 09:09
Location: retirement

Re: find duplicate photo's - fuzzy

Post by stuck »

ErikJan wrote:
14 Mar 2023, 10:07
...I know Picasa and I used it in the past...
Ah, OK, no problem.

Maybe it's because my main data HDD has a capacity of 10 TB but I see no reason to put any effort into looking for duplicates. I just deal with them if I come across them. However, if we all thought alike, the world would be very boring :smile:

Ken

User avatar
ErikJan
BronzeLounger
Posts: 1264
Joined: 03 Feb 2010, 19:59
Location: Terneuzen, the Netherlands

Re: find duplicate photo's - fuzzy

Post by ErikJan »

Yep, agree. Actually, you might be better off than I am... :cheers: My brain is perfectionistic and very detail oriented. All my pictures and music files are well organized in folders and named uniformly. Pictures all start with YYYY MMM DD - MMHHSS, are geotagged etc. And I hate doubles or "almost doubles". I'm also lazy... that explains my frantic search for tools that help me achieve what my brain wants :hairout: :laugh:

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15660
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: find duplicate photo's - fuzzy

Post by ChrisGreaves »

ErikJan wrote:
14 Mar 2023, 12:01
...I'm also lazy... that explains my frantic search for tools that help me achieve what my brain wants :hairout: :laugh:
Don't knock "lazy". That's what got me into computers in the first place. :grin:

@ErikJan and @Stuck

Over the (audio music track MP3) years I have continuously refined my attitude towards duplicates.
The disk space doesn't bother me. I am more concerned with hearing "the same" piece of music twice in the same fortnight.

To me a track is a duplicate if it sounds boring to me within a time span!

So much so that I have a new randomizer in my jukebox: It loads WinAmp with nine hours of tracks taken one at a time from the stalest 1,000 tracks in the data base (MDB) and sets the DatelastPlayed cell to NOW(). With 75 days of music in 19,162 tracks you'd think I would be satisfied, right? But a few years ago I developed a crush on Doris Day and ...

Today when I think of "finding duplicate tracks" I think of low-hanging fruit.
(1) track Name and
(2) track Size
(3) a Soundex track name match
(4) First 1000 or last 1000 bytes match
(5) Packets within each track
and so on.

The best that these schemes can do in my opinion is build up small groups of files that MIGHT be edited versions of an original track.

Today, only by listening can I decide whether any one of the ten versions of "Mussorgsky Pictures at an Exhibition" can be turfed, and that decision is almost certainly influenced by my mental state or mood at the time of decision.
That is, it can't be a hard science or algorithm.

Cheers, Chris
He who plants a seed, plants life.

User avatar
ErikJan
BronzeLounger
Posts: 1264
Joined: 03 Feb 2010, 19:59
Location: Terneuzen, the Netherlands

Re: find duplicate photo's - fuzzy

Post by ErikJan »

Again, think about an original MP3 that you created @ 320Kbps (high quality) and imagine there is another file with exactly the same music, the same audio length, the same track name, same composer, same recording. But this version is lower quality (e.g. 128Kbps). The files are digitally very different, the file sizes are different and the first and last x bytes are different. Or an MP3 vs a WAV or FLAC
Personally, I'd like to remove the lower quality duplicate(s) and keep the higher quality one(s)... And you?

User avatar
stuck
Panoramic Lounger
Posts: 8195
Joined: 25 Jan 2010, 09:09
Location: retirement

Re: find duplicate photo's - fuzzy

Post by stuck »

ErikJan wrote:
14 Mar 2023, 12:01
...My brain is perfectionistic and very detail oriented. All my pictures and music files are well organized in folders and named uniformly...
Me too!

I just realised that life is too short to worry about a few (hundred) duplicates in a digital collection of over 30,000 images :laugh:

Have fun!

Ken

User avatar
ErikJan
BronzeLounger
Posts: 1264
Joined: 03 Feb 2010, 19:59
Location: Terneuzen, the Netherlands

Re: find duplicate photo's - fuzzy

Post by ErikJan »

Update: as I indicated I purchased "Duplicate Image Finder" (https://www.mindgems.com/products/VS-Du ... -About.htm) after having explored the GUI and capabilities a bit with the free demo version.
I'm still playing with it but I can already indicate now that is does exactly what they promise! :cheers: I'm quite impressed by the setup and the speed. Moreover, I've been in contact with MindGems and I was open about the TrustPilot reviews that were shared. They are quick to respond and indicated they were very frustrated by these reviews and have tried to get things corrected. I must admit that what I've been seeing sofar has already convinced me that these TrustPilot reviews are false.
I don't want to draw conclusions too fast here but this seems to be a very good tool (the best I've seen), their help is fast and to the point.
Will spend more time the coming weeks playing with the tool as it has many options and the help file isn't always complete (I commented on that and they admitted that and indicated they'd update). I will further update you all here, but for now I can probably already say that if you want to get a flavor, you can safely try the demo. Of course it's not fully functional (no one would expect that) but it does work.
Initial scan of your collection is slow (although MUCH faster than what I've seen with other tools I tested briefly), after that, all update scans and new tasks run a few hundred times faster.
I'm thinking that AudioDeDupe, another tool that does the same with music files (that is: "similarity matching"), will work in the same way and I'd suggest that if you'd look for that, it would be worth to try that one out as well. I'll certainly consider playing with the demo myself later as well.

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15660
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: find duplicate photo's - fuzzy

Post by ChrisGreaves »

stuck wrote:
14 Mar 2023, 14:27
I just realised that life is too short to worry about a few (hundred) duplicates in a digital collection of over 30,000 images :laugh:
Hi Ken, on this you are correct. I did the maths some thirty years ago in Toronto when I calculated that it was cheaper for me to drive across town, buy a bigger IDE drive, drive back, format it and XCOPY files across - than it was to spend time looking at and in folders to determine how best I could free up some space.

That said there is still/always the perceived loss of value with duplicates; in my case having to listen to specific tunes far too frequently.
Cheers, Chris
He who plants a seed, plants life.