-
Notifications
You must be signed in to change notification settings - Fork 3
Support deduplication of various formats #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
deduplication/args.py
Outdated
| ) | ||
| parser.add_argument( | ||
| "--skip-insertion", | ||
| help="If set, will skip inserting entries to index. THis is a MOCK ARG", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit on capitalization here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yadudoc I think you made this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are fixed in #8.
| case _: | ||
| raise NotImplementedError(f"{p.suffixes[-2:]} is not a supported filetype") | ||
|
|
||
| def compute_minhash_for_anyfile(infile: str, output_dir: str, num_perm: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this covers all filetypes, perhaps we should delete compute_minhash_for_file below and refactor our minhasher to use this function instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if this was used else where or upstream or if this was something you wrote. If you wrote this compute_minhash_for_file, I'll remove/refactor it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we can refactor this here. The only other thing to note is that the MinHasher class has another compute_minhash_for_file function that needs to be updated to reflect the new change - that MinHasher method is used elsewhere in workflows.py.
|
There are several changes from me that need cleanup here; I will address these as soon as possible. |
…ents * When `skip_insertion` is enabled, unique entries found are not inserted into the index. This enables deduplicating against an index without modifying it.
|
@123epsilon @robertu94 Is there anything that's pending here that's blocking? |
|
@yadudoc @robertu94 Not to my knowledge - I am partial to merging #7 first because its small, there's some overlap, and its mostly minor bugfixes |
@123epsilon and @yadudoc here are the changes to support various formats we can review when we chat next week.