Skip to content

Conversation

@robertu94
Copy link

@robertu94 robertu94 commented Sep 11, 2025

@123epsilon and @yadudoc here are the changes to support various formats we can review when we chat next week.

)
parser.add_argument(
"--skip-insertion",
help="If set, will skip inserting entries to index. THis is a MOCK ARG",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit on capitalization here :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yadudoc I think you made this change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are fixed in #8.

case _:
raise NotImplementedError(f"{p.suffixes[-2:]} is not a supported filetype")

def compute_minhash_for_anyfile(infile: str, output_dir: str, num_perm: int):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this covers all filetypes, perhaps we should delete compute_minhash_for_file below and refactor our minhasher to use this function instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if this was used else where or upstream or if this was something you wrote. If you wrote this compute_minhash_for_file, I'll remove/refactor it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we can refactor this here. The only other thing to note is that the MinHasher class has another compute_minhash_for_file function that needs to be updated to reflect the new change - that MinHasher method is used elsewhere in workflows.py.

@yadudoc
Copy link

yadudoc commented Sep 13, 2025

There are several changes from me that need cleanup here; I will address these as soon as possible.

yadudoc and others added 3 commits September 17, 2025 10:20
…ents

* When `skip_insertion` is enabled, unique entries found are not inserted into the index. This enables deduplicating against an index without modifying it.
@robertu94 robertu94 marked this pull request as ready for review October 3, 2025 20:01
@yadudoc
Copy link

yadudoc commented Oct 29, 2025

@123epsilon @robertu94 Is there anything that's pending here that's blocking?

@123epsilon
Copy link
Collaborator

@yadudoc @robertu94 Not to my knowledge - I am partial to merging #7 first because its small, there's some overlap, and its mostly minor bugfixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants