Skip to content

Conversation

@srandall02
Copy link
Collaborator

No description provided.

@srandall02 srandall02 requested a review from gkarthik August 30, 2023 02:39
@mindoftea mindoftea self-requested a review September 6, 2023 16:35
@mindoftea
Copy link

Sorry for the delay in writing back, but like we talked about this is looking good. The logic for handling combinations of mindate/maxdate/ndays to select the right date range seems complete. At this point, I think the only remaining challenge is to combine your date-filtering code with the prevalence-filtering logic from the original piece of code and test by comparing the two locally -- the behavior should be the same when min_date and max_date aren't specified. The intended behavior for this version is that a lineage will be grouped into 'other' iff it is (a) not in keep_lineages, and (b) occurs with a prevalence greater than the prevalence_threshold for at least nday_threshold days within the date range.

@srandall02 srandall02 marked this pull request as ready for review October 2, 2023 04:42
keep_lineages.append(lineages_to_retain)
date_limit = dt.strptime(max_date, "%Y-%m-%d") - timedelta(days=ndays) # searches from max_date to ndays back
df = df[(df["prevalence"] >= prevalence_threshold) & (df['date'] < max_date) & (df['date'] > date_limit)]
num_unique_dates = df[df["date"] >= date_limit]["date"].unique().shape[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've already filtered the df at this point, I think you can skip the [df["date"] <= date_limit] part of this line and the two similar ones above, and then just have one line that does this after the if statement.



if min_date and max_date:
df = df[(df["date"].between(min_date, max_date)) & (df["prevalence"] >= prevalence_threshold)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just want to filter by date here, and not by prevalence just yet. Later, you're removing the lineages which don't have enough days above the prevalence threshold, but we still want to return data for low-prevalence days for lineages that are above the threshold.


if num_unique_dates < nday_threshold:
nday_threshold = round((nday_threshold/ndays) * num_unique_dates)
lineage_counts = df["lineage"].value_counts() #number of times lineage is found in df

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put your prevalence threshold filter down here instead, you should get the same counts as your current version, but there won't be gaps in the dataframe on low prevalence days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants