Skip to content

feat(comments): Add runner for comments migration separately#380

Open
sakshamarora1 wants to merge 5 commits intoCERNDocumentServer:masterfrom
sakshamarora1:feature/comments_migration
Open

feat(comments): Add runner for comments migration separately#380
sakshamarora1 wants to merge 5 commits intoCERNDocumentServer:masterfrom
sakshamarora1:feature/comments_migration

Conversation

@sakshamarora1
Copy link
Contributor

@sakshamarora1 sakshamarora1 commented Feb 2, 2026

closes: #286

Steps

  1. Update the collection queries for a collection, retreive all the comments for the records in the records found and create a json metadata file.
ipython ./scripts/dump_comments_to_migrate.py

Output file: comments_metadata.json
The above script also runs this script partly to generate the missing_users.json file

  1. Ensure poeple.csv, missing_users.json in /eos/media/cds/cds-rdm/<env>/migration/users/ directory and comments_metadata.json file in /eos/media/cds/cds-rdm/<env>/migration/<collection>/comments/

IMPORTANT: We are not running the ./scripts/copy_comments_attached_files.py yet (or we can, for thesis and IT there are none anyways) [See child issue in the main issue attached]

  1. Create those users (using people.csv containing person_id already placed in the /eos/media/cds/cds-rdm/<env>/migration/users/):
cds-migrator-kit comments commenters-run --filepath /eos/media/cds/cds-rdm/<env>/migration/users/missing_users.json --missing-users-dir /eos/media/cds/cds-rdm/<env>/migration/users/ --dry-run

(Remove --dry-run)

  1. Finally migrate the comments:
invenio migration comments --filepath /eos/media/cds/cds-rdm/<env>/migration/<collection>/comments/comments_metadata.json --dirpath /eos/media/cds/cds-rdm/<env>/migration/<collection>/comments/ --dry-run

(Remove --dry-run)

@sakshamarora1 sakshamarora1 marked this pull request as ready for review February 4, 2026 16:33
self.all_record_versions = {
str(hit["versions"]["index"]): hit for hit in search_result
}
oldest_version = min(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be faster via record._record.versions[-1]? I mean instead of scan_versions etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That returns an instance of VersionsManager and it doesn't have other versions stored in it. We will have to do scan_versions to find all the versions and select the minimum un-deleted version available

elif comment_status == "dm":
comment_payload["payload"].update(
{
"content": "comment was deleted by the moderator.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in RDM we do not have the "moderator" - it would be good to align it with what we display when we delete a comment in RDM (I don't remember the exact text). ping @zzacharo for more opinions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We not display this content for the deleted comments, we do this in the frontend

{}, request=request.model, request_id=str(request.id), type=event_type
)

if data.get("file_relation"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a small comment on why we are doing this?

)
return self.all_record_versions[str(oldest_version)]

def create_event(self, request, data, community, record, parent_comment_id=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way we can optimise this function to be more readable? there are a lot of conditional statements, some with repeated conditions, also it would be good if we avoid nesting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more optimisations

{"user": str(user.id)}, raise_=True
)
else:
print("User not found for email: ", data.get("created_by"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the print is redundant if you raise.
what will happen if you raise? will the whole script halt? and need to be re-run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it gets caught and logged in _load() and now that I have put it under the UnitOfWork context as you suggested, it will rollback when this is raised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it rollback the whole record or the whole migration?
If this happens, how will you resume so that missing (failed) comments are covered in the second/third and subsequent attempts?

event.model.version_id = 0

event.commit()
db.session.commit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better if we do the uow instead? otherwise you will need to re-index all requests
plus, from records migration experience I can tell you uow is faster

created_at = datetime.fromisoformat(record["created"])
request.model.created = created_at

request.commit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part would also benefic from uow

environment, collection
)
"""
collection_name/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this docstring very helpful, thank you!



# Function to flatten arbitrarily nested comment replies into a 1-level replies list
def flatten_replies(comments_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do the rubber duck excersise on this one :)

identity=system_identity,
request_id=request["id"],
)
assert comments.total == 2 # 1 comment and 1 reply
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add maybe a small check for the content of the comment and reply, just to make sure

Copy link
Contributor

@kpsherva kpsherva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we store migrated comments ids on the request level to have a retry strategy if a migration run fails? In order to understand which comments to skip on the second/third etc run

@sakshamarora1 sakshamarora1 force-pushed the feature/comments_migration branch from f420de2 to 7f6c6ed Compare February 16, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comments migration

2 participants