feat(comments): Add runner for comments migration separately#380

Open

sakshamarora1 wants to merge 5 commits intoCERNDocumentServer:masterfrom

sakshamarora1:feature/comments_migration

Contributor

sakshamarora1 commented Feb 2, 2026 •

edited

Loading

closes: #286

Steps

Update the collection queries for a collection, retreive all the comments for the records in the records found and create a json metadata file.

ipython ./scripts/dump_comments_to_migrate.py

Output file: comments_metadata.json
The above script also runs this script partly to generate the missing_users.json file

Ensure poeple.csv, missing_users.json in /eos/media/cds/cds-rdm/<env>/migration/users/ directory and comments_metadata.json file in /eos/media/cds/cds-rdm/<env>/migration/<collection>/comments/

IMPORTANT: We are not running the ./scripts/copy_comments_attached_files.py yet (or we can, for thesis and IT there are none anyways) [See child issue in the main issue attached]

Create those users (using people.csv containing person_id already placed in the /eos/media/cds/cds-rdm/<env>/migration/users/):

cds-migrator-kit comments commenters-run --filepath /eos/media/cds/cds-rdm/<env>/migration/users/missing_users.json --missing-users-dir /eos/media/cds/cds-rdm/<env>/migration/users/ --dry-run

(Remove --dry-run)

Finally migrate the comments:

invenio migration comments --filepath /eos/media/cds/cds-rdm/<env>/migration/<collection>/comments/comments_metadata.json --dirpath /eos/media/cds/cds-rdm/<env>/migration/<collection>/comments/ --dry-run

(Remove --dry-run)

sakshamarora1 marked this pull request as ready for review

February 4, 2026 16:33

kpsherva reviewed

View reviewed changes

cds_migrator_kit/rdm/comments/load.py Outdated

+                      self.all_record_versions = {
+                          str(hit["versions"]["index"]): hit for hit in search_result
+                      }
+                      oldest_version = min(

Contributor

kpsherva Feb 9, 2026

wouldn't it be faster via record._record.versions[-1]? I mean instead of scan_versions etc.

Contributor Author

sakshamarora1 Feb 9, 2026

That returns an instance of VersionsManager and it doesn't have other versions stored in it. We will have to do scan_versions to find all the versions and select the minimum un-deleted version available

kpsherva reviewed

View reviewed changes

cds_migrator_kit/rdm/comments/load.py Outdated

+                      elif comment_status == "dm":
+                          comment_payload["payload"].update(
+                              {
+                                  "content": "comment was deleted by the moderator.",

Contributor

kpsherva Feb 9, 2026

in RDM we do not have the "moderator" - it would be good to align it with what we display when we delete a comment in RDM (I don't remember the exact text). ping @zzacharo for more opinions

Contributor Author

sakshamarora1 Feb 9, 2026

We not display this content for the deleted comments, we do this in the frontend

kpsherva reviewed

View reviewed changes

cds_migrator_kit/rdm/comments/load.py

+                          {}, request=request.model, request_id=str(request.id), type=event_type
+                      )
+                      if data.get("file_relation"):

Contributor

kpsherva Feb 9, 2026

can you add a small comment on why we are doing this?

kpsherva reviewed

View reviewed changes

cds_migrator_kit/rdm/comments/load.py Outdated

+                      )
+                      return self.all_record_versions[str(oldest_version)]
+                  def create_event(self, request, data, community, record, parent_comment_id=None):

Contributor

kpsherva Feb 9, 2026

is there any way we can optimise this function to be more readable? there are a lot of conditional statements, some with repeated conditions, also it would be good if we avoid nesting

Contributor Author

sakshamarora1 Feb 10, 2026

I did some more optimisations

kpsherva reviewed

View reviewed changes

cds_migrator_kit/rdm/comments/load.py Outdated

+                              {"user": str(user.id)}, raise_=True
+                          )
+                      else:
+                          print("User not found for email: ", data.get("created_by"))

Contributor

kpsherva Feb 9, 2026

the print is redundant if you raise.
what will happen if you raise? will the whole script halt? and need to be re-run?

Contributor Author

sakshamarora1 Feb 10, 2026

No, it gets caught and logged in _load() and now that I have put it under the UnitOfWork context as you suggested, it will rollback when this is raised.

Contributor

kpsherva Feb 12, 2026

will it rollback the whole record or the whole migration?
If this happens, how will you resume so that missing (failed) comments are covered in the second/third and subsequent attempts?

kpsherva reviewed

View reviewed changes

cds_migrator_kit/rdm/comments/load.py Outdated

+                      event.model.version_id = 0
+                      event.commit()
+                      db.session.commit()

Contributor

kpsherva Feb 9, 2026

would it be better if we do the uow instead? otherwise you will need to re-index all requests
plus, from records migration experience I can tell you uow is faster

kpsherva reviewed

View reviewed changes

cds_migrator_kit/rdm/comments/load.py Outdated

+                      created_at = datetime.fromisoformat(record["created"])
+                      request.model.created = created_at
+                      request.commit()

Contributor

kpsherva Feb 9, 2026

this part would also benefic from uow

kpsherva reviewed

View reviewed changes

scripts/copy_comments_attached_files.py

+                  environment, collection
+              )
+              """
+              collection_name/

Contributor

kpsherva Feb 9, 2026

nice, this docstring very helpful, thank you!

kpsherva reviewed

View reviewed changes

scripts/dump_comments_to_migrate.py



		# Function to flatten arbitrarily nested comment replies into a 1-level replies list
		def flatten_replies(comments_list):

Contributor

kpsherva Feb 9, 2026

let's do the rubber duck excersise on this one :)

kpsherva reviewed

View reviewed changes

tests/cds-rdm/test_comments_migration.py Outdated

+                      identity=system_identity,
+                      request_id=request["id"],
+                  )
+                  assert comments.total == 2  # 1 comment and 1 reply

Contributor

kpsherva Feb 12, 2026

I would add maybe a small check for the content of the comment and reply, just to make sure

kpsherva reviewed

View reviewed changes

Contributor

kpsherva left a comment

can we store migrated comments ids on the request level to have a retry strategy if a migration run fails? In order to understand which comments to skip on the second/third etc run

sakshamarora1 added 5 commits

February 16, 2026 20:08


          feat(comments): Add runner for comments migration separately

a3140cd


          feat(comments): Add link in comment content for linked files

4c236be


          rdm: comments: Add commenterRunner and UnitOfWork to CommentsRunner.load

ff69078


          tests: Add tests for comments migration

b27a854


          Refactor(tests): Add more precise assertions

7f6c6ed

sakshamarora1 force-pushed the feature/comments_migration branch from f420de2 to 7f6c6ed Compare

February 16, 2026 20:16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet