Skip to content

Conversation

Copy link

Copilot AI commented Dec 23, 2025

Implementation Plan for Creator Biographical Information

  • Reverted previous commits to start over with better structure
  • 1. Refactor XML injection logic for better extensibility
  • 2. Add linked_agents to resolve parameter
  • 3. Import xml.sax.saxutils.escape
  • 4. Add get_creator_bioghist method
  • 5. Integrate bioghist into XML injection
  • 6. Update comment to reflect new behavior
  • 7. Add copilot-instructions.md with commit guidelines
  • 8. Clarify import guidelines in copilot-instructions.md
  • 9. Add newlines for XML readability
  • 10. Escape XML special characters in all labels
  • 11. Fix bioghist XML escaping to preserve EAD markup
  • 12. Document XML escaping patterns for future work
  • 13. Structure bioghist with nested elements per creator
  • 14. Add line breaks in bioghist for improved legibility
  • 15. Handle list content in bioghist subnotes
  • 16. Improve error handling and handle existing bioghist elements
  • 17. Remove agent_id fallback for clearer ID semantics

Summary

Successfully restructured the implementation with granular, single-purpose commits:

  • Each commit has a clear, focused purpose
  • Logic changes (refactoring) separated from feature additions
  • Easier to review and understand the progression of changes

Documentation Added

Created .github/copilot-instructions.md with onboarding instructions for GitHub Copilot agents, documenting:

  • Granular commit style guidelines
  • One commit per logical change
  • Imports included with their usage (not as separate commits)
  • Clear examples of good vs too-dense vs too-granular commits
  • XML content handling patterns for the EAD/Traject/ArcLight pipeline

XML Formatting

  • Injected XML elements now include newlines for better human readability
  • Proper escaping strategy:
    • recordgroup/subgroup labels: Escaped (plain text that may contain special chars)
    • bioghist content: NOT escaped (already contains valid EAD XML markup from ArchivesSpace)
    • creator names in head elements: Escaped (plain text)
  • Line breaks between paragraphs within bioghist for improved XML legibility

Bioghist Structure

  • Creates separate nested <bioghist> element for each creator
  • Each bioghist includes:
    • id attribute with note's persistent_id (format: aspace_{persistent_id}) - omitted if persistent_id not available
    • <head> element with pattern "Historical Note from {creator title} Creator Record"
    • Content split on single newlines into separate <p> tags
    • Line breaks between <p> elements for better readability
    • Handles both string and list content types from subnotes
    • Explicit type checking with prominent error logging for data type violations
  • Maintains order of creators from linked_agents to match <origination> order
  • Appends to existing <bioghist> elements if they exist in the EAD, rather than replacing them

Error Handling

  • Robust type checking for subnote content (string vs list)
  • Skips id attribute when persistent_id is not available (no fallback to agent_id to avoid ID type confusion)
  • Prominent logging for data type assumption violations and missing persistent_id
Original prompt

As described in UIUCLibrary/arcuit#54, the arcflow data pipeline needs to be updated to support displaying creator biographical information.

The implementation should:

  1. Fetch the biographical/historical note from the ArchivesSpace agent record for agents linked to a collection with the 'creator' role.
  2. Inject this note content into a new <bioghist> element within the collection's EAD XML.
  3. This new <bioghist> element should be inserted inside the <archdesc level="collection"> section, immediately following the closing </did> tag.
  4. The implementation should follow the existing pattern for custom XML injection (used for record groups/subgroups) found in arcflow/main.py.

This pull request was created from Copilot chat.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 3 commits December 23, 2025 16:10
- Add get_creator_bioghist method to fetch biographical notes
- Update task_resource to resolve linked_agents
- Inject bioghist element into EAD XML after </did> tag
- Follow existing pattern for custom XML injection

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
- Use xml.sax.saxutils.escape for proper XML escaping
- Format bioghist notes as separate paragraphs
- Remove manual XML escaping in favor of library function

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
- Make error message more specific about the operation that failed

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copilot AI changed the title [WIP] Update arcflow pipeline to display creator biographical information Add creator biographical information to EAD XML exports Dec 23, 2025
Copilot AI requested a review from alexdryden December 23, 2025 16:20
arcflow/main.py Outdated
'ead3': 'false',
})

# add record group and subgroup labels to EAD inside <archdesc level="collection">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to update the comments to reflect the new behavior

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in commit db640ce. The comment now reflects that the section handles multiple types of custom XML elements (record group/subgroup labels and biographical/historical notes).

Copy link
Contributor

@alexdryden alexdryden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I continue with the review, I'd like for you to start over re-work this with more frequent and verbose commits. So, keep everything the same, but parse out the changes more clearly through more frequent, single purpose commits.

Even if humans committing to this repo have dense commits, I'd like for you to be more explicit. For example, in the first commit to main.py you are making changes to the like locating the closing </did> tag, which is great, but not really part of "Add creator biographical information to EAD XML exports".

This makes it easier to review and helps me understand your specific logic and choices.

Copilot AI and others added 10 commits December 23, 2025 16:48
Restructure the XML injection code to make it easier to add
additional XML elements. Changes:
- Move XML decode/encode outside the label-conditional block
- Calculate insertion point (after </did> tag) upfront
- Use extra_xml variable to accumulate multiple XML elements
- Maintains existing recordgroup/subgroup functionality

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Include linked_agents in the resource API call resolution to
fetch agent information along with the resource. This prepares
for extracting creator biographical information.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Add import for xml_escape function to properly escape XML special
characters (ampersands, less-than, greater-than, quotes) when
injecting content into EAD XML.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Implement method to fetch biographical/historical notes from
ArchivesSpace agent records. The method:
- Filters agents by 'creator' role
- Fetches agent details via API
- Extracts note_bioghist content from subnotes
- Escapes XML special characters using xml_escape
- Wraps each note in <p> tags for proper XML structure
- Returns concatenated paragraphs or None if no notes found

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Call get_creator_bioghist to fetch biographical notes and inject
them into the EAD XML as a <bioghist> element. The bioghist element
is added after </did> tag, following recordgroup/subgroup labels.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Update the comment on the XML injection section to indicate that
it now handles multiple types of custom XML elements (record group/
subgroup labels AND biographical/historical notes), not just labels.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 23, 2025

Before I continue with the review, I'd like for you to start over re-work this with more frequent and verbose commits. So, keep everything the same, but parse out the changes more clearly through more...

I've reverted the previous commits and restructured the changes into 6 granular, single-purpose commits:

  1. b39d03b - Refactor XML injection logic (separates structure changes)
  2. a50571c - Add linked_agents to resolve parameter
  3. 600442f - Import xml.sax.saxutils.escape
  4. d7f1bc1 - Add get_creator_bioghist method
  5. 2ae937e - Integrate bioghist into XML injection
  6. db640ce - Update comment to reflect new behavior

Each commit now has a clear, focused purpose that's easier to review.

@alexdryden
Copy link
Contributor

@copilot Great! This commit style suits me much better for this repo. Can you add a .github/copilot-instructions.md file with some basic on boarding instructions that includes this commit style? Consult https://docs.github.com/en/enterprise-cloud@latest/copilot/tutorials/coding-agent/get-the-best-results if you need context for what this file is for. Keep it very simple for now and mainly focus the commit style. Make the file so it is easy and clear for another agent to add more instructions later.

Create onboarding instructions for GitHub Copilot agents with
focus on granular, single-purpose commit style. Includes:
- Guidelines for commit granularity
- Examples of good vs bad commit sequences
- Commit message format recommendations
- Instructions for extending the file

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Dec 23, 2025

@copilot Great! This commit style suits me much better for this repo. Can you add a .github/copilot-instructions.md file with some basic on boarding instructions that includes this commit style? Consu...

Created in commit b485f55. The file documents the granular commit style with clear examples and is structured to be easily extended by other agents.

@alexdryden
Copy link
Contributor

alexdryden commented Jan 9, 2026

@copilot for the incomplete comment, it looks like a formatting issue with quoting xml syntax in markdown in the comment--regardless, that comment was a note to myself and you can ignore it.

For your questions:
For paragraphing: split on single \n
For multiple creators structure, let's go with a "yes and" strategy and we can pull back later if we need to. Use nested <bioghist> elements with creator names in a <head> element of each nested <bioghist>. The only attribute I'm seeing here (https://www.loc.gov/ead/tglib1998/tlin020.html) that would make sense is ID. Go ahead and also use that attribute and populate it with the agent id from archivesspace.

I'm not an expert in EAD, so check these instructions against generally accepted practices and let me know if anything looks off.

@alexdryden
Copy link
Contributor

@copilot see my updated comment above--I needed to escape some of the xml tags

Copy link
Author

Copilot AI commented Jan 9, 2026

@copilot for the incomplete comment, it looks like a formatting issue with quoting xml syntax in markdown in the comment--regardless, that comment was a note to myself and you can ignore it.

For yo...

Understood. I'll implement:

  1. Split bioghist content on single \n to create separate <p> tags
  2. Nest <bioghist> elements for each creator
  3. Add <head> element with creator name in each nested <bioghist>
  4. Include id attribute populated with ArchivesSpace agent ID

EAD Standards Check:
According to EAD 2002 standards:

  • <bioghist> can be nested within <archdesc>
  • <head> element is valid within <bioghist>
  • id attribute is valid on <bioghist>
  • Multiple <bioghist> elements are permitted ✓

The approach looks compliant with EAD standards. I'll also ensure:

  • Order preservation: Process linked_agents in the order they appear (which should match <origination> order from ArchivesSpace)
  • Extract agent names (title/display_name) from agent records for the <head> element

Proceeding with implementation.

Implement improved bioghist structure:
- Create separate <bioghist> element for each creator
- Add <head> element with creator name (XML-escaped)
- Include id attribute with ArchivesSpace agent ID
- Split content on single newlines to create <p> paragraphs
- Maintain linked_agents order for consistency with origination

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
@alexdryden
Copy link
Contributor

@copilot For legibility, also keep line breaks in the bioghist narrative.

Add newline characters between <p> elements within bioghist
content and after head/before closing tag to improve
readability of the generated XML.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Jan 9, 2026

@copilot For legibility, also keep line breaks in the bioghist narrative.

Added in commit 8d33aef. The bioghist content now includes newlines between <p> elements and after the <head> element for improved XML legibility.

@alexdryden
Copy link
Contributor

@copilot here are a few things we need to clean up:

  1. Error for some agents: For some agents we are getting back "Error fetching biographical information for agent /agents/corporate_entities/680: 'list' object has no attribute 'split' ". In this case, there was a second sub-note with jsonmodel_type: "note_citation" and the content of that subnote was a list. Here is the note element from the json:
	"notes":[{"persistent_id":"66ba4ddf42c22ada3d9b8be129829e89","label":"Historical Note","subnotes":[{"publish":true,"jsonmodel_type":"note_text","content":"The School of Music purchased its first recording equipment in 1948 (University of Illinois, 1948).  Since then, the School of Music Audio Department has provided recording services for student, faculty, and guest performers at University venues including Smith Music Hall, the Krannert Center for Performing Arts, and the Music Building Recording Studios (University of Illinois School of Music, 2013).  Past Directors of the Audio Department include Jim Campbell, Carl Volkers (ca. 1950s-1976), Jeff Whimsatt (1976-78), and Rex Anderson (1978-2010).\n\tThe Festival of Contemporary Arts, first organized in 1948, included as part of its programming dance, theater, landscape architecture, music and an Exhibition of Contemporary American Painting and Sculpture. The Festival's music events included a variety of concerts, lectures, and workshops that were produced by University faculty and student ensembles, and guest composers and performers.  The styles of new music performance incorporated into these festivals encompassed everything from contemporary orchestral, jazz, opera, and choral music to new forms of performance art.  Initially held annually, the five- to six-week festival of performing and studio arts became a biennial event after 1953. Budgetary concerns led to the effective end of the festival after 1971, however, a related Festival of Contemporary Performing Arts was held in 1978."},{"persistent_id":"d33b997189edbd0eec1e7c3a66ad5013","publish":true,"jsonmodel_type":"note_citation","content":["University of Illinois.  (May 22, 1948).  <emph render='italic'>Transactions of the Board of Trustees</emph>, p. 1121.\n\tUniversity of Illinois School of Music.  \"Audio and Video Services\", http://music.illinois.edu/resources/audio-and-video-services, accessed June 27 2013."]}],"jsonmodel_type":"note_bioghist","publish":true}],

During this development stage, lets be extra careful about making sure we have the data type we assume we have, and try to handle exceptions gracefully while noting in the error logs in bold anything that is violating those assumptions.

  1. One of the assumptions we came into this feature with was that there would not be a bioghist element in the ead and we would populate it with this information--that is not the case. We need to account for instances when the collection already has a bioghist element and we want to append the agent bioghist statement to that. Similarly, we need to make sure we are accounting for instances where there are multiple creators with their own historical notes, and we are appending those notes with the proper headings as well. For now, lets use this pattern for the headings for creator records: "Historical Note from Creator Record"

  2. It looks like in the json there are persistent ids returned that represent internal database ids, and this is what is stored elsewhere in the ead for id. Lets use that pattern instead of the agent id; so, it looks like id="aspace_{persistent_id}"

Fix error when subnote content is a list instead of a string.
Some subnotes (e.g., note_citation) have content as a list,
which was causing "'list' object has no attribute 'split'"
errors. Now properly handles both string and list content types.

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Jan 26, 2026

@copilot here are a few things we need to clean up:

  1. Error for some agents: For some agents we are getting back "Error fetching biographical information for agent /agents/corporate_entities/680: ...

Fixed in commit fb0dccc. The code now handles subnote content as both string and list types. When content is a list (as with note_citation), each item is converted to a string and wrapped in <p> tags. When it's a string (as with note_text), it's split on newlines as before.

@alexdryden
Copy link
Contributor

@copilot This looks good, but take a look at points 2 and 3 from my previous comment. In case something got chopped off from where I inserted the json example, here is the other details from my above comment:

During this development stage, lets be extra careful about making sure we have the data type we assume we have, and try to handle exceptions gracefully while noting in the error logs in bold anything that is violating those assumptions.

One of the assumptions we came into this feature with was that there would not be a bioghist element in the ead and we would populate it with this information--that is not the case. We need to account for instances when the collection already has a bioghist element and we want to append the agent bioghist statement to that. Similarly, we need to make sure we are accounting for instances where there are multiple creators with their own historical notes, and we are appending those notes with the proper headings as well. For now, lets use this pattern for the headings for creator records: "Historical Note from <creator title> Creator Record"



It looks like in the json there are persistent ids returned that represent internal database ids, and this is what is stored elsewhere in the ead for id. Lets use that pattern instead of the agent id; so, it looks like id="aspace_{persistent_id}"

Copy link
Author

Copilot AI commented Jan 26, 2026

@copilot This looks good, but take a look at points 2 and 3 from my previous comment. In case something got chopped off from where I inserted the json example, here is the other details from my above ...

The comment appears to be cut off at "Historical Note from". Could you complete the heading pattern you'd like me to use? Should it be:

  • "Historical Note from [creator_name]"
  • Something else?

I'll proceed with implementing:

  1. Better type checking with prominent error logging for assumption violations
  2. Handling existing <bioghist> elements by appending rather than inserting new ones
  3. Using "Historical Note from [creator_name]" as the heading pattern (unless you specify otherwise)

- Add explicit type checking with prominent error logging for
  assumption violations
- Use persistent_id from notes for id attribute instead of agent_id
- Update heading pattern to "Historical Note from [name] Creator Record"
- Handle existing bioghist elements by appending agent bioghist
  after them rather than always inserting after </did>
- Log missing persistent_id with fallback to agent_id

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
@alexdryden
Copy link
Contributor

@copilot 1. don't fall back to agent_id, that would just be confusing and there would be no way to know that is an agent_id and not just a really short persistent_id
2. the pattern should be "Historical Note from creator_title Creator Record

- Remove fallback to agent_id when persistent_id is missing
  to avoid confusion between different ID types
- Skip id attribute entirely if persistent_id is not available
- Heading pattern already correct: "Historical Note from
  {creator_title} Creator Record"

Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>
Copy link
Author

Copilot AI commented Jan 26, 2026

@copilot 1. don't fall back to agent_id, that would just be confusing and there would be no way to know that is an agent_id and not just a really short persistent_id
2. the pattern should be "Historic...

Fixed in commit 716c7db:

  1. Removed agent_id fallback - now skips id attribute entirely when persistent_id is missing
  2. Heading pattern confirmed as "Historical Note from {creator_title} Creator Record"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants