Skip to content

Conversation

@kennethrioja
Copy link
Contributor

Summary of changes

  • We now have a Github Ingestion Method for materials. Here is the list of the fetched metadata fields (n = 14):
    • title, url, description, keywords, licence, status, doi, version, date {created|published|modified}, contributors, resource type, prerequisites
  • Github has a limited rate of API calls, thus we are caching the JSON response 7 days (see TTL) before re-updating the material. Here is the list of the API calls (n = 4 per Github repo)

Motivation and context

  • For CERN, and more especially the HEP Software Foundation, most of the training material were Github repos or pages, very few had bioschema (except the carpentries one). I wanted to get more elaborated entries for the HSF Training Center, thus this Github ingestor.

Checklist

  • I have read and followed the CONTRIBUTING guide.
  • I confirm that I have the authority necessary to make this contribution on behalf of its copyright owner and agree
    to license it to the TeSS codebase under the
    BSD license.

Comment on lines 98 to 103
content = open_url(url)
data = content ? JSON.parse(content.read) : nil

set_cache(key, data) if Rails.cache.read(key).nil? # sets cache only if there is no cache yet or is expired
get_cache(key)
data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified:

Rails.cache.fetch(key) do
  JSON.parse(open_url(url).read)
end

Rails will decide whether to get/set the cache.

(Also your current implementation will set the cache even if no content was read)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add a prefix to the cache key, github_ingestor_ or something, since this cache is used across the entire application

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@fbacall fbacall Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a comment via email about the cache not working, but no idea where it is in the GitHub UI.

Possibly you did not have caching enabled on your dev instance:
https://github.com/ElixirTeSS/TeSS/blob/master/config/environments/development.rb#L20-L32

Rails.cache.fetch(key, expires_in: 1.week) do ... will not execute the code in the block if key exists in the cache and the cache entry is < 1 week old.

https://guides.rubyonrails.org/caching_with_rails.html#low-level-caching-using-rails-cache

Copy link
Contributor Author

@kennethrioja kennethrioja Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted the message my bad! I tested everything and it works as expected now. My issue was that I was worried that it does open_url even when doing cache fetch. As you've shown with the docs and the tests I've done, it now correctly goes through the correct number of API calls. I refactored accordingly

Comment on lines 58 to 60
rescue StandardError
nil
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally not a good idea to rescue StandardError like this, it will obscure legitimate bugs: typos in method calls etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a @message to each and every StdErr rescue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the actual exceptions you are trying to rescue in these cases?

I think the metadata extraction code should be robust enough to not raise exceptions when it can't parse out the DOI etc. from the git repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see, I avoided the raise exceptions which were indeed not legitimate in the parsing

@kennethrioja
Copy link
Contributor Author

History:

  1. I setup my dev branch with config/environments/test.rb having config.cache_store = :memory_store, this is why some tests did not pass -> 1c3a0a6
  2. Force-pushed (only did --amend) multiple times to get CodeQL pass -> 791b4dc
  3. Added more tests to test the exact github.{com|io} format to be able to fetch data -> 23e04f1

@kennethrioja kennethrioja requested a review from fbacall October 24, 2025 11:50
@kennethrioja
Copy link
Contributor Author

kennethrioja commented Oct 24, 2025

I have changed the get_or_set_cache method to have less Github API calls + added the relevant tests about this -> d52a2a0

Comment on lines 58 to 60
rescue StandardError
nil
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the actual exceptions you are trying to rescue in these cases?

I think the metadata extraction code should be robust enough to not raise exceptions when it can't parse out the DOI etc. from the git repo.

Comment on lines 98 to 103
content = open_url(url)
data = content ? JSON.parse(content.read) : nil

set_cache(key, data) if Rails.cache.read(key).nil? # sets cache only if there is no cache yet or is expired
get_cache(key)
data
Copy link
Member

@fbacall fbacall Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a comment via email about the cache not working, but no idea where it is in the GitHub UI.

Possibly you did not have caching enabled on your dev instance:
https://github.com/ElixirTeSS/TeSS/blob/master/config/environments/development.rb#L20-L32

Rails.cache.fetch(key, expires_in: 1.week) do ... will not execute the code in the block if key exists in the cache and the cache entry is < 1 week old.

https://guides.rubyonrails.org/caching_with_rails.html#low-level-caching-using-rails-cache

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a GitHub Material Ingestor that fetches training material metadata from GitHub repositories and GitHub Pages. The implementation includes support for sitemap parsing (XML and TXT), API response caching with a 7-day TTL to respect GitHub API rate limits (60 requests/hour for unauthenticated users), and extraction of 14 metadata fields including title, description, keywords, license, DOI, version, dates, contributors, resource type, and prerequisites.

Key Changes

  • New GitHub ingestor class that transforms GitHub repositories/pages into materials with comprehensive metadata extraction
  • Extracted sitemap parsing logic into a reusable SitemapHelpers concern module
  • Added redirect handling methods to the base Ingestor class for following HTTP and meta refresh redirects

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
lib/ingestors/github_ingestor.rb Main ingestor implementation with GitHub API integration, caching, and metadata extraction logic
lib/ingestors/concerns/sitemap_helpers.rb Reusable sitemap parsing module extracted from common patterns
lib/ingestors/ingestor.rb Added get_redirected_url and to_https helper methods for redirect handling
lib/ingestors/ingestor_factory.rb Registered the new GitHub ingestor
test/unit/ingestors/github_ingestor_test.rb Comprehensive test coverage including URL parsing, caching behavior, TTL handling, and metadata extraction
test/unit/ingestors/ingestor_test.rb Tests for new base ingestor methods (open_url, get_redirected_url)
test/helpers/sitemap_helper_test.rb Tests for the sitemap helpers module
test/test_helper.rb Added DummyIngestor test helper class
test/fixtures/files/ingestion/github/* Mock data files for testing GitHub API responses

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Nokogiri::HTML(response.body)
end

# DEFINITION – Opens the GitHub homepage, fetches the 3 first >25 char <p> tags'text
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment incorrectly states '>25 char' but the code checks for 50 characters. Update comment to '>50 char' or 'more than 50 chars' to match the code at line 148.

Suggested change
# DEFINITION – Opens the GitHub homepage, fetches the 3 first >25 char <p> tags'text
# DEFINITION – Opens the GitHub homepage, fetches the 3 first >50 char <p> tags' text

Copilot uses AI. Check for mistakes.
round = 3
doc.css('p').each do |p|
p_txt = p&.text&.strip&.gsub(/\s+/, ' ')
next if (p_txt.length < 50) || round.zero?
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any method in the safe navigation chain returns nil (e.g., if text or strip returns nil), p_txt will be nil, and line 148 will crash with NoMethodError when calling .length on nil. Add nil check: next if p_txt.nil? || p_txt.length < 50 || round.zero?

Suggested change
next if (p_txt.length < 50) || round.zero?
next if p_txt.nil? || p_txt.length < 50 || round.zero?

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred to make p_txt = '' if nil, this way it can use .length method
The solution proposed by Copilot was crashing tests

raise 'Too many redirects' if limit.zero?

https_url = to_https(url) # some `homepage` were http
response = HTTParty.get(https_url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] })
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If config[:user_agent] is not defined (e.g., in ingestors that don't set it), this will pass nil as the User-Agent header value. Add a fallback or check: use config[:user_agent] || 'TeSS Bot' or verify user_agent exists before calling this method.

Suggested change
response = HTTParty.get(https_url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] })
response = HTTParty.get(https_url, follow_redirects: true, headers: { 'User-Agent' => (config[:user_agent] || 'TeSS Bot') })

Copilot uses AI. Check for mistakes.
decorator_ref.open_url_counter + 1
)

# puts("open_url has been used, total #{decorator_ref.open_url_counter}")
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out debug code before merging to keep the codebase clean.

Suggested change
# puts("open_url has been used, total #{decorator_ref.open_url_counter}")

Copilot uses AI. Check for mistakes.
# ... any tag with id containing "prereq" (EN) or "prerreq" (ES)
prereq_paragraphs = fetch_prerequisites_from_id_or_class(doc, prereq_paragraphs) if prereq_paragraphs.empty?

prereq_paragraphs&.join("\n")&.gsub(/\n\n+/, "\n")&.to_s&.strip
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The safe navigation chain with &.to_s is redundant—if join succeeds, it always returns a string, making &.to_s unnecessary. Simplify to: prereq_paragraphs&.join(\"\\n\")&.gsub(/\\n\\n+/, \"\\n\")&.strip or add explicit nil handling.

Suggested change
prereq_paragraphs&.join("\n")&.gsub(/\n\n+/, "\n")&.to_s&.strip
prereq_paragraphs&.join("\n")&.gsub(/\n\n+/, "\n")&.strip || ""

Copilot uses AI. Check for mistakes.

GITHUB_API_BASE = 'https://api.github.com/repos'
CACHE_PREFIX = 'github_ingestor_'
TTL = 1.week # time to live after the cache is deleted
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is slightly misleading: 'time to live after the cache is deleted' suggests TTL applies post-deletion, but TTL is actually the duration before cache deletion. Clarify to: 'time to live before cache expires' or 'cache expiration time'.

Suggested change
TTL = 1.week # time to live after the cache is deleted
TTL = 1.week # cache expiration time (time to live before cache expires)

Copilot uses AI. Check for mistakes.
ingestor.open_url("| touch #{file}")
rescue StandardError
end
`ls #{dir}` # This is needed or the `exist?` check below seems to return a stale result
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using backticks to execute shell commands for cache-busting filesystem checks is fragile and platform-dependent. Consider using Dir.children(dir) or File.exist?(file) with proper filesystem sync, or document why this workaround is necessary.

Suggested change
`ls #{dir}` # This is needed or the `exist?` check below seems to return a stale result
Dir.children(dir) # This is needed or the `exist?` check below seems to return a stale result

Copilot uses AI. Check for mistakes.
Comment on lines 59 to 61
rescue StandardError => e
@messages << "#{self.class.name} read failed, #{e.message}"
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These messages can be viewed by users, it's better to make them user-friendly and send the real exception message to the rails log, see:
https://github.com/ElixirTeSS/TeSS/blob/master/lib/ingestors/bioschemas_ingestor.rb#L125-L142

(I know there are other ingestors that do this - they are wrong...)

kennethrioja added a commit to kennethrioja/TeSS that referenced this pull request Dec 19, 2025
kennethrioja added a commit to kennethrioja/TeSS that referenced this pull request Dec 19, 2025
@fbacall fbacall merged commit c8464e8 into ElixirTeSS:master Dec 19, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants