Material Ingestor – GitHub #1154

kennethrioja · 2025-10-06T12:34:56Z

Summary of changes

We now have a Github Ingestion Method for materials. Here is the list of the fetched metadata fields (n = 14):
- title, url, description, keywords, licence, status, doi, version, date {created|published|modified}, contributors, resource type, prerequisites
Github has a limited rate of API calls, thus we are caching the JSON response 7 days (see TTL) before re-updating the material. Here is the list of the API calls (n = 4 per Github repo)
- Repository : GET /repos/{owner}/{repo}
- Content : GET /repos/{owner}/{repo}/contents/{path}
- Releases : GET /repos/{owner}/{repo}/releases
- Repo contributors : GET /repos/{owner}/{repo}/contributors

Motivation and context

For CERN, and more especially the HEP Software Foundation, most of the training material were Github repos or pages, very few had bioschema (except the carpentries one). I wanted to get more elaborated entries for the HSF Training Center, thus this Github ingestor.

Checklist

I have read and followed the CONTRIBUTING guide.
I confirm that I have the authority necessary to make this contribution on behalf of its copyright owner and agree
to license it to the TeSS codebase under the
BSD license.

lib/ingestors/concerns/github_ingestor_read_helpers.rb

lib/ingestors/concerns/github_ingestor_material_helpers.rb

lib/ingestors/github_ingestor.rb

lib/ingestors/concerns/github_ingestor_material_helpers.rb

lib/ingestors/concerns/github_ingestor_read_helpers.rb

fbacall · 2025-10-13T17:31:16Z

lib/ingestors/github_ingestor.rb

+      content = open_url(url)
+      data = content ? JSON.parse(content.read) : nil
+
+      set_cache(key, data) if Rails.cache.read(key).nil? # sets cache only if there is no cache yet or is expired
+      get_cache(key)
+      data


Can be simplified:

Rails.cache.fetch(key) do JSON.parse(open_url(url).read) end

Rails will decide whether to get/set the cache.

(Also your current implementation will set the cache even if no content was read)

I'd also add a prefix to the cache key, github_ingestor_ or something, since this cache is used across the entire application

Rails caching has been updated
And github_ingestor_prefix has been added

I saw a comment via email about the cache not working, but no idea where it is in the GitHub UI.

Possibly you did not have caching enabled on your dev instance:
https://github.com/ElixirTeSS/TeSS/blob/master/config/environments/development.rb#L20-L32

Rails.cache.fetch(key, expires_in: 1.week) do ... will not execute the code in the block if key exists in the cache and the cache entry is < 1 week old.

https://guides.rubyonrails.org/caching_with_rails.html#low-level-caching-using-rails-cache

I deleted the message my bad! I tested everything and it works as expected now. My issue was that I was worried that it does open_url even when doing cache fetch. As you've shown with the docs and the tests I've done, it now correctly goes through the correct number of API calls. I refactored accordingly

lib/ingestors/concerns/github_ingestor_read_helpers.rb

lib/ingestors/concerns/sitemap_helpers.rb

lib/ingestors/concerns/github_ingestor_material_helpers.rb

fbacall · 2025-10-15T14:03:44Z

lib/ingestors/concerns/github_ingestor_material_helpers.rb

+      rescue StandardError
+        nil
+      end


Generally not a good idea to rescue StandardError like this, it will obscure legitimate bugs: typos in method calls etc.

I added a @message to each and every StdErr rescue

What are the actual exceptions you are trying to rescue in these cases?

I think the metadata extraction code should be robust enough to not raise exceptions when it can't parse out the DOI etc. from the git repo.

Ok I see, I avoided the raise exceptions which were indeed not legitimate in the parsing

lib/ingestors/ingestor.rb

lib/ingestors/github_ingestor.rb

kennethrioja · 2025-10-23T13:24:48Z

History:

I setup my dev branch with config/environments/test.rb having config.cache_store = :memory_store, this is why some tests did not pass -> 1c3a0a6
Force-pushed (only did --amend) multiple times to get CodeQL pass -> 791b4dc
Added more tests to test the exact github.{com|io} format to be able to fetch data -> 23e04f1

kennethrioja · 2025-10-24T11:51:12Z

I have changed the get_or_set_cache method to have less Github API calls + added the relevant tests about this -> d52a2a0

fbacall · 2025-10-29T16:07:28Z

lib/ingestors/concerns/github_ingestor_material_helpers.rb

+      rescue StandardError
+        nil
+      end


What are the actual exceptions you are trying to rescue in these cases?

I think the metadata extraction code should be robust enough to not raise exceptions when it can't parse out the DOI etc. from the git repo.

fbacall · 2025-10-29T17:18:03Z

lib/ingestors/github_ingestor.rb

+      content = open_url(url)
+      data = content ? JSON.parse(content.read) : nil
+
+      set_cache(key, data) if Rails.cache.read(key).nil? # sets cache only if there is no cache yet or is expired
+      get_cache(key)
+      data


I saw a comment via email about the cache not working, but no idea where it is in the GitHub UI.

Possibly you did not have caching enabled on your dev instance:
https://github.com/ElixirTeSS/TeSS/blob/master/config/environments/development.rb#L20-L32

Rails.cache.fetch(key, expires_in: 1.week) do ... will not execute the code in the block if key exists in the cache and the cache entry is < 1 week old.

https://guides.rubyonrails.org/caching_with_rails.html#low-level-caching-using-rails-cache

Copilot

Pull Request Overview

This PR adds a GitHub Material Ingestor that fetches training material metadata from GitHub repositories and GitHub Pages. The implementation includes support for sitemap parsing (XML and TXT), API response caching with a 7-day TTL to respect GitHub API rate limits (60 requests/hour for unauthenticated users), and extraction of 14 metadata fields including title, description, keywords, license, DOI, version, dates, contributors, resource type, and prerequisites.

Key Changes

New GitHub ingestor class that transforms GitHub repositories/pages into materials with comprehensive metadata extraction
Extracted sitemap parsing logic into a reusable SitemapHelpers concern module
Added redirect handling methods to the base Ingestor class for following HTTP and meta refresh redirects

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`lib/ingestors/github_ingestor.rb`	Main ingestor implementation with GitHub API integration, caching, and metadata extraction logic
`lib/ingestors/concerns/sitemap_helpers.rb`	Reusable sitemap parsing module extracted from common patterns
`lib/ingestors/ingestor.rb`	Added `get_redirected_url` and `to_https` helper methods for redirect handling
`lib/ingestors/ingestor_factory.rb`	Registered the new GitHub ingestor
`test/unit/ingestors/github_ingestor_test.rb`	Comprehensive test coverage including URL parsing, caching behavior, TTL handling, and metadata extraction
`test/unit/ingestors/ingestor_test.rb`	Tests for new base ingestor methods (`open_url`, `get_redirected_url`)
`test/helpers/sitemap_helper_test.rb`	Tests for the sitemap helpers module
`test/test_helper.rb`	Added `DummyIngestor` test helper class
`test/fixtures/files/ingestion/github/*`	Mock data files for testing GitHub API responses

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lib/ingestors/github_ingestor.rb

Copilot · 2025-11-20T15:56:58Z

lib/ingestors/github_ingestor.rb

+      Nokogiri::HTML(response.body)
+    end
+
+    # DEFINITION – Opens the GitHub homepage, fetches the 3 first >25 char <p> tags'text


The comment incorrectly states '>25 char' but the code checks for 50 characters. Update comment to '>50 char' or 'more than 50 chars' to match the code at line 148.

Suggested change

# DEFINITION – Opens the GitHub homepage, fetches the 3 first >25 char <p> tags'text

# DEFINITION – Opens the GitHub homepage, fetches the 3 first >50 char <p> tags' text

Copilot · 2025-11-20T15:56:59Z

lib/ingestors/github_ingestor.rb

+      round = 3
+      doc.css('p').each do |p|
+        p_txt = p&.text&.strip&.gsub(/\s+/, ' ')
+        next if (p_txt.length < 50) || round.zero?


If any method in the safe navigation chain returns nil (e.g., if text or strip returns nil), p_txt will be nil, and line 148 will crash with NoMethodError when calling .length on nil. Add nil check: next if p_txt.nil? || p_txt.length < 50 || round.zero?

Suggested change

next if (p_txt.length < 50) || round.zero?

next if p_txt.nil? || p_txt.length < 50 || round.zero?

I preferred to make p_txt = '' if nil, this way it can use .length method
The solution proposed by Copilot was crashing tests

Copilot · 2025-11-20T15:56:59Z

lib/ingestors/ingestor.rb

+      raise 'Too many redirects' if limit.zero?
+
+      https_url = to_https(url) # some `homepage` were http
+      response = HTTParty.get(https_url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] })


If config[:user_agent] is not defined (e.g., in ingestors that don't set it), this will pass nil as the User-Agent header value. Add a fallback or check: use config[:user_agent] || 'TeSS Bot' or verify user_agent exists before calling this method.

Suggested change

response = HTTParty.get(https_url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] })

response = HTTParty.get(https_url, follow_redirects: true, headers: { 'User-Agent' => (config[:user_agent] || 'TeSS Bot') })

Copilot · 2025-11-20T15:56:59Z

test/unit/ingestors/github_ingestor_test.rb

+          decorator_ref.open_url_counter + 1
+        )
+
+        # puts("open_url has been used, total #{decorator_ref.open_url_counter}")


Remove commented-out debug code before merging to keep the codebase clean.

Suggested change

# puts("open_url has been used, total #{decorator_ref.open_url_counter}")

Copilot · 2025-11-20T15:57:00Z

lib/ingestors/github_ingestor.rb

+      # ... any tag with id containing "prereq" (EN) or "prerreq" (ES)
+      prereq_paragraphs = fetch_prerequisites_from_id_or_class(doc, prereq_paragraphs) if prereq_paragraphs.empty?
+
+      prereq_paragraphs&.join("\n")&.gsub(/\n\n+/, "\n")&.to_s&.strip


[nitpick] The safe navigation chain with &.to_s is redundant—if join succeeds, it always returns a string, making &.to_s unnecessary. Simplify to: prereq_paragraphs&.join(\"\\n\")&.gsub(/\\n\\n+/, \"\\n\")&.strip or add explicit nil handling.

Suggested change

prereq_paragraphs&.join("\n")&.gsub(/\n\n+/, "\n")&.to_s&.strip

prereq_paragraphs&.join("\n")&.gsub(/\n\n+/, "\n")&.strip || ""

Copilot · 2025-11-20T15:57:00Z

lib/ingestors/github_ingestor.rb

+
+    GITHUB_API_BASE = 'https://api.github.com/repos'
+    CACHE_PREFIX = 'github_ingestor_'
+    TTL = 1.week # time to live after the cache is deleted


Comment is slightly misleading: 'time to live after the cache is deleted' suggests TTL applies post-deletion, but TTL is actually the duration before cache deletion. Clarify to: 'time to live before cache expires' or 'cache expiration time'.

Suggested change

TTL = 1.week # time to live after the cache is deleted

TTL = 1.week # cache expiration time (time to live before cache expires)

Copilot · 2025-11-20T15:57:00Z

test/unit/ingestors/github_ingestor_test.rb

+      ingestor.open_url("| touch #{file}")
+    rescue StandardError
+    end
+    `ls #{dir}` # This is needed or the `exist?` check below seems to return a stale result


[nitpick] Using backticks to execute shell commands for cache-busting filesystem checks is fragile and platform-dependent. Consider using Dir.children(dir) or File.exist?(file) with proper filesystem sync, or document why this workaround is necessary.

Suggested change

`ls #{dir}` # This is needed or the `exist?` check below seems to return a stale result

Dir.children(dir) # This is needed or the `exist?` check below seems to return a stale result

fbacall · 2025-11-20T16:36:34Z

lib/ingestors/github_ingestor.rb

+    rescue StandardError => e
+      @messages << "#{self.class.name} read failed, #{e.message}"
+    end


These messages can be viewed by users, it's better to make them user-friendly and send the real exception message to the rails log, see:
https://github.com/ElixirTeSS/TeSS/blob/master/lib/ingestors/bioschemas_ingestor.rb#L125-L142

(I know there are other ingestors that do this - they are wrong...)

kennethrioja added 3 commits October 6, 2025 14:10

feat(github-ingestor): source can be a sitemap of|or github.{io|com}

ec6bbe3

test(github-ingestor): added

ceff245

refactor(github-ingestor): rubocop lint ok

9bfbc67

github-advanced-security bot found potential problems Oct 6, 2025

View reviewed changes

lib/ingestors/concerns/github_ingestor_read_helpers.rb Fixed Show fixed Hide fixed

lib/ingestors/concerns/github_ingestor_read_helpers.rb Fixed Show fixed Hide fixed

kennethrioja commented Oct 6, 2025

View reviewed changes

lib/ingestors/concerns/github_ingestor_material_helpers.rb Outdated Show resolved Hide resolved

kennethrioja commented Oct 6, 2025

View reviewed changes

lib/ingestors/concerns/github_ingestor_material_helpers.rb Outdated Show resolved Hide resolved

kennethrioja commented Oct 6, 2025

View reviewed changes

lib/ingestors/github_ingestor.rb Show resolved Hide resolved

kennethrioja commented Oct 6, 2025

View reviewed changes

lib/ingestors/concerns/github_ingestor_material_helpers.rb Outdated Show resolved Hide resolved

PhilReedData mentioned this pull request Oct 8, 2025

Write guide to ingestion via Google Sheet ElixirTeSS/docs#10

Open

fbacall requested changes Oct 15, 2025

View reviewed changes

kennethrioja added 2 commits October 23, 2025 11:08

review(ElixirTeSS#1154): refactor, redirect, regex, stderr

1ceba7f

test(github-ingestor): all related code is 100% covered

1c3a0a6

github-advanced-security bot found potential problems Oct 23, 2025

View reviewed changes

lib/ingestors/github_ingestor.rb Fixed Show fixed Hide fixed

lib/ingestors/github_ingestor.rb Fixed Show fixed Hide fixed

kennethrioja force-pushed the github-ingestor branch from a100dee to 1d93c58 Compare October 23, 2025 11:41

github-advanced-security bot found potential problems Oct 23, 2025

View reviewed changes

lib/ingestors/github_ingestor.rb Fixed Show fixed Hide fixed

lib/ingestors/github_ingestor.rb Fixed Show fixed Hide fixed

kennethrioja force-pushed the github-ingestor branch from 1d93c58 to 44cfe9c Compare October 23, 2025 11:43

fixtest(github-ingestor): tests now pass and corrected CodeQL error

791b4dc

kennethrioja force-pushed the github-ingestor branch from 44cfe9c to 791b4dc Compare October 23, 2025 12:58

test(github-ingestor): stricter on github.{com|io} urls

23e04f1

refactor(github-ingestor): less open_url calls

d52a2a0

kennethrioja requested a review from fbacall October 24, 2025 11:50

fbacall requested changes Oct 29, 2025

View reviewed changes

refactor(github-ingestor): cache_fetch and get rid of rescues

6c234ab

kennethrioja requested a review from fbacall November 14, 2025 14:07

fbacall requested a review from Copilot November 20, 2025 15:53

Copilot started reviewing on behalf of fbacall November 20, 2025 15:53 View session

Copilot finished reviewing on behalf of fbacall November 20, 2025 15:55

Copilot AI reviewed Nov 20, 2025

View reviewed changes

fbacall requested changes Nov 20, 2025

View reviewed changes

kennethrioja added 2 commits December 10, 2025 10:15

review(ElixirTeSS#1154): answering to 20/11/2025 requested changes

c7d4b75

fix(github-ingestor): when another url was in description

7690b6b

kennethrioja added a commit to kennethrioja/TeSS that referenced this pull request Dec 19, 2025

review(ElixirTeSS#1154): refactor, redirect, regex, stderr

0abe39f

kennethrioja added a commit to kennethrioja/TeSS that referenced this pull request Dec 19, 2025

review(ElixirTeSS#1154): answering to 20/11/2025 requested changes

583618a

chore(parse_sitemap): renaming get_sources to parse_sitemap

18b415f

kennethrioja force-pushed the github-ingestor branch from 15d7ac9 to 18b415f Compare December 19, 2025 10:34

Merge branch 'master' into github-ingestor

fe1a4fd

fbacall approved these changes Dec 19, 2025

View reviewed changes

fbacall merged commit c8464e8 into ElixirTeSS:master Dec 19, 2025
7 checks passed

	# DEFINITION – Opens the GitHub homepage, fetches the 3 first >25 char <p> tags'text
	# DEFINITION – Opens the GitHub homepage, fetches the 3 first >50 char <p> tags' text

	next if (p_txt.length < 50) \|\| round.zero?
	next if p_txt.nil? \|\| p_txt.length < 50 \|\| round.zero?

	response = HTTParty.get(https_url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] })
	response = HTTParty.get(https_url, follow_redirects: true, headers: { 'User-Agent' => (config[:user_agent] \|\| 'TeSS Bot') })

	prereq_paragraphs&.join("\n")&.gsub(/\n\n+/, "\n")&.to_s&.strip
	prereq_paragraphs&.join("\n")&.gsub(/\n\n+/, "\n")&.strip \|\| ""

	TTL = 1.week # time to live after the cache is deleted
	TTL = 1.week # cache expiration time (time to live before cache expires)

	`ls #{dir}` # This is needed or the `exist?` check below seems to return a stale result
	Dir.children(dir) # This is needed or the `exist?` check below seems to return a stale result

Material Ingestor – GitHub #1154

Material Ingestor – GitHub #1154

Uh oh!

Conversation

kennethrioja commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fbacall Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kennethrioja Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kennethrioja commented Oct 23, 2025

Uh oh!

kennethrioja commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fbacall Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 20, 2025

fbacall Oct 29, 2025 •

edited

Loading

kennethrioja Nov 14, 2025 •

edited

Loading

kennethrioja commented Oct 24, 2025 •

edited

Loading

fbacall Oct 29, 2025 •

edited

Loading