-
Notifications
You must be signed in to change notification settings - Fork 20
Material Ingestor – GitHub #1154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/ingestors/github_ingestor.rb
Outdated
| content = open_url(url) | ||
| data = content ? JSON.parse(content.read) : nil | ||
|
|
||
| set_cache(key, data) if Rails.cache.read(key).nil? # sets cache only if there is no cache yet or is expired | ||
| get_cache(key) | ||
| data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be simplified:
Rails.cache.fetch(key) do
JSON.parse(open_url(url).read)
endRails will decide whether to get/set the cache.
(Also your current implementation will set the cache even if no content was read)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also add a prefix to the cache key, github_ingestor_ or something, since this cache is used across the entire application
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a comment via email about the cache not working, but no idea where it is in the GitHub UI.
Possibly you did not have caching enabled on your dev instance:
https://github.com/ElixirTeSS/TeSS/blob/master/config/environments/development.rb#L20-L32
Rails.cache.fetch(key, expires_in: 1.week) do ... will not execute the code in the block if key exists in the cache and the cache entry is < 1 week old.
https://guides.rubyonrails.org/caching_with_rails.html#low-level-caching-using-rails-cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted the message my bad! I tested everything and it works as expected now. My issue was that I was worried that it does open_url even when doing cache fetch. As you've shown with the docs and the tests I've done, it now correctly goes through the correct number of API calls. I refactored accordingly
| rescue StandardError | ||
| nil | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally not a good idea to rescue StandardError like this, it will obscure legitimate bugs: typos in method calls etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a @message to each and every StdErr rescue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the actual exceptions you are trying to rescue in these cases?
I think the metadata extraction code should be robust enough to not raise exceptions when it can't parse out the DOI etc. from the git repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see, I avoided the raise exceptions which were indeed not legitimate in the parsing
a100dee to
1d93c58
Compare
1d93c58 to
44cfe9c
Compare
44cfe9c to
791b4dc
Compare
|
History:
|
|
I have changed the |
| rescue StandardError | ||
| nil | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the actual exceptions you are trying to rescue in these cases?
I think the metadata extraction code should be robust enough to not raise exceptions when it can't parse out the DOI etc. from the git repo.
lib/ingestors/github_ingestor.rb
Outdated
| content = open_url(url) | ||
| data = content ? JSON.parse(content.read) : nil | ||
|
|
||
| set_cache(key, data) if Rails.cache.read(key).nil? # sets cache only if there is no cache yet or is expired | ||
| get_cache(key) | ||
| data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a comment via email about the cache not working, but no idea where it is in the GitHub UI.
Possibly you did not have caching enabled on your dev instance:
https://github.com/ElixirTeSS/TeSS/blob/master/config/environments/development.rb#L20-L32
Rails.cache.fetch(key, expires_in: 1.week) do ... will not execute the code in the block if key exists in the cache and the cache entry is < 1 week old.
https://guides.rubyonrails.org/caching_with_rails.html#low-level-caching-using-rails-cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a GitHub Material Ingestor that fetches training material metadata from GitHub repositories and GitHub Pages. The implementation includes support for sitemap parsing (XML and TXT), API response caching with a 7-day TTL to respect GitHub API rate limits (60 requests/hour for unauthenticated users), and extraction of 14 metadata fields including title, description, keywords, license, DOI, version, dates, contributors, resource type, and prerequisites.
Key Changes
- New GitHub ingestor class that transforms GitHub repositories/pages into materials with comprehensive metadata extraction
- Extracted sitemap parsing logic into a reusable
SitemapHelpersconcern module - Added redirect handling methods to the base
Ingestorclass for following HTTP and meta refresh redirects
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
lib/ingestors/github_ingestor.rb |
Main ingestor implementation with GitHub API integration, caching, and metadata extraction logic |
lib/ingestors/concerns/sitemap_helpers.rb |
Reusable sitemap parsing module extracted from common patterns |
lib/ingestors/ingestor.rb |
Added get_redirected_url and to_https helper methods for redirect handling |
lib/ingestors/ingestor_factory.rb |
Registered the new GitHub ingestor |
test/unit/ingestors/github_ingestor_test.rb |
Comprehensive test coverage including URL parsing, caching behavior, TTL handling, and metadata extraction |
test/unit/ingestors/ingestor_test.rb |
Tests for new base ingestor methods (open_url, get_redirected_url) |
test/helpers/sitemap_helper_test.rb |
Tests for the sitemap helpers module |
test/test_helper.rb |
Added DummyIngestor test helper class |
test/fixtures/files/ingestion/github/* |
Mock data files for testing GitHub API responses |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lib/ingestors/github_ingestor.rb
Outdated
| Nokogiri::HTML(response.body) | ||
| end | ||
|
|
||
| # DEFINITION – Opens the GitHub homepage, fetches the 3 first >25 char <p> tags'text |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment incorrectly states '>25 char' but the code checks for 50 characters. Update comment to '>50 char' or 'more than 50 chars' to match the code at line 148.
| # DEFINITION – Opens the GitHub homepage, fetches the 3 first >25 char <p> tags'text | |
| # DEFINITION – Opens the GitHub homepage, fetches the 3 first >50 char <p> tags' text |
lib/ingestors/github_ingestor.rb
Outdated
| round = 3 | ||
| doc.css('p').each do |p| | ||
| p_txt = p&.text&.strip&.gsub(/\s+/, ' ') | ||
| next if (p_txt.length < 50) || round.zero? |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any method in the safe navigation chain returns nil (e.g., if text or strip returns nil), p_txt will be nil, and line 148 will crash with NoMethodError when calling .length on nil. Add nil check: next if p_txt.nil? || p_txt.length < 50 || round.zero?
| next if (p_txt.length < 50) || round.zero? | |
| next if p_txt.nil? || p_txt.length < 50 || round.zero? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preferred to make p_txt = '' if nil, this way it can use .length method
The solution proposed by Copilot was crashing tests
lib/ingestors/ingestor.rb
Outdated
| raise 'Too many redirects' if limit.zero? | ||
|
|
||
| https_url = to_https(url) # some `homepage` were http | ||
| response = HTTParty.get(https_url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] }) |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If config[:user_agent] is not defined (e.g., in ingestors that don't set it), this will pass nil as the User-Agent header value. Add a fallback or check: use config[:user_agent] || 'TeSS Bot' or verify user_agent exists before calling this method.
| response = HTTParty.get(https_url, follow_redirects: true, headers: { 'User-Agent' => config[:user_agent] }) | |
| response = HTTParty.get(https_url, follow_redirects: true, headers: { 'User-Agent' => (config[:user_agent] || 'TeSS Bot') }) |
| decorator_ref.open_url_counter + 1 | ||
| ) | ||
|
|
||
| # puts("open_url has been used, total #{decorator_ref.open_url_counter}") |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out debug code before merging to keep the codebase clean.
| # puts("open_url has been used, total #{decorator_ref.open_url_counter}") |
lib/ingestors/github_ingestor.rb
Outdated
| # ... any tag with id containing "prereq" (EN) or "prerreq" (ES) | ||
| prereq_paragraphs = fetch_prerequisites_from_id_or_class(doc, prereq_paragraphs) if prereq_paragraphs.empty? | ||
|
|
||
| prereq_paragraphs&.join("\n")&.gsub(/\n\n+/, "\n")&.to_s&.strip |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The safe navigation chain with &.to_s is redundant—if join succeeds, it always returns a string, making &.to_s unnecessary. Simplify to: prereq_paragraphs&.join(\"\\n\")&.gsub(/\\n\\n+/, \"\\n\")&.strip or add explicit nil handling.
| prereq_paragraphs&.join("\n")&.gsub(/\n\n+/, "\n")&.to_s&.strip | |
| prereq_paragraphs&.join("\n")&.gsub(/\n\n+/, "\n")&.strip || "" |
lib/ingestors/github_ingestor.rb
Outdated
|
|
||
| GITHUB_API_BASE = 'https://api.github.com/repos' | ||
| CACHE_PREFIX = 'github_ingestor_' | ||
| TTL = 1.week # time to live after the cache is deleted |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is slightly misleading: 'time to live after the cache is deleted' suggests TTL applies post-deletion, but TTL is actually the duration before cache deletion. Clarify to: 'time to live before cache expires' or 'cache expiration time'.
| TTL = 1.week # time to live after the cache is deleted | |
| TTL = 1.week # cache expiration time (time to live before cache expires) |
| ingestor.open_url("| touch #{file}") | ||
| rescue StandardError | ||
| end | ||
| `ls #{dir}` # This is needed or the `exist?` check below seems to return a stale result |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using backticks to execute shell commands for cache-busting filesystem checks is fragile and platform-dependent. Consider using Dir.children(dir) or File.exist?(file) with proper filesystem sync, or document why this workaround is necessary.
| `ls #{dir}` # This is needed or the `exist?` check below seems to return a stale result | |
| Dir.children(dir) # This is needed or the `exist?` check below seems to return a stale result |
| rescue StandardError => e | ||
| @messages << "#{self.class.name} read failed, #{e.message}" | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These messages can be viewed by users, it's better to make them user-friendly and send the real exception message to the rails log, see:
https://github.com/ElixirTeSS/TeSS/blob/master/lib/ingestors/bioschemas_ingestor.rb#L125-L142
(I know there are other ingestors that do this - they are wrong...)
15d7ac9 to
18b415f
Compare
Summary of changes
TTL) before re-updating the material. Here is the list of the API calls (n = 4 per Github repo)Motivation and context
Checklist
to license it to the TeSS codebase under the
BSD license.