-
Notifications
You must be signed in to change notification settings - Fork 13
Cel/terraform datacrunch #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add comprehensive DataCrunch cloud provider support for GPU instances ranging from Blackwell B300 to Tesla V100. Features intelligent GPU selection with three strategies: wildcard variant selection (ANY_1H100), tier-based fallback (H100_OR_LESS, B300_OR_LESS) for automatic fallback through GPU tiers when top options unavailable, and explicit instance types for production workloads. GPU tier hierarchy supports 10 tiers with automatic fallback to maximize provisioning success while capping costs. Tier-based selection recommended for most users (~$1.99/hr cap for H100 tier). Add capacity checking infrastructure to validate instance availability before provisioning. API credential management uses OAuth2 with secure token retrieval from ~/.datacrunch/credentials. Special handling for local provider development with dev_overrides in ~/.terraformrc, skipping force_init in Ansible terraform module. Add ML environment setup for DataCrunch instances: package updates, ML dependencies, PyTorch in virtualenv, NVIDIA driver reload with proper module unload sequence (nvidia_uvm, nvidia_drm, nvidia_modeset, nvidia) to avoid PyTorch/driver mismatch, Claude Code installation, MOTD with PyTorch activation instructions, and bashrc auto-activation. Add persistent volume support with KEEP=1 configuration option. Volume mappings cached in ~/.cache/kdevops/datacrunch/ enable fast reprovisioning (seconds vs minutes) while managing ~$10/month storage costs. Volume lifecycle wrapper scripts automate volume ID tracking across instance destroy/recreate cycles. Add defconfig files for tier-based selection, specific GPUs, and multi-GPU configurations. Comprehensive documentation focuses on capacity challenges, selection strategies, and best practices. When a DataCrunch instance type fails to deploy due to capacity issues, the terraform bringup will retry with a different instance from the same tier group. An exclude option can be used multiple times to skip instance types that have already failed. Generated-by: Claude AI Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
…rapper.sh When terraform output fails, the script suppresses the error with 2>/dev/null and exits immediately at line 52. This prevents the "Volume cache updated successfully" message from printing and hides the actual terraform error from the user. Replace the early exit with an else block so the script continues to the normal exit at line 79. Change 2>/dev/null to 2>&1 so terraform errors are captured and displayed when the command fails. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The grep + cut approach for extracting CONFIG_KDEVOPS_HOSTS_PREFIX assumes the config value is always wrapped in double quotes with no escaped quotes inside. If the prefix contains special characters or quotes, parsing will fail. Replace the fragile grep+cut parsing with sourcing the .config file directly. This is the standard kdevops pattern used by other scripts like gen_ssh_key.sh and gen_pcie_passthrough_vars.sh. The .config file is generated by Kconfig and is designed to be shell-sourceable. The fix also uses shell parameter expansion with defaults for safer variable access, ensuring empty values are handled correctly. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The previous implementation downloaded and executed the uv installer script directly from astral.sh without any integrity verification. This posed a security risk as an attacker could compromise the domain or perform a MITM attack to inject malicious code. Replace the curl pipe to shell pattern with a secure installation flow: download the installer from GitHub releases to a temporary file, verify its SHA256 checksum using ansible.builtin.get_url's checksum parameter, execute the verified script, then remove the temporary file. The installer is pinned to version 0.9.17. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
urllib.request.urlopen without a timeout can block indefinitely if the server doesn't respond. Add a DEFAULT_TIMEOUT constant (30 seconds) to both datacrunch_api.py and datacrunch_ssh_keys.py, and pass it to all urlopen calls. Also add explicit exception handling for socket.timeout and urllib.error.URLError so that timeout and connection errors are handled cleanly with informative error messages rather than being caught by the generic Exception handler. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The get_unique_key_name() function used os.getcwd() to generate a directory-based hash for SSH key names. This broke the guarantee that the same project always gets the same key name, because os.getcwd() returns different paths when the script is run from different subdirectories within the same project. Use git rev-parse --show-toplevel to get the repository root path, which remains constant regardless of which subdirectory the script is invoked from. Fall back to os.getcwd() if not in a git repository. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The embedded Python script in apply_wrapper.sh constructs a subprocess command list using shell-expanded variables ($VOLUME_CACHE and $HOST_PREFIX). When bash expands these variables inside the double-quoted heredoc, they become literal strings in the Python code. If either path contains spaces, the Python list will be incorrectly split. Refactor to have Python output hostname/volume_id pairs in a parseable format, then let the shell call volume_cache.py directly with properly quoted arguments. This matches the pattern already used in destroy_wrapper.sh and ensures paths with spaces are handled safely. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The TERRAFORM_DATACRUNCH_SSH_KEY_FILE config uses a static default path that is shared across all projects, contradicting the documented security model which states that each kdevops project should use its own SSH key. Update the default to use the same directory-based checksum pattern already used by TERRAFORM_SSH_CONFIG_PUBKEY_FILE, ensuring the SSH key file path is unique per project directory. Also update the Python script default to match the new behavior. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The on-demand availability check was using location_codes[0] to return results for only the first location. This is inconsistent with spot mode behavior which returns per-location availability. Change the on-demand branch to iterate over all location_codes and return a result entry for each location. This provides complete availability information when querying on-demand instances and aligns the output format with spot mode. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The cmd_delete function correctly checks whether save_cache succeeds and returns 0 on success. However, if save_cache fails and returns False, the function implicitly returns None instead of an integer exit code. This causes unexpected behavior in shell scripts that check the exit status. Add an else clause to return 1 when save_cache fails, ensuring all execution paths return an integer exit code. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The default value for ssh_config_name uses a relative path that gets evaluated against the working directory, causing inconsistent behavior when terraform is executed from different locations. Change the default to use the home-relative path pattern already established in the rcloud module. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
25a781e to
c1ea579
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Merge mcgrof's implementation of the datacrunch terraform provider. Looks like they've rebranded as https://verda.com.