Skip to content

Conversation

@chucklever
Copy link
Contributor

@chucklever chucklever commented Dec 15, 2025

Merge mcgrof's implementation of the datacrunch terraform provider. Looks like they've rebranded as https://verda.com.

mcgrof and others added 11 commits December 15, 2025 11:22
Add comprehensive DataCrunch cloud provider support for GPU instances
ranging from Blackwell B300 to Tesla V100.

Features intelligent GPU selection with three strategies: wildcard
variant selection (ANY_1H100), tier-based fallback (H100_OR_LESS,
B300_OR_LESS) for automatic fallback through GPU tiers when top options
unavailable, and explicit instance types for production workloads.

GPU tier hierarchy supports 10 tiers with automatic fallback to
maximize provisioning success while capping costs. Tier-based selection
recommended for most users (~$1.99/hr cap for H100 tier).

Add capacity checking infrastructure to validate instance availability
before provisioning. API credential management uses OAuth2 with secure
token retrieval from ~/.datacrunch/credentials.

Special handling for local provider development with dev_overrides in
~/.terraformrc, skipping force_init in Ansible terraform module.

Add ML environment setup for DataCrunch instances: package updates, ML
dependencies, PyTorch in virtualenv, NVIDIA driver reload with proper
module unload sequence (nvidia_uvm, nvidia_drm, nvidia_modeset, nvidia)
to avoid PyTorch/driver mismatch, Claude Code installation, MOTD with
PyTorch activation instructions, and bashrc auto-activation.

Add persistent volume support with KEEP=1 configuration option. Volume
mappings cached in ~/.cache/kdevops/datacrunch/ enable fast
reprovisioning (seconds vs minutes) while managing ~$10/month storage
costs. Volume lifecycle wrapper scripts automate volume ID tracking
across instance destroy/recreate cycles.

Add defconfig files for tier-based selection, specific GPUs, and
multi-GPU configurations. Comprehensive documentation focuses on
capacity challenges, selection strategies, and best practices.

When a DataCrunch instance type fails to deploy due to capacity issues,
the terraform bringup will retry with a different instance from the
same tier group. An exclude option can be used multiple times to skip
instance types that have already failed.

Generated-by: Claude AI
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
…rapper.sh

When terraform output fails, the script suppresses the error with
2>/dev/null and exits immediately at line 52. This prevents the
"Volume cache updated successfully" message from printing and hides
the actual terraform error from the user.

Replace the early exit with an else block so the script continues to
the normal exit at line 79. Change 2>/dev/null to 2>&1 so terraform
errors are captured and displayed when the command fails.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The grep + cut approach for extracting CONFIG_KDEVOPS_HOSTS_PREFIX
assumes the config value is always wrapped in double quotes with no
escaped quotes inside. If the prefix contains special characters or
quotes, parsing will fail.

Replace the fragile grep+cut parsing with sourcing the .config file
directly. This is the standard kdevops pattern used by other scripts
like gen_ssh_key.sh and gen_pcie_passthrough_vars.sh. The .config file
is generated by Kconfig and is designed to be shell-sourceable.

The fix also uses shell parameter expansion with defaults for safer
variable access, ensuring empty values are handled correctly.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The previous implementation downloaded and executed the uv installer
script directly from astral.sh without any integrity verification.
This posed a security risk as an attacker could compromise the domain
or perform a MITM attack to inject malicious code.

Replace the curl pipe to shell pattern with a secure installation
flow: download the installer from GitHub releases to a temporary file,
verify its SHA256 checksum using ansible.builtin.get_url's checksum
parameter, execute the verified script, then remove the temporary
file. The installer is pinned to version 0.9.17.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
urllib.request.urlopen without a timeout can block indefinitely if the
server doesn't respond. Add a DEFAULT_TIMEOUT constant (30 seconds) to
both datacrunch_api.py and datacrunch_ssh_keys.py, and pass it to all
urlopen calls.

Also add explicit exception handling for socket.timeout and
urllib.error.URLError so that timeout and connection errors are handled
cleanly with informative error messages rather than being caught by the
generic Exception handler.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The get_unique_key_name() function used os.getcwd() to generate a
directory-based hash for SSH key names. This broke the guarantee that
the same project always gets the same key name, because os.getcwd()
returns different paths when the script is run from different
subdirectories within the same project.

Use git rev-parse --show-toplevel to get the repository root path,
which remains constant regardless of which subdirectory the script
is invoked from. Fall back to os.getcwd() if not in a git repository.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The embedded Python script in apply_wrapper.sh constructs a subprocess
command list using shell-expanded variables ($VOLUME_CACHE and
$HOST_PREFIX). When bash expands these variables inside the double-quoted
heredoc, they become literal strings in the Python code. If either path
contains spaces, the Python list will be incorrectly split.

Refactor to have Python output hostname/volume_id pairs in a parseable
format, then let the shell call volume_cache.py directly with properly
quoted arguments. This matches the pattern already used in
destroy_wrapper.sh and ensures paths with spaces are handled safely.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The TERRAFORM_DATACRUNCH_SSH_KEY_FILE config uses a static default path
that is shared across all projects, contradicting the documented security
model which states that each kdevops project should use its own SSH key.

Update the default to use the same directory-based checksum pattern
already used by TERRAFORM_SSH_CONFIG_PUBKEY_FILE, ensuring the SSH key
file path is unique per project directory.

Also update the Python script default to match the new behavior.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The on-demand availability check was using location_codes[0] to
return results for only the first location. This is inconsistent
with spot mode behavior which returns per-location availability.

Change the on-demand branch to iterate over all location_codes and
return a result entry for each location. This provides complete
availability information when querying on-demand instances and
aligns the output format with spot mode.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The cmd_delete function correctly checks whether save_cache succeeds
and returns 0 on success. However, if save_cache fails and returns
False, the function implicitly returns None instead of an integer
exit code. This causes unexpected behavior in shell scripts that
check the exit status.

Add an else clause to return 1 when save_cache fails, ensuring all
execution paths return an integer exit code.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The default value for ssh_config_name uses a relative path that gets
evaluated against the working directory, causing inconsistent behavior
when terraform is executed from different locations. Change the default
to use the home-relative path pattern already established in the rcloud
module.

Generated-by: Claude AI
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
@chucklever chucklever force-pushed the cel/terraform-datacrunch branch from 25a781e to c1ea579 Compare December 15, 2025 20:22
@chucklever chucklever merged commit e2e2d8f into main Dec 16, 2025
20 of 22 checks passed
@chucklever chucklever deleted the cel/terraform-datacrunch branch December 16, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants