Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/get-started/verify-hami.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,17 +30,17 @@ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

### 2. Configure your runtime

* For containerd: Edit `/etc/containerd/config.toml` to set the default runtime name to `"nvidia"` and the binary name to `"/usr/bin/nvidia-container-runtime"`.
- For containerd: Edit `/etc/containerd/config.toml` to set the default runtime name to `"nvidia"` and the binary name to `"/usr/bin/nvidia-container-runtime"`.

* Restart:
- Restart:

```bash
sudo systemctl daemon-reload && sudo systemctl restart containerd
```

* For Docker: Edit `/etc/docker/daemon.json` to set `"default-runtime": "nvidia"`.
- For Docker: Edit `/etc/docker/daemon.json` to set `"default-runtime": "nvidia"`.

* Restart:
- Restart:

```bash
sudo systemctl daemon-reload && sudo systemctl restart docker
Expand Down
4 changes: 2 additions & 2 deletions docs/installation/how-to-use-hami-dra.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ By installing the [HAMi DRA webhook](https://github.com/Project-HAMi/HAMi-DRA) i

## Prerequisites

* Kubernetes version >= 1.34 with DRA Consumable Capacity [featuregate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) enabled
- Kubernetes version >= 1.34 with DRA Consumable Capacity [featuregate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) enabled

## Installation

Expand All @@ -38,7 +38,7 @@ DRA mode is not compatible with traditional mode. Do not enable both at the same

The implementation of DRA functionality requires support from the corresponding device's DRA Driver. Currently supported devices include:

* [NVIDIA GPU](../userguide/nvidia-device/dynamic-resource-allocation)
- [NVIDIA GPU](../userguide/nvidia-device/dynamic-resource-allocation)

Please refer to the corresponding page to install the device driver.

Expand Down
20 changes: 10 additions & 10 deletions docs/releases.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@ Minor releases contain features, enhancements, and fixes that are introduced in
Since HAMi is a fast growing project and features continue to iterate rapidly,
having a minor release approximately every few months helps balance speed and stability.

* Roughly every 3 months
- Roughly every 3 months

### PATCH release

Patch releases are for backwards-compatible bug fixes and very minor enhancements which do not impact stability or compatibility.
Typically only critical fixes are selected for patch releases. Usually there will be at least one patch release in a minor release cycle.

* When critical fixes are required, or roughly each month
- When critical fixes are required, or roughly each month

### Versioning

Expand All @@ -53,17 +53,17 @@ Critical issues, with no work-arounds, are added to the next patch release.

Release branches and PRs are managed as follows:

* All changes are always first committed to `master`.
* Branches are created for each major or minor release.
* The branch name will contain the version, for example release-1.2.
* Patch releases are created from a release branch.
* For critical fixes that need to be included in a patch release, PRs should always be first merged to master
- All changes are always first committed to `master`.
- Branches are created for each major or minor release.
- The branch name will contain the version, for example release-1.2.
- Patch releases are created from a release branch.
- For critical fixes that need to be included in a patch release, PRs should always be first merged to master
and then cherry-picked to the release branch. PRs need to be guaranteed to have a release note written and
these descriptions will be reflected in the next patch release.
The cherry-pick process of PRs is executed through the script. See [cherry-pick usage](https://project-hami.io/docs/contributor/cherry-picks).
* For complex changes, specially critical bugfixes, separate PRs may be required for master and release branches.
* The milestone mark (for example v1.4) will be added to PRs which means changes in PRs are one of the contents of the corresponding release.
* During PR review, the Assignee selection is used to indicate the reviewer.
- For complex changes, specially critical bugfixes, separate PRs may be required for master and release branches.
- The milestone mark (for example v1.4) will be added to PRs which means changes in PRs are one of the contents of the corresponding release.
- During PR review, the Assignee selection is used to indicate the reviewer.

### Release Planning

Expand Down
20 changes: 10 additions & 10 deletions docs/userguide/awsneuron-device/enable-awsneuron-managing.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,20 @@ AWS Neuron devices are specialized hardware accelerators designed by AWS to opti

HAMi now integrates with [Neuron scheduler extension](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#deploy-neuron-scheduler-extension), providing the following capabilities:

* **Neuron sharing**: HAMi now supports sharing on aws.amazon.com/neuron by allocating device cores(aws.amazon.com/neuroncore), each Neuron core equals 1/2 of a neuron device.
- **Neuron sharing**: HAMi now supports sharing on aws.amazon.com/neuron by allocating device cores(aws.amazon.com/neuroncore), each Neuron core equals 1/2 of a neuron device.

* **Topology awareness**: When allocating multiple aws-neuron devices in a container, HAMi ensures these devices are connected to minimize the communication cost between neuron devices. For details about how these devices are connected, refer to [Container Device Allocation On Different Instance Types](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#container-device-allocation-on-different-instance-types).
- **Topology awareness**: When allocating multiple aws-neuron devices in a container, HAMi ensures these devices are connected to minimize the communication cost between neuron devices. For details about how these devices are connected, refer to [Container Device Allocation On Different Instance Types](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#container-device-allocation-on-different-instance-types).

## Prerequisites

* Neuron-device-plugin
* EC2 instance of type `Inf` or `Trn`
- Neuron-device-plugin
- EC2 instance of type `Inf` or `Trn`

## Enabling Neuron-sharing Support

* Deploy neuron-device-plugin on EC2 neuron nodes according to the AWS document: [Neuro Device Plugin](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#neuron-device-plugin)
- Deploy neuron-device-plugin on EC2 neuron nodes according to the AWS document: [Neuro Device Plugin](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#neuron-device-plugin)

* Deploy HAMi
- Deploy HAMi

```bash
helm install hami hami-charts/hami -n kube-system
Expand All @@ -33,10 +33,10 @@ HAMi divides each AWS Neuron device into 2 units for resource allocation. You ca

### Neuron Allocation

* Each unit of `aws.amazon.com/neuroncore` represents 1/2 of neuron device
* Don't assign `aws.amazon.com/neuron` like other devices, only assigning `aws.amazon.com/neuroncore` is enough
* When the number of `aws.amazon.com/neuroncore`>=2, it is equivalent to setting `aws.amazon.com/neuron=1/2 * neuronCoreNumber`
* The topology awareness scheduling is automatically enabled when tasks require multiple neuron devices.
- Each unit of `aws.amazon.com/neuroncore` represents 1/2 of neuron device
- Don't assign `aws.amazon.com/neuron` like other devices, only assigning `aws.amazon.com/neuroncore` is enough
- When the number of `aws.amazon.com/neuroncore`>=2, it is equivalent to setting `aws.amazon.com/neuron=1/2 * neuronCoreNumber`
- The topology awareness scheduling is automatically enabled when tasks require multiple neuron devices.

## Running Neuron jobs

Expand Down
22 changes: 11 additions & 11 deletions docs/userguide/cambricon-device/enable-cambricon-mlu-sharing.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,22 @@ title: Enable Cambricon MLU Sharing

**HAMi now supports `cambricon.com/mlu` by implementing most device-sharing features similar to NVIDIA GPUs**, including:

* **MLU Sharing**: Tasks can request a fraction of an MLU instead of an entire MLU card.
- **MLU Sharing**: Tasks can request a fraction of an MLU instead of an entire MLU card.
This enables multiple tasks to share the same MLU device.

* **Device Memory Control**: You can allocate MLUs with a specified memory size, with
- **Device Memory Control**: You can allocate MLUs with a specified memory size, with
guaranteed enforcement to ensure usage does not exceed the requested limit.

* **Device Core Control**: MLUs can be assigned a specific number of compute cores,
- **Device Core Control**: MLUs can be assigned a specific number of compute cores,
and enforcement ensures core usage remains within bounds.

* **MLU Type Selection**: You can use annotations to specify which MLU types a task *must use* or
- **MLU Type Selection**: You can use annotations to specify which MLU types a task *must use* or
*must avoid* by setting `cambricon.com/use-mlutype` or `cambricon.com/nouse-mlutype`.

## Prerequisites

* neuware-mlu370-driver > 5.10
* cntoolkit > 2.5.3
- neuware-mlu370-driver > 5.10
- cntoolkit > 2.5.3

## Enabling MLU Sharing

Expand All @@ -40,8 +40,8 @@ title: Enable Cambricon MLU Sharing

Get the `cambricon-device-plugin` from your device provider, and configure it with the following parameters:

* `mode=dynamic-smlu`: Enables dynamic SMLU support.
* `min-dsmlu-unit=256`: Sets the minimum allocatable memory unit to 256 MB.
- `mode=dynamic-smlu`: Enables dynamic SMLU support.
- `min-dsmlu-unit=256`: Sets the minimum allocatable memory unit to 256 MB.

Refer to your provider’s documentation for additional details.

Expand All @@ -55,9 +55,9 @@ title: Enable Cambricon MLU Sharing

To request shared MLU resources in a container, use the following resource types:

* `cambricon.com/vmlu`
* `cambricon.com/mlu.smlu.vmemory`
* `cambricon.com/mlu.smlu.vcore`
- `cambricon.com/vmlu`
- `cambricon.com/mlu.smlu.vmemory`
- `cambricon.com/mlu.smlu.vcore`

Here is a YAML example:

Expand Down
18 changes: 9 additions & 9 deletions docs/userguide/configure.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,15 +47,15 @@ HAMi allows configuring per-node behavior for device plugin. Edit the ConfigMap:
```sh
kubectl -n <namespace> edit cm hami-device-plugin
```
* `name`: Name of the node.
* `operatingmode`: Operating mode of the node, can be "hami-core" or "mig", default: "hami-core".
* `devicememoryscaling`: Overcommit ratio of device memory.
* `devicecorescaling`: Overcommit ratio of device core.
* `devicesplitcount`: Allowed number of tasks sharing a device.
* `filterdevices`: Devices that are not registered to HAMi.
* `uuid`: UUIDs of devices to ignore
* `index`: Indexes of devices to ignore.
* A device is ignored by HAMi if it is in the `uuid` or `index` list.
- `name`: Name of the node.
- `operatingmode`: Operating mode of the node, can be "hami-core" or "mig", default: "hami-core".
- `devicememoryscaling`: Overcommit ratio of device memory.
- `devicecorescaling`: Overcommit ratio of device core.
- `devicesplitcount`: Allowed number of tasks sharing a device.
- `filterdevices`: Devices that are not registered to HAMi.
- `uuid`: UUIDs of devices to ignore
- `index`: Indexes of devices to ignore.
- A device is ignored by HAMi if it is in the `uuid` or `index` list.

## Chart Configs: arguments

Expand Down
20 changes: 10 additions & 10 deletions docs/userguide/enflame-device/enable-enflame-gcu-sharing.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@ title: Enable Enflame GCU Sharing

## Prerequisites

* Enflame gcushare-device-plugin >= 2.1.6 (please consult your device provider, gcushare has two components: gcushare-scheduler-plugin and gcushare-device-plugin; only gcushare-device-plugin is needed here)
* driver version >= 1.2.3.14
* kubernetes >= 1.24
* enflame-container-toolkit >=2.0.50
- Enflame gcushare-device-plugin >= 2.1.6 (please consult your device provider, gcushare has two components: gcushare-scheduler-plugin and gcushare-device-plugin; only gcushare-device-plugin is needed here)
- driver version >= 1.2.3.14
- kubernetes >= 1.24
- enflame-container-toolkit >=2.0.50

## Enabling GCU-sharing Support

* Deploy gcushare-device-plugin on enflame nodes (Please consult your device provider to acquire its package and document)
- Deploy gcushare-device-plugin on enflame nodes (Please consult your device provider to acquire its package and document)

:::caution
Install only `gcushare-device-plugin`. Do not install the `gcushare-scheduler-plugin` package.
Expand All @@ -39,7 +39,7 @@ The default resource names are:
You can customize these names by modifying the `hami-scheduler-device` ConfigMap.
:::

* Set 'devices.enflame.enabled=true' when deploy HAMi
- Set 'devices.enflame.enabled=true' when deploy HAMi

```bash
helm install hami hami-charts/hami --set devices.enflame.enabled=true -n kube-system
Expand All @@ -51,10 +51,10 @@ HAMi divides each Enflame GCU into 100 units for resource allocation. When you r

### GCU Slice Allocation

* Each unit of `enflame.com/vgcu-percentage` represents 1% device memory and 1% core
* If you don't specify a memory request, the system will default to using 100% of the available memory
* Memory allocation is enforced with hard limits to ensure tasks don't exceed their allocated memory
* Core allocation is enforced with hard limits to ensure tasks don't exceed their allocated cores
- Each unit of `enflame.com/vgcu-percentage` represents 1% device memory and 1% core
- If you don't specify a memory request, the system will default to using 100% of the available memory
- Memory allocation is enforced with hard limits to ensure tasks don't exceed their allocated memory
- Core allocation is enforced with hard limits to ensure tasks don't exceed their allocated cores

## Running Enflame jobs

Expand Down
6 changes: 3 additions & 3 deletions docs/userguide/hygon-device/enable-hygon-dcu-sharing.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ title: Enable Hygon DCU sharing

## Prerequisites

* dtk driver >= 24.04
* hy-smi v1.6.0
- dtk driver >= 24.04
- hy-smi v1.6.0

## Enabling DCU-sharing Support

* Deploy the [dcu-vgpu-device-plugin](https://github.com/Project-HAMi/dcu-vgpu-device-plugin)
- Deploy the [dcu-vgpu-device-plugin](https://github.com/Project-HAMi/dcu-vgpu-device-plugin)

## Running DCU jobs

Expand Down
10 changes: 5 additions & 5 deletions docs/userguide/kunlunxin-device/enable-kunlunxin-schedule.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,15 @@ of the requested resources on the selected node, following these rules:

## Prerequisites

* Kunlunxin driver >= v5.0.21
* Kubernetes >= v1.23
* kunlunxin k8s-device-plugin
- Kunlunxin driver >= v5.0.21
- Kubernetes >= v1.23
- kunlunxin k8s-device-plugin

## Enabling Topology-Aware Scheduling

* Deploy the Kunlunxin device plugin on P800 nodes.
- Deploy the Kunlunxin device plugin on P800 nodes.
(Please contact your device vendor to obtain the appropriate package and documentation.)
* Deploy HAMi according to the instructions in `README.md`.
- Deploy HAMi according to the instructions in `README.md`.

## Running Kunlunxin Jobs

Expand Down
12 changes: 6 additions & 6 deletions docs/userguide/kunlunxin-device/enable-kunlunxin-vxpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ This component supports multiplexing Kunlunxin XPU devices (P800-OAM) and provid

## Prerequisites

* driver version >= 5.0.21.16
* xpu-container-toolkit >= xpu_container_1.0.2-1
* XPU device type: P800-OAM
- driver version >= 5.0.21.16
- xpu-container-toolkit >= xpu_container_1.0.2-1
- XPU device type: P800-OAM

## Enable XPU-sharing Support

* Deploy [vxpu-device-plugin]
- Deploy [vxpu-device-plugin]

```yaml
apiVersion: rbac.authorization.k8s.io/v1
Expand Down Expand Up @@ -125,8 +125,8 @@ spec:
:::note
Default resource names are as follows:

* `kunlunxin.com/vxpu` for VXPU count
* `kunlunxin.com/vxpu-memory` for memory allocation
- `kunlunxin.com/vxpu` for VXPU count
- `kunlunxin.com/vxpu-memory` for memory allocation

You can customize these names using the parameters above.
:::
Expand Down
8 changes: 4 additions & 4 deletions docs/userguide/mthreads-device/enable-mthreads-gpu-sharing.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,20 +24,20 @@ title: Enable Mthreads GPU sharing

## Prerequisites

* [MT CloudNative Toolkits > 1.9.0](https://docs.mthreads.com/cloud-native/cloud-native-doc-online/)
* driver version >= 1.2.0
- [MT CloudNative Toolkits > 1.9.0](https://docs.mthreads.com/cloud-native/cloud-native-doc-online/)
- driver version >= 1.2.0

## Enabling GPU-sharing Support

* Deploy MT-CloudNative Toolkit on Mthreads nodes (Please consult your device provider to acquire its package and document)
- Deploy MT-CloudNative Toolkit on Mthreads nodes (Please consult your device provider to acquire its package and document)

:::note

You can remove `mt-mutating-webhook` and `mt-gpu-scheduler` after installation (optional).

:::

* set the 'devices.mthreads.enabled = true' when installing hami
- set the 'devices.mthreads.enabled = true' when installing hami

```bash
helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set devices.mthreads.enabled=true -n kube-system
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ By installing hami-k8s-dra-driver, your cluster scheduler can discover NVIDIA GP

## Prerequisites

* The underlying container runtime (e.g., containerd or CRI-O) has [CDI](https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#how-to-configure-cdi) enabled
- The underlying container runtime (e.g., containerd or CRI-O) has [CDI](https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#how-to-configure-cdi) enabled

## Installation

Expand Down
Loading