Skip to content

Commit 4731379

Browse files
authored
FEAT Integrate BD-LoRA into PEFT (#2895)
Implements BD-LoRA: Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving (https://openreview.net/forum?id=1cjLvtFOmL). With BD-LoRA, the LoRA weights are implemented in a block-diagonal way. This allows to reduce communication overhead when using tensor parallelism (TP) and thus faster serving. There is an experiment vLLM PR to support this, but it's not merged (yet): vllm-project/vllm#28136.
1 parent 4d63474 commit 4731379

File tree

16 files changed

+907
-3
lines changed

16 files changed

+907
-3
lines changed
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# BD-LoRA Finetuning
2+
3+
Block-Diagonal LoRA (BD-LoRA) is a LoRA variant in which some LoRA factors are constrained to be block-diagonal.
4+
This allows faster serving by eliminating communication overheads when running inference on multiple GPU, at the same finetuning performance as vanilla LoRA.
5+
6+
To get an overview on how to use BD-LoRA, please view the Python notebook at `peft/examples/bdlora_finetuning/bdlora_peft_demo.ipynb`.
7+
8+
To benefit from inference speed-ups, you need an inference engine that is compatible with BD-LoRA. At the moment, there is an experimental PR at https://github.com/vllm-project/vllm/pull/28136 which allows you to use BD-LoRA in vLLM. If you find this work useful, consider leaving a comment there.
9+
10+
To install, you can clone the GitHub repository connected to the fork at https://github.com/Conzel/vllm/tree/bdlora-bk. Then, install vLLM following the usual instructions: https://docs.vllm.ai/en/stable/getting_started/installation/. We assume that you have a hardware setup with at least 2 available GPUs.
11+
12+
This example folder contains 3 scripts:
13+
- `bdlora_peft_demo.ipynb` Showcases how to instantiate a BD-LoRA model, train it, and save/reload the weights.
14+
- `vllm_server.bash` Spins up a BD-LoRA compatible vLLM server. To use it, you need to run the notebook once to create adapters with the correct format.
15+
- `chat.py` Can be used to query the vLLM server after it has finished booting up. Usage example: `python3 chat.py --target lora1`.
306 KB
Loading
343 KB
Loading

0 commit comments

Comments
 (0)