18071807
18081808
18091809
1810+ < li class ="md-nav__item ">
1811+ < a href ="../../docs/reference/cli/dstack/offer/ " class ="md-nav__link ">
1812+
1813+
1814+
1815+ < span class ="md-ellipsis ">
1816+
1817+
1818+ dstack offer
1819+
1820+
1821+
1822+ </ span >
1823+
1824+
1825+
1826+ </ a >
1827+ </ li >
1828+
1829+
1830+
1831+
1832+
1833+
1834+
1835+
1836+
1837+
18101838 < li class ="md-nav__item ">
18111839 < a href ="../../docs/reference/cli/dstack/volume/ " class ="md-nav__link ">
18121840
35913619 < span class ="md-ellipsis ">
35923620
35933621 < span class ="md-typeset ">
3594- vRAM consumption
3622+ VRAM consumption
35953623 </ span >
35963624
35973625 </ span >
@@ -3912,8 +3940,8 @@ <h3 id="tokensec-and-ttft-per-rps">Token/sec and TTFT per RPS<a class="headerlin
39123940performance improved notably when the number of requests was below 900.</ p >
39133941</ blockquote >
39143942< p > < img src ="https://raw.githubusercontent.com/dstackai/benchmarks/refs/heads/main/amd/inference/charts_rps/mean_ttft_tgi_vllm.png " width ="725 " style ="padding: 0 40px 0 50px "/> </ p >
3915- < h3 id ="vram-consumption "> vRAM consumption< a class ="headerlink " href ="#vram-consumption " title ="Permanent link "> ¶</ a > </ h3 >
3916- < p > When considering vRAM consumption right after loading model weights, TGI allocates approximately 28% less vRAM compared
3943+ < h3 id ="vram-consumption "> VRAM consumption< a class ="headerlink " href ="#vram-consumption " title ="Permanent link "> ¶</ a > </ h3 >
3944+ < p > When considering VRAM consumption right after loading model weights, TGI allocates approximately 28% less VRAM compared
39173945to vLLM.</ p >
39183946< p > < img src ="https://raw.githubusercontent.com/dstackai/benchmarks/refs/heads/main/amd/inference/gpu_vram_tgi_vllm.png " width ="750 " /> </ p >
39193947< p > This difference may be related to how vLLM < a href ="https://docs.vllm.ai/en/latest/models/performance.html " target ="_blank "> pre-allocates GPU cache < span class ="twemoji external "> < svg xmlns ="http://www.w3.org/2000/svg " viewBox ="0 0 24 24 "> < path d ="m11.93 5 2.83 2.83L5 17.59 6.42 19l9.76-9.75L19 12.07V5z "/> </ svg > </ span > </ a > .</ p >
@@ -3931,7 +3959,7 @@ <h2 id="conclusion">Conclusion<a class="headerlink" href="#conclusion" title="Pe
39313959< li > With vLLM, we used the default backend configuration. With better tuning, we might have achieved improved performance.</ li >
39323960</ ul >
39333961</ div >
3934- < p > In general, the 8x AMD MI300X is a good fit for larger models and allows us to make the most of its vRAM , especially for
3962+ < p > In general, the 8x AMD MI300X is a good fit for larger models and allows us to make the most of its VRAM , especially for
39353963larger batches.</ p >
39363964< p > If you’d like to support us in doing more benchmarks, please let us know.</ p >
39373965< h2 id ="whats-next "> What's next?< a class ="headerlink " href ="#whats-next " title ="Permanent link "> ¶</ a > </ h2 >
0 commit comments