Skip to content

Enable OpenVINO model caching#118

Open
solidDoWant wants to merge 7 commits into
SearchSavior:mainfrom
solidDoWant:feat/enable-ov-model-caching-1
Open

Enable OpenVINO model caching#118
solidDoWant wants to merge 7 commits into
SearchSavior:mainfrom
solidDoWant:feat/enable-ov-model-caching-1

Conversation

@solidDoWant
Copy link
Copy Markdown
Contributor

When the OPENARC_OV_CACHE_DIR var is set, compiled models will now be cached/restored from cache. This results in an enormous decrease in peak memory requirements after first startup.

@SearchSavior
Copy link
Copy Markdown
Owner

OpenVINO model caching effects all models and is an overdue change to handle automatically. Good find.

I think the best way is to auto inject the "CACHE_DIR" into runtime_config somewhere in add.py; in qwen3_asr th call to ov.Core is the same API runtime_config pokes- properties, and setting this to the same directory as model_path should make it so compiled models are always saved and stored in an organized way, since they are blobs they don't contain any easy to understand metadata about their compile target. so hardcoding path behavior will be neccessary. However there are some edge cases.

  • compiled models cannot be reused across devices. this feature caches a JIT compiled model blob for the compile situation- so if a user changes the device for a given config, well, I'm not sure if openvino handles this gracefully. You will need to test. Do this by watching what files are created when you load a model, change the configs to load CPU etc.

  • might need to add runtime_config to each call to ov.core for all the submodels in each openvino implementation: that code looks the same in all the implementations and should be under load_model method in all cases ie, not such a massive task. Frankly your other changes were harder to imagine (for me at least) 😄

  • optimum backend it out of scope for these changes, we can address in a seperate PR.

LMK if you have any questions. Don't worry, most of the work for this one is testing to see what happens in all the cases for each model-type.

To be clear, like you found we expect first load time to be long; and subsequent loads to be slow, and discover what happens when you play with settings, but have not cleared the cached model files.

Also I like the env approach- we need to start making openarc more container friendly. Can you add a toggle to shut the auto cache behavior off (to save fighting configs if this ever breaks)

Again, great find and thanks for the PR

@solidDoWant
Copy link
Copy Markdown
Contributor Author

When you say CACHE_DIR should be auto-injected via the add command, do you mean that this should become a config property that is then passed to openvino at load time? Just trying to make sure that we're on the same page.

I'm hesitant to place the cache beside the model. This means that the model itself can no longer be stored on a read-only filesystem. Instead I'd propose this: allow the user to specify a cache directory, then place then load/store the compiled model in a model-specific subdirectory of this path. For example, if the user specifies /model-cache, place a compiled qwen3 model at /model-cache/qwen3. This allows the cache file tree to be organized, while also allowing users to better manage how and where caches are stored. This would be particularly useful for sharing a cache directory between multiple openarc instances.

compiled models cannot be reused across devices

I don't think that this is strictly true, rather, I think that compiled models are specific to (device, driver, library, model) tuples. If you're running multiple nodes with the same device and kernel modules and the same container image, a model compiled on one machine should work on any of the others. I'll do some further testing to verify. I can't easily change driver versions, but I can change the others.

@SearchSavior
Copy link
Copy Markdown
Owner

When you say CACHE_DIR should be auto-injected via the add command, do you mean that this should become a config property that is then passed to openvino at load time?

Yes, exactly.

if the user specifies /model-cache, place a compiled qwen3 model at /model-cache/qwen3

OK that framing of how to organize caching makes much more sense. in this case /model-cache/{model_name} would work nicely to keep things organized like you suggest.

If you're running multiple nodes with the same device and kernel modules and the same container image, a model compiled on one machine should work on any of the others.

During JIT creation of model cache hardware features are considered in ways that aren't obvious, which can change if your device supports different data types or instruction sets. In general there is a plugin system with it's own compile behavior. Most concerning is this line from the docs

If you're running multiple nodes with the same device and kernel modules and the same container image, a model compiled on one machine should work on any of the others

Ok so you are imagining a scenario where we compile once and reuse- I think that should work; however changing device setting for an already compiled cache will cause the runtime to build a new cache. from above link:

"Cache files can be reused within the same Model Server version, target device, hardware, model and the model shape parameters. The Model Server, automatically detects if the cache is present and re-generates new cache files when required."

Unfortunately this assertion seems underspecified vs what I have observed in practice, so LMK what you find out. I will do some tests later tonight and report back.

@solidDoWant
Copy link
Copy Markdown
Contributor Author

All this sounds great, thanks! FYI I have been building the cache on one machine and running on another (absolutely identical hardware and software with basically every single dependency version pinned from the OS up) with great success. With this setup I can even update the cache while openarc is running in most cases, and then have a fast restart to load it. It should also supports multiple openarc instances on multiple identical nodes, though I haven't rolled this out yet.

@solidDoWant solidDoWant force-pushed the feat/enable-ov-model-caching-1 branch from 38860c3 to 92e90bb Compare May 23, 2026 23:33
@solidDoWant
Copy link
Copy Markdown
Contributor Author

Ran through a series of tests covering edge cases. Here's what I found:

  • The cache directory will never be pruned, and openvino provides no way to tell which files are stale and which are not. A hacky cleanup process could potentially be added, but this would tie cleanup to openvino internals. I'd recommend just documenting this limitation and leaving it to users to clean up.
  • Intel OneDNN caches OpenCL kernel binaries at this path as well. These are only compiled upon first inference. The cost of generating these is almost zero, and even with a read-only mount model loading still succeeds, so this probably isn't a concern.
  • Changing runtime_config does trigger a cache rebuild when parameters that affect the compiled model are changed. Changing PERFORMANCE_HINT: THROUGHPUT to PERFORMANCE_HINT: LATENCY (and others) triggers a rebuild, but NUM_STREAMS: 1 -> NUM_STREAMS: 2 does not.
  • Changing GPU causes a cache rebuild.
  • Changing CPU -> GPU or GPU -> CPU causes a cache rebuild.
  • Using a cache greatly reduces startup time after it is warmed. In testing, the Qwen3-ASR model loads about seven times faster when pre-compiled on my setup. My setup involves loading the cache files over a network (CephFS), so locally this should be reduced further.
  • If the cache disk is full, the model fails to load. This has been fixed ([core] Recover from model cache write failures openvinotoolkit/openvino#35132), but I think openarc needs a dep update to avoid this.
  • Multiple processes can read from the cache at the same time and see roughly the same start time improvement.
  • Multiple processes cannot write to the cache at the same time. Needs documentation but IMO not an issue.

@SearchSavior I probably need a minor doc update, but do you see any other issues with this work given the above?

Signed-off-by: solidDoWant <fred.heinecke@yahoo.com>
Signed-off-by: solidDoWant <fred.heinecke@yahoo.com>
Signed-off-by: solidDoWant <fred.heinecke@yahoo.com>
Signed-off-by: solidDoWant <fred.heinecke@yahoo.com>
Signed-off-by: solidDoWant <fred.heinecke@yahoo.com>
Signed-off-by: solidDoWant <fred.heinecke@yahoo.com>
@solidDoWant solidDoWant force-pushed the feat/enable-ov-model-caching-1 branch from 412630a to bfb4e5b Compare May 25, 2026 21:10
Signed-off-by: solidDoWant <fred.heinecke@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants