Hi,
I use DataFaker heavily in SeedStream; here I run multiple parallel threads to generate test data and I noticed a significant CPU overhead and traced it back to Faker instantiation and per-instance YAML loading but fixed most of it with a thread-local Faker:
private static final ThreadLocal<Map<Locale, Faker>> CACHE = ThreadLocal.withInitial(HashMap::new);
public static Faker getOrCreate(Locale locale, Random random) {
return CACHE.get().computeIfAbsent(locale, loc -> new Faker(loc, random))
}
Profiling again after this change, I can see each Faker instance still builds its own FakeValuesService with its own fakeValuesInterfaceMap. With 8 threads all using the same Locale, the same YAML gets loaded 8 times into 8 separate maps. Same goes with expression templates: compiled and cached inside each FakeValuesService independently via EXPRESSION_2_SPLITTED and expression2generex.
Digging some more I noticed that BaseFaker has a constructor:
public BaseFaker(FakeValuesService fakeValuesService, FakerContext context)
This makes me think that sharing a single FakeValuesService across multiple Faker instances is at least architecturally possible.
In the average scenario I am working on, all threads use the same locale and only differ in their Random: a shared, read-only FakeValuesService with per-instance FakerContext could work.
At this point I have a few questions
- Is the current design — one FakeValuesService per Faker — intentional for thread-safety or isolation reasons I might be missing?
- Are there concurrency issues in sharing a FakeValuesService across Faker instances with different Random instances?
- Would a factory method like
Faker.withSharedService(FakeValuesService shared, Locale locale, Random random) be something you would consider? It would let callers manage a shared, pre-warmed service and pass it in, with each Faker only owning its FakerContext.
- Alternatively, would a
FakeValuesServiceFactory.getShared(Locale) singleton pattern fit the library's design philosophy?
Happy to contribute if the approach sounds reasonable and there are no blockers I'm missing.
Thanks for any insight.
Hi,
I use DataFaker heavily in SeedStream; here I run multiple parallel threads to generate test data and I noticed a significant CPU overhead and traced it back to Faker instantiation and per-instance YAML loading but fixed most of it with a thread-local Faker:
Profiling again after this change, I can see each Faker instance still builds its own FakeValuesService with its own fakeValuesInterfaceMap. With 8 threads all using the same Locale, the same YAML gets loaded 8 times into 8 separate maps. Same goes with expression templates: compiled and cached inside each FakeValuesService independently via EXPRESSION_2_SPLITTED and expression2generex.
Digging some more I noticed that
BaseFakerhas a constructor:public BaseFaker(FakeValuesService fakeValuesService, FakerContext context)This makes me think that sharing a single FakeValuesService across multiple Faker instances is at least architecturally possible.
In the average scenario I am working on, all threads use the same locale and only differ in their Random: a shared, read-only FakeValuesService with per-instance FakerContext could work.
At this point I have a few questions
Faker.withSharedService(FakeValuesService shared, Locale locale, Random random)be something you would consider? It would let callers manage a shared, pre-warmed service and pass it in, with each Faker only owning its FakerContext.FakeValuesServiceFactory.getShared(Locale)singleton pattern fit the library's design philosophy?Happy to contribute if the approach sounds reasonable and there are no blockers I'm missing.
Thanks for any insight.