Skip to content

perf: improve wildcard query perf with predicate and contains-check pushdown #397

Open
cheb0 wants to merge 3 commits into
mainfrom
0-wildcard-predicate-pushdown
Open

perf: improve wildcard query perf with predicate and contains-check pushdown #397
cheb0 wants to merge 3 commits into
mainfrom
0-wildcard-predicate-pushdown

Conversation

@cheb0
Copy link
Copy Markdown
Collaborator

@cheb0 cheb0 commented Apr 3, 2026

Description

Currently we spend only a fraction of time calling bytes.Index. This PR partially addresses that.

This PR pushes pattern.Searcher to Block level, so that Block is able to stream tokens through searcher. For ordinary wildcards like *error* there is direct FindContains method which is even faster.

For example, query message:*foobarf*:
main: 86 ms
using FindToken: 50 ms
using FindContains: 37 ms

So, FindContains just throws out costly abstractions to get additional performance. We could also provide a dedicated func like FindSuffix, for example. This is a typical example when performance requires additional code.

Query Type Ids cold, ms hot, ms cold (branch), ms hot (branch), ms cold diff hot diff
trace_id:*foobar reg 0 18.76 4.37 16.14 1.84 -14% -57.9%
k8s_pod:*6 reg 100 13.3 0.67 13.03 0.47 -2% -29.9%
message:*err* reg 100 138.72 26.97 120.27 12.36 -13.3% -54.2%
message:*foo* reg 100 77.69 27.08 60.54 11.84 -22.1% -56.3%
message:*request* reg 100 124.95 25.45 104.13 10.37 -16.7% -59.3%
message:*foobar*foobar* reg 0 187.54 64.25 147.31 30.5 -21.5% -52.5%
message:*foobarfoobar* reg 0 184.93 63.87 121.51 20.39 -34.3% -68.1%
message:*very_very_message_aggregator_events* reg 0 173.45 51.62 116.9 12.81 -32.6% -75.2%

Next steps:

  • try calling bytes.Index over Block payload - already shows good results
  • build Offsets lazy - if previous is done
  • modernize token Block, boost Unpack speed

  • I have read and followed all requirements in CONTRIBUTING.md;
  • I used LLM/AI assistance to make this pull request;

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 88.00000% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.70%. Comparing base (5115f7b) to head (c1219b9).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
frac/active_token_list.go 78.94% 2 Missing and 2 partials ⚠️
frac/sealed/token/provider.go 87.87% 2 Missing and 2 partials ⚠️
frac/sealed/token/block_loader.go 86.66% 1 Missing and 1 partial ⚠️
pattern/pattern.go 91.30% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #397      +/-   ##
==========================================
- Coverage   71.28%   70.70%   -0.58%     
==========================================
  Files         210      219       +9     
  Lines       15579    17088    +1509     
==========================================
+ Hits        11105    12082     +977     
- Misses       3673     4103     +430     
- Partials      801      903     +102     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread frac/sealed/token/block_loader.go Outdated
return b.Payload[offset : offset+l]
}

func (b *Block) FindContains(from, to int, needle []byte) ([]int, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed that you can perform bytes.Contains on the block payload before checking each token individually. Have you measured performance of such optimization?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed that you can perform bytes.Contains on the block payload before checking each token individually.

Yes, I tried calling bytes.Index on entire payload. It boosts even further comparing to this PR:
message:foobar
35 ms => 9 ms

However, this means that when bytes.Index returns and if we have some proper index returned, then we need to do a bin search on Offsets to find an index and then check for false positive. It also comes with neat property that we can avoid call Unpack (build offsets) lazily which boosts cold query performance (somewhat around extra 20%).

I put a task to the backlog, decided that it's too much for a single PR.

Comment thread frac/sealed/token/block_loader.go Outdated
}

func (b *Block) FindContains(from, to int, needle []byte) ([]int, error) {
indices := make([]int, 0)
Copy link
Copy Markdown
Member

@dkharms dkharms Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could pass here slice of needles as well to handle queries like message:*foo*bar* with multiple needles. Or there is something that blocks such improvement?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it's doable. Maybe will do

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upd: will do in a separate PR

@eguguchkin eguguchkin self-requested a review April 6, 2026 10:20
@eguguchkin eguguchkin modified the milestones: v0.72.0, v0.73.0 Apr 13, 2026
@cheb0 cheb0 added the performance Features or improvements that positively affect seq-db performance label May 12, 2026
@eguguchkin eguguchkin modified the milestones: v0.73.0, v0.72.0 May 18, 2026
Comment thread frac/sealed/token/block_loader.go Outdated
Comment thread pattern/pattern.go

type tokenProvider interface {
GetToken(uint32) []byte
FindContains(firstTID uint32, lastTID uint32, needle []byte) ([]uint32, error)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you decide to make firstTID and lastTID a part of an API?

Seems like for this specific case (e.g. query foo:'*bar*') we cannot narrow the TID search boundaries.

And now we always pass the first and last TID in this method:
https://github.com/ozontech/seq-db/blob/0-wildcard-predicate-pushdown/pattern/pattern.go#L411

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For foo:'*bar*' we narrow down to all tokens of foo field, i.e. foo:? tokens. If there are 50 such tokens only, we will check only a single block, and firstTID might be like 1000 and lastTID 1050

Copy link
Copy Markdown
Member

@dkharms dkharms May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see. But for specific case of FindContains() arguments firstLID, lastLID should not be a part of an API. Here, take a look, we already have all necessary information in Provider:

type Provider struct {
	...
    // NOTE: Entries already were narrowed.
	entries  []*TableEntry
	...
}

func (tp *Provider) FirstTID() uint32 {
	return tp.entries[0].StartTID
}

func (tp *Provider) LastTID() uint32 {
	return tp.entries[len(tp.entries)-1].getLastTID()
}

func (tp *Provider) FindContains(needle []byte) ([]uint32, error) {
	return tp.findInBlocks(tp.FirstTID(), tp.LastTID(), func(b *Block, firstIndex, lastIndex int) ([]int, error) {
		return b.contains(firstIndex, lastIndex, needle)
	})
}

Comment thread frac/sealed/token/provider.go
Comment thread frac/sealed/token/provider.go Outdated
Comment thread frac/sealed/token/provider.go Outdated
Comment thread pattern/pattern.go
if err != nil {
return nil, err
}
func isSimpleWildcardContains(token parser.Token) (needle []byte, ok bool) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment with an example of an expression that fits/matches this check

Comment thread frac/sealed/token/provider.go Outdated
for _, entry := range entries {
block := tp.findBlock(entry.BlockIndex)
firstIndex, lastIndex := tp.narrowTIDs(entry, firstTID, lastTID)
indices, err := search(block, firstIndex, lastIndex)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better to use 'indexes' in this context

Option Context of use Examples
indexes Recommended variant for databases, books, and general context. database indexes, book indexes, market indexes
indices Preferred in mathematics, science, and engineering. array indices, price indices, mathematical indices

Comment thread frac/sealed/token/provider.go Outdated
return tids, nil
}

func (tp *Provider) narrowTIDs(entry *TableEntry, firstTID, fromTID uint32) (int, int) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a token provider method; tp is not used in this function at all. Rather, this is a method for TableEntry

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to TableEntry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Features or improvements that positively affect seq-db performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants