perf: improve wildcard query perf with predicate and contains-check pushdown #397
perf: improve wildcard query perf with predicate and contains-check pushdown #397cheb0 wants to merge 3 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #397 +/- ##
==========================================
- Coverage 71.28% 70.70% -0.58%
==========================================
Files 210 219 +9
Lines 15579 17088 +1509
==========================================
+ Hits 11105 12082 +977
- Misses 3673 4103 +430
- Partials 801 903 +102 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| return b.Payload[offset : offset+l] | ||
| } | ||
|
|
||
| func (b *Block) FindContains(from, to int, needle []byte) ([]int, error) { |
There was a problem hiding this comment.
We've discussed that you can perform bytes.Contains on the block payload before checking each token individually. Have you measured performance of such optimization?
There was a problem hiding this comment.
We've discussed that you can perform bytes.Contains on the block payload before checking each token individually.
Yes, I tried calling bytes.Index on entire payload. It boosts even further comparing to this PR:
message:foobar
35 ms => 9 ms
However, this means that when bytes.Index returns and if we have some proper index returned, then we need to do a bin search on Offsets to find an index and then check for false positive. It also comes with neat property that we can avoid call Unpack (build offsets) lazily which boosts cold query performance (somewhat around extra 20%).
I put a task to the backlog, decided that it's too much for a single PR.
| } | ||
|
|
||
| func (b *Block) FindContains(from, to int, needle []byte) ([]int, error) { | ||
| indices := make([]int, 0) |
There was a problem hiding this comment.
I guess you could pass here slice of needles as well to handle queries like message:*foo*bar* with multiple needles. Or there is something that blocks such improvement?
There was a problem hiding this comment.
No, I think it's doable. Maybe will do
There was a problem hiding this comment.
upd: will do in a separate PR
|
|
||
| type tokenProvider interface { | ||
| GetToken(uint32) []byte | ||
| FindContains(firstTID uint32, lastTID uint32, needle []byte) ([]uint32, error) |
There was a problem hiding this comment.
Why did you decide to make firstTID and lastTID a part of an API?
Seems like for this specific case (e.g. query foo:'*bar*') we cannot narrow the TID search boundaries.
And now we always pass the first and last TID in this method:
https://github.com/ozontech/seq-db/blob/0-wildcard-predicate-pushdown/pattern/pattern.go#L411
There was a problem hiding this comment.
For foo:'*bar*' we narrow down to all tokens of foo field, i.e. foo:? tokens. If there are 50 such tokens only, we will check only a single block, and firstTID might be like 1000 and lastTID 1050
There was a problem hiding this comment.
Yeah, I see. But for specific case of FindContains() arguments firstLID, lastLID should not be a part of an API. Here, take a look, we already have all necessary information in Provider:
type Provider struct {
...
// NOTE: Entries already were narrowed.
entries []*TableEntry
...
}
func (tp *Provider) FirstTID() uint32 {
return tp.entries[0].StartTID
}
func (tp *Provider) LastTID() uint32 {
return tp.entries[len(tp.entries)-1].getLastTID()
}
func (tp *Provider) FindContains(needle []byte) ([]uint32, error) {
return tp.findInBlocks(tp.FirstTID(), tp.LastTID(), func(b *Block, firstIndex, lastIndex int) ([]int, error) {
return b.contains(firstIndex, lastIndex, needle)
})
}| if err != nil { | ||
| return nil, err | ||
| } | ||
| func isSimpleWildcardContains(token parser.Token) (needle []byte, ok bool) { |
There was a problem hiding this comment.
Please add a comment with an example of an expression that fits/matches this check
| for _, entry := range entries { | ||
| block := tp.findBlock(entry.BlockIndex) | ||
| firstIndex, lastIndex := tp.narrowTIDs(entry, firstTID, lastTID) | ||
| indices, err := search(block, firstIndex, lastIndex) |
There was a problem hiding this comment.
nit: better to use 'indexes' in this context
| Option | Context of use | Examples |
|---|---|---|
| indexes | Recommended variant for databases, books, and general context. | database indexes, book indexes, market indexes |
| indices | Preferred in mathematics, science, and engineering. | array indices, price indices, mathematical indices |
| return tids, nil | ||
| } | ||
|
|
||
| func (tp *Provider) narrowTIDs(entry *TableEntry, firstTID, fromTID uint32) (int, int) { |
There was a problem hiding this comment.
This is not a token provider method; tp is not used in this function at all. Rather, this is a method for TableEntry
Description
Currently we spend only a fraction of time calling
bytes.Index. This PR partially addresses that.This PR pushes
pattern.SearchertoBlocklevel, so thatBlockis able to stream tokens through searcher. For ordinary wildcards like*error*there is directFindContainsmethod which is even faster.For example, query
message:*foobarf*:main: 86 ms
using
FindToken: 50 msusing
FindContains: 37 msSo, FindContains just throws out costly abstractions to get additional performance. We could also provide a dedicated func like FindSuffix, for example. This is a typical example when performance requires additional code.
trace_id:*foobark8s_pod:*6message:*err*message:*foo*message:*request*message:*foobar*foobar*message:*foobarfoobar*message:*very_very_message_aggregator_events*Next steps:
bytes.IndexoverBlockpayload - already shows good results