Skip to content
/ server Public

MDEV-30182: Optimize open_tables to take O(N) time#4859

Open
FooBarrior wants to merge 5 commits intomainfrom
bb-main-opentables
Open

MDEV-30182: Optimize open_tables to take O(N) time#4859
FooBarrior wants to merge 5 commits intomainfrom
bb-main-opentables

Conversation

@FooBarrior
Copy link
Copy Markdown
Contributor

This is a continuation of #26680, after I made the adoption work, and perf analysis and two micro-optimizations (with distinguising results, see comments in https://jira.mariadb.org/browse/MDEV-30182).

Please let's try to preserve the original athor's (@Vanislavsky) creds in the commit header.

Summary

Table opening stage was known to make two list traversals:

  1. In find_fk_prelocked_table, all the query_tables list is traversed for each
    foreign key of a table to open.
  2. MDL_context::find_ticket traverses all mdl tickets, one ticket per table.

Both result in O(tables^2) time total.
This may dramatically increase the query latencty in the following known cases:

  • updates/deletes on tables with many children
  • DMLs in transactions involving many different tables

Also, it slows down the DROP DATABASE performance, with a big enough amount of
tables.

Optimization done:
find_ticket and find_fk_prelocked_table now looks up in a hash table.

The benefit of list lookup is that it's faster for 2-3 tables. This is why we introduced a new hash table, that features an optimization similar to libc++ SSO in the first place.

Let me denote the mentioned optimization as SHO, for small hash optimization.

I've also put an effort to choose the lowest latency approach to hash tables today and ended up with open pure address hash table, which was proposed to implement to the author as a part of his technical task.

So far it's been proven that it indeed outperforms mysys hash dramatically, likely by a number of reasons.

Finally, I've compared list vs SHO performance and ended up with two micro-optimizations, which provided the same results as for a list of two elements.

FooBarrior and others added 5 commits January 19, 2026 12:56
======
This is an  adopted version of patch provided in the PR #MariaDB/2680
Original author: Sergey Vanislavskiy
Adopted by: Nikita Maliavin
======

Table opening stage was known to make two list traversals:
1. In find_fk_prelocked_table, all the query_tables list is traversed for each
foreign key of a table to open.
2. MDL_context::find_ticket traverses all mdl tickets, one ticket per table.

Both result in O(tables^2) time total.
This may dramatically increase the query latencty in the following known cases:
* updates/deletes on tables with many children
* DMLs in transactions involving many different tables

Also, it slows down the DROP DATABASE performance, with a big enough amount of
tables.

So to optimize this out the following is done:
* A hash table with all FK-prelocked tables is added to THD. A table is filled
and queried inside find_fk_prelocked_table, and cleaned up at the query end.
* The find_ticket implementation is replaced with two consecutive hash lookups.
* A hash table with all tickets for current context (Query_tables_list) is
added.
* find_fk_prelocked_table now makes a hash lookup as well. We have to calculate
a hash value for this lookup, so MDL_key is created earlier. It's then reused
if a new table list is created.

Given the requirement of no performance degradation for a small table value,
a new data structure is introduced: Open_address_hash.

Open_address_hash is a generic data structure, that is optimized for usage with
a few elements, similarly to the "short string optimization" in c++ standard
libraries: if a number of elements can fit the structure's class body inline,
then they are placed there. One bit is borrowed to indicate whether the inline
mode is used.

This also means that this optimization cannot work in 32-bit environments.

The hash table is implemented from scratch using open address hashing to reduce
memory fragmentation, reduce allocation pressure and memory footprint.

Speaking of the memory footprint, it is expected to be as follows:
* +4x pointer size for each connection (2 pointers per hash, two hashes total)
* +4x pointer size for each prepared statement, or cached/exected stored
procedure.
* If number of tables opened > 2, then +2x pointer size per table, because the
hash load factor is kept 50% at this moment
* If number of FK-prelocked tables opened > 2, then +2x pointer size per
FK-prelocked table.
elem_suits => match
init_one_table_no_request => init_one_table_no_mdl
malloc,realloc,free => my_malloc,...

insert_into_bucket made return void.

Using my_realloc resulted in that shrink can now return an error. The
reason is my_realloc can fail even if shrinking with SAFEMALLOC enabled.

SAFEMALLOC is enabled by default on non-windows builds.
Copy link
Copy Markdown
Contributor

@svoj svoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree that the problem is valid and it has to be fixed.

I couldn't yet convince myself that Open_address_hash is a necessity, but I must admit I like some of its characteristics.

I will want to have another look once issues I've outlined are sorted out.

}

size_t size() const
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing space.

}
else
{
return _size;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing space.

@@ -3065,6 +3072,9 @@ MDL_context::acquire_lock(MDL_request *mdl_request, double lock_wait_timeout)
m_wait.reset_status();

m_tickets[mdl_request->duration].push_front(ticket);
bool success = ticket_hash.insert(ticket);
if (unlikely(!success))
DBUG_RETURN(TRUE);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is wrong. You loose ticket reference and it stays locked forever.
success is redundant here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the lock should be unlocked as well, no?

Copy link
Copy Markdown
Contributor

@svoj svoj Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lock has to be released, ticket has to be deleted, ticket has to be removed from m_tickets. There's an alternative though: do nothing. find_ticket() won't find entry for given key and try_acquire_lock_impl() will have to create ticket anew. OTOH is_lock_owner() will get broken.

#include "my_sys.h"
#include "m_ctype.h"

#define ER_OUT_OF_RESOURCES 1041 // from "mysqld_error.h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? This looks really wrong.

&& tl->prelocking_placeholder == TABLE_LIST::PRELOCK_FK
&& strcmp(tl->table_name.str, key.name()) == 0
&& strcmp(tl->db.str, key.db_name()) == 0;
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_fk_prelocked_table() is unused, please remove.
You've also missed to remove table_already_fk_prelocked() declaration in sql_base.h

@@ -0,0 +1,458 @@
#pragma once
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All new files are missing copyright headers.

capacity_shift= CAPACITY_POWER_INITIAL;
hash_array= (Value *)my_malloc(PSI_NOT_INSTRUMENTED,
_buffer_size() * sizeof (Value *),
MYF(MY_ZEROFILL|MY_WME));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this instrumented?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want a new key or have some sugestion?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find anything that fits. Apparently we need a new one.

DBUG_ASSERT(thd->is_error()); // ER_OUTOFMEMORY is set inside the hash
if (arena)
thd->restore_active_arena(arena, &backup);
DBUG_RETURN(TRUE);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not:

error= TRUE;
break;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code lived through a couple of rebases for now, by the time of introduction there was no error= TRUE; break;.

I disagree that a pattern is correct to use for error handling. If someone will add logic after the loop, this error handling will be broken.

The correct way is to use RAII class Query_arena_stmt.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/me shrugs

Duplicating code that handles errors is not great either. Query_arena_stmt is fine with me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like we should assert that ticket_hash is empty here.

@svoj
Copy link
Copy Markdown
Contributor

svoj commented Mar 27, 2026

Both result in O(tables^2) time total.

Is this really the case? I can't immediately see how tables ^ 2 is possible even in the worst scenario.

Copy link
Copy Markdown
Contributor Author

@FooBarrior FooBarrior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @svoj.

I think now that I know libc api a bit better, I can refactor shrink to be error-less.

What's your concern about static and inline? Perhaps, inline is redundand, since it's implicit anyway for in-body methods.

I also like the availability of clear and well-defined decomposition when possible. init_one_table_* looks fine to me even with a fresh look.

I didn't comment on typos etc, that'll be fixed.

capacity_shift= CAPACITY_POWER_INITIAL;
hash_array= (Value *)my_malloc(PSI_NOT_INSTRUMENTED,
_buffer_size() * sizeof (Value *),
MYF(MY_ZEROFILL|MY_WME));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want a new key or have some sugestion?


Open_address_hash &operator=(const Open_address_hash&)
{
// Do nothing. Copy operator is called by set_query_tables_list used only for backup.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a part of explanation why it's actually doing nothing, violationg the assignment semantic.

@@ -2741,6 +2741,8 @@ MDL_context::try_acquire_lock_impl(MDL_request *mdl_request,
switch (mdl_locks.try_acquire_lock(m_pins, &mdl_request->key, ticket,
out_ticket != nullptr)) {
case TAL_ACQUIRED:
if (!ticket_hash.insert(ticket))
return TRUE;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's nothing new to this function:

m_tickets[mdl_request->duration].push_front

just below may silently fail on OOM resulting in same.

Though the concern in general is valid.

@@ -3065,6 +3072,9 @@ MDL_context::acquire_lock(MDL_request *mdl_request, double lock_wait_timeout)
m_wait.reset_status();

m_tickets[mdl_request->duration].push_front(ticket);
bool success = ticket_hash.insert(ticket);
if (unlikely(!success))
DBUG_RETURN(TRUE);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the lock should be unlocked as well, no?

@@ -2556,8 +2530,37 @@ struct TABLE_LIST
Lex_ident_table(*table_name_arg);
lock_type= lock_type_arg;
updating= lock_type >= TL_FIRST_WRITE;
}

static enum_mdl_type get_mdl_type(enum thr_lock_type lock_type)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because no object dependencies. It's inside class.

DBUG_ASSERT(thd->is_error()); // ER_OUTOFMEMORY is set inside the hash
if (arena)
thd->restore_active_arena(arena, &backup);
DBUG_RETURN(TRUE);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code lived through a couple of rebases for now, by the time of introduction there was no error= TRUE; break;.

I disagree that a pattern is correct to use for error handling. If someone will add logic after the loop, this error handling will be broken.

The correct way is to use RAII class Query_arena_stmt.

m_tickets[MDL_TRANSACTION].remove(mdl_xlock_request.ticket);
delete mdl_xlock_request.ticket;
}

DBUG_RETURN(FALSE);
DBUG_RETURN(MY_TEST(!success));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I had no public comment about that...
yes, that's ridiculus! I sort of agree. That's due to the fact that safe_malloc makes a new allocation even during a shrinking my_realloc.

Perhaps I could outsmart it by assuming a failed shrinking realloc a no-event. Just checked: it doesn't free the old buffer in case of failure.
Just checked cppreference as well:

https://en.cppreference.com/w/c/memory/realloc

So I could actually do that.

being requested. If not look at lists for other durations.
For the MDL_EXPLICIT duration, look for the tickets with any duration.
For other durations, first try to find the ticket with duration other than
MDL_EXPLICIT. If not found, check MDL_EXPLICIT duration as well.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks.

m_tickets[MDL_TRANSACTION].remove(mdl_xlock_request.ticket);
delete mdl_xlock_request.ticket;
}

DBUG_RETURN(FALSE);
DBUG_RETURN(MY_TEST(!success));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mdl_type, MDL_TRANSACTION);
get_mdl_type(lock_type_arg), MDL_TRANSACTION);
}
inline void init_one_table_by_key(const LEX_CSTRING *db_arg,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, perhaps, but now it's the function that could be conveniently used to init a TABLE_LIST with MDL_KEY. It looks symmetric to init_one_table.

@FooBarrior
Copy link
Copy Markdown
Contributor Author

Is this really the case? I can't immediately see how tables ^ 2 is possible even in the worst scenario.

Yes. One easy to get scenario is table with N referencial constraints pointing to it (with cascade actions) -- that immediately adds all those tables to a single ticket list, if to speak about MDL.

Same to SELECT with many tables, but I don't know how widespread is that. AFAIK, that's the reason many DBMS jit-optimize select trees, to avoid extra vfunc overhead, needless to say about O(N^2) overhead.

DROP DATABASE -- same.

Multi-statement transaction involving different tables -- same.

@svoj
Copy link
Copy Markdown
Contributor

svoj commented Mar 27, 2026

Is this really the case? I can't immediately see how tables ^ 2 is possible even in the worst scenario.

Yes. One easy to get scenario is table with N referencial constraints pointing to it (with cascade actions) -- that immediately adds all those tables to a single ticket list, if to speak about MDL.

Same to SELECT with many tables, but I don't know how widespread is that. AFAIK, that's the reason many DBMS jit-optimize select trees, to avoid extra vfunc overhead, needless to say about O(N^2) overhead.

DROP DATABASE -- same.

Multi-statement transaction involving different tables -- same.

Apparently you're right. IIUC the maximum number of comparisons is going to be N^2/2. But that's still quadratic growth rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants