MDEV-31956 SSD based InnoDB buffer pool extension #4510

vlad-lesin · 2026-01-05T13:32:11Z

In one of the practical cloud MariaDB setups, a server node accesses its
datadir over the network, but also has a fast local SSD storage for
temporary data. The content of such temporary storage is lost when the
server container is destroyed.

The commit uses this ephemeral fast local storage (SSD) as an extension of
the portion of InnoDB buffer pool (DRAM) that caches persistent data
pages. This cache is separated from the persistent storage of data files
and ib_logfile0 and ignored during backup.

The following system variables were introduced:

innodb_extended_buffer_pool_size - the size of external buffer pool
file, if it equals to 0, external buffer pool will not be used;

innodb_extended_buffer_pool_path - the path to external buffer pool
file.

If innodb_extended_buffer_pool_size is not equal to 0, external buffer
pool file will be (re)created on startup.

Only clean pages will be flushed to external buffer pool file. There is
no need to flush dirty pages, as such pages will become clean after
flushing, and then will be evicted when they reach the tail of LRU list.

The general idea of this commit is to flush clean pages to external
buffer pool file when they are evicted.

A page can be evicted either by transaction thread or by background
thread of page cleaner. In some cases transaction thread is waiting for
page cleaner thread to finish its job. We can't do flushing in external
buffer pool file when transaction threads are waiting for eviction,
that would heart performance. That's why the only case for flushing is
when page cleaner thread evicts pages in background and there are no
waiters. For this purpose buf_pool_t::done_flush_list_waiters_count
variable was introduced, we flush evicted clean pages only if the
variable is zeroed.

Clean pages are evicted in buf_flush_LRU_list_batch() to keep some
amount of pages in buffer pool's free list. That's why we flush every
second page to external buffer pool file, otherwise there could be not
enough amount of pages in free list to let transaction threads to
allocate buffer pool pages without page cleaner waiting. This might be
not a good solution, but this is enough for prototyping.

External buffer pool page is introduced to store information in buffer
pool page hash about the certain page can be read from external buffer
pool file. The first several members of such page must be the same as the
members of internal page. External page frame must be equal to the
certain value to distinguish external page from internal one. External
buffer pages are preallocated on startup in external pages array. We
could get rid of the frame in external page, and check if the page's
address belongs to the array to distinguish external and internal pages.

There are also external pages free and LRU lists. When some internal page
is decided to be flushed in external buffer pool file, a new external
page is allocated either from the head of external free list, or from
the tail of external LRU list. Both lists are protected with
buf_pool.mutex. It makes sense, because a page is removed from internal
LRU list during eviction under buf_pool.mutex.

Then internal page is locked and the allocated external page is attached
to io request for external buffer pool file, and when write request is
completed, the internal page is replaced with external one in page hash,
external page is pushed to the head of external LRU list and internal
page is unlocked. After internal page was removed from external free list,
it was not placed in external LRU, and placed there only
after write completion, so the page can't be used by the other threads
until write is completed.

Page hash chain get element function has additional template parameter,
which notifies the function if external pages must be ignored or not. We
don't ignore external pages in page hash in two cases, when some page is
initialized for read and when one is reinitialized for new page creating.

When an internal page is initialized for read and external page with the
same page id is found in page hash, the internal page is locked,
the external page in replaced with newly initialized internal page in the
page hash chain, the external page is removed from external LRU list and
attached to io request to external buffer pool file. When the io request
is completed, external page is returned to external free list,
internal page is unlocked. So during read external page absents in both
external LRU and free lists and can't be reused.

When an internal page is initialized for new page creating and external
pages with the same page id is found in page hash, we just remove external
page from the page hash chain and external LRU list and push it to the
head of external free list. So the external page can be used for future
flushing.

There is also some magic with watch sentinels. If watch is goint to be
set for the external page with the same page id in page hash, we replace
the external page with sentinel in page hash and attach external
page to the sentinel's frame. When such sentinel with attached external
page should be removed, we replace it with the attached external page in
page hash instead of just removing the sentinel. This idea is not fully
implemented, as the function, which allocates external pages, does not
take into account that page hash can contain sentinels with attached
external pages. And, anyway, this code must be removed if we cherry-pick
the commit to 11.* branch, as change buffer is already removed in those
versions.

The pages are flushed to and read from external buffer pool file with
the same manner as they are flushed to their spaces, i.e. compressed and
encrypted pages stay compressed and encrypted in external buffer pool
file.

The Jira issue number for this PR is: MDEV-______

Description

TODO: fill description here

Release Notes

TODO: What should the release notes say about this change?
Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.

How can this PR be tested?

TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

This is a new feature or a refactoring, and the PR is based against the main branch.
This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

CLAassistant · 2026-01-05T13:32:18Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ vlad-lesin
❌ dr-m
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

dr-m

Before I review this deeper, could you please fix the build and the test innodb.ext_buf_pool on Microsoft Windows?

storage/innobase/buf/buf0flu.cc

mysql-test/suite/innodb_i_s/innodb_buffer_pool_stats.result

include/my_sys.h

mysql-test/suite/innodb/t/ext_buf_pool.test

dr-m · 2026-01-19T05:59:27Z

mysql-test/suite/innodb/t/ext_buf_pool.opt

@@ -0,0 +1 @@
+--innodb-buffer-pool-size=21M --innodb-extended-buffer-pool-size=1M


Typically, the local SSD is larger than the RAM, and therefore it would be more interesting to test a scenario where the buffer pool is smaller than the extended buffer pool. I understand that our current CI setup is not suitable for any performance testing. But, currently we only cover this functionality in debug-instrumented executables.

Would it be possible to write some sort of a test, or extend an existing one (such as encryption.innochecksum) with a variant that would use the minimal innodb_buffer_pool_size and a larger innodb_extended_buffer_pool_size? Currently, that test is running with innodb_buffer_pool_size=64M.

The test flushes small amount of pages into extended buffer pool file, that's why there was not any reason to set a big file size. Probably it makes sense to develop separate test, we can set the minimal innodb_buffer_pool_size and a larger innodb_extended_buffer_pool_size, but what exactly are we going to test in that separate test?

I was thinking that it would be useful to mimic a typical deployment scenario where we expect some pages to be written to the local SSD and read back from there. Sure, mtr tests are limited in size and not good for performance testing. Should there be any race conditions, having a test run on multiple platforms for every push would help catch them over time.

storage/innobase/buf/buf0dblwr.cc

storage/innobase/buf/buf0flu.cc

dr-m

This is a partial review.

storage/innobase/include/buf0buf.h

storage/innobase/include/fil0fil.h

storage/innobase/include/os0file.h

dr-m

Here is another incomplete review.

storage/innobase/os/os0file.cc

storage/innobase/log/log0recv.cc

storage/innobase/include/fil0fil.h

dr-m · 2026-01-26T08:39:15Z

storage/innobase/include/fil0fil.h

+  /** Extended buffer pool file path */
+  char  *ext_bp_path;


Could this be const char*?

No, because the path can be set with system variable.

storage/innobase/include/fil0fil.h

storage/innobase/include/buf0lru.h

storage/innobase/include/buf0buf.h

dr-m

Here are some more comments. I’m doing this review in batches so that you can work concurrently on addressing some of these comments. Everything seems to be relatively minor stuff.

storage/innobase/handler/i_s.cc

storage/innobase/handler/ha_innodb.cc

storage/innobase/fil/fil0fil.cc

dr-m · 2026-01-26T10:58:50Z

storage/innobase/fil/fil0fil.cc

+  fil_space_t *space;
+  if (ext_buf())
+  {
+    ut_d(fil_space_t *debug_space=) space=
+        fil_space_t::get(buf_page->id().space());


Do we really need to declare the variable space this early? As far as I can tell, it’s only truly needed if !ext_buf().

In the case when ext_buf() returns true we invoke fil_space_t::get(...), what increases the space's reference counter. We need to decrease it, and we decrease it at the end of IORequest::read_complete().

dr-m · 2026-01-26T11:02:12Z

storage/innobase/fil/fil0fil.cc

+      /* We must hold buf_pool.mutex while releasing the block, so that
+      no other thread can access it before we have freed it. */
+      mysql_mutex_lock(&buf_pool.mutex);
+      buf_page->write_complete_release(buf_page->state());
+      buf_LRU_free_page(buf_page, true, ext_buf_page());
+      mysql_mutex_unlock(&buf_pool.mutex);


Is there any alternative to this, such as holding an exclusive latch on the buf_pool.page_hash slice?

Please add a FIXME comment to implement deferred, batched freeing, similar to the one we do in the normal LRU flushing/eviction in the page cleaner thread.

This would be an obvious performance bottleneck us when there is some write activity to the external buffer pool going on. The IORequest::write_complete() is expected to be called from multiple threads concurrently.

dr-m · 2026-01-26T11:03:33Z

storage/innobase/fil/fil0fil.cc

+  {
+    sql_print_error("Cannot open/create extended buffer pool file '%s'", path);
+    /* Report OS error in error log */
+    (void) os_file_get_last_error(true, false);


std::ignore= is the canonical way to discard a return value.

dr-m · 2026-01-27T06:30:53Z

mysys/my_copy.c

    file_created= 1;
    while ((Count=my_read(from_file, buff, sizeof(buff), MyFlags)) != 0)
    {
-	if (Count == (uint) -1 ||
+	if (Count == (size_t) -1 ||
 	    my_write(to_file,buff,Count,MYF(MyFlags | MY_NABP)))
 	goto err;
    }


I see that my_copy() is being outside mysqltest.cc, for example in myisampack and aria_pack. This could fix a bug in those tools. Therefore, it would make sense to file a separate MDEV and PR for this fix, so that this simple bug can be fixed in GA releases.

dr-m

I am still yet to review the change to buf0flu.cc, buf0lru.cc, buf0rea.cc.

dr-m · 2026-01-27T06:44:10Z

storage/innobase/buf/buf0buf.cc

+ext_buf_page_t *buf_pool_t::alloc_ext_page(page_id_t page_id) noexcept
+{
+  mysql_mutex_assert_owner(&mutex);
+  ext_buf_page_t *p;
+  if ((p= UT_LIST_GET_FIRST(ext_free)))
+    UT_LIST_REMOVE(ext_free, p);
+  else if ((p= UT_LIST_GET_LAST(ext_LRU))) {
+    for (; p; p= UT_LIST_GET_PREV(ext_LRU_list, p)) {
+      hash_chain &hash_chain= page_hash.cell_get(p->id_.fold());
+      page_hash_latch &hash_lock= page_hash.lock_get(hash_chain);
+      if (!hash_lock.try_lock())
+        continue;
+      UT_LIST_REMOVE(ext_LRU, p);
+      page_hash.remove(hash_chain, reinterpret_cast<buf_page_t *>(p));
+      hash_lock.unlock();
+      break;
+    }
+    if (!p)
+      return nullptr;


This is potentially a very expensive operation, and it is being executed within a single critical section of buf_pool.mutex. Several years ago in 7cffb5f, I tried to refactor code so that we will avoid calculating page_id_t::fold() while holding buf_pool.mutex. Here, we are potentially doing it for every single element of buf_pool.ext_LRU_list. I see that the caller only needs buf_pool.mutex for buf_LRU_stat_inc_io() afterwards. So, there should be no problem to release and reacquire buf_pool.mutex here.

Based on buf_pool_t::unzip() and buf_page_create_low(), the correct lock order is to acquire buf_pool.mutex first and then a latch on a buf_pool.page_hash slice. Hence, it looks like the loop could be unnecessary and we could just wait for the hash_lock on the last element of buf_pool.ext_LRU_list.

There is only a single call to this function, in buf_page_t::flush(). The code would be easier to follow and faster to execute (better locality of reference) if the function was defined inline in the same compilation unit with the caller, that is, buf0flu.cc.

dr-m · 2026-01-27T06:46:23Z

storage/innobase/buf/buf0buf.cc

    memory_unaligned= nullptr;
  }

+  my_free(ext_buf_pages_array);


Shouldn’t we assign ext_buf_pages_array= nullptr; to be in line with how we release buf_pool.memory and buf_pool.memory_unaligned?

dr-m · 2026-01-27T06:50:41Z

storage/innobase/buf/buf0buf.cc

+  buf_page_t *bpage= buf_pool.page_hash.get<true>(page_id, chain);
+
+  if (bpage && buf_pool.is_page_external(*bpage)) {
+      page_hash_latch &hash_lock= buf_pool.page_hash.lock_get(chain);
+      hash_lock.lock();
+      buf_pool.page_hash.remove(chain, bpage);
+      hash_lock.unlock();
+      ut_ad(!bpage->in_page_hash);
+      ext_buf_page_t *ext_buf_page=
+        reinterpret_cast<ext_buf_page_t *>(bpage);
+      buf_pool.remove_ext_page_from_LRU(*ext_buf_page);
+      buf_pool.free_ext_page(*ext_buf_page);
+      bpage= nullptr;
+  }

  if (bpage)
  {


We would be checking for bpage!=nullptr twice after this change. I would suggest the following:

if (!bpage) /* not found */; else if (buf_pool.is_page_extrenal(*bpage)) { // … (indent with 2 extra spaces instead of 4) } else {

that is, replacing the if (bpage) with else.

dr-m

I think I only have buf0flu.cc left.

dr-m · 2026-01-27T07:11:44Z

storage/innobase/buf/buf0lru.cc

 @param zip        whether to remove both copies of a ROW_FORMAT=COMPRESSED page
 @retval true if freed and buf_pool.mutex may have been temporarily released
 @retval false if the page was not freed */
-bool buf_LRU_free_page(buf_page_t *bpage, bool zip)
+bool buf_LRU_free_page(buf_page_t *bpage, bool zip,
+    ext_buf_page_t *ext_buf_page)


The added line is indented insufficiently, and the comment needs to be expanded, presumably as follows:

@param ext_buf_page external buffer page to replace bpage in buf_pool.page_hash

dr-m · 2026-01-27T07:13:14Z

storage/innobase/buf/buf0lru.cc

 static bool buf_LRU_block_remove_hashed(buf_page_t *bpage, const page_id_t id,
                                        buf_pool_t::hash_chain &chain,
-                                        bool zip);
+                                        bool zip,
+                                        ext_buf_page_t *ext_buf_page= nullptr);


There are only 2 callers of this function. We could remove the default parameter and explicitly pass the nullptr in buf_pool_t::corrupted_evict().

dr-m · 2026-01-27T07:15:04Z

storage/innobase/buf/buf0lru.cc

 @param id         page identifier
 @param chain      locked buf_pool.page_hash chain (will be released here)
 @param zip        whether bpage->zip of BUF_BLOCK_FILE_PAGE should be freed
+@param ext_buf_page external buffer page to replace bpage in page hash


It’s customary to align the description of each @param. The preceding lines should be edited accordingly, or the parameter name could be shortened to ext_bpage.

dr-m · 2026-01-27T07:21:03Z

storage/innobase/fil/fil0fil.cc

+    if (!space) {
+      buf_pool.corrupted_evict(buf_page, buf_page_t::READ_FIX + 1, false);


This is the only call site where we suppress the call to buf_page_t::set_corrupted_id(). Can you please add a comment that explains why that is necessary?

dr-m · 2026-01-27T07:26:51Z

storage/innobase/buf/buf0rea.cc

+struct page_init_result {
+/*  page_init_result() : bpage(nullptr), ext_buf_page(nullptr) {} */
+  bool in_ext_buffer_pool() const noexcept { return ext_buf_page; }
+  buf_page_t* bpage; /* Initialized page */
+  ext_buf_page_t *ext_buf_page; /* External buffer pool page if bpage can be
+                               read from from exretnal buffer pool file */
+};


I would omit the constructor comment as well as the member function. I don’t think that replacing init_page_result.in_ext_buffer_pool() with init_page_result.ext_buf_page or (say) init_page_result.ext_read would reduce readability.

The data members need to be documented with Doxygen compatible comments:

struct page_init_result { /** initialized page descriptor and uninitialized frame */ buf_page_t *page; /** external buffer pool page if the page can be read from ext_bp */ ext_buf_page_t *ext_read; };

dr-m · 2026-01-27T09:18:38Z

storage/innobase/buf/buf0rea.cc

-    auto fio= space->io(IORequest(IORequest::READ_SYNC),
+    auto fio=
+        init_page_result.in_ext_buffer_pool()
+            ? fil_io_t{fil_system.ext_bp_io(
+                           *bpage, *init_page_result.ext_buf_page,
+                           IORequest::READ_SYNC, nullptr, len, dst),
+                       nullptr}
+            : space->io(IORequest(IORequest::READ_SYNC),
                        os_offset_t{page_id.page_no()} * len, len, dst, bpage);
    *err= fio.err;
    thd_wait_end(thd);
+    if (init_page_result.in_ext_buffer_pool())
+    {
+      mysql_mutex_lock(&buf_pool.mutex);
+      buf_pool.free_ext_page(*init_page_result.ext_buf_page);
+      mysql_mutex_unlock(&buf_pool.mutex);
+    }


We are checking for init_page_result.ext_buf_page twice. Maybe it’s acceptable.

Thinking aloud: Could we free the block before reading it? As far as I can tell, the call buf_pool.ext_page_offset() inside fil_system.ext_bp_io() is not dereferencing the object. The answer is no, because if we freed the block first, then it could be reallocated for something else and the underlying file block could be overwritten before we get a chance to read the desired contents.

Also thinking aloud: Could we avoid the buf_pool.mutex here, and avoid having the buf_pool.ext_free list as well? We would simply identify freed blocks with an atomic acquire/release version of id_.is_corrupted() and id_.set_corrupted(). In that way, freeing blocks would be fast, but allocation would have to sweep the entire extended buffer pool until an empty slot is found. There could perhaps be a "cursor" to quickly locate the next available empty slot. Maybe you could add a FIXME or TODO comment about this somewhere, for example next to the definition of buf_pool.ext_free? This would also address my earlier comment about the new scalability bottleneck in IORequest::read_complete().

dr-m · 2026-01-27T09:23:43Z

storage/innobase/buf/buf0rea.cc

-      *err= bpage->read_complete(*fio.node, recv_sys.recovery_on);
+      *err= bpage->read_complete(init_page_result.in_ext_buffer_pool()
+                                     ? *UT_LIST_GET_FIRST(space->chain)
+                                     : *fio.node,
+                                 recv_sys.recovery_on);


What if we did not replace IORequest::node with a union, but just added another field ext_buf_page_t *const ext_buf_page? In that way, we could avoid this change altogether.

dr-m · 2026-01-27T09:33:26Z

storage/innobase/buf/buf0rea.cc

-  buf_pool.stat.n_pages_read++;
+  if (!ext_buf_page)
+    buf_pool.stat.n_pages_read++;
  ut_ad(!bpage || bpage->in_file());
+  if (ext_buf_page && !fil_system.ext_buf_pool_enabled())


Can we write else if (!fil_system.ext_buf_pool_enabled()) in order to skip one condition inside a critical section of buf_pool.mutex?

If we didn’t need the else branch inside the critical section (and I think we could fix that, see another comment), we could write

buf_pool.stat.n_pages_read+= !ext_buf_page;

to replace the condition with arithmetics, optimizing for the likely code path that no external buffer pool is being used.

dr-m · 2026-01-27T09:35:18Z

storage/innobase/buf/buf0rea.cc

+  {
+    buf_pool.free_ext_page(*ext_buf_page);
+    ext_buf_page= nullptr;
+  }
  mysql_mutex_unlock(&buf_pool.mutex);


As noted elsewhere, if we removed buf_pool.ext_free, this could be done outside the critical section of buf_pool.mutex. But, since this is an unlikely code path (the extended buffer pool was recently disabled due to an I/O error), I think that we can live with this.

dr-m · 2026-01-27T09:39:58Z

storage/innobase/buf/buf0rea.cc


-    bpage= static_cast<buf_page_t*>(ut_zalloc_nokey(sizeof *bpage));
+    bpage=
+        static_cast<buf_page_t *>(ut_zalloc_nokey(sizeof *bpage));

+    // TODO: do we need to init it for compressed pages? I think no.
    page_zip_des_init(&bpage->zip);


You are right, the page_zip_des_t::clear() is redundant here, and the entire macro declaration could be removed. This is because ut_zalloc_nokey() is already zero-initializing the entire bpage. The bpage->zip.clear() would zero-initialize everything except the buffer-fix field. Can you push this code removal as a separate pull request against 10.11?

dr-m

I completed my first round of review. I would suggest that you reply to each comment that you intend to fully address, and mark those comments as resolved.

There are some "nice to have" comments that can be addressed later. I would request to leave them as unresolved, so that we can easily find them when working on improving the performance.

Many of my comments are rather trivial to address and do not prevent this from being approved. After all, this has been already subjected to stress testing and a number of bugs have already been fixed.

dr-m · 2026-01-27T09:56:56Z

storage/innobase/buf/buf0flu.cc

+    bool persistent = (type == PERSISTENT);
    ut_d(lsn_t om= oldest_modification());
-    ut_ad(om >= 2);
+    ut_ad(type == EXT_BUF || om >= 2);
    ut_ad(persistent == (om > 2));
+    ut_ad(type != EXT_BUF || !oldest_modification());
    /* We use release memory order to guarantee that callers of
    oldest_modification_acquire() will observe the block as
    being detached from buf_pool.flush_list, after reading the value 0. */
-    oldest_modification_.store(persistent, std::memory_order_release);
+    if (type != EXT_BUF)
+      oldest_modification_.store(persistent, std::memory_order_release);


I think that it would be clearer to add a switch (type) around all this. This would also make the assertion about type==TEMPORARY stricter:

lsn_t om ut_d(= oldest_modification()); switch (type) { case PERSISTENT: ut_ad(om > 2); om= 1; goto clear_modification; case TEMPORARY: ut_ad(om == 2); om= 0; clear_modification: /* We use release memory order to guarantee that callers of oldest_modification_acquire() will observe the block as being detached from buf_pool.flush_list, after reading the value 0. */ oldest_modification_.store(om, std::memory_order_release); break; case EXT_BUF: ut_ad(om == 0); break; }

dr-m · 2026-01-27T10:29:00Z

storage/innobase/include/os0file.h

+  union
+  {
+    /** File descriptor */
+    fil_node_t *const node_ptr= nullptr;
+    /** External buffer pool page if the request is for external buffer pool
+    file, i.e. the lowest bit of bpage_ptr is set, nullptr otherwise */
+    ext_buf_page_t *const ext_buf_page_ptr;
+  };
+
+  /** Page to be written on write operation, the lowest bit shows if the
+  request is for external buffer pool or not */
+  buf_page_t *const bpage_ptr= nullptr;
+
 public:
-  /** Page to be written on write operation */
-  buf_page_t *const bpage= nullptr;

  /** Memory to be used for encrypted or page_compressed pages */
  buf_tmp_buffer_t *const slot= nullptr;

-  /** File descriptor */
-  fil_node_t *const node= nullptr;
+  buf_page_t *bpage() const noexcept
+  {
+    return reinterpret_cast<buf_page_t *>(
+        reinterpret_cast<ptrdiff_t>(bpage_ptr) & ~ptrdiff_t(1));
+  };
+
+  bool ext_buf() const noexcept
+  {
+    return reinterpret_cast<ptrdiff_t>(bpage_ptr) & 1;
+  }


Instead of replacing node with a union and repurposing a bit of bpage we could leave those data members alone and simply add an ext_buf_page that would normally be nullptr. The sizeof(IORequest) is only 32 bytes on 64-bit systems, including 32 bits of padding around type. It should be no problem at all to let it grow by a single pointer.

Yes, the size of IORequest is 32 bytes, as well as MAX_AIO_USERDATA_LEN, i.e. aiocb::m_userdata buffer size where the IORequest object should be fit in. We can, of course, increase MAX_AIO_USERDATA_LEN by one more sizeof(void*) as well as write_slots and read_slots arrays size. But it will cost a memcpy.

dr-m · 2026-01-27T10:30:58Z

storage/innobase/buf/buf0flu.cc

-  const bool persistent= bpage->oldest_modification() != 2;
+  buf_page_t::space_type type= request.ext_buf()
+                                   ? buf_page_t::EXT_BUF
+                                   : static_cast<buf_page_t::space_type>(
+                                         bpage->oldest_modification() == 2);


It would be good to document this assumption of the values 0 and 1 here:

static_assert(buf_page_t::PERSISTENT == 0, ""); static_assert(buf_page_t::TEMPORARY == 1, "");

dr-m · 2026-01-27T10:38:13Z

storage/innobase/buf/buf0flu.cc

+  if (UNIV_UNLIKELY(type != buf_page_t::PERSISTENT) && UNIV_LIKELY(!error))
  {
+    if (type == buf_page_t::EXT_BUF)
+    {
+      ut_d(if (DBUG_IF("ib_ext_bp_count_io_only_for_t")) {
+        if (fil_space_t *space= fil_space_t::get(bpage->id_.space()))
+        {
+          auto space_name= space->name();
+          if (space_name.data() &&
+              !strncmp(space_name.data(), "test/t.ibd", space_name.size()))
+          {
+            ++buf_pool.stat.n_pages_written_to_ebp;
+          }
+          space->release();
+        }
+      } else)
+        ++buf_pool.stat.n_pages_written_to_ebp;
+    }
    /* We must hold buf_pool.mutex while releasing the block, so that
    no other thread can access it before we have freed it. */
    mysql_mutex_lock(&buf_pool.mutex);
-    bpage->write_complete(persistent, error, state);
-    buf_LRU_free_page(bpage, true);
+    bpage->write_complete(type, error, state);


Can we use a normal non-atomic counter here, and increment it inside bpage->write_complete(), inside the critical section of buf_pool.mutex? I think it should be more efficient. Also, if we had an IORequest::node always present, we could simplify the debug injection and just refer to node->space in the debug injection.

I think that we’d better assign bpage->oldest_modification() to a local variable and pass it to buf_page_t::write_complete(), instead of passing type. It’s an atomic field, therefore we should load it only once. Inside that function we should assert that the value still matches, and then identify the type of the block (0=external buffer pool, 1=impossible, 2=temporary tablespace, >2=write to persistent file) based on that.

dr-m · 2026-01-27T10:40:43Z

storage/innobase/buf/buf0flu.cc

-  ut_ad(lsn
-        ? lsn >= oldest_modification() || oldest_modification() == 2
-        : (space->is_temporary() || space->is_being_imported()));
+  ut_ad(to_ext_buf ||
+        (lsn ? lsn >= oldest_modification() || oldest_modification() == 2
+             : (space->is_temporary() || space->is_being_imported())));


Can we assert that space->is_temporary() will never be written to_ext_buf?

dr-m · 2026-01-27T10:46:02Z

storage/innobase/buf/buf0flu.cc

-  space->reacquire();
+  if (!to_ext_buf)
+    space->reacquire();


If we omitted this condition, we could guarantee that IORequest::node (see my other comments) is valid until the write has been completed. I don’t think that we should attempt to optimize for the unlikely case that the tablespace is being dropped while a write to the external buffer pool is in progress. It’s better to keep it simple.

dr-m · 2026-01-27T10:47:03Z

storage/innobase/buf/buf0flu.cc

+  if (to_ext_buf) {
+    ut_ad(ext_page);
+    fil_system.ext_bp_io(*this, *ext_page, IORequest::WRITE_ASYNC, slot, size,
+                         write_frame);


We should just check if (ext_page) and let the variable to_ext_buf die earlier. The ext_page will also be passed as a parameter to the subsequent function call.

dr-m · 2026-01-27T10:53:11Z

storage/innobase/buf/buf0flu.cc

+    ut_ad(!bpage->is_io_fixed());
+    switch (bpage->oldest_modification())
    {
-      ut_ad(!bpage->is_io_fixed());
-      switch (bpage->oldest_modification()) {


In our formatting convention, there is no line break before a { that begins a switch block.

dr-m · 2026-01-27T10:58:48Z

storage/innobase/buf/buf0flu.cc

-    if (state < buf_page_t::READ_FIX && bpage->lock.u_lock_try(true))
+flush_to_ebp:
+  if ((flush_to_ebp || state < buf_page_t::READ_FIX) &&
+      bpage->lock.u_lock_try(true))


It looks like the code is being deindented by one level here for no valid reason. We’re basically just adding a label right before the if, as well as a condition flush_to_ebp || to skip the state check.

It would be nice to have a comment that explains why we would skip the state check here. It seems to me that it could lead to a failure of the subsequent statement ut_ad(!bpage->is_io_fixed()). state >= buf_page_t::READ_FIX holds exactly when the block is read or write fixed.

I could understand if we allowed a block to be concurrently written both to the persistent data file and to the extended buffer pool. But, I don’t think that buf_page_t::write_complete() is prepared to deal with such "recursion". Furthermore, I understood that the design principle is that only clean blocks (which have already been written back to the persistent data file) may be written to the extended buffer pool.

Based on the following subsequent code:

default: /* bpage->oldest_modification() could be changed from 0 to not 0 while bpage was unlocked, in this case we just flush the page to its space */ flush_to_ebp= false;

I believe that we may have a real race condition here. I don’t quite understand that comment. I would suggest replacing the phrase "was unlocked" (whose literal interpretation could be "during a call to bpage->lock.u_unlock()") with something that identifies a time frame, i.e., "between the time the block had been unlocked in … and our bpage->lock.u_lock_try(true) above". Anyway, I understood that a race condition was found during testing. Maybe it is related to this one.

dr-m · 2026-01-27T11:09:18Z

storage/innobase/buf/buf0flu.cc

+    else if (space->is_stopping_writes())
+    {
+      space->release();
+      space= nullptr;
+    no_space:
+      if (flush_to_ebp && !bpage->oldest_modification()) {
+        bpage->lock.u_unlock(true);
+        buf_LRU_free_page(bpage, true);


What if we always held a space reference, also for flush_to_ebp? It is true that we are not writing to space->chain but to the external buffer pool file, but some things would be simpler if we can assume that the tablespace won’t disappear while the write is in progress.

In one of the practical cloud MariaDB setups, a server node accesses its datadir over the network, but also has a fast local SSD storage for temporary data. The content of such temporary storage is lost when the server container is destroyed. The commit uses this ephemeral fast local storage (SSD) as an extension of the portion of InnoDB buffer pool (DRAM) that caches persistent data pages. This cache is separated from the persistent storage of data files and ib_logfile0 and ignored during backup. The following system variables were introduced: innodb_extended_buffer_pool_size - the size of external buffer pool file, if it equals to 0, external buffer pool will not be used; innodb_extended_buffer_pool_path - the path to external buffer pool file. If innodb_extended_buffer_pool_size is not equal to 0, external buffer pool file will be created on startup. Only clean pages will be flushed to external buffer pool file. There is no need to flush dirty pages, as such pages will become clean after flushing, and then will be evicted when they reach the tail of LRU list. The general idea of this commit is to flush clean pages to external buffer pool file when they are evicted. A page can be evicted either by transaction thread or by background thread of page cleaner. In some cases transaction thread is waiting for page cleaner thread to finish its job. We can't do flushing in external buffer pool file when transaction threads are waithing for eviction, that would heart performance. That's why the only case for flushing is when page cleaner thread evicts pages in background and there are no waiters. For this purprose buf_pool_t::done_flush_list_waiters_count variable was introduced, we flush evicted clean pages only if the variable is zeroed. Clean pages are evicted in buf_flush_LRU_list_batch() to keep some amount of pages in buffer pool's free list. That's why we flush every second page to external buffer pool file, otherwise there could be not enought amount of pages in free list to let transaction threads to allocate buffer pool pages without page cleaner waiting. This might be not a good solution, but this is enought for prototyping. External buffer pool page is introduced to store information in buffer pool page hash about the certain page can be read from external buffer pool file. The first several members of such page must be the same as the members of internal page. External page frame must be equal to the certain value to disthinguish external page from internal one. External buffer pages are preallocated on startup in external pages array. We could get rid of the frame in external page, and check if the page's address belongs to the array to distinguish external and internal pages. There are also external pages free and LRU lists. When some internal page is decided to be flushed in external buffer pool file, a new external page is allocated eighter from the head of external free list, or from the tail of external LRU list. Both lists are protected with buf_pool.mutex. It makes sense, because a page is removed from internal LRU list during eviction under buf_pool.mutex. Then internal page is locked and the allocated external page is attached to io request for external buffer pool file, and when write request is completed, the internal page is replaced with external one in page hash, external page is pushed to the head of external LRU list and internal page is unlocked. After internal page was removed from external free list, it was not placed in external LRU, and placed there only after write completion, so the page can't be used by the other threads until write is completed. Page hash chain get element function has additional template parameter, which notifies the function if external pages must be ignored or not. We don't ignore external pages in page hash in two cases, when some page is initialized for read and when one is reinitialized for new page creating. When an internal page is initialized for read and external page with the same page id is found in page hash, the internal page is locked, the external page in replaced with newly initialized internal page in the page hash chain, the external page is removed from external LRU list and attached to io request to external buffer pool file. When the io request is completed, external page is returned to external free list, internal page is unlocked. So during read external page absents in both external LRU and free lists and can't be reused. When an internal page is initialized for new page creating and external pages with the same page id is found in page hash, we just remove external page from the page hash chain and external LRU list and push it to the head of external free list. So the external page can be used for future flushing. The pages are flushed to and read from external buffer pool file with the same manner as they are flushed to their spaces, i.e. compressed and encrypted pages stay compressed and encrypted in external buffer pool file.

Fix some tests. Make ext_buf_pool test more stable avoiding race conditions for read/write counters.

Fix Windows and liburing issues.

Use persistent named files for external buffer pool instead of temporary one.

Squash it. Fix for the following RQG test failures: 2. Scenario: The server is under load (9 concurrent sessions). At some point of time he crashes with mariadbd: 11.8-MDEV-31956-ext_buf_pool/storage/innobase/buf/buf0flu.cc:294: void buf_page_t::write_complete(buf_page_t::space_type, bool, uint32_t): Assertion `persistent == (om > 2)' failed. 4. Scenario: The server was some time under load (one connection). Intentional SIGKILL DB server followed by restart and running certain checks. All that did not show some error. But the shutdown hang like Fragment of rqg.log: # 2026-01-16T13:15:57 [1467965] INFO: DBServer_e::MySQL::MySQLd::stopServer: server[1]: Stopping server on port 25140 ... # 2026-01-16T13:28:22 [1467965] ERROR: DBServer_e::MySQL::MySQLd::stopServer: server[1]: Did not shut down properly. Terminate it == RQG loses the "patience" and sends finally SIGABRT to the process of the DB server. The server error log shows 2026-01-16 13:15:58 0 [Note] /data/Server_bin/11.8-MDEV-31956-ext_buf_pool_debug_Og/bin/mariadbd (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown ... 2026-01-16 13:15:58 0 [Note] InnoDB: FTS optimize thread exiting. 2026-01-16 13:16:01 0 [Note] InnoDB: Starting shutdown... .... 2026-01-16 13:16:01 0 [Note] InnoDB: Buffer pool(s) dump completed at 260116 13:16:01 2026-01-16 13:18:37 0 [Note] InnoDB: Waiting for page cleaner thread to exit .... 2026-01-16 13:26:24 0 [Note] InnoDB: Waiting for page cleaner thread to exit

Evict page on write completion if it's space was removed. Lock external buffer pool file on Linux.

Added testing for: 1. Async read 2. Space removing during async io 3. IO error during async write The case of io error during sync and async io should also be added. Squash it.

RocksDB test fix.

buf_dblwr_t::create(): Create the doublewrite buffer in a single atomic mini-transaction. Do not write any log records for initializing any doublewrite buffer pages, in order to avoid recovery failure with innodb_log_archive=ON starting from the very beginning. Cherry-picking from 1d1699e

Remove double-write buffer workaround.

Get rid of ext_buf_page_t::frame.

Some code review fixes.

After hash chain is unlocked and buffer pool mutex is released in buf_page_init_for_read(), the chain list can be changed by concurent thread, that's why we need zero out extended buffer page pointer. Make ext_bp_size to be atomic. Process asinc errors in os_aio() correctly.

Some code review fixes.

vlad-lesin requested a review from dr-m January 5, 2026 13:32

gkodinov added the MariaDB Corporation label Jan 6, 2026

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch 2 times, most recently from ee7d993 to bc5b76d Compare January 11, 2026 20:19

dr-m reviewed Jan 14, 2026

View reviewed changes

storage/innobase/buf/buf0flu.cc Show resolved Hide resolved

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch 4 times, most recently from 99921cf to 64eb035 Compare January 15, 2026 10:04

dr-m reviewed Jan 19, 2026

View reviewed changes

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch 2 times, most recently from db5856a to c78f5ac Compare January 19, 2026 09:11

dr-m reviewed Jan 21, 2026

View reviewed changes

dr-m reviewed Jan 26, 2026

View reviewed changes

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch from 47b10e0 to 9a06eec Compare January 26, 2026 16:20

dr-m reviewed Jan 27, 2026

View reviewed changes

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch from 6194247 to 62199a8 Compare January 28, 2026 03:24

vlad-lesin added 8 commits January 28, 2026 13:45

MDEV-31956 SSD based InnoDB buffer pool extension

006890c

Fix some tests. Make ext_buf_pool test more stable avoiding race conditions for read/write counters.

MDEV-31956 SSD based InnoDB buffer pool extension

c153311

Fix Windows and liburing issues.

MDEV-31956 SSD based InnoDB buffer pool extension

ae74df0

Use persistent named files for external buffer pool instead of temporary one.

MDEV-31956 SSD based InnoDB buffer pool extension

12d59ed

Evict page on write completion if it's space was removed. Lock external buffer pool file on Linux.

MDEV-31956 SSD based InnoDB buffer pool extension

c395aa2

Added testing for: 1. Async read 2. Space removing during async io 3. IO error during async write The case of io error during sync and async io should also be added. Squash it.

MDEV-31956 SSD based InnoDB buffer pool extension

c01312b

RocksDB test fix.

dr-m and others added 6 commits January 28, 2026 13:45

MDEV-31956 SSD based InnoDB buffer pool extension

689b89a

Remove double-write buffer workaround.

MDEV-31956 SSD based InnoDB buffer pool extension

cbbfb6f

Get rid of ext_buf_page_t::frame.

MDEV-31956 SSD based InnoDB buffer pool extension

1fe308f

Some code review fixes.

MDEV-31956 SSD based InnoDB buffer pool extension

9fdb651

Some code review fixes.

vlad-lesin force-pushed the 11.8-MDEV-31956-ext_buf_pool branch from 62199a8 to 9fdb651 Compare January 28, 2026 10:49

		@@ -0,0 +1 @@
		--innodb-buffer-pool-size=21M --innodb-extended-buffer-pool-size=1M

		if (!space) {
		buf_pool.corrupted_evict(buf_page, buf_page_t::READ_FIX + 1, false);

Uh oh!

MDEV-31956 SSD based InnoDB buffer pool extension #4510

Are you sure you want to change the base?

MDEV-31956 SSD based InnoDB buffer pool extension #4510

Conversation

vlad-lesin commented Jan 5, 2026

Description

Release Notes

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

Uh oh!

CLAassistant commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-m left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dr-m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dr-m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dr-m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dr-m left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Jan 5, 2026 •

edited

Loading

dr-m left a comment •

edited

Loading