Skip to content

SQL Optimizations

Milan Miladinovic edited this page Jul 31, 2021 · 4 revisions

Note

It might be a good idea to check out the schema page first, to have an idea as to what each table actually stores.

This page will only cover writes, since most reads are performed from Redis anyways. We occasionally read from Postgres, though this mostly occurs on program startup.

Query Types

When we flush our buffers to disk, up to 7 different queries can execute.

  1. Insert new Orders
  2. Update old Orders
  3. Insert new Pending Orders
  4. Delete old Pending Orders
  5. Update the total number of orders on the exchange
  6. Update any modified Markets
  7. Insert new Trades

Optimizations

Queries 1-4 and 7 have required some form of optimization, while queries 5 & 6 have not. This is because query 5 only modifies a single row, and query 6 modifies x rows, where x <= # markets on the exchange. In short, these two queries operate on such a small number of rows that they have not needed any optimizations.

As for the other 5 queries, each can insert or modify a massive number of rows depending on the number of elements we store in the program's buffers.

Query 1 - Inserts

Suppose we had to insert 100k new Orders, one approach would be to insert orders one-by-one using a different query each time: INSERT INTO Orders (...) VALUES (...);. In this case, each time we execute this statement, the index on the Orders table would need to be updated, the query planner would need to parse and develop and new plan, and it would require a request-response between the client (our application) and the server (postgres). Performing this for each of the 100k orders would make our writes extremely slow, and we would still have to execute queries 2-4 and 7.

Instead, we use a combination of Prepared Statements and Transactions for batch query execution. For example, we can prepare the following query: INSERT INTO Orders (...) Values (...);, which means Postgres only needs to run the query planner once before we insert 100k orders. The next part, executing in a single transaction, ensures that we only update the Order table's index once after the rows have been inserted. The only issue that remains unaddressed here is the large number of messages sent between the client and server, although this gets addressed in queries 3, 4, and 7.

Query 2 - Updates

Similar to query 1, we use prepared statements and a batch transaction. However, this time we prepare three separate statements, each reflecting the type of update that may have occurred. For example, an order might have been partially filled, making its filled count change. Maybe the user decided to simply cancel the order, thereby modifying its status, or perhaps the order was completely filled, requiring modification to both filled and status. We represent these states with the following queries:

1. UPDATE Orders SET filled=$1, time_updated=$2 WHERE order_id=$3;
2. UPDATE Orders SET status=$1, time_updated=$2 WHERE order_id=$3;
3. UPDATE Orders SET filled=$1, status=$2, time_updated=$3 WHERE order_id=$4;

Thus, we can prepare each statement once, and reuse it for all the updates, then simply execute the required queries (with placeholder values swapped for real values) in a single batch transaction.

We actually go even further with this specific update query by changing PostgreSQL's default fillfactor to 70. This allows us to perform HOT updates (Heap Only Tuple). There is some nuance here but it really boils down to reusing, rather than recreating, indexes - which increases update performance significantly.

Queries 3, 4, and 7 - Decreasing Communication

While these three queries do very different things, we structure them similarly in our application. Going back to our earlier example of 100k inserts, we recall that we never solved the issue of decreasing the amount of communication between client and server. For these remaining queries, we do exactly that, without sacrificing the performance gains of prepared statements or batch transactions.

We will use the "insert pending orders" query as an example, since it is similar to inserting normal orders, just with far less data. Rather than preparing a statement, we use a multi-row insert query, i.e INSERT INTO Table (...) VALUES (val1), (val2), ..., (val n);. While we don't specifically use prepared statements, the query plan only runs for the first value, and then is reused for the remaining n-1 values, so we avoid unnecessary computation. Additionally, we can execute this statement within a transaction, ensuring the indexes aren't updated until all values have been inserted. Finally, since we put all the values in a single statement, we only need to send one message to the server, avoiding the extra communication that comes with reading a response for each value.

However, this doesn't fully explain our implementation. We don't use a single statement, rather, we create several of these multi-row insert queries, each containing x values, where x is a variable we can define at compile time. This may seem confusing, but it's because Postgres has a fixed cache size for these batch queries, so it's better to fill the cache-line and execute several queries in a single transaction than to execute a single query that is ignorant of the cache.

Disclaimer: I don't have the actual number on me right now, although I recall reading it may have been in the range of a few MB. I ran tests and found that as the number of values increases, the speed of the query increased as well.

We use this approach for inserting and deleting pending orders, as well as inserting new trades. I was hopeful that we could use it for inserting new orders, but it seems too difficult because the update_time field can be either Some(val) or None.

Clone this wiki locally