Skip to content

MOVED errors with Azure Managed Redis cluster #3256

@mderriey

Description

@mderriey

Description

Short summary

We're running a Node.js API on Azure App Service that connects to an Azure Managed Redis cluster using the OSS clustering policy over a private endpoint.

The app is occasionally getting into a state where all issued commands will return MOVED <slot> <ip> errors, and the only fix we've found is to restart the API.

General setup

We were using Azure Cache for Redis, and moved to Azure Managed Redis since Azure announced they're sunsetting the former.

We switched from createClient to createCluster since AMR runs a "proper" cluster:

const hostname = '<instance>.<region>.redis.azure.net'
const port = 10000
const client = createCluster({
  rootNodes: [{ url: `rediss://${hostname}:${port}` }],
  defaults: {
    password,
    pingInterval: 60_000,
    socket: {
      tls: true,
      // The Redis server can instruct the client to connect to specific nodes in the cluster, it does so using IP addresses.
      // When connecting to the raw IPs, TLS will fail hostname validation unless we set it explicitly.
      servername: hostname,
    },
  },
})

We changed which events we listen to following the advice from https://github.com/redis/node-redis/blob/master/docs/clustering.md#events:

type RedisNode = { host: string; port: number }
client.on('connect', () => redisLogger.info('The Redis cluster has successfully connected and is ready to use'))
client.on('disconnect', () => redisLogger.info('The Redis cluster has disconnected'))
client.on('error', async (error: Error) => {
  redisLogger.info('The Redis cluster has errored', { error })
  // More on this later
  await handleError(error, client)
})
client.on('node-ready', (node: RedisNode) => redisLogger.info('A Redis cluster node is ready to establish a connection', { node }))
client.on('node-connect', (node: RedisNode) => redisLogger.info('A Redis cluster node has connected', { node }))
client.on('node-reconnecting', (node: RedisNode) =>
  redisLogger.info('A Redis cluster node is attempting to reconnect after an error', { node }),
)
client.on('node-disconnect', (node: RedisNode) => redisLogger.info('A Redis cluster node has disconnected', { node }))
client.on('node-error', async (error: Error, node: RedisNode) => {
  redisLogger.warn('A Redis cluster node has errored', { error, node })
  // More on this later
  await handleError(error, client)
})

The API uses Redis for rate limiting/quota enforcement using sorted sets over a 1-hour rolling window, the client is used in a single place, and always performs the same operations:

  1. Query the sorted set associated with the user's key to figure out how much quota is left.
  2. Unrelated to Redis: compute the cost of the current request.
  3. If request doesn't go over quota: add elements to the sorted set and update key expiry.
// Query the sorted set associated with the user's key to figure out how much quota is left.
const startRange = Instant.now().minusSeconds(periodInSeconds).toEpochMilli()

// .multi() / .exec() performs the operations as a transaction
// See https://github.com/redis/node-redis/blob/master/docs/transactions.md
const [_, cost] = await redisClient
  .multi()
  // Remove members from the sorted set which are older than the period
  // https://redis.io/docs/latest/commands/zremrangebyscore/
  .ZREMRANGEBYSCORE(key, '-inf', startRange)
  // Get the number of members in the sorted set
  // https://redis.io/docs/latest/commands/zcard/
  .ZCARD(key)
  .execTyped()

return cost
// Add elements to the sorted set and update key expiry
const now = Instant.now().toEpochMilli()

// Create as many members to add to the sorted set as the cost of the operation:
// - Using hex so it takes less space: 1731086567987 => 1930cccdedf
// - Adding the index as a suffix to ensure uniqueness
const members = [...Array(cost).keys()].map((_, index) => {
  return {
    score: now,
    value: `${now.toString(16)}${index}`,
  } as {
    score: number
    value: string
  }
})

// Use pipelining instead of a transaction here: https://github.com/redis/node-redis/blob/master/packages/redis/README.md#auto-pipelining
// Pipelining sends multiple commands over a single request to Redis, however they're not guaranteed to be processed in order,
// or that other commands won't be processed in between.
// Adding the members to the sorted set and updating the expiry are not critical operations, and wouldn't benefit
// from the overhead of a transaction.
await Promise.all([
  // Add member to the sorted set
  // https://redis.io/docs/latest/commands/zadd/
  redisClient.ZADD(key, members),
  // Mark the key for expiry so Redis removes it if this user doesn't make a request within the defined period
  // https://redis.io/docs/latest/commands/expire/
  redisClient.EXPIRE(key, periodInSeconds),
])

Handle failovers

This isn't directly related to the MOVED issue, but I want to detail our setup in depth.

The first issue we faced was when Azure performs maintenance on the cluster, which causes a failover. Essentially Redis closes the connection, and we need to recover from that. The automatic reconnection strategy didn't seem to work, so we implemented a custom disconnect/reconnect strategy.

client.on('error', async (error: Error) => {
  redisLogger.info('The Redis cluster has errored', { error })
  // More on this later
  await handleError(error, client)
})
client.on('node-error', async (error: Error, node: RedisNode) => {
  redisLogger.warn('A Redis cluster node has errored', { error, node })
  // More on this later
  await handleError(error, client)
})

let isTryingToReconnect = false

async function handleError(error: Error, client: RedisClient) {
  if (isTryingToReconnect) {
    redisLogger.info(`We're already trying to reconnect to Redis`)
    return
  }

  if (error.message.includes('Socket closed unexpectedly') || ('code' in error && error.code === 'ECONNRESET')) {
    redisLogger.info('Attempting to disconnect and reconnect to Redis')
    isTryingToReconnect = true

    let reconnectAttempt = 1
    const maxReconnectAttempt = 20

    while (reconnectAttempt <= maxReconnectAttempt) {
      try {
        client.destroy()
        await wait(5000)
        await client.connect()
        redisLogger.info('Successfully reconnected to Redis')
        isTryingToReconnect = false
        return
      } catch (e) {
        redisLogger.error('An error occurred while attempting to disconnect and reconnect to Redis', {
          originalError: error,
          error: e,
          attempt: reconnectAttempt,
        })
      }
      reconnectAttempt++
    }

    redisLogger.error('Failed to disconnect and reconnect to Redis')
    isTryingToReconnect = false
  }
}

We tested this manually by scaling the Redis cluster up and down a few times, which causes a failover, and the API recovered successfully.

Any feedback on this approach is appreciated, particularly if the client should be able to recover from that automatically.

MOVED errors

Since then, we had two separate events where the Redis client got into a state where issuing commands always threw an error of type MOVED <slot> <ip>, and we had no other choice than to restart the API process to get the client connecting again.

Here's a sample stack trace:

Error: MOVED 3664 10.61.0.133:8501
    at #decodeSimpleError (/home/site/wwwroot/node_modules/@redis/client/dist/lib/RESP/decoder.js:457:13)
    at #decodeTypeValue (/home/site/wwwroot/node_modules/@redis/client/dist/lib/RESP/decoder.js:104:91)
    at Decoder.write (/home/site/wwwroot/node_modules/@redis/client/dist/lib/RESP/decoder.js:74:38)
    at RedisSocket.<anonymous> (/home/site/wwwroot/node_modules/@redis/client/dist/lib/client/index.js:437:37)
    at RedisSocket.emit (node:events:519:28)
    at TLSSocket.<anonymous> (/home/site/wwwroot/node_modules/@redis/client/dist/lib/client/socket.js:235:38)
    at TLSSocket.emit (node:events:519:28)
    at addChunk (node:internal/streams/readable:561:12)
    at readableAddChunkPushByteMode (node:internal/streams/readable:512:3)
    at Readable.push (node:internal/streams/readable:392:5)

The IP is the one associated to the network interface of the private endpoint, so it is correct, and reachable by the client. Also worth noting that transactions and pipelining are always used against the same key, so we took slot-crossing out of the equation.

We tried to diagnose this with Claude, and it thinks it found some gaps. We don't have enough Redis expertise to judge its findings.

Click to expand

First, #getShards uses clusterSlots() which sends the CLUSTER SLOTS command — not CLUSTER SHARDS. The comment confirms this: // switch to CLUSTER SHARDS when Redis 7.0 will be the minimum supported version. So the rediscover is running CLUSTER SLOTS. That matters less for AMR (both return topology info), but worth noting.

Second — and this is probably the key point — #discover creates brand new client connections via #initiateSlotNode for each master and replica. It destroys old clients that are no longer in the address set (line 114–115). So rediscover does effectively tear down and rebuild shard connections if the addresses in the response differ from what's cached.

But here's the thing: if CLUSTER SLOTS returns the same addresses the client already has cached (because AMR's topology actually hasn't changed, and the problem is elsewhere), then addressesInUse contains the same addresses as nodeByAddress, and none of the existing clients are destroyed. Rediscover reuses them. So if the existing shard connection has some internal corruption that's causing misbehaviour, rediscover won't fix it because it's not actually rebuilding the connection.

This could be what you're hitting. Imagine this sequence:

  1. Some transient issue at 19:32 puts a specific shard connection into a weird state — perhaps a TLS session issue, a framing issue, or something at the socket layer that doesn't manifest as an error but causes the shard to respond to some commands with MOVED incorrectly, or causes node-redis to misroute commands to the wrong connection.
  2. Every MOVED triggers rediscover.
  3. CLUSTER SLOTS returns the same (correct) topology.
  4. Because addresses match, the existing (broken) shard connections are reused.
  5. Commands keep getting MOVEDed.
  6. App restart creates fresh connections; everything works.

This is speculative — I can't prove it from the code alone. But it's consistent with "restart fixes it and rediscover doesn't."

Thoughts

We thought about moving to the Enterprise clustering policy which abstracts the cluster away from the client (i.e. going back, client-wise, to something similar as the setup we had with Azure Cache for Redis). This should be possible given we only use multi-key commands over the same key, that will resolve to the same slot.

However we'd first like to understand if there's something wrong in our setup.

Temporary workaround

For now we've implemented a workaround that catches MOVED errors when we're issuing commands, and triggers the same disconnect/reconnect as when a failover happens. This hasn't been running long enough for us to tell whether it's effective.

Conclusion

I hope there's enough, but not too much information. If you have any questions, please let us know.

Node.js Version

22.22.0

Redis Server Version

7.4.3

Node Redis Version

5.10.0

Platform

Linux (Azure App Service)

Logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions