Skip to content

createCollection fails with EXISTING_COLLECTION_DIFFERENT_SETTINGS after lexical/rerank feature is enabled on a kube with pre-existing disabled-lexical collections #2466

@Hazel-Datastax

Description

@Hazel-Datastax

What happened ?

A createCollection command that had previously being worked for a user started failing after the lexcial and rerank features were enabled in their kube by the change https://github.com/riptano/serverless-argocd/pull/10615

The command started failing and returning errors such as:

{
	"id": "5383de4c-925e-4673-804a-0d8b4c0671ed",
	"family": "REQUEST",
	"scope": "SCHEMA",
	"errorCode": "EXISTING_COLLECTION_DIFFERENT_SETTINGS",
	"title": "Collection already exists with different settings",
	"message": "Collection 'chat_messages' already exists but with settings different from ones passed with 'createCollection' command.\n\nIf you need to change collection settings you will need to 'deleteCollection', then re-create with new settings."
}

This was not the users fault, there was nothing they could do to fix this.

The fix was to revert the change introduced by the PR above.

Why did the user start getting EXISTING_COLLECTION_DIFFERENT_SETTINGS error ?

Whern creating a collection, code in the CreateCollectionOperation class failed to reconcile the configuration of an existing collection with the proposed new collection sent by the user.

There is code written specifically to do this reconcilation, it failed.

It detects if the two specifications are different, and works out if the difference is only that lexical search or reranking are now enabled.

If the collection existed, why did the user try to recreate it ?

The Data API tries to follow behaviour of MongoDB, and so allows a createCollection command to silently fail if the user tries to create a collection that is the same as an existing on. If the user tries to create a collection with a different config they get the EXISTING_COLLECTION_DIFFERENT_SETTINGS error the customer saw.

We know that many customers use this method, such as the one who had this issue, where they create the collection many times, perhaps as instances of code come and god.

Why did the config reconcilation fail ?

The reconcilation failed because of the use of reference quality rather than value equality.

Below is the relevant code from CreateCollectionOperation:

  boolean settingsAreEqual = existingCollectionSettings.equals(newCollectionSettings);

  if (!settingsAreEqual) {
    final var oldLexical = existingCollectionSettings.lexicalConfig();
    final var newLexical = lexicalConfig();
    final var oldReranking = existingCollectionSettings.rerankingConfig();
    final var newReranking = rerankDef();

    // So: for backwards compatibility reasons we may need to override settings if
    // (and only if) the collection was created before lexical and reranking.
    // In addition, we need to check that new lexical settings are for defaults
    // (difficult to check the lexicalConfigsame for reranking; for now assume that if lexical
    // is default, reranking is also default).
    if (oldLexical == CollectionLexicalConfig.configForPreLexical()
        && newLexical == CollectionLexicalConfig.configForDefault()
        && oldReranking == CollectionRerankDef.configForPreRerankingCollection()
        && newReranking == CollectionRerankDef.configForDefault()) {
      var originalNewSettings = newCollectionSettings;
      newCollectionSettings =
          newCollectionSettings.withLexicalAndRerankOverrides(
              oldLexical, existingCollectionSettings.rerankingConfig());
      // and now re-check if settings are the same
      settingsAreEqual = existingCollectionSettings.equals(newCollectionSettings);
      logger.info(
          "CreateCollectionOperation for {}.{} with existing legacy lexical/reranking settings, new settings differ. Tried to unify, result: {}"
              + " Old settings: {}, New settings: {}",
          commandContext.schemaObject().identifier().keyspace(),
          name,
          settingsAreEqual,
          existingCollectionSettings,
          originalNewSettings);
    } else {
      logger.info(
          "CreateCollectionOperation for {}.{} with different settings (but not old legacy lexical/reranking settings), cannot unify."
              + " Old settings: {}, New settings: {}",
          commandContext.schemaObject().identifier().keyspace(),
          name,
          existingCollectionSettings,
          newCollectionSettings);
    }
  }

NOTE: the use of == in the if statemeent after detecting the settings were different. This was intended to identify if the old settings for lexical and rerank were what they would be pre these features being added, and if the new settings are the same as the current defaults. Meaning the new settings came in from defaults.

If this check works out the "new" settings are just the defaults then replace the lexical and reranking config in the new settings with the "old" ones and re-compare.

It should have used value equality to compare the values of these config items, rather then object reference quality.

We configed this check returned false by finding the second log in splunk for this tenant:

(some reformatting)

CreateCollectionOperation for default_keyspace.project_documents with different settings (but not old legacy lexical/reranking settings), cannot unify. 

Old settings:
 CollectionSchemaObject[identifier=COLLECTION:default_keyspace.project_documents, idConfig=IdConfig[idType=], vectorConfig=VectorConfig[vectorEnabled=true, columnVectorDefinitions={$vectorize=VectorColumnDefinition[fieldName=$vectorize, vectorSize=3072, similarityFunction=COSINE, sourceModel=OTHER, vectorizeDefinition=null]}], indexingConfig=CollectionIndexingConfig[allowed=[documentId, projectId, userId], denied=[], indexedProject=Suppliers.memoize(io.stargate.sgv2.jsonapi.service.schema.collections.CollectionIndexingConfig$$Lambda/0x00000003018b8468@4399c3b8)], lexicalConfig=CollectionLexicalConfig[enabled=false, analyzerDefinition=null], rerankDef=CollectionRerankDef[enabled=false, rerankServiceDesc=null]], 

New settings: 

CollectionSchemaObject[identifier=COLLECTION:default_keyspace.project_documents, idConfig=IdConfig[idType=], vectorConfig=VectorConfig[vectorEnabled=true, columnVectorDefinitions={$vectorize=VectorColumnDefinition[fieldName=$vectorize, vectorSize=3072, similarityFunction=COSINE, sourceModel=OTHER, vectorizeDefinition=null]}], indexingConfig=CollectionIndexingConfig[allowed=[documentId, projectId, userId], denied=[], indexedProject=Suppliers.memoize(io.stargate.sgv2.jsonapi.service.schema.collections.CollectionIndexingConfig$$Lambda/0x00000003018b8468@169de8c0)], lexicalConfig=CollectionLexicalConfig[enabled=true, analyzerDefinition="standard"], rerankDef=CollectionRerankDef[enabled=true, rerankServiceDesc=RerankServiceDef[provider=nvidia, modelName=nvidia/llama-3.2-nv-rerankqa-1b-v2, authentication=null, parameters=null]]]

Why did testing not detect this situation ?

We have integration tests for this code, in CreateCollectionBackwardCompatibilityIntegrationTest, that creates a an existing collection manually using CQL and then runs a command to create a new collection using the API and checks that the user does not get an error.

We used CQL because this test was written with the lexical and reranking code in the code base, so we needed a way to create collections from before the code was in the code base.

It created a collection via CQL as below:

  String collectionWithoutLexicalRerank =
      """
                CREATE TABLE IF NOT EXISTS "%s"."%s" (
                    key frozen<tuple<tinyint, text>> PRIMARY KEY,
                    array_contains set<text>,
                    array_size map<text, int>,
                    doc_json text,
                    exist_keys set<text>,
                    query_bool_values map<text, tinyint>,
                    query_dbl_values map<text, decimal>,
                    query_null_values set<text>,
                    query_text_values map<text, text>,
                    query_timestamp_values map<text, timestamp>,
                    query_vector_value vector<float, 123>,
                    tx_id timeuuid
                ) WITH comment = '{"collection":{"name":"%s","schema_version":1,"options":{"defaultId":{"type":""}}}}';
                """;

The schema of the collection is in the comments of the table.

It then checked that a collection with the same name could be created, and no error returned:

// create the same collection using API - should not get
// COLLECTION_EXISTS_WITH_DIFFERENT_SETTINGS error
givenHeadersPostJsonThenOkNoErrors(
          """
              {
                  "createCollection": {
                      "name": "%s"
                  }
              }
      """
          .formatted(PRE_LEXICAL_RERANK_COLLECTION_NAME))
  .body("$", responseIsStatusOnly())
  .body("status.ok", is(1));

For two reasons this test did not surface the use of reference equality.

First, the createCollection command that was sent did not define an options member, and so code selected to use the singleton default objects. When we updated the test to provide some options (like the user request) the test did not fail like we expected it to. So while we reconize this as a fault in the code it is probably not the only one.

Second, the "old" table created by the test did not match what the customer has. From looking at the customers schema we can see their table as below:

CREATE TABLE IF NOT EXISTS "34613964383738372d353061632d343961352d613661322d393435623534623934363165_default_keyspace".project_documents (
    key tuple<tinyint, text> PRIMARY KEY,
    doc_json text,
    query_vector_value vector<float, 3072>,
    tx_id timeuuid,
    array_contains set<text>,
    array_size map<text, int>,
    exist_keys set<text>,
    query_bool_values map<text, tinyint>,
    query_dbl_values map<text, decimal>,
    query_null_values set<text>,
    query_text_values map<text, text>,
    query_timestamp_values map<text, timestamp>
) WITH ID = cafcdb41-93c3-11f0-b5d2-a9bcf953d305
    AND additional_write_policy = '99p'
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND cdc = false
    AND comment = '{"collection":{"name":"project_documents","schema_version":1,"options":{"indexing":{"allow":["projectId","userId","documentId"]},"vector":{"dimension":3072,"metric":"cosine","sourceModel":"OTHER"},"defaultId":{"type":""},"lexical":{"enabled":false},"rerank":{"enabled":false}}}}'
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.UnifiedCompactionStrategy'}
    AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND memtable = {}
    AND crc_check_chance = 1.0
    AND default_time_to_live = 0
    AND extensions = {}
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair = 'BLOCKING'
    AND speculative_retry = '99p';

The collection definition in the table comment is:

{
	"collection": {
		"name": "project_documents",
		"schema_version": 1,
		"options": {
			"indexing": {
				"allow": [
					"projectId",
					"userId",
					"documentId"
				]
			},
			"vector": {
				"dimension": 3072,
				"metric": "cosine",
				"sourceModel": "OTHER"
			},
			"defaultId": {
				"type": ""
			},
			"lexical": {
				"enabled": false
			},
			"rerank": {
				"enabled": false
			}
		}
	}
}

NOTE: The customers table has values for the lexcial and rerank config settings, while the table created for the test above did not. While we did not trace the code, we believe the missing values for these settings caused the code to pull the singleton defaults similar to the fault for options not being present in the request.

After updating the test to send options and it still not failing, we updated the test to use the same "on disk" collection config as the customer and were able to get the test to fail.

With this in mind we consider the presence in the "on disk" config of the lexical and rerank fields to be the cause of failure for this user.

Why did the test not cover the case the user had ?

The lexcial and reranking code were added to the API together a number of months ago. At the time we added it we considered the backwards compatiblity as is evidenced by the testing and code above. However we did not consider all cases, and the length of time between the original release and expanding availability now exposed that.

When the code was created we considered the backwards case where: the lexcial and rerank code is deployed with the features enabled, to existing kubes with existing collections created before the code was released.

Because the BM25 indexes and access to the reranking models was limited, the code was deployed to some kubes with the features enabled and to others with the features disabled.

We did not consider the backwards case where: the lexical and rerank code is deployed but disabled and then later enabled, and when enabled it would handle collections that were created after the initial code was released but before it was ena bled as a feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions