Skip to content

Implement equals for ArkoudaStringArray and ArkoudaCategoricalArray (pandas-aligned) #5432

@ajpotts

Description

@ajpotts

Summary

Implement the equals method for: - ArkoudaStringArray -
ArkoudaCategoricalArray

to match pandas ExtensionArray semantics.

pandas relies on .equals() for correctness checks, testing utilities,
alignment logic, and some internal fast-path decisions. Missing or
incorrect implementations can cause false negatives/positives in
comparisons and may trigger slow fallbacks (e.g., converting to
object/NumPy).


Background / Why

In pandas, ExtensionArray.equals(other) answers:

Are these two arrays the same length and do they contain equal
elements in the same positions, treating missing values as equal to
missing values?

Key points: - This is not elementwise comparison (==); it returns
a single boolean. - Missing values compare equal only when both are
missing in the same positions
. - For categoricals, equality also
depends on dtype metadata (categories/order).

This method is used in: - pandas tests/assertions (tm.assert_*
helpers) - Series.equals, Index.equals - some optimization checks
(e.g., short-circuiting operations)


Expected pandas Semantics

Strings

Two arrays are equal if: - Same length - For each position: - both
missing → equal - both non-missing and strings equal → equal - otherwise
not equal

Example: - ["a", None, "b"] equals ["a", None, "b"] → True -
["a", None, "b"] equals ["a", "x", "b"] → False - ["a", None]
equals ["a"] → False

Categoricals

Two categoricals are equal if: - Same length - Same dtype metadata
(pandas behavior requires: - same categories (typically same values and
same order) - same ordered flag) - And the codes (including
missing) match positionally

Examples: - Categorical(["a", None], categories=["a","b"]) equals same
dtype and same values → True - Same values but different categories
order → False (pandas treats dtype mismatch as not equal) - Same
categories but different ordered flag → False

Note: If pandas allows equality when categories are the same set but
different order, we should match exactly what pandas does for
Categorical.equals.


Scope

In Scope

  • Implement:
    • ArkoudaStringArray.equals(other) -> bool
    • ArkoudaCategoricalArray.equals(other) -> bool
  • Accept other as:
    • same Arkouda array type
    • pandas equivalent array type where reasonable (e.g., pandas
      StringArray/Categorical)
    • array-like (optional; if not supported, return False)
  • Ensure missing-value semantics match pandas
  • Avoid full materialization for large arrays (no .to_numpy() of
    full data)
  • Add unit tests comparing to pandas baselines

Out of Scope

  • Elementwise comparisons (==), handled elsewhere
  • Cross-dtype "coercive" equality (should generally return False)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions