Skip to content

BUG: DataFrame.rank does not preserve ExtensionArray dtypes#63987

Open
weeknd415 wants to merge 2 commits intopandas-dev:mainfrom
weeknd415:fix-dataframe-rank-ea-dtype-gh52829
Open

BUG: DataFrame.rank does not preserve ExtensionArray dtypes#63987
weeknd415 wants to merge 2 commits intopandas-dev:mainfrom
weeknd415:fix-dataframe-rank-ea-dtype-gh52829

Conversation

@weeknd415
Copy link

Summary

DataFrame.rank() converts PyArrow-backed and nullable ExtensionArray columns to float64, while Series.rank() correctly preserves the EA dtype. This is because the internal ranker() function calls data.values for 2D data (DataFrames), which goes through BlockManager.as_array() and strips all ExtensionArray type information.

Reproducer

import pandas as pd
import pyarrow as pa

s = pd.Series([1, 2, 3], dtype=pd.ArrowDtype(pa.int32()))
df = s.to_frame(name="a")

print(s.rank(method="min").dtype)       # uint64[pyarrow] ✓
print(df.rank(method="min").dtypes)     # float64 ✗ (should be uint64[pyarrow])

Fix

Replace the ranker() closure with block-level processing via self._mgr.apply(), following the same pattern used by _accumulate(). This processes each block independently:

  • ExtensionArray blocks → dispatch to EA._rank(), preserving dtype
  • NumPy blocks → dispatch to algos.rank(), same as before
  • axis=1 (cross-column ranking) → falls back to NumPy conversion since ranking across columns requires a single array

Tests Added

  • test_rank_ea_dtype_preservation — PyArrow int32/float64 columns across all 5 rank methods (average, min, max, first, dense)
  • test_rank_ea_dtype_preservation_nullable — Nullable Int64/Float64 columns with NA values

…ev#52829)

DataFrame.rank() converted PyArrow-backed and nullable EA columns to
float64 because the ranker() function called data.values for 2D data,
which goes through BlockManager.as_array() and strips all
ExtensionArray type information.

Fix by using _mgr.apply() to process each block independently,
dispatching to EA._rank() for ExtensionArrays and algos.rank() for
numpy arrays. This follows the same pattern used by _accumulate().

For axis=1 (cross-column ranking), fall back to the numpy conversion
path since ranking across columns requires a single array.
Numpy blocks in the BlockManager are stored transposed (n_cols, n_rows)
relative to the user-facing DataFrame layout (n_rows, n_cols). Without
transposing, algos.rank with axis=0 ranks along the wrong dimension.

Follow the same transpose pattern used by _accumulate(). For 1D
ExtensionArrays, .T is a no-op so the fix is safe for both code paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.rank does not return EA types when original type was an EADtype

1 participant