BUG: DataFrame.rank does not preserve ExtensionArray dtypes#63987
Open
weeknd415 wants to merge 2 commits intopandas-dev:mainfrom
Open
BUG: DataFrame.rank does not preserve ExtensionArray dtypes#63987weeknd415 wants to merge 2 commits intopandas-dev:mainfrom
weeknd415 wants to merge 2 commits intopandas-dev:mainfrom
Conversation
…ev#52829) DataFrame.rank() converted PyArrow-backed and nullable EA columns to float64 because the ranker() function called data.values for 2D data, which goes through BlockManager.as_array() and strips all ExtensionArray type information. Fix by using _mgr.apply() to process each block independently, dispatching to EA._rank() for ExtensionArrays and algos.rank() for numpy arrays. This follows the same pattern used by _accumulate(). For axis=1 (cross-column ranking), fall back to the numpy conversion path since ranking across columns requires a single array.
Numpy blocks in the BlockManager are stored transposed (n_cols, n_rows) relative to the user-facing DataFrame layout (n_rows, n_cols). Without transposing, algos.rank with axis=0 ranks along the wrong dimension. Follow the same transpose pattern used by _accumulate(). For 1D ExtensionArrays, .T is a no-op so the fix is safe for both code paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
doc/source/whatsnew/(follow the existing format)Summary
DataFrame.rank()converts PyArrow-backed and nullable ExtensionArray columns tofloat64, whileSeries.rank()correctly preserves the EA dtype. This is because the internalranker()function callsdata.valuesfor 2D data (DataFrames), which goes throughBlockManager.as_array()and strips all ExtensionArray type information.Reproducer
Fix
Replace the
ranker()closure with block-level processing viaself._mgr.apply(), following the same pattern used by_accumulate(). This processes each block independently:EA._rank(), preserving dtypealgos.rank(), same as beforeTests Added
test_rank_ea_dtype_preservation— PyArrowint32/float64columns across all 5 rank methods (average,min,max,first,dense)test_rank_ea_dtype_preservation_nullable— NullableInt64/Float64columns withNAvalues