-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG : Fix Excel header NaN duplication with merged MultiIndex columns in to_excel #62576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6703b44 to
a5784f5
Compare
a5784f5 to
b34c5d6
Compare
af60451 to
a85715d
Compare
a85715d to
5f70c16
Compare
Alvaro-Kothe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs tests and an entry in whatsnew.
fbdd7dc to
1eaf805
Compare
1eaf805 to
1b5fe97
Compare
| with ExcelWriter("test.xlsx", engine="openpyxl") as writer: | ||
| df.to_excel(writer, sheet_name="Sheet1", merge_cells=merge_cells) | ||
|
|
||
| reader = ExcelFile("test.xlsx") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't read and write to CWD, either use a temporary file, or use an in-memory buffer.
| DummyClass.assert_called_and_reset() | ||
|
|
||
| @td.skip_if_no("openpyxl") | ||
| def test_to_excel_multiindex_nan_in_columns(self, merge_cells): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test all excel engines.
| result = pd.read_excel(reader, index_col=0, header=[0, 1]) | ||
|
|
||
| original_values = df.to_numpy() | ||
| result_values = result.to_numpy() | ||
| tm.assert_numpy_array_equal(original_values, result_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create an expected DataFrame and use tm.assert_frame_equal
|
pre-commit.ci autofix |
2dad8cc to
77422bf
Compare
|
done |
| with ExcelFile(tmp_excel) as reader: | ||
| result = pd.read_excel(reader, index_col=0, header=[0, 1]) | ||
|
|
||
| tm.assert_numpy_array_equal(result.to_numpy(), df.to_numpy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't test the header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test validates that data survives the Excel round-trip. NaN in headers are written correctly (verified with openpyxl) but cannot be read back due to Excel treating empty cells as blanks. This is an Excel limitation, not a code bug.
| buf = BytesIO() | ||
| df.to_excel(buf) | ||
|
|
||
| def test_to_excel_multiindex_nan_in_columns(self, merge_cells, tmp_excel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test passes on main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand is it not supposed to passed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are expected to fail on main without the patch. If they’re passing, it means the bug isn’t actually being reproduced, so you are not truly verifying the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m a bit confused: this test case doesn’t exist on main at all.
I only created it in this branch, so I don’t understand how it could be “passing on main.”
Is there something I’m missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you remove your patch with
git restore --source=upstream/main -- pandas/io/formats/excel.py
and run the tests with
pytest pandas/tests/io/excel/test_writers.py
The test that you created still passes. Hence, it's not testing your fix.
GH#62340: Use original column values (with NaN) instead of NBSP-filled values when writing MultiIndex headers to Excel. - Modify _format_header_mi() to use columns.get_level_values() to get the original column values with NaN preserved - Add test to verify MultiIndex structure and data integrity are preserved during Excel round-trip - Note: read_excel() limitation means NaN in headers become empty cells in Excel and cannot be reconstructed on read, but data values are correctly preserved
6b9525e to
9b84372
Compare
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.