-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-22232: [C++][Python] Introduce optional default_column_type parameter #47663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
|
|
|
|
1 similar comment
|
|
|
|
|
|
|
@github-actions crossbow submit preview-docs |
|
|
|
|
|
|
Thank you @vladborovtsov for the contribution. |
|
|
|
Hi @AlenkaF |
|
Happy to see a response! I will wait for an opinion from a C++ dev and in the meantime try to look at the Python part. |
|
Hi @AlenkaF |
AlenkaF
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only read the c++ part. I think the way option is defined and checked is good.
A nit: I'd move c++ tests into arrow/cpp/src/arrow/csv/column_builder_test.cc and use available machinery there. See how CheckInferred is used here:
arrow/cpp/src/arrow/csv/column_builder_test.cc
Lines 441 to 457 in ad9279e
| TEST_F(InferringColumnBuilderTest, MultipleChunkTimestamp) { | |
| auto options = ConvertOptions::Defaults(); | |
| auto tg = TaskGroup::MakeSerial(); | |
| std::shared_ptr<ChunkedArray> expected; | |
| ChunkedArrayFromVector<TimestampType>(timestamp(TimeUnit::SECOND), | |
| {{false}, {true}, {true}}, | |
| {{0}, {0}, {1542129070}}, &expected); | |
| CheckInferred(tg, {{""}, {"1970-01-01"}, {"2018-11-13 17:11:10"}}, options, expected); | |
| options.timestamp_parsers.push_back(TimestampParser::MakeStrptime("%Y/%m/%d")); | |
| tg = TaskGroup::MakeSerial(); | |
| ChunkedArrayFromVector<TimestampType>(timestamp(TimeUnit::SECOND), | |
| {{false}, {true}, {true}}, | |
| {{0}, {0}, {1542067200}}, &expected); | |
| CheckInferred(tg, {{""}, {"1970/01/01"}, {"2018/11/13"}}, options, expected); | |
| } |
Edit: column_builder_test.cc is checking behavior only on single columns, perhaps keep a couple multi-column tests in the current location to test for that.
|
There is not too much to add into I’ve added a guard test to ensure that my changes don’t override the type inference logic. Not really sure if it it makes sense to have them there. As for the reader tests, I’ve tried to keep them representative while minimizing the amount of lines with recent changes. |
Rationale for this change
Add an optional default_column_type parameter to the CSV reading API (C++ and Python) to provide a fallback type when per-column types aren’t specified, improving schema consistency and complementing the existing column_types logic.
What changes are included in this PR?
Are these changes tested?
Yes. Existing and new tests are passing.
C++:
pyarrow:
New tests are passing.
Are there any user-facing changes?
I believe this change is backward compatible. Parameter is optional and its default value doesn't change the existing behavior; All the existing rests are passing.
Maybe relevant: #22232
Relates to #47502
GitHub Issue: [C++] CSV reader: add a default column type (or sentinel mapping) to avoid per-column enumeration #47502
GitHub Issue: [C++] CSV reader: Ability to not infer column types. #22232