GH-35806: [R] Improve error message for null type inference with sparse CSV data#49338
Conversation
…h sparse CSV data When a CSV column contains only missing values in the first block of data, Arrow infers the type as null. If a non-null value appears later, the conversion fails with an unhelpful error suggesting `skip = 1`. This change adds a specific check for "conversion error to null" and provides a more helpful message explaining the cause (type inference from sparse data) and the solution (specify column types explicitly). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
|
|
I'm not totally happy with the error message, will rewrite before marking ready for review |
jonkeane
left a comment
There was a problem hiding this comment.
I like the improved message, though I'm not totally sure I follow why we can assert the reason for why null was inferred? And also wouldn't this similarly error if someone specified null manually and then there was data(??)
7796990 to
0bb82ce
Compare
Yeah, no, you're right, updated the messaging |
jonkeane
left a comment
There was a problem hiding this comment.
One more addition to the message
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
| msg <- c( | ||
| msg, | ||
| i = paste( | ||
| "If you have not specified the schema, this error may be due to the column type being", |
There was a problem hiding this comment.
This might be a little overly complicated, but at this point, is schema NULL if it wasn't specified? If it is, we could actually detect if someone has specified or not and message that?
There was a problem hiding this comment.
Nah, it's an empty schema. I can account for that though.
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2a526c1. There was 1 benchmark result with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2a526c1. There was 1 benchmark result with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…h sparse CSV data (apache#49338) ### Rationale for this change When reading a CSV with sparse data (many missing values followed by actual values), Arrow can infer a column type as `null` based on the first block of data. When non-null values appear later, the error message incorrectly suggests using `skip = 1` for header rows, which is misleading. ### What changes are included in this PR? Adds a specific check for "conversion error to null" that provides a helpful message explaining the cause (type inference from sparse data) and the solution (change the block size to use for inference). ### Are these changes tested? Yes, added a test in `test-dataset-csv.R`. ### Are there any user-facing changes? Yes, improved error message when CSV type inference fails due to sparse data. --- This PR was authored by Claude (Opus 4.5) and reviewed by @ thisisnic. 🤖 Generated with [Claude Code](https://claude.ai/code) * GitHub Issue: apache#35806 Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Jonathan Keane <jkeane@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Rationale for this change
When reading a CSV with sparse data (many missing values followed by actual values), Arrow can infer a column type as
nullbased on the first block of data. When non-null values appear later, the error message incorrectly suggests usingskip = 1for header rows, which is misleading.What changes are included in this PR?
Adds a specific check for "conversion error to null" that provides a helpful message explaining the cause (type inference from sparse data) and the solution (change the block size to use for inference).
Are these changes tested?
Yes, added a test in
test-dataset-csv.R.Are there any user-facing changes?
Yes, improved error message when CSV type inference fails due to sparse data.
This PR was authored by Claude (Opus 4.5) and reviewed by @thisisnic.
🤖 Generated with Claude Code