Skip to content

GH-35806: [R] Improve error message for null type inference with sparse CSV data#49338

Merged
thisisnic merged 9 commits intoapache:mainfrom
thisisnic:GH-35806-null-type-error-message
Mar 27, 2026
Merged

GH-35806: [R] Improve error message for null type inference with sparse CSV data#49338
thisisnic merged 9 commits intoapache:mainfrom
thisisnic:GH-35806-null-type-error-message

Conversation

@thisisnic
Copy link
Copy Markdown
Member

@thisisnic thisisnic commented Feb 19, 2026

Rationale for this change

When reading a CSV with sparse data (many missing values followed by actual values), Arrow can infer a column type as null based on the first block of data. When non-null values appear later, the error message incorrectly suggests using skip = 1 for header rows, which is misleading.

What changes are included in this PR?

Adds a specific check for "conversion error to null" that provides a helpful message explaining the cause (type inference from sparse data) and the solution (change the block size to use for inference).

Are these changes tested?

Yes, added a test in test-dataset-csv.R.

Are there any user-facing changes?

Yes, improved error message when CSV type inference fails due to sparse data.


This PR was authored by Claude (Opus 4.5) and reviewed by @thisisnic.

🤖 Generated with Claude Code

…h sparse CSV data

When a CSV column contains only missing values in the first block of data,
Arrow infers the type as null. If a non-null value appears later, the
conversion fails with an unhelpful error suggesting `skip = 1`.

This change adds a specific check for "conversion error to null" and
provides a more helpful message explaining the cause (type inference
from sparse data) and the solution (specify column types explicitly).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@thisisnic thisisnic requested a review from jonkeane as a code owner February 19, 2026 09:31
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #35806 has been automatically assigned in GitHub to PR creator.

@thisisnic thisisnic marked this pull request as draft February 19, 2026 13:50
@thisisnic
Copy link
Copy Markdown
Member Author

I'm not totally happy with the error message, will rewrite before marking ready for review

Comment thread r/R/util.R Outdated
Comment thread r/R/util.R Outdated
@github-actions github-actions Bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Feb 23, 2026
@thisisnic thisisnic marked this pull request as ready for review February 23, 2026 14:20
Copy link
Copy Markdown
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the improved message, though I'm not totally sure I follow why we can assert the reason for why null was inferred? And also wouldn't this similarly error if someone specified null manually and then there was data(??)

Comment thread r/R/util.R Outdated
@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 23, 2026
@github-actions github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 5, 2026
@thisisnic thisisnic force-pushed the GH-35806-null-type-error-message branch from 7796990 to 0bb82ce Compare March 5, 2026 10:49
@thisisnic
Copy link
Copy Markdown
Member Author

I'm not totally sure I follow why we can assert the reason for why null was inferred

Yeah, no, you're right, updated the messaging

Comment thread r/R/util.R Outdated
Copy link
Copy Markdown
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more addition to the message

@github-actions github-actions Bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 5, 2026
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
@github-actions github-actions Bot added awaiting review Awaiting review and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Mar 5, 2026
Comment thread r/R/util.R Outdated
@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 26, 2026
Comment thread r/R/util.R
msg <- c(
msg,
i = paste(
"If you have not specified the schema, this error may be due to the column type being",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a little overly complicated, but at this point, is schema NULL if it wasn't specified? If it is, we could actually detect if someone has specified or not and message that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, it's an empty schema. I can account for that though.

Co-authored-by: Jonathan Keane <jkeane@gmail.com>
@github-actions github-actions Bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Mar 26, 2026
@thisisnic thisisnic merged commit 2a526c1 into apache:main Mar 27, 2026
8 of 10 checks passed
@thisisnic thisisnic removed the awaiting changes Awaiting changes label Mar 27, 2026
@github-actions github-actions Bot added the awaiting review Awaiting review label Mar 27, 2026
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2a526c1.

There was 1 benchmark result with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2a526c1.

There was 1 benchmark result with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

thisisnic added a commit to thisisnic/arrow that referenced this pull request Apr 6, 2026
…h sparse CSV data (apache#49338)

### Rationale for this change

When reading a CSV with sparse data (many missing values followed by actual values), Arrow can infer a column type as `null` based on the first block of data. When non-null values appear later, the error message incorrectly suggests using `skip = 1` for header rows, which is misleading.

### What changes are included in this PR?

Adds a specific check for "conversion error to null" that provides a helpful message explaining the cause (type inference from sparse data) and the solution (change the block size to use for inference).

### Are these changes tested?

Yes, added a test in `test-dataset-csv.R`.

### Are there any user-facing changes?

Yes, improved error message when CSV type inference fails due to sparse data.

---

This PR was authored by Claude (Opus 4.5) and reviewed by @ thisisnic.

🤖 Generated with [Claude Code](https://claude.ai/code)
* GitHub Issue: apache#35806

Lead-authored-by: Nic Crane <thisisnic@gmail.com>
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants