GH-35806: [R] Improve error message for null type inference with sparse CSV data by thisisnic · Pull Request #49338 · apache/arrow

thisisnic · 2026-02-19T09:31:51Z

Rationale for this change

When reading a CSV with sparse data (many missing values followed by actual values), Arrow can infer a column type as null based on the first block of data. When non-null values appear later, the error message incorrectly suggests using skip = 1 for header rows, which is misleading.

What changes are included in this PR?

Adds a specific check for "conversion error to null" that provides a helpful message explaining the cause (type inference from sparse data) and the solution (change the block size to use for inference).

Are these changes tested?

Yes, added a test in test-dataset-csv.R.

Are there any user-facing changes?

Yes, improved error message when CSV type inference fails due to sparse data.

This PR was authored by Claude (Opus 4.5) and reviewed by @thisisnic.

🤖 Generated with Claude Code

GitHub Issue: [R] Error message caused by reading sparsely populated data is misleading #35806

…h sparse CSV data When a CSV column contains only missing values in the first block of data, Arrow infers the type as null. If a non-null value appears later, the conversion fails with an unhelpful error suggesting `skip = 1`. This change adds a specific check for "conversion error to null" and provides a more helpful message explaining the cause (type inference from sparse data) and the solution (specify column types explicitly). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-02-19T09:32:16Z

⚠️ GitHub issue #35806 has been automatically assigned in GitHub to PR creator.

thisisnic · 2026-02-19T13:50:27Z

I'm not totally happy with the error message, will rewrite before marking ready for review

jonkeane

I like the improved message, though I'm not totally sure I follow why we can assert the reason for why null was inferred? And also wouldn't this similarly error if someone specified null manually and then there was data(??)

thisisnic · 2026-03-05T10:50:16Z

I'm not totally sure I follow why we can assert the reason for why null was inferred

Yeah, no, you're right, updated the messaging

jonkeane

One more addition to the message

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

jonkeane · 2026-03-26T16:31:34Z

+    msg <- c(
+      msg,
+      i = paste(
+        "If you have not specified the schema, this error may be due to the column type being",


This might be a little overly complicated, but at this point, is schema NULL if it wasn't specified? If it is, we could actually detect if someone has specified or not and message that?

Nah, it's an empty schema. I can account for that though.

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

conbench-apache-arrow · 2026-03-27T22:06:43Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2a526c1.

There was 1 benchmark result with an error:

Commit Run on amd64-m5-4xlarge-linux at 2026-03-27 17:07:53Z
- dataset-serialize (Python) with dataset=nyctaxi_multi_parquet_s3, format=csv, selectivity=100pc

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

conbench-apache-arrow · 2026-03-27T22:09:23Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2a526c1.

There was 1 benchmark result with an error:

Commit Run on amd64-m5-4xlarge-linux at 2026-03-27 17:07:53Z
- dataset-serialize (Python) with dataset=nyctaxi_multi_parquet_s3, format=csv, selectivity=100pc

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

…h sparse CSV data (apache#49338) ### Rationale for this change When reading a CSV with sparse data (many missing values followed by actual values), Arrow can infer a column type as `null` based on the first block of data. When non-null values appear later, the error message incorrectly suggests using `skip = 1` for header rows, which is misleading. ### What changes are included in this PR? Adds a specific check for "conversion error to null" that provides a helpful message explaining the cause (type inference from sparse data) and the solution (change the block size to use for inference). ### Are these changes tested? Yes, added a test in `test-dataset-csv.R`. ### Are there any user-facing changes? Yes, improved error message when CSV type inference fails due to sparse data. --- This PR was authored by Claude (Opus 4.5) and reviewed by @ thisisnic. 🤖 Generated with [Claude Code](https://claude.ai/code) * GitHub Issue: apache#35806 Lead-authored-by: Nic Crane <thisisnic@gmail.com> Co-authored-by: Jonathan Keane <jkeane@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

thisisnic requested a review from jonkeane as a code owner February 19, 2026 09:31

github-actions Bot added Component: R awaiting committer review Awaiting committer review labels Feb 19, 2026

thisisnic marked this pull request as draft February 19, 2026 13:50

Only give 1 option as to what to do, remove redundant comments

2154bdc

thisisnic commented Feb 23, 2026

View reviewed changes

Comment thread r/R/util.R Outdated

thisisnic commented Feb 23, 2026

View reviewed changes

Comment thread r/R/util.R Outdated

thisisnic added 2 commits February 23, 2026 14:19

Remove redundant comments

3538a2d

Remove redundant comments

921701a

github-actions Bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Feb 23, 2026

thisisnic marked this pull request as ready for review February 23, 2026 14:20

jonkeane reviewed Feb 23, 2026

View reviewed changes

Comment thread r/R/util.R Outdated

github-actions Bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 23, 2026

github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 5, 2026

Make error more general

0bb82ce

thisisnic force-pushed the GH-35806-null-type-error-message branch from 7796990 to 0bb82ce Compare March 5, 2026 10:49

jonkeane reviewed Mar 5, 2026

View reviewed changes

Comment thread r/R/util.R Outdated

jonkeane approved these changes Mar 5, 2026

View reviewed changes

github-actions Bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 5, 2026

Update r/R/util.R

7672821

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

github-actions Bot added awaiting review Awaiting review and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Mar 5, 2026

fix test error message

1c650c7

jonkeane reviewed Mar 26, 2026

View reviewed changes

Comment thread r/R/util.R Outdated

github-actions Bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Mar 26, 2026

jonkeane reviewed Mar 26, 2026

View reviewed changes

Update r/R/util.R

9e0de65

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

github-actions Bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Mar 26, 2026

Account for no schema

2ff2e1b

thisisnic merged commit 2a526c1 into apache:main Mar 27, 2026
8 of 10 checks passed

thisisnic removed the awaiting changes Awaiting changes label Mar 27, 2026

thisisnic mentioned this pull request Mar 27, 2026

[R] Error message caused by reading sparsely populated data is misleading #35806

Closed

github-actions Bot added the awaiting review Awaiting review label Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35806: [R] Improve error message for null type inference with sparse CSV data#49338

GH-35806: [R] Improve error message for null type inference with sparse CSV data#49338
thisisnic merged 9 commits intoapache:mainfrom
thisisnic:GH-35806-null-type-error-message

thisisnic commented Feb 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 19, 2026

Uh oh!

thisisnic commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

jonkeane left a comment

Uh oh!

Uh oh!

thisisnic commented Mar 5, 2026

Uh oh!

Uh oh!

jonkeane left a comment

Uh oh!

Uh oh!

jonkeane Mar 26, 2026

Uh oh!

thisisnic Mar 27, 2026

Uh oh!

Uh oh!

conbench-apache-arrow Bot commented Mar 27, 2026

Uh oh!

conbench-apache-arrow Bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thisisnic commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Feb 19, 2026

Uh oh!

thisisnic commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thisisnic commented Mar 5, 2026

Uh oh!

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonkeane Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

thisisnic Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

conbench-apache-arrow Bot commented Mar 27, 2026

Uh oh!

conbench-apache-arrow Bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thisisnic commented Feb 19, 2026 •

edited

Loading