Skip to content

HIVE-29516 Fix NPE in StatsUtils.updateStats when column statistics a…#6382

Open
shubhluck wants to merge 1 commit intoapache:masterfrom
shubhluck:HIVE-29516
Open

HIVE-29516 Fix NPE in StatsUtils.updateStats when column statistics a…#6382
shubhluck wants to merge 1 commit intoapache:masterfrom
shubhluck:HIVE-29516

Conversation

@shubhluck
Copy link
Contributor

…re unavailable

Added null checks before iterating over column statistics in:

  • StatsUtils.updateStats()
  • StatsUtils.getColStatisticsUpdatingTableAlias()
  • StatsRulesProcFactory (JOIN statistics) This prevents query compilation failures during semijoin optimization when column-level statistics are incomplete, commonly seen with large TPC-DS datasets (100GB+).

What changes were proposed in this pull request?

This PR adds null checks before iterating over column statistics in three locations to prevent NullPointerException:

  1. StatsUtils.updateStats() - Added null check for stats.getColumnStats() before the for-each loop, defaulting to empty list when null
  2. StatsUtils.getColStatisticsUpdatingTableAlias() - Added null check with early return of empty list when parent column stats are null
  3. StatsRulesProcFactory (JOIN statistics computation) - Added null check before iterating over column stats during join statistics calculation
    The root cause is that Statistics.getColumnStats() returns null (not an empty list) when no column statistics are available:
public List<ColStatistics> getColumnStats() {
    if (columnStats != null) {
        return Lists.newArrayList(columnStats.values());
    }
    return null;  // Returns null, causing NPE in for-each loops
}

Why are the changes needed?

Query compilation fails with NullPointerException during semijoin optimization when column statistics are unavailable:

java.lang.NullPointerException
    at org.apache.hadoop.hive.ql.stats.StatsUtils.updateStats(StatsUtils.java:2067)
    at org.apache.hadoop.hive.ql.parse.TezCompiler.removeSemijoinOptimizationByBenefit(TezCompiler.java:1982)
    at org.apache.hadoop.hive.ql.parse.TezCompiler.semijoinRemovalBasedTransformations(TezCompiler.java:539)
    at org.apache.hadoop.hive.ql.parse.TezCompiler.optimizeOperatorPlan(TezCompiler.java:238)
    ...

This issue is particularly prevalent with:

Large TPC-DS datasets (100GB+) where statistics collection may be incomplete
Tables where column-level statistics have not been computed
Complex queries where intermediate operators lack column statistics
The fix ensures graceful handling when column statistics are unavailable, allowing the optimizer to continue using row-based statistics instead of failing.

Does this PR introduce any user-facing change?

No. This is a bug fix that prevents query compilation failures. Previously failing queries will now compile and execute successfully. There is no change to query results or behavior for queries that were already working.

How was this patch tested?

  1. Reproduced the issue with TPC-DS queries at 100GB scale where column statistics were incomplete
  2. Verified that queries failing with NPE now compile and execute successfully after the fix
  3. Verified that queries with complete column statistics continue to work correctly and produce the same results
  4. Existing unit tests pass without modification

To reproduce the original issue:

  1. Generate TPC-DS dataset at 100GB+ scale
  2. Do not compute column statistics (or ensure they are incomplete)
  3. Run queries involving semijoin optimizations
  4. Observe NPE during compilation

…re unavailable

Added null checks before iterating over column statistics in:
- StatsUtils.updateStats()
- StatsUtils.getColStatisticsUpdatingTableAlias()
- StatsRulesProcFactory (JOIN statistics)
This prevents query compilation failures during semijoin optimization
when column-level statistics are incomplete, commonly seen with large
TPC-DS datasets (100GB+).
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants