Skip to content

GH-3522: Optimize IntList.size() from O(slabs) to O(1) with running counter#3533

Open
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-intlist-size
Open

GH-3522: Optimize IntList.size() from O(slabs) to O(1) with running counter#3533
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-intlist-size

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented May 1, 2026

Summary

  • Replace IntList.size() slab-iteration with a simple totalSize counter incremented on each add()
  • Eliminates O(slabs) overhead from dictionary encoding hot paths where size() is called frequently

Details

IntList.size() was iterating over all slabs to sum their lengths on every call. For delta encoding writers that check size thresholds frequently, this becomes O(slabs) per check. With a running counter it's O(1).

The improvement is primarily relevant for large row groups where the number of slabs grows.

All 576 parquet-column tests pass.

…ning counter

IntList.size() was iterating over all slabs to sum their lengths on every
call. Replace with a simple totalSize counter incremented on each add().
This eliminates O(slabs) overhead from dictionary encoding hot paths where
size() is called frequently.

private void allocateSlab() {
currentSlab = new int[currentSlabSize];
currentSlabPos = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also set totalSize to 0 here?


currentSlab[currentSlabPos] = i;
++currentSlabPos;
++totalSize;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, totalSize will always be the same as currentSlabSize

Comment on lines -154 to -156
for (int[] slab : slabs) {
size += slab.length;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change is correct. We're returning the slab length, and this one can grow as well (double until MAX_SLAB_SIZE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants