Skip to content

Apache Parquet Java Performance Improvements #3530

@iemejia

Description

@iemejia

These issues/PRs implement a coordinated performance improvement effort for parquet-java encoding and decoding hot paths. The work focuses on reducing CPU overhead, allocation pressure, and avoidable memory copies in commonly used readers and writers, including plain values, binary values, byte-stream split encoding, dictionary encoding, delta byte-array encoding, delta binary packing, RLE/bit-packing decoding, page assembly, and row-group flushing.

Together, the changes preserve existing Parquet format compatibility and public behavior while making the implementation more efficient internally. The improvements use more direct ByteBuffer access, batched read/write operations, reusable buffers and helpers, cached computed values, and earlier release of temporary memory. The goal of the parent issue is to track this broader optimization series as a set of focused, reviewable PRs that each improve one hot path while contributing to better end-to-end read/write performance and lower memory usage.

Benchmark summary

Benchmarks were run with JMH (-wi 3 -i 5 -f 1, 100k values/invocation) on Linux x86_64, JDK 25 (Temurin-25.0.3+9-LTS). The machine was an Azure VM with 8 vCPUs on an AMD EPYC 9V45 96-Core Processor, 4 cores / 8 threads visible, AVX2 and AVX-512 available, and 31 GiB RAM.

Area / PR Benchmark Baseline Optimized Improvement
Plain values reader IntEncodingBenchmark.decodePlain 428M ops/s 5,397M ops/s 12.6x
Plain values writer IntEncodingBenchmark.encodePlain 183M ops/s 328M ops/s +80%
Binary hashCode cache BinaryEncodingBenchmark.encodeDictionary LOW/1000 1.4M ops/s 146M ops/s +10,019%
Byte-stream split writer ByteStreamSplitEncodingBenchmark Long 51M ops/s 423M ops/s +732%
Byte-stream split reader ByteStreamSplitDecodingBenchmark Float 199M ops/s 1,017M ops/s +412%
Binary plain reader BinaryEncodingBenchmark.decodePlain LOW/10 140M ops/s 230M ops/s +64%
Dictionary writers IntEncodingBenchmark.encodeDictionary RANDOM 14.7M ops/s 23.4M ops/s +59%
Delta byte-array writer BinaryEncodingBenchmark.encodeDeltaByteArray HIGH/10 56.8M ops/s 79.0M ops/s +39%
RLE dictionary-id decode IntEncodingBenchmark.decodeDictionary SEQUENTIAL 418M ops/s 539M ops/s +29%
Delta integer decode IntEncodingBenchmark.decodeDelta HIGH_CARDINALITY 371M ops/s 506M ops/s +37%

Additional changes are primarily allocation or memory improvements rather than direct throughput microbenchmark wins: IntList.size() becomes O(1), the batch read API enables more efficient reader implementations, page assembly avoids full-page copies, and row-group flushing releases column buffers earlier to reduce peak memory usage.

This is a parent issue to track the ongoing work on performance improvements for encodings/decodings and other areas of the Java implementation. Since I am not a committer I don't have permission to create sub-issues so I am using this one as the main place to track them.

REVIEWS IN PROGRESS

  1. Optimize PlainValuesReader by reading directly from ByteBuffer (12x decode speedup) #3493
    GH-3493: Optimize PlainValuesReader with direct ByteBuffer reads #3494
  2. Optimize PlainValuesWriter by writing directly to ByteBuffer slabs (up to 2x encode speedup) #3495
    GH-3495: Optimize PlainValuesWriter with direct ByteBuffer slab writes (~2.5x encode speedup) #3496
  3. Cache hashCode() for non-reused Binary instances (huge dictionary-encode speedup) #3499
    GH-3499: Cache hashCode() for non-reused Binary instances (up to 73x dictionary-encode speedup) #3500
  4. Optimize ByteStreamSplitValuesWriter: remove per-value allocation and batch single-byte writes #3503 (NOT REVIEWED YET)
    GH-3503: Optimize ByteStreamSplitValuesWriter with batched scatter writes #3504
  5. Optimize ByteStreamSplitValuesReader page transposition #3505 (NOT REVIEWED YET)
    GH-3505: Optimize ByteStreamSplitValuesReader page transposition #3506
  6. Optimize BinaryPlainValuesReader by reading directly from ByteBuffer #3509
    GH-3509: Optimize BinaryPlainValuesReader by reading directly from ByteBuffer #3510
  7. Optimize dictionary writers by replacing fastutil Linked maps with OpenHashMap + ArrayList #3513
    Optimize dictionary writers by replacing fastutil Linked maps with OpenHashMap + ArrayList #3513
  8. Optimize DeltaByteArrayWriter and DeltaLengthByteArrayValuesWriter: remove per-value allocation and LittleEndianDataOutputStream wrapper #3516
    GH-3516: Optimize DeltaByteArrayWriter and DeltaLengthByteArrayValuesWriter #3517
  9. Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode) #3522
    GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode) #3523

Benchmarks PR

  1. Add JMH benchmarks for encoding/decoding paths and fix parquet-benchmarks shaded jar #3511
    GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar #3512

WARNING: The associated GH-XXXX is not correct)

  1. GH-3522: Optimize IntList.size() from O(slabs) to O(1) with running counter #3533 (NOT REVIEWED YET)
  2. GH-3522: Optimize delta binary packing with batch unpack32/pack32 and cached packers (+13-37% decode) #3534 (NOT REVIEWED YET)
  3. GH-3522: Add batch read APIs to ValuesReader hierarchy #3535 (NOT REVIEWED YET)
  4. GH-3522: Eliminate unnecessary page-size copies in compressed page assembly and CRC checksums #3536 (NOT REVIEWED YET)
  5. GH-3522: Reduce peak memory during row group flush by eagerly releasing column buffers #3537 (NOT REVIEWED YET)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions