Skip to content

Latest commit

 

History

History
182 lines (124 loc) · 8.03 KB

File metadata and controls

182 lines (124 loc) · 8.03 KB

Packed String Family

A family of fixed-width, bit-packed string types optimized for:

  • Deterministic memory layout
  • Zero allocation
  • O(1) comparisons
  • Fast hashing
  • Compact identifier storage

The core idea: encode restricted alphabets into 6-bit symbols and store them inside native integer words.

Overview

PackedString is a fixed-capacity, bit-packed string representation optimized for:

  • Predictable performance
  • Constant-time metadata access
  • Extremely fast equality
  • Cache efficiency
  • VM / compiler / tokenizer usage
  • Identifier-heavy workloads

Instead of storing 8-bit ASCII characters, it uses:

  • 6-bit encoding
  • 64-character alphabet
  • Fixed-width storage
  • Embedded metadata

This makes the representation:

  • 100% stack-allocatable
  • No heap allocation
  • No dynamic memory
  • No pointer chasing
  • Deterministic behavior

Encoding Model

Each character is encoded in 6 bits.

Alphabet (64 symbols):

0-9
a-z
A-Z
_ $

Total: 10 + 26 + 26 + 2 = 64.

This allows exact packing inside fixed-size integer blocks.


Alphabet Model

All APIs operate on sixbit values internally.

Alphabet (64 symbols):

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_$

Encoding rule:

  • 6 bits per character
  • ASCII only touched during pack/unpack/debug
  • Public API operates on sixbit values

Example:

u8 h = ps_char('H');     // encode ASCII → sixbit
ps_find_six(ps, h);      // search

You deliberately expose encoding — this is C, not a managed abstraction.


Packed Family Variants

The PackedString family is designed to scale by capacity:

Variant Max Chars Storage Size Use Case
packed8 10 64-bit small identifiers
packed16 20 128-bit general identifiers
packed24 30 192-bit long tokens
packed32 40 256-bit extended cases

Each variant:

  • Uses the same 6-bit encoding
  • Uses embedded metadata
  • Uses fixed layout
  • Avoids dynamic allocation

The current stable implementation is packed16.


Design Goals

  1. Portable C (no __uint128_t)
  2. No platform-specific intrinsics
  3. Two u64 for 128-bit representation
  4. Constant-time equality
  5. O(1) metadata access
  6. Branch-minimal critical paths
  7. VM-friendly

Packed16

Performance Characteristics

Operation Complexity Notes
length O(1) metadata bits
flags O(1) metadata bits
equal O(1) 2× u64 compare
packed_compare O(1) 120-bit compare
hash O(1) murmur finalizer
at O(1) direct bit extraction
substring O(1) bit shifting
concat O(1) bounded 20 chars
case conversion O(N ≤ 20) max 20 iterations
contains O(N) bounded small N
contains_six O(1) jump table

All operations are bounded by 20 characters maximum.

This gives predictable latency.


Error Model

Length field is 5 bits → 0–31.

Valid lengths: 0–20
21–30: reserved
31: INVALID state

Special states:

  • PSC_INVALID
  • PSC_NULL
  • PSC_EMPTY

This allows:

  • Optional-like semantics
  • Error propagation without extra memory
  • State tagging without changing structure size

Intended Usage

This is not a general string library.

Use when:

  • You need fixed-size identifiers
  • You want value-type strings
  • You want deterministic hashing
  • You want very fast comparisons
  • You are building a compiler, VM, DSL, DB, token system

Avoid when:

  • Arbitrary UTF-8 required
  • Large dynamic text required