Add quantize operator (bf16->int8)

Description:

    Add a quantize operator that converts bfloat16 tensors to int8 with per-group symmetric scaling. Basically counterpart to #95

    Together with INT8 GEMM (#93), this completes the W8A8 quantized inference pipeline on the NPU:

    bf16 activations → quantize (bf16→i8) → INT8 GEMM (i8×i8→i32) → dequant (i32→bf16) → bf16

    Proposed behavior:
    - Input: bfloat16 tensor + group_size parameter
    - Output: int8 tensor + bfloat16 scale factors (one per group)
    - Scaling: symmetric, per-group, scale = max(abs(group)) / 127, out = clamp(round(in / scale), -128, 127)
    - Follows the dequant_i32 operator pattern (custom MLIROperator with mixed input/output dtypes)

    Related:
    - #93 : INT8 GEMM support
    - #95 : Dequant i32→bf16 operator
    - Existing iron/operators/dequant/ (int4→bf16) and iron/operators/dequant_i32/ as implementation references

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add quantize operator (bf16->int8) #97

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add quantize operator (bf16->int8) #97

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions