Skip to content

Support matching bitstrings that are in fact binaries (size divisible by 8)#1978

Open
mat-hek wants to merge 5 commits intoatomvm:mainfrom
mat-hek:mf/upstream-support-aligned-bitstrings
Open

Support matching bitstrings that are in fact binaries (size divisible by 8)#1978
mat-hek wants to merge 5 commits intoatomvm:mainfrom
mat-hek:mf/upstream-support-aligned-bitstrings

Conversation

@mat-hek
Copy link
Contributor

@mat-hek mat-hek commented Nov 13, 2025

In some cases, there's generic code that uses the bitstring modifier because it needs to work for both bitstrings and binaries, for example in erl_eval. Currently such code doesn't work even if no actual bitstrings are involved, and this PR attempts to fix it. In other words, this now works:

defmodule Foo do
  def pattern_match(x, s) do
    case x do
      <<a::bitstring-size(s), b::bitstring>> ->
        {a, b}

      _ ->
        :no_match
    end
  end
end

Foo.pattern_match(<<234, 123, 2>>, 16)
# => {<<234, 123>>, <<2>>}

TODO:

  • add tests
  • adjust jit.erl

@mat-hek mat-hek force-pushed the mf/upstream-support-aligned-bitstrings branch 2 times, most recently from bb34fe8 to 7bb58f2 Compare November 13, 2025 14:37
@mat-hek mat-hek force-pushed the mf/upstream-support-aligned-bitstrings branch from 576fb54 to ffcc68d Compare November 18, 2025 11:21
mat-hek added a commit to software-mansion/popcorn that referenced this pull request Nov 18, 2025
…408)

Closes #393

Once atomvm/AtomVM#1978 is merged and
downstreamed, the eval_bits patch won't be needed (unless there are
other unrelated patches in there OFC)
@mat-hek mat-hek force-pushed the mf/upstream-support-aligned-bitstrings branch from ffcc68d to 89e5cf8 Compare November 19, 2025 09:42
@mat-hek mat-hek changed the title Support creating and matching bitstrings that are in fact binaries (size divisible by 8) Support matching bitstrings that are in fact binaries (size divisible by 8) Nov 19, 2025
@mat-hek mat-hek force-pushed the mf/upstream-support-aligned-bitstrings branch 3 times, most recently from 5aa8f8c to c2b0ede Compare November 26, 2025 08:58
Copy link
Collaborator

@bettio bettio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still an issue.
build-and-test (arm-linux-gnueabihf-gcc, -mcpu=cortex-a7 -mfloat-abi=hard -O2 -mthumb -mthumb-int is failing and this tells me there is a problem with JIT implementation.
We have to fix this before merging this PR. Maybe @pguyot has some advice here.

@mat-hek mat-hek force-pushed the mf/upstream-support-aligned-bitstrings branch from c2b0ede to 6c9843d Compare December 15, 2025 11:47
Copy link
Collaborator

@bettio bettio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests keep failing:

test_binary_to_term:
CRASH 
======
pid: <0.1.0>

Stacktrace:
[{test_binary_to_term,test_function,0,[{file,"/home/runner/work/AtomVM/AtomVM/tests/erlang_tests/test_binary_to_term.erl"},{line,316}]},{test_binary_to_term,start,0,[{file,"/home/runner/work/AtomVM/AtomVM/tests/erlang_tests/test_binary_to_term.erl"},{line,170}]}]

cp: #CP<module: 0, label: 52, offset: 2272>

x[0]: error
x[1]: {badmatch,<<1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,1,119,19,116,101,115,116,95,98,105,110,97,114,121,95,116,111,95,116,101,114,109,97,5,98,5,26,215,111,88,119,13,110,111,110,111,100,101,64,110,111,104,111,115,116,0,0,0,0,0,0,0,0,0,0,0,0,97,2>>}
x[2]: {2,2,74,1,[{0,8240},{0,27350}],error}

Stack 
-----

<<88,119,13,110,111,110,111,100,101,64,110,111,104,111,115,116,0,0,0,0,0,0,0,0,0,0,0,0>>
85645167
4
85
21
<<119,19,116,101,115,116,95,98,105,110,97,114,121,95,116,111,95,116,101,114,109>>
<<131,108,0,0,0,2,112,0,0,0,85,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,119,19,116,101,115,116,95,98,105,110,97,114,121,95,116,111,95,116,101,114,109,97,4,98,5,26,215,111,88,119,13,110,111,110,111,100,101,64,110,111,104,111,115,116,0,0,0,0,0,0,0,0,0,0,0,0,112,0,0,0,87,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,1,119,19,116,101,115,116,95,98,105,110,97,114,121,95,116,111,95,116,101,114,109,97,5,98,5,26,215,111,88,119,13,110,111,110,111,100,101,64,110,111,104,111,115,116,0,0,0,0,0,0,0,0,0,0,0,0,97,2,106>>
#CP<module: 0, label: 2, offset: 2760>
#CP<module: 0, label: 0, offset: 5424>


Mailbox
-------


Monitors
--------


**End Of Crash Report**
Expected 0 but result is not an integer
test_binary_to_term:FAILED

@mat-hek
Copy link
Contributor Author

mat-hek commented Dec 15, 2025

@bettio I'll try looking into it. There's also https://github.com/atomvm/AtomVM/actions/runs/20231091694/job/58073962162?pr=1978#step:18:596 failing, with disabled JIT 🤔 Instead of {<<123, 234>>, <<245>>}, {<<123, 234>>, <<234>>} is returned. Not sure if it's my changes or some other memory corruption

test_bs:
CRASH 
======
pid: <0.1.0>

Stacktrace:
[{test_bs,test_bs_match_bitstring_modifier,0,[{file,"/__w/AtomVM/AtomVM/tests/erlang_tests/test_bs.erl"},{line,688}]},{test_bs,start,0,[{file,"/__w/AtomVM/AtomVM/tests/erlang_tests/test_bs.erl"},{line,108}]}]

cp: #CP<module: 0, label: 399, offset: 42>

x[0]: error
x[1]: {badmatch,{<<123,234>>,<<234>>}}
x[2]: {2,2,49,1,[{0,902},{0,10758}],error}

end,
MSt12 = cond_jump_to_label({{free, SizeReg}, '<', BSOffsetReg1}, Fail, MMod, MSt11),
{MSt12, Size bsl 4};
{MSt12, (Size * Unit) div 8};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to bsl 4 here as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems there is a bug in this opcode. Working on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #2113

You may want to rebase once it is merged.

ctx, jit_state, offset, ?UNSUPPORTED_ATOM
]);
true ->
MMod:sub(MSt10, SizeReg, (Size * Unit) div 8)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to bsl 4 the (Size * Unit) div 8. You could write (Size * Unit) bsl 1.

Indeed, SizeReg contains the size in bytes as a term ((Imm bsl 4) bor 16#F).

We could add a comment here to clarify.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above).

mat-hek added 2 commits March 2, 2026 17:29
… by 8)

Signed-off-by: Mateusz Front <mateusz.front@swmansion.com>
Signed-off-by: Mateusz Front <mateusz.front@swmansion.com>
@mat-hek mat-hek force-pushed the mf/upstream-support-aligned-bitstrings branch from 6c9843d to 9a1b28e Compare March 2, 2026 16:35
@mat-hek mat-hek requested a review from pguyot March 2, 2026 16:42
@petermm
Copy link
Contributor

petermm commented Mar 2, 2026

Per AMP https://ampcode.com/threads/T-019cafb8-53f3-71de-9d8e-6def785206e8:

all caveats apply, the recommended fixes looks legit.

PR Review: Support matching bitstrings that are in fact binaries

Commits: 886acf3f0 + 9a1b28ea6 (CR fixup)
Author: Mateusz Front
Files: libs/jit/src/jit.erl, src/libAtomVM/opcodesswitch.h, tests/erlang_tests/test_bs.erl


Oracle Code Review Summary

🔴 Bug 1 — Interpreter sub-binary offset (opcodesswitch.h)

The term_maybe_create_sub_binary(bs_bin, bs_offset / unit, ...) still divides by unit instead of 8. Since bs_offset is in bits, this breaks for unit != 8 at non-zero offsets.

🔴 Bug 2 — JIT tagged integer arithmetic (jit.erl)

The expression (Size * Unit) bsl 1 relies on tagged-integer math, but the binary size field may be a raw byte count, not a tagged term. The comment says it should be (((Size * Unit) div 8) bsl 4) — these are only equivalent if the value is in tagged representation. This needs verification against AtomVM's actual term encoding.

🟡 Missing guards

  • No negative-size check after size_val = term_to_int(size) * unit — a negative size could become a huge size_t.
  • Potential integer overflow on size * unit multiplication in C.

🟡 Test gaps

The test only covers size=15 (non-aligned) and size=16 (aligned) at offset 0. Missing:

  • Non-zero offset with unit != 8
  • size=0
  • Negative dynamic size
  • Other unit values (e.g., unit=2, size=4)

✅ What looks good

  • The overall approach (allow bitstring match when size*unit is byte-aligned) is correct.
  • OP_BS_MATCH binary-command path has correct alignment math.
  • Error atoms (unsupported, badarg) are appropriate.

1. TL;DR

The change is conceptually right (treat /bitstring matches as binary matches when Size*Unit is byte-aligned), but there are two correctness issues:

  1. Interpreter OP_BS_GET_BINARY2 uses bs_offset / unit instead of bs_offset / 8 when creating the sub-binary (wrong for unit != 8).
  2. JIT OP_BS_GET_BINARY2 has a suspicious constant-size path: the comment/code around (Size * Unit) bsl 1 is only valid for tagged arithmetic, but the operands in this opcode appear to be raw byte counts, so this path likely miscomputes (even if the current test doesn't hit it).

2. Recommended Fixes

A. Fix the interpreter bug in OP_BS_GET_BINARY2

In src/libAtomVM/opcodesswitch.h:

Current code (bug):

term t = term_maybe_create_sub_binary(bs_bin, bs_offset / unit, size_val, &ctx->heap, ctx->global);

Should be:

term t = term_maybe_create_sub_binary(bs_bin, bs_offset / 8, size_val, &ctx->heap, ctx->global);

Because:

  • bs_offset is in bits (confirmed by surrounding code and term_get_match_state_offset usage)
  • term_maybe_create_sub_binary expects byte offset

This is the biggest correctness issue introduced/exposed by supporting non-8 units.

B. Fix JIT constant-size arithmetic in OP_BS_GET_BINARY2

In libs/jit/src/jit.erl around:

% Equivalent of SizeReg - (((Size * Unit) div 8) bsl 4)
MMod:sub(MSt10, SizeReg, (Size * Unit) bsl 1)

That identity only holds if SizeReg is a tagged small-int representing bytes and (Size*Unit) is the bit count. But in AtomVM, a binary's stored size field (boxed_value[1]) is a raw size_t/word count in bytes, not a tagged term (see term_alloc_refc_binary / term_binary_size).

The safe/simple fix is to compute bytes explicitly and subtract raw bytes:

MatchedBits = Size * Unit,
if MatchedBits rem 8 =/= 0 -> raise unsupported end,
MatchedBytes = MatchedBits div 8,
MMod:sub(MSt10, SizeReg, MatchedBytes)

Also update the comment to reflect raw byte arithmetic, not tagged math.

C. Add negative-size guards (interpreter + JIT)

Even if BEAM never emits negative sizes, AtomVM should not risk "negative length → huge size_t" style issues.

Interpreter: after size_val = term_to_int(size) * unit; add:

if (UNLIKELY(size_val < 0)) {
    RAISE_ERROR(BADARG_ATOM);
}

JIT: after converting dynamic size to int (term_to_int), add a conditional raise/jump if < 0 before doing mul / alignment checks.


3. Rationale and Trade-offs

  • The intended rule is: allow the match if MatchedBits = Size * Unit is divisible by 8 and the current offset is byte-aligned.
  • Interpreter OP_BS_MATCH (binary command) already does the right alignment math with matched_bits = size * unit and matched_bits % 8.
  • The /bitstring modifier with dynamic size is much more likely to go through OP_BS_GET_BINARY2, not OP_BS_MATCH (because OP_BS_MATCH's binary command decodes size as a literal).
  • Therefore OP_BS_GET_BINARY2 is the critical correctness path, and the offset division bug (/ unit) breaks exactly the new feature for non-8 units whenever the match doesn't start at offset 0.

Tagged arithmetic comment vs reality

Comment: "Equivalent of SizeReg - (((Size * Unit) div 8) bsl 4)" but code does (Size * Unit) bsl 1. Is that correct?

  • Mathematically, if you are working in tagged small-int bytes, then ((bits div 8) bsl 4) equals (bits bsl 1) only when bits is divisible by 8, because (bits/8)*16 = bits*2.
  • But in AtomVM's binary layout, the stored size is raw bytes, not tagged. In that world the subtraction should be by bits div 8, not by bits bsl 1.
  • The identity is fine in a tagged-bytes universe, but it's very likely applied to the wrong representation here, making the code incorrect.

4. Risks and Guardrails

  1. Integer overflow in C (matched_bits = size * unit; with int):

    • Use int64_t or size_t and check multiplication overflow if you want hardening.
    • Minimal guardrail: cast to int64_t and reject matched_bits < 0 or matched_bits > term_binary_size(bs_bin)*8.
  2. Negative sizes:

    • Without explicit guards, negative values can slip into divisions/conversions and become huge size_t lengths when allocating sub-binaries.
  3. JIT vs interpreter divergence:

    • After fixing, ensure both paths:
      • Require bs_offset % 8 == 0
      • Require (Size*Unit) % 8 == 0 (else unsupported for /bitstring cases)
      • Use byte offsets consistently (/8)

5. Test Coverage Review

What the test covers

test_bs_match_bitstring_modifier/0 currently checks:

  • Size=15, unit=1 → expects error:unsupported on AtomVM (good; not byte-aligned)
  • Size=16, unit=1 → expects success and correct split (good; byte-aligned)

Missing high-value cases

  1. Non-zero offset + unit != 8

    bitstring_match_offset(BS) ->
        <<_Skip:8, Matched:16/bitstring-unit:1, Rest/bits>> = BS,
        {Matched, Rest}.

    Validate output on <<1,2,3,4>>Matched = <<2,3>>, Rest = <<4>>.

  2. Unit != 8 but byte-aligned

    • Size=4, unit=2 (8 bits → 1 byte)
    • Size=1, unit=16 (16 bits → 2 bytes)
  3. Size=0

    • <<Matched:0/bitstring-unit:1, Rest/bits>> should yield Matched = <<>> and Rest = original.
  4. Negative size (runtime)

    • Size = -1 should raise badarg (not crash, not allocate).

6. Optional Advanced Path

Centralize the "bitstring-as-binary allowed?" logic into a helper used by both interpreter and JIT:

  • Validate Size >= 0, Unit > 0
  • Compute MatchedBits with overflow checks
  • Enforce MatchedBits % 8 == 0
  • Compute MatchedBytes = MatchedBits / 8

This reduces the chance that JIT and interpreter drift again.

If you do only one thing beyond the core fix: fix /unit/8 in interpreter OP_BS_GET_BINARY2, because that's a concrete correctness bug that will surface as soon as anyone matches byte-aligned bitstrings at non-zero offsets with units other than 8.

Signed-off-by: Mateusz Front <mateusz.front@swmansion.com>
{MSt11, SizeReg};
is_integer(Size) ->
% SizeReg is binary size
% SizeReg contains binary size in bytes as a term
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true.
SizeReg is just the size in bytes AFAIK, not a term. We got it from get_array_element(...,1): it's the second word of the boxed binary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I accidentally pushed a half baked thing. But I got suggested by

Indeed, SizeReg contains the size in bytes as a term

#1978 (comment)

MSt11 = MMod:sub(MSt10, SizeReg, SizeBytes),
MSt11 =
if
(Size * Unit) rem 8 =/= 0 ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size is a tagged integer so you don't want to multiply it by Unit.

Since Unit may not be 8, you want to rename SizeBytes into SizeInUnits.

]);
true ->
% Equivalent of SizeReg - (((SizeBytes * Unit) div 8) bsl 4)
MMod:sub(MSt10, SizeReg, (SizeBytes * Unit) bsl 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since SizeReg is the size in bytes I'm not sure this is correct

@petermm
Copy link
Contributor

petermm commented Mar 4, 2026

had tokens surplus - llm claims these two commits: https://github.com/petermm/AtomVM/commits/test1978/

all caveats apply.

petermm added 2 commits March 4, 2026 12:23
Fix several issues in the JIT OP_BS_GET_BINARY2 code generation
introduced by the bitstring-as-binary matching support.

- Rename `SizeBytes` to `SizeInUnits` to reflect that `Size bsr 4`
  gives the count in units, not bytes. The actual byte count is
  `(SizeInUnits * Unit) div 8`.
- Use `SizeInUnits * Unit` (not `Size * Unit`) in the alignment check,
  since `Size` is a tagged integer and must be untagged first.
- Subtract raw bytes `(SizeInUnits * Unit) div 8` from `SizeReg`
  instead of using `bsl 1`, since `SizeReg` holds a raw byte count
  (not a tagged term).

- Add fast-path for `Unit =:= 8` that skips the `mul`/`if_block`/
  `shift_right` sequence entirely, since size-in-units already equals
  size-in-bytes. This avoids unnecessary codegen on ARM32 Thumb where
  register pressure is high.
- Fix register leak for `Unit =/= 8`: pass `{free, SizeValReg}` to
  `shift_right/3` so it reuses the register in-place instead of
  allocating a new one. Without this, `SizeValReg` (holding bits)
  remains allocated while the new result register is never freed,
  which can exhaust available GPRs on ARM32 and cause incorrect
  code generation.

Fix tagged/raw integer confusion in JIT OP_BS_GET_BINARY2
constant-size path.

- SizeReg holds raw bytes, not a tagged term. Subtract raw bytes
  `(SizeBytes * Unit) div 8` instead of `(SizeBytes * Unit) bsl 1`.
- Use untagged `SizeBytes` in the alignment check instead of tagged
  `Size`, since `Size * Unit` on a tagged integer gives a wrong result.
- Fix comment: SizeReg is "raw bytes (not a term)".

Signed-off-by: Peter M <petermm@gmail.com>
Signed-off-by: Mateusz Front <mateusz.front@swmansion.com>
Expand test_bs_match_bitstring_modifier with additional
coverage for edge cases identified during code review:

- Non-zero offset with dynamic size
- Size=0 (empty binary match)
- Negative dynamic size (should error, not crash)
- Constant size 16/bitstring (compile-time path)
- Non-zero offset with constant size

These exercise both the JIT constant-size (is_integer)
and dynamic-size (runtime variable) code paths, as well
as the interpreter's bs_offset byte-alignment logic.

Signed-off-by: Peter M <petermm@gmail.com>
Signed-off-by: Mateusz Front <mateusz.front@swmansion.com>
@mat-hek
Copy link
Contributor Author

mat-hek commented Mar 4, 2026

Thanks @petermm - it looks fine to me, I cherry-picked these commits with minor changes

MMod:free_native_registers(BSt4, [SizeValReg])
{BSt2, SizeValReg2} =
if
is_integer(SizeValReg) ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this cannot happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants