refactor(web): split ContextState.analyzeTransition 🚂 by jahorton · Pull Request #15770 · keymanapp/keyman

jahorton · 2026-03-18T17:29:02Z

The implementation of ContextState.analyzeTransition is rather long and large, comprised of multiple subsections. We can get clearer, more maintainable code if we split it into multiple pieces - especially if we can do so in a manner that allows individual pieces to be unit-tested.

Also, with whitespace fat-finger handling coming up, special handling for applied suggestions may be needed. This refactor may facilitate development of a suggestion-specialized variant (in #15775).

Build-bot: skip build:web
Test-bot: skip

keymanapp-test-bot · 2026-03-18T17:29:07Z

User Test Results

Test specification and instructions

User tests are not required

Test Artifacts

Web
- KeymanWeb Test Home - build : all tests passed (no artifacts on BuildLevel "build")

The implementation of ContextState.analyzeTransition is rather long and large, comprised of multiple subsections. We can get clearer, more maintainable code if we split it into multiple pieces - especially if we can do so in a manner that allows individual pieces to be unit-tested. Also, with whitespace fat-finger handling coming up, special handling for applied suggestions may be needed. This refactor may facilitate development of a suggestion-specialized variant. Build-bot: skip build:web Test-bot: skip

ermshiperete · 2026-04-10T07:12:30Z

+export function transitionTokenizations(
+  precomputedTransitionSubsets: ReadonlyMap<string, TokenizationSubset>,
+  transformDistribution: Distribution<Transform>
+) {


Suggested change

) {

): Map<string, ContextTokenization> {

ermshiperete · 2026-04-10T07:42:24Z

+  for(let baseTokenization of startTokenizations) {
+    for(let mass of transformDistribution) {
+      const tokenizationAnalysis = baseTokenization.mapWhitespacedTokenization(lexicalModel, mass.sample);
+      const alignment = tokenizationAnalysis.alignment;
+
+      // Pre-process any splits and merges; the result of these operations may
+      // have the same properties as other base tokenizations within the
+      // subset if compatible.
+      const needsRealignment = (alignment.merges.length > 0 || alignment.splits.length > 0 || alignment.unmappedEdits.length > 0);
+      const sourceTokenization = needsRealignment ? baseTokenization.realign(alignment) : baseTokenization;
+
+      subsetBuilder.addPrecomputation(sourceTokenization, tokenizationAnalysis, mass.p);
+
+      if(mass.sample == trueInput) {
+        keyMatchingUserContext = subsetBuilder.keyer(tokenizationAnalysis);
+      }
+    }


Devin.ai flags this as a potential bug:

Per-transform realign in precomputation loop creates fragmented subset entries and mutates alignment data

In precomputeTransitions, calling baseTokenization.realign(alignment) per-transform (line 64) has two compounding problems:

realign mutates the alignment's merges/splits arrays via .shift() (context-tokenization.ts:527,537). When subsetBuilder.addPrecomputation() subsequently calls the keyer, the alignment's merges/splits are empty, so editKeyer (tokenization-subsets.ts:83-99) omits the M: and S: key components. This can cause transforms with merges/splits to produce the same key as transforms without them.

Each realign() creates a new ContextTokenization instance (context-tokenization.ts:556). Since Map uses reference equality, different transforms that both require realignment produce separate transitionEdges entries in the same subset. When transitionTokenizations (transition-helpers.ts:122) finds independentTransitionResults.length > 1, it throws "Multi-tokenization transitions not yet supported." (line 129).

The old code avoided both issues: it passed the same baseTokenization reference to addPrecomputation for all transforms (accumulating into one TransitionEdge), and called realign only once after subset building, preserving alignment data for keying. This bug triggers when any fat-finger distribution contains a transform that causes a merge or split in the tokenization (e.g., word-boundary shifts from punctuation input).

realign mutates the alignment's merges/splits arrays via .shift() (context-tokenization.ts:527,537). When subsetBuilder.addPrecomputation() subsequently calls the keyer, the alignment's merges/splits are empty, so editKeyer (tokenization-subsets.ts:83-99) omits the M: and S: key components. This can cause transforms with merges/splits to produce the same key as transforms without them.

This is actually a goal of the logic here.

The idea is to handle the merges and splits before the rest of the tokenization-transition. By doing so, we can detect if applying the actual inputs result in matching tokenization patterns with other routes, so that merges and splits can actually converge with them.

To make that a lot clearer, suppose the following (forced as it is):

[can, '] + t

[cans] + t

The pre-merge step yields the following:

[can'] + t

[cans] + t

Which both land with the same tokenization pattern:

5 codepoints

same transform input count

They should converge to the same tokenization and be "clustered" together. Without the pre-merge / pre-split step, they would not do so. To be extra-explicit - the "keyer"'s returned key itself is the basis for determining when this convergence occurs.

... I should probably document that in a comment here so that it's much clearer down the line.

2. Each realign() creates a new ContextTokenization instance (context-tokenization.ts:556). Since Map uses reference equality, different transforms that both require realignment produce separate transitionEdges entries in the same subset. When transitionTokenizations (transition-helpers.ts:122) finds independentTransitionResults.length > 1, it throws "Multi-tokenization transitions not yet supported." (line 129).

See #15384 - the error it mentions here is the very next thing I address.

ermshiperete · 2026-04-10T07:47:32Z

+ * Generates all actual fixtures used in the tokenization-transition unit tests found below.
+ * @returns
+ */
+function generateFixturesForTransitionSource() {


devin.ai flags this:

web/src/test/auto/headless/engine/predictive-text/worker-thread/context/transition-helpers.tests.ts:R221-316

Test fixtures don't cover merge/split scenarios

The test fixtures in generateFixturesForTransitionSource use simple append transforms on the 'cafe' tokenization (['I', ' ', 'ate', '', 'at', ' ', 'a', ' ', 'cafe']). None of the input distributions cause merges or splits in tokenization. The bug reported in transition-helpers.ts specifically triggers when merges/splits are present, which means it's untested. Consider adding test cases where a transform causes word-boundary shifts (e.g., typing an apostrophe that merges adjacent tokens, or a deletion that splits a merged token).

Yeah... it took me most of this week to get the unit test scenarios even this far. I'm willing to further develop this test suite down the line, but I think it's wise to triage this task until later.

This is part of why the task took me the full week, though - I recognized the need for the utility methods to be capable of providing clear support for building such cases. Took a while to figure out a set of good patterns for that.

ermshiperete · 2026-04-10T07:48:59Z


      // A shallow copy of the array is fine, but we'd be best off
      // not aliasing the array itself.
      if(stateToClone.suggestions?.length ?? 0 > 0) {


devin.ai flags this:

Pre-existing operator precedence issue in clone constructor

At context-state.ts:141, the expression stateToClone.suggestions?.length ?? 0 > 0 is parsed as stateToClone.suggestions?.length ?? (0 > 0) due to > having higher precedence than ??. The intent was likely (stateToClone.suggestions?.length ?? 0) > 0. However, this happens to work correctly in all cases: when suggestions is undefined, undefined ?? false = false (skips clone); when length is 0, 0 is falsy (skips clone); when length > 0, the truthy number enters the branch (clones). This is a pre-existing issue not introduced by this PR, but it's fragile and confusing.

ermshiperete · 2026-04-10T08:02:27Z

      // context separately from pre-application contexts for predictions based on empty roots.
      const state = new ContextState(this);
-      state.tokenization = startTokenizationsAfterSlide[0];
+      state.tokenization = [...startTokenizations.values()][0];


devin.ai flag:

Unnecessary [...startTokenizations.values()][0] instead of startTokenizations[0]

startTokenizations is already an Array (from [this.tokenization].map(...)) so .values() returns an array iterator. [...startTokenizations.values()][0] creates a temporary spread just to access element 0 — functionally identical to startTokenizations[0] but needlessly verbose and allocates an extra array. Not a bug, but a minor style concern.

This will be a proper array once whitespace fat-fingering lands; each possible variation in where the word boundaries land will be modeled by a separate ContextTokenization instance. As a result, this will be a proper array at that time.

ermshiperete · 2026-04-10T08:05:46Z

+  const bestProb = transformDistribution.reduce((best, curr) => Math.max(best, curr.p), 0);
+
+  // For all target tokenizations - each transition subset...
+  const finalTokenizations: Map<string, ContextTokenization> = new Map();


devin.ai flag:

web/src/engine/predictive-text/worker-thread/src/main/correction/transition-helpers.ts:R98-130

transitionTokenizations evaluates ALL subsets but only one result is used

In context-state.ts:228-229, transitionTokenizations(subsets, transformDistribution) builds tokenizations for every subset, but only the one matching trueInputSubsetKey is retrieved via .get(). The extra work evaluates evaluateTransition (which builds LegacyQuotientSpur instances and search spaces) for subsets that are never used. This is a performance concern that will compound if the number of subsets grows with multi-tokenization support. Additionally, trueInput.id is used as the transitionId for ALL subsets (transition-helpers.ts:107), which is semantically incorrect for non-true-input subsets — though currently harmless since those results are discarded.

... for now, yes. The current system is designed with "one true tokenization" in mind, after all. This PR is but a step in the transition to moving past that.

Note that the unit tests added in this PR and in #15834 do use / consider the other subsets! They'll also be used down the line - that whole this.tokenization -> this.tokenizations thing I mentioned in the comments above. this.tokenizations in the transitioned state will use many of the evaluated subsets. (Perhaps not all - some thresholding is desired, as it's not worth keeping very unlikely tokenization patterns around forever.)

mcdurdin · 2026-04-10T08:03:17Z

+    const startTokenizations = [this.tokenization].map((t) => {
+      return t.applyContextSlide(lexicalModel, slideUpdateTransform);
+    });


This seems overly complex? Why turn this into an array if it only has a single entry?

Suggested change

const startTokenizations = [this.tokenization].map((t) => {

return t.applyContextSlide(lexicalModel, slideUpdateTransform);

});

const startTokenization = this.tokenization.applyContextSlide(lexicalModel, slideUpdateTransform);

Because it won't be a single array once whitespace fat-fingering lands. this.tokenization is slated to become this.tokenizations: ContextTokenization[].

Basically, I'm aiming to "telegraph" or hint at upcoming changes, structuring the code in a manner that will better model where things will land. Each possible variation in where the word boundaries land will be modeled by a separate ContextTokenization instance.

mcdurdin · 2026-04-10T08:03:45Z

      // context separately from pre-application contexts for predictions based on empty roots.
      const state = new ContextState(this);
-      state.tokenization = startTokenizationsAfterSlide[0];
+      state.tokenization = [...startTokenizations.values()][0];


I don't understand why this is necessary. Why can't you just say startTokenizations[0], or if you eliminate the seemingly extraneous array then just startTokenization?

Suggested change

state.tokenization = [...startTokenizations.values()][0];

state.tokenization = startTokenization;

mcdurdin · 2026-04-10T08:04:21Z

-    // Should gain one per subsetBuilder.subsets entry.
-    const realignedTokenization = baseTokenization.realign(tokenizationAnalysis.alignment);
-    const resultTokenization = realignedTokenization.evaluateTransition(tokenizationAnalysis, trueInput.id, bestProb);
+    const { subsets, keyMatchingUserContext: trueInputSubsetKey } = precomputeTransitions(startTokenizations, transformDistribution);


Suggested change

const { subsets, keyMatchingUserContext: trueInputSubsetKey } = precomputeTransitions(startTokenizations, transformDistribution);

const { subsets, keyMatchingUserContext: trueInputSubsetKey } = precomputeTransitions([startTokenization], transformDistribution);

Following the other two comments... but if you only ever have a single token, does this need to be an array passed in?

See transition-helpers.tests.ts, line 386 - the tests there already pass in multiple ContextTokenization objects (word-boundary location variants), not just one!

Once whitespace fat-fingering is in full swing, "a single token" may rarely be the case outside of disabled-correction scenarios.

mcdurdin · 2026-04-10T08:08:27Z

+    // Of particular note - there may no longer be a specific, single set of edits
+    // for the transition; there will be different paths to reach a tokenization!
+    throw new Error("Multi-tokenization transitions not yet supported.");


Is this a TODO item for this epic or for future functionality?

See the next PR: #15834. It felt wrong to lump those changes in with this PR, so I split it off.

…behavior

github-project-automation bot added this to Keyman Mar 18, 2026

github-project-automation bot moved this to Todo in Keyman Mar 18, 2026

keymanapp-test-bot bot added the epic-autocorrect label Mar 18, 2026

keymanapp-test-bot bot changed the title ~~refactor(web): split ContextState.analyzeTransition~~ refactor(web): split ContextState.analyzeTransition 🚂 Mar 18, 2026

keymanapp-test-bot bot added this to the A19S25 milestone Mar 18, 2026

github-actions bot added web/ web/predictive-text/ refactor labels Mar 18, 2026

jahorton force-pushed the refactor/web/split-analyze-transition branch from ab5ee30 to da0d7aa Compare March 20, 2026 13:25

jahorton changed the base branch from feat/web/map-suggestions-to-source-tokenization to refactor/web/create-default-keep March 20, 2026 13:26

keyman-server modified the milestones: A19S25, A19S26 Mar 28, 2026

jahorton force-pushed the refactor/web/create-default-keep branch from a53305b to d285478 Compare April 2, 2026 19:26

jahorton force-pushed the refactor/web/split-analyze-transition branch from da0d7aa to b908a23 Compare April 6, 2026 16:55

jahorton changed the base branch from refactor/web/create-default-keep to change/web/context-token-init April 6, 2026 16:56

Base automatically changed from change/web/context-token-init to epic/autocorrect April 9, 2026 19:53

jahorton force-pushed the refactor/web/split-analyze-transition branch from 716b223 to 0e29439 Compare April 9, 2026 20:18

jahorton force-pushed the refactor/web/split-analyze-transition branch from 0e29439 to 938d5ed Compare April 9, 2026 20:31

jahorton marked this pull request as ready for review April 9, 2026 21:05

jahorton requested review from ermshiperete and mcdurdin April 9, 2026 21:05

ermshiperete approved these changes Apr 10, 2026

View reviewed changes

mcdurdin reviewed Apr 10, 2026

View reviewed changes

docs(web): enhance comment explaining purpose of pre-merge/pre-split …

a834860

…behavior

keyman-server modified the milestones: A19S26, A19S27 Apr 14, 2026

	state.tokenization = [...startTokenizations.values()][0];
	state.tokenization = startTokenization;

	const { subsets, keyMatchingUserContext: trueInputSubsetKey } = precomputeTransitions(startTokenizations, transformDistribution);
	const { subsets, keyMatchingUserContext: trueInputSubsetKey } = precomputeTransitions([startTokenization], transformDistribution);

Uh oh!

Conversation

jahorton commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keymanapp-test-bot bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User Test Results

Test Artifacts

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jahorton Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jahorton commented Mar 18, 2026 •

edited

Loading

keymanapp-test-bot bot commented Mar 18, 2026 •

edited

Loading

jahorton Apr 10, 2026 •

edited

Loading