Skip to content

refactor(web): split ContextState.analyzeTransition 🚂#15770

Open
jahorton wants to merge 2 commits intoepic/autocorrectfrom
refactor/web/split-analyze-transition
Open

refactor(web): split ContextState.analyzeTransition 🚂#15770
jahorton wants to merge 2 commits intoepic/autocorrectfrom
refactor/web/split-analyze-transition

Conversation

@jahorton
Copy link
Copy Markdown
Contributor

@jahorton jahorton commented Mar 18, 2026

The implementation of ContextState.analyzeTransition is rather long and large, comprised of multiple subsections. We can get clearer, more maintainable code if we split it into multiple pieces - especially if we can do so in a manner that allows individual pieces to be unit-tested.

Also, with whitespace fat-finger handling coming up, special handling for applied suggestions may be needed. This refactor may facilitate development of a suggestion-specialized variant (in #15775).

Build-bot: skip build:web
Test-bot: skip

@keymanapp-test-bot
Copy link
Copy Markdown

keymanapp-test-bot bot commented Mar 18, 2026

User Test Results

Test specification and instructions

User tests are not required

Test Artifacts

  • Web
    • KeymanWeb Test Home - build : all tests passed (no artifacts on BuildLevel "build")

@keymanapp-test-bot keymanapp-test-bot bot changed the title refactor(web): split ContextState.analyzeTransition refactor(web): split ContextState.analyzeTransition 🚂 Mar 18, 2026
@keymanapp-test-bot keymanapp-test-bot bot added this to the A19S25 milestone Mar 18, 2026
@jahorton jahorton force-pushed the refactor/web/split-analyze-transition branch from ab5ee30 to da0d7aa Compare March 20, 2026 13:25
@jahorton jahorton changed the base branch from feat/web/map-suggestions-to-source-tokenization to refactor/web/create-default-keep March 20, 2026 13:26
@keyman-server keyman-server modified the milestones: A19S25, A19S26 Mar 28, 2026
@jahorton jahorton force-pushed the refactor/web/create-default-keep branch from a53305b to d285478 Compare April 2, 2026 19:26
@jahorton jahorton force-pushed the refactor/web/split-analyze-transition branch from da0d7aa to b908a23 Compare April 6, 2026 16:55
@jahorton jahorton changed the base branch from refactor/web/create-default-keep to change/web/context-token-init April 6, 2026 16:56
Base automatically changed from change/web/context-token-init to epic/autocorrect April 9, 2026 19:53
@jahorton jahorton force-pushed the refactor/web/split-analyze-transition branch from 716b223 to 0e29439 Compare April 9, 2026 20:18
The implementation of ContextState.analyzeTransition is rather long and large, comprised of multiple subsections.  We can get clearer, more maintainable code if we split it into multiple pieces - especially if we can do so in a manner that allows individual pieces to be unit-tested.

Also, with whitespace fat-finger handling coming up, special handling for applied suggestions may be needed.  This refactor may facilitate development of a suggestion-specialized variant.

Build-bot: skip build:web
Test-bot: skip
@jahorton jahorton force-pushed the refactor/web/split-analyze-transition branch from 0e29439 to 938d5ed Compare April 9, 2026 20:31
@jahorton jahorton marked this pull request as ready for review April 9, 2026 21:05
export function transitionTokenizations(
precomputedTransitionSubsets: ReadonlyMap<string, TokenizationSubset>,
transformDistribution: Distribution<Transform>
) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) {
): Map<string, ContextTokenization> {

Comment on lines +55 to +71
for(let baseTokenization of startTokenizations) {
for(let mass of transformDistribution) {
const tokenizationAnalysis = baseTokenization.mapWhitespacedTokenization(lexicalModel, mass.sample);
const alignment = tokenizationAnalysis.alignment;

// Pre-process any splits and merges; the result of these operations may
// have the same properties as other base tokenizations within the
// subset if compatible.
const needsRealignment = (alignment.merges.length > 0 || alignment.splits.length > 0 || alignment.unmappedEdits.length > 0);
const sourceTokenization = needsRealignment ? baseTokenization.realign(alignment) : baseTokenization;

subsetBuilder.addPrecomputation(sourceTokenization, tokenizationAnalysis, mass.p);

if(mass.sample == trueInput) {
keyMatchingUserContext = subsetBuilder.keyer(tokenizationAnalysis);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin.ai flags this as a potential bug:

Per-transform realign in precomputation loop creates fragmented subset entries and mutates alignment data

In precomputeTransitions, calling baseTokenization.realign(alignment) per-transform (line 64) has two compounding problems:

  1. realign mutates the alignment's merges/splits arrays via .shift() (context-tokenization.ts:527,537). When subsetBuilder.addPrecomputation() subsequently calls the keyer, the alignment's merges/splits are empty, so editKeyer (tokenization-subsets.ts:83-99) omits the M: and S: key components. This can cause transforms with merges/splits to produce the same key as transforms without them.

  2. Each realign() creates a new ContextTokenization instance (context-tokenization.ts:556). Since Map uses reference equality, different transforms that both require realignment produce separate transitionEdges entries in the same subset. When transitionTokenizations (transition-helpers.ts:122) finds independentTransitionResults.length > 1, it throws "Multi-tokenization transitions not yet supported." (line 129).

The old code avoided both issues: it passed the same baseTokenization reference to addPrecomputation for all transforms (accumulating into one TransitionEdge), and called realign only once after subset building, preserving alignment data for keying. This bug triggers when any fat-finger distribution contains a transform that causes a merge or split in the tokenization (e.g., word-boundary shifts from punctuation input).

Copy link
Copy Markdown
Contributor Author

@jahorton jahorton Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. realign mutates the alignment's merges/splits arrays via .shift() (context-tokenization.ts:527,537). When subsetBuilder.addPrecomputation() subsequently calls the keyer, the alignment's merges/splits are empty, so editKeyer (tokenization-subsets.ts:83-99) omits the M: and S: key components. This can cause transforms with merges/splits to produce the same key as transforms without them.

This is actually a goal of the logic here.

The idea is to handle the merges and splits before the rest of the tokenization-transition. By doing so, we can detect if applying the actual inputs result in matching tokenization patterns with other routes, so that merges and splits can actually converge with them.

To make that a lot clearer, suppose the following (forced as it is):

  • [can, '] + t
  • [cans] + t

The pre-merge step yields the following:

  • [can'] + t
  • [cans] + t

Which both land with the same tokenization pattern:

  • 5 codepoints
  • same transform input count

They should converge to the same tokenization and be "clustered" together. Without the pre-merge / pre-split step, they would not do so. To be extra-explicit - the "keyer"'s returned key itself is the basis for determining when this convergence occurs.

... I should probably document that in a comment here so that it's much clearer down the line.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. Each realign() creates a new ContextTokenization instance (context-tokenization.ts:556). Since Map uses reference equality, different transforms that both require realignment produce separate transitionEdges entries in the same subset. When transitionTokenizations (transition-helpers.ts:122) finds independentTransitionResults.length > 1, it throws "Multi-tokenization transitions not yet supported." (line 129).

See #15384 - the error it mentions here is the very next thing I address.

* Generates all actual fixtures used in the tokenization-transition unit tests found below.
* @returns
*/
function generateFixturesForTransitionSource() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

devin.ai flags this:

web/src/test/auto/headless/engine/predictive-text/worker-thread/context/transition-helpers.tests.ts:R221-316

Test fixtures don't cover merge/split scenarios

The test fixtures in generateFixturesForTransitionSource use simple append transforms on the 'cafe' tokenization (['I', ' ', 'ate', '', 'at', ' ', 'a', ' ', 'cafe']). None of the input distributions cause merges or splits in tokenization. The bug reported in transition-helpers.ts specifically triggers when merges/splits are present, which means it's untested. Consider adding test cases where a transform causes word-boundary shifts (e.g., typing an apostrophe that merges adjacent tokens, or a deletion that splits a merged token).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... it took me most of this week to get the unit test scenarios even this far. I'm willing to further develop this test suite down the line, but I think it's wise to triage this task until later.

This is part of why the task took me the full week, though - I recognized the need for the utility methods to be capable of providing clear support for building such cases. Took a while to figure out a set of good patterns for that.


// A shallow copy of the array is fine, but we'd be best off
// not aliasing the array itself.
if(stateToClone.suggestions?.length ?? 0 > 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

devin.ai flags this:

Pre-existing operator precedence issue in clone constructor

At context-state.ts:141, the expression stateToClone.suggestions?.length ?? 0 > 0 is parsed as stateToClone.suggestions?.length ?? (0 > 0) due to > having higher precedence than ??. The intent was likely (stateToClone.suggestions?.length ?? 0) > 0. However, this happens to work correctly in all cases: when suggestions is undefined, undefined ?? false = false (skips clone); when length is 0, 0 is falsy (skips clone); when length > 0, the truthy number enters the branch (clones). This is a pre-existing issue not introduced by this PR, but it's fragile and confusing.

// context separately from pre-application contexts for predictions based on empty roots.
const state = new ContextState(this);
state.tokenization = startTokenizationsAfterSlide[0];
state.tokenization = [...startTokenizations.values()][0];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

devin.ai flag:

Unnecessary [...startTokenizations.values()][0] instead of startTokenizations[0]

startTokenizations is already an Array (from [this.tokenization].map(...)) so .values() returns an array iterator. [...startTokenizations.values()][0] creates a temporary spread just to access element 0 — functionally identical to startTokenizations[0] but needlessly verbose and allocates an extra array. Not a bug, but a minor style concern.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a proper array once whitespace fat-fingering lands; each possible variation in where the word boundaries land will be modeled by a separate ContextTokenization instance. As a result, this will be a proper array at that time.

const bestProb = transformDistribution.reduce((best, curr) => Math.max(best, curr.p), 0);

// For all target tokenizations - each transition subset...
const finalTokenizations: Map<string, ContextTokenization> = new Map();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

devin.ai flag:

web/src/engine/predictive-text/worker-thread/src/main/correction/transition-helpers.ts:R98-130

transitionTokenizations evaluates ALL subsets but only one result is used

In context-state.ts:228-229, transitionTokenizations(subsets, transformDistribution) builds tokenizations for every subset, but only the one matching trueInputSubsetKey is retrieved via .get(). The extra work evaluates evaluateTransition (which builds LegacyQuotientSpur instances and search spaces) for subsets that are never used. This is a performance concern that will compound if the number of subsets grows with multi-tokenization support. Additionally, trueInput.id is used as the transitionId for ALL subsets (transition-helpers.ts:107), which is semantically incorrect for non-true-input subsets — though currently harmless since those results are discarded.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... for now, yes. The current system is designed with "one true tokenization" in mind, after all. This PR is but a step in the transition to moving past that.

Note that the unit tests added in this PR and in #15834 do use / consider the other subsets! They'll also be used down the line - that whole this.tokenization -> this.tokenizations thing I mentioned in the comments above. this.tokenizations in the transitioned state will use many of the evaluated subsets. (Perhaps not all - some thresholding is desired, as it's not worth keeping very unlikely tokenization patterns around forever.)

Comment on lines +211 to +213
const startTokenizations = [this.tokenization].map((t) => {
return t.applyContextSlide(lexicalModel, slideUpdateTransform);
});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly complex? Why turn this into an array if it only has a single entry?

Suggested change
const startTokenizations = [this.tokenization].map((t) => {
return t.applyContextSlide(lexicalModel, slideUpdateTransform);
});
const startTokenization = this.tokenization.applyContextSlide(lexicalModel, slideUpdateTransform);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it won't be a single array once whitespace fat-fingering lands. this.tokenization is slated to become this.tokenizations: ContextTokenization[].

Basically, I'm aiming to "telegraph" or hint at upcoming changes, structuring the code in a manner that will better model where things will land. Each possible variation in where the word boundaries land will be modeled by a separate ContextTokenization instance.

// context separately from pre-application contexts for predictions based on empty roots.
const state = new ContextState(this);
state.tokenization = startTokenizationsAfterSlide[0];
state.tokenization = [...startTokenizations.values()][0];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is necessary. Why can't you just say startTokenizations[0], or if you eliminate the seemingly extraneous array then just startTokenization?

Suggested change
state.tokenization = [...startTokenizations.values()][0];
state.tokenization = startTokenization;

// Should gain one per subsetBuilder.subsets entry.
const realignedTokenization = baseTokenization.realign(tokenizationAnalysis.alignment);
const resultTokenization = realignedTokenization.evaluateTransition(tokenizationAnalysis, trueInput.id, bestProb);
const { subsets, keyMatchingUserContext: trueInputSubsetKey } = precomputeTransitions(startTokenizations, transformDistribution);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const { subsets, keyMatchingUserContext: trueInputSubsetKey } = precomputeTransitions(startTokenizations, transformDistribution);
const { subsets, keyMatchingUserContext: trueInputSubsetKey } = precomputeTransitions([startTokenization], transformDistribution);

Following the other two comments... but if you only ever have a single token, does this need to be an array passed in?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See transition-helpers.tests.ts, line 386 - the tests there already pass in multiple ContextTokenization objects (word-boundary location variants), not just one!

Once whitespace fat-fingering is in full swing, "a single token" may rarely be the case outside of disabled-correction scenarios.

Comment on lines +127 to +129
// Of particular note - there may no longer be a specific, single set of edits
// for the transition; there will be different paths to reach a tokenization!
throw new Error("Multi-tokenization transitions not yet supported.");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO item for this epic or for future functionality?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the next PR: #15834. It felt wrong to lump those changes in with this PR, so I split it off.

@keyman-server keyman-server modified the milestones: A19S26, A19S27 Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

4 participants