refactor(web): split ContextState.analyzeTransition 🚂#15770
refactor(web): split ContextState.analyzeTransition 🚂#15770jahorton wants to merge 2 commits intoepic/autocorrectfrom
Conversation
User Test ResultsTest specification and instructions User tests are not required Test Artifacts
|
ab5ee30 to
da0d7aa
Compare
a53305b to
d285478
Compare
da0d7aa to
b908a23
Compare
716b223 to
0e29439
Compare
The implementation of ContextState.analyzeTransition is rather long and large, comprised of multiple subsections. We can get clearer, more maintainable code if we split it into multiple pieces - especially if we can do so in a manner that allows individual pieces to be unit-tested. Also, with whitespace fat-finger handling coming up, special handling for applied suggestions may be needed. This refactor may facilitate development of a suggestion-specialized variant. Build-bot: skip build:web Test-bot: skip
0e29439 to
938d5ed
Compare
| export function transitionTokenizations( | ||
| precomputedTransitionSubsets: ReadonlyMap<string, TokenizationSubset>, | ||
| transformDistribution: Distribution<Transform> | ||
| ) { |
There was a problem hiding this comment.
| ) { | |
| ): Map<string, ContextTokenization> { |
| for(let baseTokenization of startTokenizations) { | ||
| for(let mass of transformDistribution) { | ||
| const tokenizationAnalysis = baseTokenization.mapWhitespacedTokenization(lexicalModel, mass.sample); | ||
| const alignment = tokenizationAnalysis.alignment; | ||
|
|
||
| // Pre-process any splits and merges; the result of these operations may | ||
| // have the same properties as other base tokenizations within the | ||
| // subset if compatible. | ||
| const needsRealignment = (alignment.merges.length > 0 || alignment.splits.length > 0 || alignment.unmappedEdits.length > 0); | ||
| const sourceTokenization = needsRealignment ? baseTokenization.realign(alignment) : baseTokenization; | ||
|
|
||
| subsetBuilder.addPrecomputation(sourceTokenization, tokenizationAnalysis, mass.p); | ||
|
|
||
| if(mass.sample == trueInput) { | ||
| keyMatchingUserContext = subsetBuilder.keyer(tokenizationAnalysis); | ||
| } | ||
| } |
There was a problem hiding this comment.
Devin.ai flags this as a potential bug:
Per-transform realign in precomputation loop creates fragmented subset entries and mutates alignment data
In precomputeTransitions, calling baseTokenization.realign(alignment) per-transform (line 64) has two compounding problems:
-
realignmutates the alignment'smerges/splitsarrays via.shift()(context-tokenization.ts:527,537). WhensubsetBuilder.addPrecomputation()subsequently calls the keyer, the alignment's merges/splits are empty, soeditKeyer(tokenization-subsets.ts:83-99) omits theM:andS:key components. This can cause transforms with merges/splits to produce the same key as transforms without them. -
Each
realign()creates a newContextTokenizationinstance (context-tokenization.ts:556). SinceMapuses reference equality, different transforms that both require realignment produce separatetransitionEdgesentries in the same subset. WhentransitionTokenizations(transition-helpers.ts:122) findsindependentTransitionResults.length > 1, it throws"Multi-tokenization transitions not yet supported."(line 129).
The old code avoided both issues: it passed the same baseTokenization reference to addPrecomputation for all transforms (accumulating into one TransitionEdge), and called realign only once after subset building, preserving alignment data for keying. This bug triggers when any fat-finger distribution contains a transform that causes a merge or split in the tokenization (e.g., word-boundary shifts from punctuation input).
There was a problem hiding this comment.
realignmutates the alignment'smerges/splitsarrays via.shift()(context-tokenization.ts:527,537). WhensubsetBuilder.addPrecomputation()subsequently calls the keyer, the alignment's merges/splits are empty, soeditKeyer(tokenization-subsets.ts:83-99) omits theM:andS:key components. This can cause transforms with merges/splits to produce the same key as transforms without them.
This is actually a goal of the logic here.
The idea is to handle the merges and splits before the rest of the tokenization-transition. By doing so, we can detect if applying the actual inputs result in matching tokenization patterns with other routes, so that merges and splits can actually converge with them.
To make that a lot clearer, suppose the following (forced as it is):
- [
can,'] +t - [
cans] +t
The pre-merge step yields the following:
- [
can'] +t - [
cans] +t
Which both land with the same tokenization pattern:
- 5 codepoints
- same transform input count
They should converge to the same tokenization and be "clustered" together. Without the pre-merge / pre-split step, they would not do so. To be extra-explicit - the "keyer"'s returned key itself is the basis for determining when this convergence occurs.
... I should probably document that in a comment here so that it's much clearer down the line.
There was a problem hiding this comment.
2. Each
realign()creates a newContextTokenizationinstance (context-tokenization.ts:556). SinceMapuses reference equality, different transforms that both require realignment produce separatetransitionEdgesentries in the same subset. WhentransitionTokenizations(transition-helpers.ts:122) findsindependentTransitionResults.length > 1, it throws"Multi-tokenization transitions not yet supported."(line 129).
See #15384 - the error it mentions here is the very next thing I address.
| * Generates all actual fixtures used in the tokenization-transition unit tests found below. | ||
| * @returns | ||
| */ | ||
| function generateFixturesForTransitionSource() { |
There was a problem hiding this comment.
devin.ai flags this:
web/src/test/auto/headless/engine/predictive-text/worker-thread/context/transition-helpers.tests.ts:R221-316
Test fixtures don't cover merge/split scenarios
The test fixtures in generateFixturesForTransitionSource use simple append transforms on the 'cafe' tokenization (['I', ' ', 'ate', '', 'at', ' ', 'a', ' ', 'cafe']). None of the input distributions cause merges or splits in tokenization. The bug reported in transition-helpers.ts specifically triggers when merges/splits are present, which means it's untested. Consider adding test cases where a transform causes word-boundary shifts (e.g., typing an apostrophe that merges adjacent tokens, or a deletion that splits a merged token).
There was a problem hiding this comment.
Yeah... it took me most of this week to get the unit test scenarios even this far. I'm willing to further develop this test suite down the line, but I think it's wise to triage this task until later.
This is part of why the task took me the full week, though - I recognized the need for the utility methods to be capable of providing clear support for building such cases. Took a while to figure out a set of good patterns for that.
|
|
||
| // A shallow copy of the array is fine, but we'd be best off | ||
| // not aliasing the array itself. | ||
| if(stateToClone.suggestions?.length ?? 0 > 0) { |
There was a problem hiding this comment.
devin.ai flags this:
Pre-existing operator precedence issue in clone constructor
At context-state.ts:141, the expression stateToClone.suggestions?.length ?? 0 > 0 is parsed as stateToClone.suggestions?.length ?? (0 > 0) due to > having higher precedence than ??. The intent was likely (stateToClone.suggestions?.length ?? 0) > 0. However, this happens to work correctly in all cases: when suggestions is undefined, undefined ?? false = false (skips clone); when length is 0, 0 is falsy (skips clone); when length > 0, the truthy number enters the branch (clones). This is a pre-existing issue not introduced by this PR, but it's fragile and confusing.
| // context separately from pre-application contexts for predictions based on empty roots. | ||
| const state = new ContextState(this); | ||
| state.tokenization = startTokenizationsAfterSlide[0]; | ||
| state.tokenization = [...startTokenizations.values()][0]; |
There was a problem hiding this comment.
devin.ai flag:
Unnecessary [...startTokenizations.values()][0] instead of startTokenizations[0]
startTokenizations is already an Array (from [this.tokenization].map(...)) so .values() returns an array iterator. [...startTokenizations.values()][0] creates a temporary spread just to access element 0 — functionally identical to startTokenizations[0] but needlessly verbose and allocates an extra array. Not a bug, but a minor style concern.
There was a problem hiding this comment.
This will be a proper array once whitespace fat-fingering lands; each possible variation in where the word boundaries land will be modeled by a separate ContextTokenization instance. As a result, this will be a proper array at that time.
| const bestProb = transformDistribution.reduce((best, curr) => Math.max(best, curr.p), 0); | ||
|
|
||
| // For all target tokenizations - each transition subset... | ||
| const finalTokenizations: Map<string, ContextTokenization> = new Map(); |
There was a problem hiding this comment.
devin.ai flag:
web/src/engine/predictive-text/worker-thread/src/main/correction/transition-helpers.ts:R98-130
transitionTokenizations evaluates ALL subsets but only one result is used
In context-state.ts:228-229, transitionTokenizations(subsets, transformDistribution) builds tokenizations for every subset, but only the one matching trueInputSubsetKey is retrieved via .get(). The extra work evaluates evaluateTransition (which builds LegacyQuotientSpur instances and search spaces) for subsets that are never used. This is a performance concern that will compound if the number of subsets grows with multi-tokenization support. Additionally, trueInput.id is used as the transitionId for ALL subsets (transition-helpers.ts:107), which is semantically incorrect for non-true-input subsets — though currently harmless since those results are discarded.
There was a problem hiding this comment.
... for now, yes. The current system is designed with "one true tokenization" in mind, after all. This PR is but a step in the transition to moving past that.
Note that the unit tests added in this PR and in #15834 do use / consider the other subsets! They'll also be used down the line - that whole this.tokenization -> this.tokenizations thing I mentioned in the comments above. this.tokenizations in the transitioned state will use many of the evaluated subsets. (Perhaps not all - some thresholding is desired, as it's not worth keeping very unlikely tokenization patterns around forever.)
| const startTokenizations = [this.tokenization].map((t) => { | ||
| return t.applyContextSlide(lexicalModel, slideUpdateTransform); | ||
| }); |
There was a problem hiding this comment.
This seems overly complex? Why turn this into an array if it only has a single entry?
| const startTokenizations = [this.tokenization].map((t) => { | |
| return t.applyContextSlide(lexicalModel, slideUpdateTransform); | |
| }); | |
| const startTokenization = this.tokenization.applyContextSlide(lexicalModel, slideUpdateTransform); |
There was a problem hiding this comment.
Because it won't be a single array once whitespace fat-fingering lands. this.tokenization is slated to become this.tokenizations: ContextTokenization[].
Basically, I'm aiming to "telegraph" or hint at upcoming changes, structuring the code in a manner that will better model where things will land. Each possible variation in where the word boundaries land will be modeled by a separate ContextTokenization instance.
| // context separately from pre-application contexts for predictions based on empty roots. | ||
| const state = new ContextState(this); | ||
| state.tokenization = startTokenizationsAfterSlide[0]; | ||
| state.tokenization = [...startTokenizations.values()][0]; |
There was a problem hiding this comment.
I don't understand why this is necessary. Why can't you just say startTokenizations[0], or if you eliminate the seemingly extraneous array then just startTokenization?
| state.tokenization = [...startTokenizations.values()][0]; | |
| state.tokenization = startTokenization; |
| // Should gain one per subsetBuilder.subsets entry. | ||
| const realignedTokenization = baseTokenization.realign(tokenizationAnalysis.alignment); | ||
| const resultTokenization = realignedTokenization.evaluateTransition(tokenizationAnalysis, trueInput.id, bestProb); | ||
| const { subsets, keyMatchingUserContext: trueInputSubsetKey } = precomputeTransitions(startTokenizations, transformDistribution); |
There was a problem hiding this comment.
| const { subsets, keyMatchingUserContext: trueInputSubsetKey } = precomputeTransitions(startTokenizations, transformDistribution); | |
| const { subsets, keyMatchingUserContext: trueInputSubsetKey } = precomputeTransitions([startTokenization], transformDistribution); |
Following the other two comments... but if you only ever have a single token, does this need to be an array passed in?
There was a problem hiding this comment.
See transition-helpers.tests.ts, line 386 - the tests there already pass in multiple ContextTokenization objects (word-boundary location variants), not just one!
Once whitespace fat-fingering is in full swing, "a single token" may rarely be the case outside of disabled-correction scenarios.
| // Of particular note - there may no longer be a specific, single set of edits | ||
| // for the transition; there will be different paths to reach a tokenization! | ||
| throw new Error("Multi-tokenization transitions not yet supported."); |
There was a problem hiding this comment.
Is this a TODO item for this epic or for future functionality?
There was a problem hiding this comment.
See the next PR: #15834. It felt wrong to lump those changes in with this PR, so I split it off.
The implementation of ContextState.analyzeTransition is rather long and large, comprised of multiple subsections. We can get clearer, more maintainable code if we split it into multiple pieces - especially if we can do so in a manner that allows individual pieces to be unit-tested.
Also, with whitespace fat-finger handling coming up, special handling for applied suggestions may be needed. This refactor may facilitate development of a suggestion-specialized variant (in #15775).
Build-bot: skip build:web
Test-bot: skip