Splitting out AstNode to AstNode, Block, Pipeline, StatementNode, ExpressionNode. by WindSoilder · Pull Request #68 · nushell/new-nu-parser

WindSoilder · 2026-03-16T07:54:22Z

As title, this pr is a huge refactor of the codebase, it splits a general AstNode to different nodes.

- A BlockNode, it contains
- A list of StatementNode or ExpressionNode

We can look into enum StatementNode and enum ExpressionNode to see what they can be.

It's a little different to another pr #54 , this pr saves all these nodes into Compiler, so we have the following fields:

ast_nodes (It contains some smaller set of original AstNode)
expression_nodes
block_nodes
pipeline_nodes
name_nodes (explained in next paragraph)
string_nodes (explained in next paragraph)
variable_nodes (explained in next paragraph)

For Expression Node, I dupliated the storage of NameNode, StringNode and VariableNode to an individual place, because they can be re-used in many places, after that, we can define StatementNode::Let, StatementNode::Def easier.

After we pushing these node, we also put an Indexer into Compiler.indexer, we can still get all nodes sequentially from this field.

For every type of node, they have a new id type, for example:

NameNode: NameNodeId
StringNode: StringNodeId
VariableNode: VariableNodeId
And here we define NodeIdGetter trait for NodeId and NodePush trait for Node. So we can push node to compiler and get node information from compiler easily.

What can we actheves after this pr:

A better runtime performance, because we don't need to check AstNode every time after we get a node, we are more likely to get right node when we get from a dedicated typed id.
Less possibility of bugs, because we can make use of Rust's type system to make sure we are using the right node in the right place.

Something might goes bad after this pr:

Some spaces overhead because we need to store more vectors to store these nodes.
More complicated output in display_state method, especially in Compiler, because it outputs the indirect NameNode, StringNode, VariableNode, Statement etc as well.

I had done tests for files under tests manually to update snapshot, but something might still broken

…vestigate

…ecker

…lineId into Expression indexer via pipeline_to_expression mapping

…de index display

…exer::Pipeline variant is removed

stormasm · 2026-04-19T15:39:24Z

@WindSoilder Let us wait for a couple of weeks to see if @kubouch or @ysthakur is able and has the time to review this PR and give feed back as they are the most knowledgeable about ideas on how things should work.

stormasm · 2026-04-19T15:40:47Z

In the meantime there is no risk of code rot etc... because we will not land any other PR's here until your PR is resolved...

stormasm · 2026-04-24T05:20:30Z

On April 23 @kubouch mentioned on the core team channel that when he has some free time he will take a look at this PR. So lets wait on his feedback 😄

kubouch

I first made the inline comments as I was understanding the code before writing this, so this message is the primary feedback. Some of the inline feedback may not be relevant.

I think the PR targets the right issue of having a more programmer-friendly interface to the AST nodes, but I'm not convinced about this approach because it adds a lot of complexity of its own. We have now four "top-level" vectors of Nodes (plus some extra for strings, variables etc.), there are extra hash maps and several new traits. Furthermore, one of the arguments was performance, so I ran the benchmarks and it seems the parser stage can be 2x slower and all stages except lexing take a performance hit with this PR. (The benchmarks seem a bit broken, but you can still run them with cargo export target/benchmarks/ast_nodes -- bench, then target/benchmarks/ast_nodes/benchmarks solo -s 20 --warmup true, requires the cargo-export plugin.)

I checked the code for some offending pattern that we'd like to improve and found this:

    fn typecheck_block(&mut self, node_id: NodeId, expected: TypeId) -> TypeId {
        let AstNode::Block(block_id) = self.compiler.ast_nodes[node_id.0] else {
            panic!(
                "Expected block to typecheck, got '{:?}'",
                self.compiler.ast_nodes[node_id.0]
            );
        };
        let block = &self.compiler.blocks[block_id.0];
        (...)

I think this snippet well illustrates the problem: We get a generic NodeId and then assume it's a block (and panic if it's not). What we could do is to refactor it like this:

    fn typecheck_block(&mut self, node_id: NodeId, expected: TypeId) -> TypeId {
        let block = &self.compiler.get_block(node_id);
        (...)

where

impl Compiler {
    pub fn get_block(&self, node_id: NodeId) -> &Block {
        let AstNode::Block(block_id) = self.ast_nodes[node_id.0] else {
            panic!(
                "Expected block, got '{:?}'",
                self.ast_nodes[node_id.0]
            );
        };
        &self.compiler.blocks[block_id.0]
    }
}

Just using helpers like this would clean up the codebase without making things too complex/slow. If we were concerned about the performance impact of the runtime check (how much it actually impacts would need to be checked, I'd suspect that modern branch predictors would be quite good at predicting such a static check), I can think of two options:

Use cold_path hint which I just learned about.
Reinterpret self.ast_nodes[node_id.0] as BlockId and assume it will always be called correctly. That's unsafe, but once parsing is complete without errors, we can assume it's been parsed correctly. I'd do this only if we prove significant performance benefits.

There are also some things I noted in the inline comments that I didn't quite understand, like making the return types Option<> in the parser.

This is clearly a lot of work, but I'm unsure if it goes the right direction. We could iterate on some concrete code samples to see if we can design a bit lighter solution, like I did above using the typecheck_block(), it's easier when you have some concrete example to improve.

kubouch · 2026-04-26T13:44:54Z

+        compiler
+            .string_to_expression
+            .get(&self)
+            .expect("internal error: name node should have a corresponding expression node")


This and similar errors repeat "name node" also for non-name nodes

kubouch · 2026-04-26T13:51:58Z

+    spans: Vec<Span>,
+}
+
+impl<T> NodeSpans<T> {


I like grouping nodes and spans like this. If needed, it could have .get(&self, i: usize) -> (&T, Span) as well.

And maybe one more small thing: The .get methods could take i: NodeId instead of i: usize. It reduces the need for node_id.0 in the rest of the code and makes it explicit that it's addressed by NodeId.

kubouch · 2026-04-26T14:01:14Z

-            .spans
-            .get(node_id.0)
-            .expect("internal error: missing span of node")
+    /// TODO: no need this.


There are some TODO comments which are a bit unclear. Should these be removed?

kubouch · 2026-04-26T14:01:42Z

    /// Get the source contents of a node
-    pub fn node_as_str(&self, node_id: NodeId) -> &str {
-        std::str::from_utf8(self.get_span_contents(node_id))
+    /// TODO: use generic rather than NodeIndexer


What is generic? NodeId?

kubouch · 2026-04-26T14:45:44Z

+    Garbage,
+}
+
+pub trait NodeIdGetter {


I have two comments about these getter/pusher traits:

The traits add a layer of complexity

It would feel more natural to do compiler.get_node(node_id) instead of node.get_node(&mut compiler).

I think both could be solved by implementing these as methods of the Compiler, for example Compiler::push_string(&mut self, span: Span, string_node: StringNode) -> StringNodeId { ... } or Compiler::get_string_mut(&mut self, id: StringNodeId) -> &'a mut StringNode { ... }. You'd still get the type safety and IMO it feels more natural.

If introducing push_string, get_string_mut, set_string, push_name, get_name, get_name_mut it makes Compiler bloat out with too much method. It's why I came up with current design.
But I agree that it adds complexity in another way.

kubouch · 2026-04-26T15:36:42Z

-    pub ast_nodes: Vec<AstNode>,
+    // different types of nodes.
+    pub name_nodes: NodeSpans<NameNode>,
+    pub string_nodes: NodeSpans<StringNode>,


Adding a separate Vec for these trivial type adds additional indirection (instead of addressing expression_nodes[i] directly, it needs to go string_nodes[expression_nodes[i]]. I'm wondering if we could achieve type safety without having these separate Vecs.

kubouch · 2026-04-26T15:38:58Z

+    Xor,
+    Or,
+
+    // Assignments


Assignments are statements, could be moved there.

kubouch · 2026-04-26T15:40:16Z

+    FlagShortGroup,
+
+    // ??? should statement belongs to AstNode?
+    Statement(StatementNodeId),


Since there is a separate storage for statements (statement_nodes), is this necessary anymore?

kubouch · 2026-04-26T15:46:54Z

+pub struct NodeId(pub usize);
+
+#[derive(Clone, Copy, PartialEq, Eq, Hash)]
+pub enum NodeIndexer {


Just a naming issue: The NodeIndexer becomes the "top-level" node type now, so I'd name it AstNode or just Node. And the former AstNode now becomes more like a "misc" node, could be named GeneralNode or similar.

Also, If I understood it correctly, now there is no single storage anymore for the AST nodes. What previously was compiler.ast_nodes is now broken down to four Vecs.

kubouch · 2026-04-26T17:20:00Z

    pub errors: Vec<SourceError>,
+
+    // cache mapping
+    pub name_to_expression: HashMap<NameNodeId, ExpressionNodeId>,


I guess these HashMaps are needed to be able to determine which ExpressionNodeId corresponds to each of the expression nodes? Because now when they are different types, they don't know about each other. This adds a lot of complexity, the hash map lookups are also more expensive than just index to a Vec. I'm wondering if we really need those.

Yup, this is one of thing I don't really like in this new design, it makes something like rollback logic more complicated

WindSoilder · 2026-04-30T09:28:03Z

@kubouch Thank you for the feedback! I agree that it doesn't feel good because to too much overhead, I actually have the same feel about this.

What do you think about wrapping NodeId insider different type? Take AstNode::Block as example:

impl Compiler {
    pub fn get_block(&self, node_id: BlockNodeId) -> &Block {
        let AstNode::Block(block_id) = self.ast_nodes[node_id.0.0] else {
            panic!(
                "Expected block, got '{:?}'",
                self.ast_nodes[node_id.0]
            );
        };
        &self.compiler.blocks[block_id.0]
    }
}

Where BlockNodeId is defined like:

type BlockNodeId(NodeId)

In this way, we can make good use of rust type system, and maybe it's more friendly to programmers.
Maybe #54 it's a right direction to go.

stormasm · 2026-05-01T15:46:07Z

If you and @kubouch think #54 is the correct way to go it would be great if @ysthakur could comment on this idea as well...

I sent out a message to @ysthakur on the core team channel but have not heard back from him yet...

It seems the three of you all are the best people to make this decision...

@WindSoilder if you are able to get in contact with @ysthakur that would be great

kubouch · 2026-05-03T14:59:25Z

The BlockNodeId might give the programmer the hint to use .get_block() only if the NodeId points to a block, but it may also be unnecessarily obscuring the fact that NodeIds are just integer pointing to an array. The programmer would need to always do let block_node_id = BlockNodeId(node_id); compiler.get_block(...), not sure if that adds more value than noise.

Fundamentally, to achieve perfect type safety, each AST node type would need to have its own storage, similar to what you did with some of the node types. But this results in a lot of fragmentation leading to extra complexity and performance degradation. Once we have more than one type of AST node stored in Vec<AstNode>, then we need to live with the fact that perfect type safety is impossible because the AST nodes are only known at runtime. Therefore, whether it's BlockNodeId or OtherNodeId is also determined at runtime, it can't be resolved by Rust type system. So you always need to do some casting/assumptions at runtime with some kind of a check (or do an unsafe cast). (The Zig parser, for example, does that heavily: https://mitchellh.com/zig/parser.)

The Statement/Expression split in #54 can be done, but it adds a lot of verbosity (eg. AstNode::Expr(Expr::Int) instead of just AstNode::Int). As mentioned in the thread, it can also probably be resolved with a helper (eg. .is_expr()).

I went ahead and made a helper for the blocks here: #69. There are plenty of patterns that can be refactored this way, eg.

new-nu-parser/src/typechecker.rs

Lines 790 to 792 in fafd8df

    
           let AstNode::InOutTypes(types) = self.compiler.get_node(ty) else { 
        
               panic!("internal error: return type is not a return type"); 
        
           };

.

WindSoilder · 2026-05-06T08:20:50Z

Going to close the pr because it's not a good direction to go.

WindSoilder added 12 commits March 13, 2026 22:00

split out AstNode

02a0e61

let's define them in compiler

1e0a8df

introduce a new span_end method, also add Pipeline

c763962

parser change

3a7d3c2

finish parser change

90e37ee

make some minor fix

ceb8a65

introduce a new NameOrString

92c0d22

some typechecker change

896ec16

rename

7c4a05b

some minor adjust for type chagnes

a82e103

some little change

d2dd7a6

more little changes

f2e928e

WindSoilder force-pushed the ast_nodes branch from 56dd9f5 to 998381f Compare March 23, 2026 13:23

WindSoilder added 8 commits April 2, 2026 16:05

finish typechecker change

3701501

resolver

a2d84d5

more changes on resolver

c71d6ed

more changes on resolver

3b32c92

ir generator

44fd508

let's change display_status

ddf0459

finish it

b2e6ecd

don't make nested insert, but it may causes problem, still need to in…

7c8937a

…vestigate

WindSoilder force-pushed the ast_nodes branch from b0e8922 to 7c8937a Compare April 2, 2026 08:05

WindSoilder added 8 commits April 4, 2026 18:02

remove useless import

6f2385e

remove useless import

5478353

change into_indexer to accept compiler reference

e8a37ec

remove NodeIndexer::String, NodeIndexer::Variane, NodeIndexer::Name

50c58d2

change into_indexer to accept compiler reference, also improve typech…

6c42fda

…ecker

impl some Debug trait to reduce verbosing

4b70310

no duplicate

ad97d7a

no duplicate advance

f6fc0f7

WindSoilder added 10 commits April 10, 2026 22:49

fix if

58356ea

add mapping from pipeline to expression

e0aabee

Remove NodeIndexer::Pipeline variant and related handling; route Pipe…

9172bee

…lineId into Expression indexer via pipeline_to_expression mapping

Remove handling for NodeIndexer::Pipeline in compiler.get_span and no…

3d4c352

…de index display

Remove unreachable catch-all arm in resolver.resolve now that NodeInd…

cd86cb3

…exer::Pipeline variant is removed

fix resolve_call to keep original logic, also fix rollback apply

279d11c

fix in_out_types

4fbde40

fix type infer

fde82d1

fix clippy

0821dae

update snapshot

edb9b6b

WindSoilder force-pushed the ast_nodes branch from 1cd9742 to edb9b6b Compare April 14, 2026 05:51

WindSoilder marked this pull request as ready for review April 14, 2026 05:52

kubouch reviewed Apr 26, 2026

View reviewed changes

kubouch mentioned this pull request May 3, 2026

Add block-getting helper #69

Merged

WindSoilder mentioned this pull request May 6, 2026

Implement Copy trait on AstNode #70

Open

WindSoilder closed this May 6, 2026

Conversation

WindSoilder commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stormasm commented Apr 19, 2026

Uh oh!

stormasm commented Apr 19, 2026

Uh oh!

stormasm commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kubouch left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WindSoilder commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stormasm commented May 1, 2026

Uh oh!

kubouch commented May 3, 2026

Uh oh!

WindSoilder commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WindSoilder commented Mar 16, 2026 •

edited

Loading

stormasm commented Apr 24, 2026 •

edited

Loading

kubouch left a comment •

edited

Loading

WindSoilder commented Apr 30, 2026 •

edited

Loading