Skip to content

Splitting out AstNode to AstNode, Block, Pipeline, StatementNode, ExpressionNode.#68

Closed
WindSoilder wants to merge 38 commits intonushell:mainfrom
WindSoilder:ast_nodes
Closed

Splitting out AstNode to AstNode, Block, Pipeline, StatementNode, ExpressionNode.#68
WindSoilder wants to merge 38 commits intonushell:mainfrom
WindSoilder:ast_nodes

Conversation

@WindSoilder
Copy link
Copy Markdown
Contributor

@WindSoilder WindSoilder commented Mar 16, 2026

As title, this pr is a huge refactor of the codebase, it splits a general AstNode to different nodes.

- A BlockNode, it contains
- A list of StatementNode or ExpressionNode

We can look into enum StatementNode and enum ExpressionNode to see what they can be.

It's a little different to another pr #54 , this pr saves all these nodes into Compiler, so we have the following fields:

  • ast_nodes (It contains some smaller set of original AstNode)
  • expression_nodes
  • block_nodes
  • pipeline_nodes
  • name_nodes (explained in next paragraph)
  • string_nodes (explained in next paragraph)
  • variable_nodes (explained in next paragraph)

For Expression Node, I dupliated the storage of NameNode, StringNode and VariableNode to an individual place, because they can be re-used in many places, after that, we can define StatementNode::Let, StatementNode::Def easier.

After we pushing these node, we also put an Indexer into Compiler.indexer, we can still get all nodes sequentially from this field.

For every type of node, they have a new id type, for example:

  • NameNode: NameNodeId
  • StringNode: StringNodeId
  • VariableNode: VariableNodeId
    And here we define NodeIdGetter trait for NodeId and NodePush trait for Node. So we can push node to compiler and get node information from compiler easily.

What can we actheves after this pr:

  • A better runtime performance, because we don't need to check AstNode every time after we get a node, we are more likely to get right node when we get from a dedicated typed id.
  • Less possibility of bugs, because we can make use of Rust's type system to make sure we are using the right node in the right place.

Something might goes bad after this pr:

  • Some spaces overhead because we need to store more vectors to store these nodes.
  • More complicated output in display_state method, especially in Compiler, because it outputs the indirect NameNode, StringNode, VariableNode, Statement etc as well.

I had done tests for files under tests manually to update snapshot, but something might still broken

@WindSoilder WindSoilder marked this pull request as ready for review April 14, 2026 05:52
@stormasm
Copy link
Copy Markdown

@WindSoilder Let us wait for a couple of weeks to see if @kubouch or @ysthakur is able and has the time to review this PR and give feed back as they are the most knowledgeable about ideas on how things should work.

@stormasm
Copy link
Copy Markdown

In the meantime there is no risk of code rot etc... because we will not land any other PR's here until your PR is resolved...

@stormasm
Copy link
Copy Markdown

stormasm commented Apr 24, 2026

On April 23 @kubouch mentioned on the core team channel that when he has some free time he will take a look at this PR. So lets wait on his feedback 😄

Copy link
Copy Markdown
Contributor

@kubouch kubouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first made the inline comments as I was understanding the code before writing this, so this message is the primary feedback. Some of the inline feedback may not be relevant.

I think the PR targets the right issue of having a more programmer-friendly interface to the AST nodes, but I'm not convinced about this approach because it adds a lot of complexity of its own. We have now four "top-level" vectors of Nodes (plus some extra for strings, variables etc.), there are extra hash maps and several new traits. Furthermore, one of the arguments was performance, so I ran the benchmarks and it seems the parser stage can be 2x slower and all stages except lexing take a performance hit with this PR. (The benchmarks seem a bit broken, but you can still run them with cargo export target/benchmarks/ast_nodes -- bench, then target/benchmarks/ast_nodes/benchmarks solo -s 20 --warmup true, requires the cargo-export plugin.)

I checked the code for some offending pattern that we'd like to improve and found this:

    fn typecheck_block(&mut self, node_id: NodeId, expected: TypeId) -> TypeId {
        let AstNode::Block(block_id) = self.compiler.ast_nodes[node_id.0] else {
            panic!(
                "Expected block to typecheck, got '{:?}'",
                self.compiler.ast_nodes[node_id.0]
            );
        };
        let block = &self.compiler.blocks[block_id.0];
        (...)

I think this snippet well illustrates the problem: We get a generic NodeId and then assume it's a block (and panic if it's not). What we could do is to refactor it like this:

    fn typecheck_block(&mut self, node_id: NodeId, expected: TypeId) -> TypeId {
        let block = &self.compiler.get_block(node_id);
        (...)

where

impl Compiler {
    pub fn get_block(&self, node_id: NodeId) -> &Block {
        let AstNode::Block(block_id) = self.ast_nodes[node_id.0] else {
            panic!(
                "Expected block, got '{:?}'",
                self.ast_nodes[node_id.0]
            );
        };
        &self.compiler.blocks[block_id.0]
    }
}

Just using helpers like this would clean up the codebase without making things too complex/slow. If we were concerned about the performance impact of the runtime check (how much it actually impacts would need to be checked, I'd suspect that modern branch predictors would be quite good at predicting such a static check), I can think of two options:

  1. Use cold_path hint which I just learned about.
  2. Reinterpret self.ast_nodes[node_id.0] as BlockId and assume it will always be called correctly. That's unsafe, but once parsing is complete without errors, we can assume it's been parsed correctly. I'd do this only if we prove significant performance benefits.

There are also some things I noted in the inline comments that I didn't quite understand, like making the return types Option<> in the parser.

This is clearly a lot of work, but I'm unsure if it goes the right direction. We could iterate on some concrete code samples to see if we can design a bit lighter solution, like I did above using the typecheck_block(), it's easier when you have some concrete example to improve.

Comment thread src/ast_nodes.rs
compiler
.string_to_expression
.get(&self)
.expect("internal error: name node should have a corresponding expression node")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and similar errors repeat "name node" also for non-name nodes

Comment thread src/compiler.rs
spans: Vec<Span>,
}

impl<T> NodeSpans<T> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like grouping nodes and spans like this. If needed, it could have .get(&self, i: usize) -> (&T, Span) as well.

And maybe one more small thing: The .get methods could take i: NodeId instead of i: usize. It reduces the need for node_id.0 in the rest of the code and makes it explicit that it's addressed by NodeId.

Comment thread src/compiler.rs
.spans
.get(node_id.0)
.expect("internal error: missing span of node")
/// TODO: no need this.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some TODO comments which are a bit unclear. Should these be removed?

Comment thread src/compiler.rs
/// Get the source contents of a node
pub fn node_as_str(&self, node_id: NodeId) -> &str {
std::str::from_utf8(self.get_span_contents(node_id))
/// TODO: use generic rather than NodeIndexer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is generic? NodeId?

Comment thread src/ast_nodes.rs
Garbage,
}

pub trait NodeIdGetter {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two comments about these getter/pusher traits:

  • The traits add a layer of complexity
  • It would feel more natural to do compiler.get_node(node_id) instead of node.get_node(&mut compiler).

I think both could be solved by implementing these as methods of the Compiler, for example Compiler::push_string(&mut self, span: Span, string_node: StringNode) -> StringNodeId { ... } or Compiler::get_string_mut(&mut self, id: StringNodeId) -> &'a mut StringNode { ... }. You'd still get the type safety and IMO it feels more natural.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If introducing push_string, get_string_mut, set_string, push_name, get_name, get_name_mut it makes Compiler bloat out with too much method. It's why I came up with current design.
But I agree that it adds complexity in another way.

Comment thread src/compiler.rs
pub ast_nodes: Vec<AstNode>,
// different types of nodes.
pub name_nodes: NodeSpans<NameNode>,
pub string_nodes: NodeSpans<StringNode>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a separate Vec for these trivial type adds additional indirection (instead of addressing expression_nodes[i] directly, it needs to go string_nodes[expression_nodes[i]]. I'm wondering if we could achieve type safety without having these separate Vecs.

Comment thread src/ast_nodes.rs
Xor,
Or,

// Assignments
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assignments are statements, could be moved there.

Comment thread src/ast_nodes.rs
FlagShortGroup,

// ??? should statement belongs to AstNode?
Statement(StatementNodeId),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is a separate storage for statements (statement_nodes), is this necessary anymore?

Comment thread src/ast_nodes.rs
pub struct NodeId(pub usize);

#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub enum NodeIndexer {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a naming issue: The NodeIndexer becomes the "top-level" node type now, so I'd name it AstNode or just Node. And the former AstNode now becomes more like a "misc" node, could be named GeneralNode or similar.

Also, If I understood it correctly, now there is no single storage anymore for the AST nodes. What previously was compiler.ast_nodes is now broken down to four Vecs.

Comment thread src/compiler.rs
pub errors: Vec<SourceError>,

// cache mapping
pub name_to_expression: HashMap<NameNodeId, ExpressionNodeId>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these HashMaps are needed to be able to determine which ExpressionNodeId corresponds to each of the expression nodes? Because now when they are different types, they don't know about each other. This adds a lot of complexity, the hash map lookups are also more expensive than just index to a Vec. I'm wondering if we really need those.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is one of thing I don't really like in this new design, it makes something like rollback logic more complicated

@WindSoilder
Copy link
Copy Markdown
Contributor Author

WindSoilder commented Apr 30, 2026

@kubouch Thank you for the feedback! I agree that it doesn't feel good because to too much overhead, I actually have the same feel about this.

What do you think about wrapping NodeId insider different type? Take AstNode::Block as example:

impl Compiler {
    pub fn get_block(&self, node_id: BlockNodeId) -> &Block {
        let AstNode::Block(block_id) = self.ast_nodes[node_id.0.0] else {
            panic!(
                "Expected block, got '{:?}'",
                self.ast_nodes[node_id.0]
            );
        };
        &self.compiler.blocks[block_id.0]
    }
}

Where BlockNodeId is defined like:

type BlockNodeId(NodeId)

In this way, we can make good use of rust type system, and maybe it's more friendly to programmers.
Maybe #54 it's a right direction to go.

@stormasm
Copy link
Copy Markdown

stormasm commented May 1, 2026

If you and @kubouch think #54 is the correct way to go it would be great if @ysthakur could comment on this idea as well...

I sent out a message to @ysthakur on the core team channel but have not heard back from him yet...

It seems the three of you all are the best people to make this decision...

@WindSoilder if you are able to get in contact with @ysthakur that would be great

@kubouch kubouch mentioned this pull request May 3, 2026
@kubouch
Copy link
Copy Markdown
Contributor

kubouch commented May 3, 2026

The BlockNodeId might give the programmer the hint to use .get_block() only if the NodeId points to a block, but it may also be unnecessarily obscuring the fact that NodeIds are just integer pointing to an array. The programmer would need to always do let block_node_id = BlockNodeId(node_id); compiler.get_block(...), not sure if that adds more value than noise.

Fundamentally, to achieve perfect type safety, each AST node type would need to have its own storage, similar to what you did with some of the node types. But this results in a lot of fragmentation leading to extra complexity and performance degradation. Once we have more than one type of AST node stored in Vec<AstNode>, then we need to live with the fact that perfect type safety is impossible because the AST nodes are only known at runtime. Therefore, whether it's BlockNodeId or OtherNodeId is also determined at runtime, it can't be resolved by Rust type system. So you always need to do some casting/assumptions at runtime with some kind of a check (or do an unsafe cast). (The Zig parser, for example, does that heavily: https://mitchellh.com/zig/parser.)

The Statement/Expression split in #54 can be done, but it adds a lot of verbosity (eg. AstNode::Expr(Expr::Int) instead of just AstNode::Int). As mentioned in the thread, it can also probably be resolved with a helper (eg. .is_expr()).

I went ahead and made a helper for the blocks here: #69. There are plenty of patterns that can be refactored this way, eg.

let AstNode::InOutTypes(types) = self.compiler.get_node(ty) else {
panic!("internal error: return type is not a return type");
};
.

@WindSoilder
Copy link
Copy Markdown
Contributor Author

Going to close the pr because it's not a good direction to go.

@WindSoilder WindSoilder closed this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants