Formalize the Vortex type system by connortsui20 · Pull Request #29 · vortex-data/rfcs

connortsui20 · 2026-03-06T22:19:18Z

I wanted to write this for 2 reasons, the first being that we do not have a formalized definition of the Vortex type system. Note that I'm not saying we don't understand how it works (I think all of us intuitively understand it), but I thought it would be good to map it to actual theory.

asubiotto

Thank you for writing this RFC, it's very useful to formalize the type system a little to help motivate changes/design to the type system.

I think that it would also be useful to spell out the motivation for the existence of separate concepts in the type system in order to inform the decision framework. These are things that we can probably internally/intuitively articulate but again I think it's helpful to spell it out. Specifically:

Why do we define DTypes as logical types separately from physical encodings?
Why do we have the concept of canonical physical representations? What's the goal?
What is the goal of extenstion types? How are they different from first-class dtypes?

Other than that, I think I mostly agree with the RFC. The conclusion I take away from the FSB discussion is that FSB should be part of the possible canonicalization targets of the Binary DType. Similarly, FixedSizeList should not be its own DType and another canonicalization target of the List DType.

One other thing I'm curious about which might be good to add to the RFC is "what amount of gating is required for a data type to be considered an extension type rather than a first-class dtype". Every type could essentially be sugar on a bytes type.

proposed/0029-types.md

connortsui20 · 2026-03-09T13:34:54Z

Thoughts on me splitting this RFC into 2 RFCs? The first can just be the formalization and the second can be the other proposal.

Edit: I am going to split this RFC.

connortsui20 · 2026-03-09T16:21:17Z

After some offline discussion I'm going to completely pull out the second part of this RFC as we need to better understand how execute should work before we think about execution targets.

connortsui20 · 2026-03-09T17:09:17Z

@asubiotto Note that this RFC doesn't make any claims that FixedSizeList shouldn't be a dtype, nor that FixedSizeBinary should be. It only has a framework in which we can think about these things.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

asubiotto · 2026-03-11T10:34:29Z

Thanks for the changes. This looks good to me.

gatesn

I came away from this with the following (not fully formed) thoughts:

We should deprecate DType::Utf8 and make it an extension type
We should deprecate FSL and make it an extension type
Why are primitives types different? Why is u16 not a refinement over the integer type?

gatesn · 2026-03-11T19:26:44Z

proposed/0029-types.md

+### What is a `Canonical` encoding?
+
+The `N + M` argument relies on a common decompression target that operations are implemented
+against. A **canonical encoding** is a physical encoding chosen as this representative for a logical


Suggested change

against. A **canonical encoding** is a physical encoding chosen as this representative for a logical

against. A **canonical encoding** is a physical encoding chosen as this representation for a logical

gatesn · 2026-03-11T19:31:44Z

proposed/0029-types.md

+
+## Motivation
+
+This definition has mostly worked well for us. However, several recent discussions have revealed


What definition?

gatesn · 2026-03-11T19:41:35Z

proposed/0029-types.md

+Without this predicate, `Utf8` and `Binary` would be the same type, and maintaining both would be
+redundant.
+
+## What justifies a new type?


I'm not sure I see how this follows from the formal definitions? I like the formal definitions! This section now reads as if we're throwing away those principals for fluffy reasons.

Like yeah.... why do we have DType::Utf8? Maybe we should indeed deprecate it!

gatesn · 2026-03-11T19:43:03Z

proposed/0029-types.md

+constraint that every element has the same length `n`, but `scalar_at`, `filter`, `take`, etc. all
+behave identically regardless of whether the list is fixed-size.
+
+**Is it structurally distinct from an existing `DType`?** Yes. `FixedSizeList` has a different


I think I disagree? I don't think it is structurally different from a list

gatesn · 2026-03-11T20:15:34Z

As a crazy proposal

enum DType {
 Null, => Null
 Bool, => BitBool
 Bytes, => (Fixed width bytes)
 Utf8, => GermanStrings // This is the weird one... why?
 List => ListView
 Struct (Tuple) => Struct
 Union => Union
 Variant => Variant
 Extension => ...
}

Differences:
* Primitive, Decimal => Bytes(n) // Collapse physical encodings for primitive, decimal, FSB, into one array
* Binary => List<u8> // no point storing arbitrary binary data as german strings
* FixedList => List // sizes == constant array. Still want cheap filter which we cannot do now with FSL.
* + Union, Variant

I would also propose adding a selection vector to the fixed width BytesArray. That means all of our canonical encodings now have "views" and can be filtered and shuffled with zero copy.

connortsui20 force-pushed the ct/types branch from 25d9ccc to 74a0ac9 Compare March 6, 2026 22:20

connortsui20 requested review from gatesn and joseph-isaacs March 6, 2026 22:24

asubiotto reviewed Mar 7, 2026

View reviewed changes