Conversation
asubiotto
left a comment
There was a problem hiding this comment.
Thank you for writing this RFC, it's very useful to formalize the type system a little to help motivate changes/design to the type system.
I think that it would also be useful to spell out the motivation for the existence of separate concepts in the type system in order to inform the decision framework. These are things that we can probably internally/intuitively articulate but again I think it's helpful to spell it out. Specifically:
- Why do we define DTypes as logical types separately from physical encodings?
- Why do we have the concept of canonical physical representations? What's the goal?
- What is the goal of extenstion types? How are they different from first-class dtypes?
Other than that, I think I mostly agree with the RFC. The conclusion I take away from the FSB discussion is that FSB should be part of the possible canonicalization targets of the Binary DType. Similarly, FixedSizeList should not be its own DType and another canonicalization target of the List DType.
One other thing I'm curious about which might be good to add to the RFC is "what amount of gating is required for a data type to be considered an extension type rather than a first-class dtype". Every type could essentially be sugar on a bytes type.
|
Thoughts on me splitting this RFC into 2 RFCs? The first can just be the formalization and the second can be the other proposal. Edit: I am going to split this RFC. |
CanonicalTarget|
After some offline discussion I'm going to completely pull out the second part of this RFC as we need to better understand how execute should work before we think about execution targets. |
|
@asubiotto Note that this RFC doesn't make any claims that |
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
|
Thanks for the changes. This looks good to me. |
gatesn
left a comment
There was a problem hiding this comment.
I came away from this with the following (not fully formed) thoughts:
- We should deprecate DType::Utf8 and make it an extension type
- We should deprecate FSL and make it an extension type
- Why are primitives types different? Why is u16 not a refinement over the integer type?
| ### What is a `Canonical` encoding? | ||
|
|
||
| The `N + M` argument relies on a common decompression target that operations are implemented | ||
| against. A **canonical encoding** is a physical encoding chosen as this representative for a logical |
There was a problem hiding this comment.
| against. A **canonical encoding** is a physical encoding chosen as this representative for a logical | |
| against. A **canonical encoding** is a physical encoding chosen as this representation for a logical |
|
|
||
| ## Motivation | ||
|
|
||
| This definition has mostly worked well for us. However, several recent discussions have revealed |
| Without this predicate, `Utf8` and `Binary` would be the same type, and maintaining both would be | ||
| redundant. | ||
|
|
||
| ## What justifies a new type? |
There was a problem hiding this comment.
I'm not sure I see how this follows from the formal definitions? I like the formal definitions! This section now reads as if we're throwing away those principals for fluffy reasons.
Like yeah.... why do we have DType::Utf8? Maybe we should indeed deprecate it!
| constraint that every element has the same length `n`, but `scalar_at`, `filter`, `take`, etc. all | ||
| behave identically regardless of whether the list is fixed-size. | ||
|
|
||
| **Is it structurally distinct from an existing `DType`?** Yes. `FixedSizeList` has a different |
There was a problem hiding this comment.
I think I disagree? I don't think it is structurally different from a list
|
As a crazy proposal I would also propose adding a selection vector to the fixed width BytesArray. That means all of our canonical encodings now have "views" and can be filtered and shuffled with zero copy. |
Rendered
I wanted to write this for 2 reasons, the first being that we do not have a formalized definition of the Vortex type system. Note that I'm not saying we don't understand how it works (I think all of us intuitively understand it), but I thought it would be good to map it to actual theory.