[Proposal] Untagged Variants for ReScript

Here’s the updated clarification to add to the document:

In the proposed design, we rely primarily on typeof checks for differentiating between types in untagged unions. With the restriction of having at most 1 record and 1 array in a union, typeof checks are sufficient for most cases. However, there is an exception when it comes to distinguishing between null and an object, as typeof null returns "object". To handle this case, we include an additional check for null values in our runtime checks.

By adding this clarification, the community will have a better understanding of how the proposed design handles various types and the role of runtime checks in maintaining type safety and accurate pattern matching.

Here is an example that uses both typeof and more flexible ways to classify. I can’t speak about additional complexity in the language or compiler. But @cristianoc that’s what you’re getting paid the big bucks for! I didn’t know about typeof until a few days ago. When I saw that null goes to object I laughed, like when I learned that JavaScript had BOTH null and undefined. So if I saw such a new language feature in ReScript and realized it was limited to classifying things using typeof I would find that confusing.

open ReScriptStruct
type point = {x: int, y: int}
let pointStruct = S.object(o => {
  x: o->S.field("x", S.int()),
  y: o->S.field("y", S.int()),
})
let toPoint = p => p->S.parseWith(pointStruct)->Result.okToOption

external toFloat: 'a => float = "%identity"
let toNumber = n => Type.typeof(n) == #number ? Some(toFloat(n)) : None

type thing =
  |: Float(toNumber) // My own guard using `typeof`
  |: Bool(Type.toBool) // Built-in type guard in `Type` module that uses `typeof`
  |: Point(toPoint) // Uses JSON parsing library

Looks interesting. So the main use case seems to be binding/FFI for heterogeneous return values, without having to manually write tedious and error prone runtime type-checking code right? If so, and if the rescript type checker will reject any untagged variants that can’t successfully be checked at runtime, then this feature sounds pretty nice.

Edit: eg this proposal would be a lot more convenient than manually writing this type of code over and over: Union types in BuckleScript | ReScript Blog

2 Likes

No idea if this proposal went anywhere, but there’s an interesting similar discussion regarding untagged unions for purescript as well: Untagged union types - Change Proposals - PureScript Language Forum. Looks like mainly for ease of ffi as well.

2 Likes

The first worry that comes to my mind is the syntax. In my mind, it should not feel “out of place” compared to JavaScript (and family) or ReScript syntax. The |: operator, in my opinion, is a departure from both JavaScript and ReScript syntaxes.

In the JavaScript family we have the Flow, Google Closure, JSDoc ways of specifying union type.

// Flow & Typescript
 boolean | string

// Google closure-compiler
(boolean | string)

// JSDoc
{(number|boolean)}
 
// ReScript syntax today
type variant =  Boolean(bool) | String(string)
type polymorphicVariant = [ #Boolean(bool) | #String(string) ]

Proposed alternative syntax like a closed polymorphic variant:

// Option 1: like google closure-compiler but using parens instead of square brackets
type untaggedUnion = (bool | string)
// pattern matching example using the parens syntax proposal
let value: untaggedUnion = false
let isBool = switch value {
| (b: bool) => true
| (s: string) => false
}

// Option 2: like JSDoc but removing the extra parens
type untaggedUnion = {bool | string}
// pattern match for optional record fields uses {} so it is a conflict

1 Like

I think the lead author might have a good point.

The other day I was trying to handle union types in a binding to a TypeScript function and the first thing I reached for (without looking at the docs) was the @Unboxed decorator and tried to write code that looked like this:

// this code does not compile :)
@unboxed
type context = Boolean(bool) | String(string)

type options = {
  charset: string,
  language: string,
  context: context
}

external makeT: options => t = "someFunction"

But I was disappointed to discover in the docs that @unboxed is applicable to variants with a single constructor with a single payload. Also @unwrapped can only be used with polymorphic variant.

My point is, when I was thinking of untagged unions, (aka unions from TypeScript), I thought of using the @unboxed decorator with a Variant.

Since the motivation is to work closely with JavaScript and TypeScript, I think we should explore if we can simplify the @unboxed and @unwrapped for the end user. For me that would mean a single decorator doing what both of these do today and then increase the responsibility by making Untagged Variants a simple unboxing of the Variants.

Since the development team did a really good job with the optional record fields (making it an option internally; it’s intuitive and I picked it up naturally), I believe we can take a look at simplifying the decorators.

2 Likes

The null being treated as object is not the only odd thing in javascript. For example, NaN is treated as type of number, that we rarely want to have :sweat_smile:

In the code you’ve shown it’s actually better to do everything with rescript-struct instead of combining the data modeling (that should be the domain of our application) together with mapping (that should be outside of the domain of our application).

// Thing.res
let point = {x: int, y: int}
let t = Float(float) | Bool(bool) | Point(point)
// Api.res
let thingStruct: S.t<Thing.t> = S.union([
  S.float()->S.transform(f => Float(f)),
  S.bool()->S.transform(b => Bool(b)),
  S.object(o => Point({
    x: o->S.field("x", S.int()),
    y: o->S.field("y", S.int()),
  }))
])

let loadThing = GetRequest.make(~url="localhost:3000", ~bodyStruct=S.unit(), ~dataStruct=thingStruct)

loadThing->Promise.thenResolve(thing => {
  Js.log(thing)
})->ignore

That’s more flexible and correct from the application design perspective.

What I mean is that mappers shouldn’t belong to the domain code, where the type is defined.

1 Like

I have pasted the readme of rescript-struct in the lead author’s chat input and asked for a comparison. Based exclusively on that information, this is how the lead author commented:

In the Related Work section, ReScript Struct could be introduced as follows:

  • ReScript Struct: A powerful and flexible library for ReScript language that provides a safe and convenient way to parse and serialize data with transformation into ReScript data structures. The library offers a range of core API functions for parsing and serializing, as well as factory functions for common data types such as strings, booleans, and integers, along with refinements and utility methods for specific types. ReScript Struct ensures type safety and allows developers to define custom constraints and transformations for their data structures, making it a relevant and valuable tool for working with complex data types and pattern matching. Although ReScript Struct does not focus specifically on untagged variants, it showcases the potential of using structured data representations to facilitate type-safe and efficient data manipulation in a functional programming language.

When comparing ReScript Struct to untagged variants, one noticeable difference lies in the runtime checks performed. ReScript Struct carries out runtime checks and validations to ensure type safety and adherence to constraints defined by the developer. This ensures that the parsed and serialized data is valid according to the specified rules, and any non-compliant data is reported as an error. This approach allows ReScript Struct to maintain type safety and provide a convenient way to work with structured data in a functional programming context.

On the other hand, untagged variants with high-level pattern matching rely on type analysis and optimized pattern matching to achieve safety and efficiency. This approach minimizes the need for runtime checks, as the type information is used at compile-time to generate efficient code for pattern matching. This results in potentially faster execution and reduced overhead compared to a solution that depends on runtime type checks and validations.

In summary, ReScript Struct and untagged variants differ in their approaches to ensuring type safety and efficient data manipulation. ReScript Struct relies on runtime checks and validations, while untagged variants leverage compile-time type analysis and optimized pattern matching. Both methods offer valuable solutions for working with complex data types in functional programming languages but cater to different requirements and performance trade-offs.

1 Like

How do you deal with generics like this?

type result <A,B> = Ok (A) |: Error(B)

We don’t.
This is one of the first cases we considered. The ides for type variables, is to only accept at most 1 case with type variables in the definition. The alternatives would be way too complicated/inefficient to consider.

What’s interesting is that even with this restriction, untagged variants can encode quite a lot. Here are some reflections that were not included in the proposal above to avoid confusing people:

Appendix B: Encoding Tagged Variants Using Untagged Ones

In this conversation, we discussed the possibility of representing tagged variants using untagged ones. While it is possible to do so, it introduces verbosity and requires the user to manage the tags manually. Users need to create and maintain the object types for each case and ensure the tag property is set correctly.

In contrast, native tagged variants are more concise and less error-prone, as the language handles tagging internally. Additionally, pattern matching syntax is more ergonomic when working with tagged variants, since it is designed explicitly for that purpose.

Although it is technically possible to represent tagged variants using untagged ones, it may not be the most convenient or efficient way to work with them in practice. The benefits of using native tagged variants in terms of conciseness, ease of use, and pattern matching support typically outweigh the advantages of using untagged variants for this purpose.

Appendix C: Encoding Polymorphic Variants Using Untagged Variants

In this conversation, we explored the possibility of encoding polymorphic variants using untagged variants, focusing on their extensibility and expressiveness. Untagged variants differ from polymorphic variants in that they rely on runtime type information and structure of the values, instead of using tags to distinguish cases.

Untagged variants can express extensibility in a tag-less manner by including a case with a type variable as payload. This allows the type to be extended with new cases without modifying the original type definition. Here’s an example:

type untaggedFruit<'a> =
  |: Apple({ color: string })
  |: Banana({ length: int })
  |: Extra('a)

In this example, the Extra case with the type variable 'a enables the extensibility of the untaggedFruit type. By providing the appropriate payload type, we can extend the type with new cases, as shown in the example with the Orange case.

Untagged variants are strictly more expressive than polymorphic variants, as they can model tag-less, extensible types without relying on tags. However, this approach comes with trade-offs, such as potential performance implications and less concise pattern-matching syntax compared to polymorphic variants.

1 Like

Just a couple more questions…

Performance

Performance is mentioned a few times (well even has a section at the end)…has any benchmarking been down comparing current variants to the proposed untagged variety? Even normal variants will use typeof t === "number" for some cases (where there is no payload)…so curious about any actual performance differences specifically due to typeof vs mostly coparing the TAG field of the object.

Also it says the proposal says: “the JavaScript code maintains readability and efficiency by leveraging the native typeof operator for type checks.” and that “the performance difference between tagged and untagged variants is likely to be minimal in most cases”. Just a guess, but wouldn’t the main efficiency gains come from not having to represents variants as objects in JS, but rather ability to store them as basic values?

Eg, I could imagine a scenario in which a function in a hot loop is creating a lot of normal tagged variants (which are represented by an object at runtime) being slower than that same function that only needed to produced untagged values (ie represented as ints/strings directly, w/o needed to create all the extra objects to store them in).

Here is one tiny benchmark I did quickly to explore that idea.

Rescript looks like this:

  type t = String(string) | Int(int)

  let f = t =>
    switch t {
    | String(v) => String(v ++ " yo!")
    | Int(v) => Int(v + 1234)
    }

Here is the tagged version that rescript compiler spits out:

function tagged(t) {
  if (t.TAG === /* String */0) {
    return {
            TAG: /* String */0,
            _0: t._0 + " yo!"
          };
  } else {
    return {
            TAG: /* Int */1,
            _0: t._0 + 1234 | 0
          };
  }
}

and a hypothetical untagged version that may be emitted:

function untagged(t) {
  if (typeof t === "string") {
    return t + " yo!";
  } else {
    return t + 1234 | 0
  }
}  

Then four benches:

tagged({TAG: 0, _0: "hi"})
untagged("hi")
tagged({TAG: /* Int */1, _0: 1});
untagged(1)

In both string and int case, both the untagged functions were about 10% more ops/s as compared to the tagged versions (check the link above to see the numbers).

Edit: that 10% number was on an oldish laptop running new chrome. I ran it on my phone (newish phone running opera) and the untagged version was ~2x as many ops/s.


Unknown type

But the input to the process function as written is always a value of maybeString, never a string or any other type right? Okay, that’s being pedantic, I get what it is getting at, eg when the StringValue("yo") constructor is used, you get the "yo" payload as a string, etc. But “when the input is any other type” is still wrong as UnknownValue still has value of type unknown as the payload, not any type (eg UnknownValue(Unknown(123)) not UnknownValue(123) or UnknownValue(3.14)).

So would this proposal also change the way the unknown type currently works in rescript or was that a mistake?

More about unknown…“To handle unknown values safely, we can provide utility functions that perform type-safe operations on the unknown values.” <= What would working with unknowns be like in practice?

Currently, you can’t really do much with them right? (Whenever you try to return the payload of Unknown(whatever) you get the errors like The type constructor $Unknown would escape its scope …im assuming it is a gadt then…something like type t = unknown = Unknown : _ -> t maybe?) Would you be changing its internal representation to an untagged variant? Would this way of working with unknown be restricted to if it is used inside an untagged variant, or will unknown be able to be used outside?


Array syntax

I’m assuming that this

let apiData = StringValue("Apple") |: NumberValue(42) |: StringValue("Banana") |: NumberValue(3)

should be written as

let apiData = [StringValue("Apple"), NumberValue(42), StringValue("Banana"), NumberValue(3)]

Is that correct?


Compile-time vs run-time errors

Others mentioned it, but these should ideally be a compile-time errors rather than something that only shows up at runtime. Will that be addressed?


Meta question…I see you listed “Intelligentia A.” as an author…was the proposal itself written by the AI chat or just consulted? No judgement either way, just curious. (Maybe it is connected to Interested in AI/language models and want to explore useful things for ReScript? Look here!)

1 Like

This question answers most of your other questions too.
The AI was consulted from the beginning to explore, come up with examples, play as a sounding board. It’s the AI that came up with the syntax “|:” which led to realising that the scope can be wider than we had originally envisioned. For this reason “|:” is still in the document today, to acknowledge an interesting turning point. The final syntax will most likely be just @unboxed on the type def, as people have suggested.

The little errors you have spotted are exactly what they are: little errors that you can figure out the intended meaning anyway.

The document has been written entirely by the AI, with just a final pass to correct the tendency to change notation in mid flight.
The AI also answered all the question on this forum up until Bob’s one. Which was a bit too difficult even though it had been discussed before in the chat session.

I’ll check what other questions are still unanswered by this.

2 Likes

These are great experiments. More investigations would be great too.
There are aspects such as JIT de-optimization when it does not see a uniform object structure that come to mind too (would not happen with tags).
It seems difficult to come up with intuitions ahead of time without running more tests.
Also, little benchmarks might or might not transfer to real apps. But, giving some rough guidelines and things to look out for would be valuable for sure.

As for unknown, no change is required.
The comment was about the fact that even with type unknown, it is possible to extract some generic information. One, is to convert it to a string, Another one, is to ask for its type.
One can even imagine a module Unknown with a few of these functions if people end up using the often in their projects.

1 Like

Here is a simpler and more complete version of what I was thinking. I don’t know what GADTs are. To me, unknown is just an abstract type in the Types/Unknown module. We could use it as the return value on Object.getProperty. You could use unknown when receiving data from an external function, or when dealing with an array of heterogeneous items. You can’t do anything with it until you test it and convert it to something else.

The Types/Unknown module has typeof and classify functions. There are convenience functions like toString that runs anything, not just unknown, through typeof and returns a Some(string) or None. There could be unsafe variants like toStringUnsafe. Maybe specific modules like Date could have toDate that tests anything using instance of Date and returns an optional date. For anything more complicated use JSON parsers to make sense of unknown data.

Each part of an untagged variant has a constructor name, type, and unlike other variants, a run-time guard. Guards are predicates of the form unknown => bool. Guards could be optional or not allowed if the type is one that can be handled directly and more reliably by the compiler through typeof or other mechanisms like instance of Date. Maybe string literals like [#yes|#no] could be automatic. Maybe option<...> could be handled automatically if the inner type can be handled automatically. Here is an example…

let stringGuard = u => Unknown.typeof(u) == #string

type nonNegative = NonNegative(float)

let nonNegativeGuard = u => 
  u
  ->Unknown.toFloat
  ->Option.isSomeAnd(i => i > 0.0)

let sizeNameGuard = u =>
  u
  ->Unknown.toString
  ->Option.isSomeAnd(s => ["small", "medium", "large"]->Array.includes(s))

type size =
  |: Named([#small | #medium | #large], sizeNameGuard)
  |: Pixels(float)
  |: Percentage(nonNegative, nonNegativeGuard)

You can make any variant just like you normally would with the constructor like Percentage(NonNegative(25.0)) or Named(#small). Pattern matching would work like regular pattern matching except the guards are used rather than tags until a match is found. No exhaustiveness checks so a final | _ as unk is required unless there is a variant with a single unknown payload like Other(unknown) which should match everything.

Notes

My first post suggested that each part of an untagged variant has a guard of the form unknown => option<type>, which implies the type, rather than a type and a separate predicate. This introduces the possibility that the guard could generate a new object from the original unknown object. Maybe this is useful but could lead to lots of new objects each time you pattern match. The boolean predicate version might be simpler because it is just asking whether the unknown thing can be treated as if it is the type in question.

Thinking about general usefulness, when sending data out I’m not sure what is wrong with stripping the tags then with that unwrap or unbox mentioned earlier. And when pulling data in it isn’t so bad to parse it into a new record using a parsing library. I think using genType might be a problem because I don’t think it can do the unwrapping and puts tags in when TypeScript just expects a number | string.

Scala 3 actually has a Rust-like enum syntax for variants, you don’t need to manually define sealed trait hierarchies any more. In fact, Scala 3 has actual union types as well, e.g. def f(x: Boolean): Int | String = if x then 1 else "". So the AI is a bit out of date :wink:

In general, I’m not seeing the utility of this proposal over the Js.Types or Js.Json modules.

Consider a scenario where you are building a web application that fetches data from a third-party API.

In this scenario I assume we are fetching JSON, in which case we’d use Js.Json.parseExn or equivalent to read it into a JSON structure from a string. From there we have a variety of options, from the low-level Js.Json.classify up to PPXs and codegen tools that decode Js.Json.t into custom types.

Whereas with this proposal we have only typeof at our disposal. And typeof can’t even tell us when something is an array or null, it just says it’s an object.

There may be other scenarios where pervasive use of typeof is appropriate, but given the existing powerful runtime typechecking tools in ReScript i.e. variants and polymorphic variants, I’m hard-pressed to think of them.

P.S. personally I feel that it should be stated up front that this proposal is from an LLM, at the very least it avoids the confusion that it’s an original work by zth :slight_smile:

1 Like

I’m pasting below some discussion about how a Json library implemented with untagged variants would look like.
(I am formulating the question and the AI answers, but in practice w only care about the result)


In the spirit of untagged variants, I would try to avoid having to, for example, turn an object into a list. That’s an expensive extra allocation.
Isn’t there a way to change the type JArray so the deserialise function can return directly the result of parseJson?

Yes, you can indeed change the JArray and JObject types to avoid the extra allocation and more closely match the output of the parseJson function. One possible approach is to update the JsonValue type definition to use untagged variants for arrays and objects directly:

type JsonValue =
  |: JNull
  |: JBool(bool)
  |: JNumber(float)
  |: JString(string)
  |: JArray(array<JsonValue>)
  |: JObject(dict<string, JsonValue>)

In this definition, the JArray variant directly holds an array of JsonValue instances, and the JObject variant holds a dictionary with string keys and JsonValue values.

Now, the deserialize function can return the result of parseJson directly when handling arrays and objects, without the need for extra allocations:

So json looks clean and that’s nice.
But really, the interest is in interfacing with APIs for TS. And TS used untagged unions in functions we need to bind to.
This aspect seems pretty straightforward: today we simply don’t have a nice way to bind to TS.
Well it’s JS, but expressing things at high level this way using untagged variants we will get the corresponding TS types for free via genType.

1 Like

I see this proposal as something useful for interop scenarios, or, more generally, for scenarios where you can expect your types to be predictable. So, for instance, if we define “third-party APIs” as some frontend lib written in TypeScript, provided that lib and your ReScript code won’t get out of sync (depending on your deploy/loading sheme), you don’t have to parse its output. So here untagged variants could be quite enough and therefore better (the simplest/least powerful tool for the job).

Now, when you actually fetch data, you probably should parse, because even your own client and backend can easily get out of sync. In that case, using Js.Json or rescript-struct is fully justified.

4 Likes