Parse, Don’t Validate

6 years ago (lexi-lambda.github.io)

This is describing what I've known as "Typed Hungarian notation", and have seen a few times before, though I can't seem to find now.

The original intent of Hungarian notation was to encode metadata into the variable name - for example, "validData = validate(rawData); showValidData(validData)". The idea was that the prefixes would make it just look wrong to the programmer if someone accidentally used mismatched names, such as "showValidData(rawData)", indicating a likely bug.

This notation mutated into the far more simplistic and arguably less useful "prefix with the datatype", such as iFoo indicating Foo is an integer. This mutation became known as Systems Hungarian, while the original became Apps Hungarian.

The suggestion in this post is taking Apps Hungarian, and instead of relying on variable names, encoding it into the type system itself.

The first time I recall seeing something like this suggested was actually in Java, with examples in that blog post involving class attributes:

  public String Name;
  public String Address;

And creating classes to represent these values, preventing them from being used the wrong way:

  public NameType Name;
  public AddressType Address;

...which is why I remember this as "Typed Hungarian" - making the language's type system handle the work that a person would normally have to think about in Apps Hungarian.

  • The Java example you've provided is known as Domain Modeling, it's well described (and honestly greatly expanded on) in Domain Driven Design by Eric Evans.

    It's applied heavily in the enterprise world simply because it's so powerful. In general as the domain logic becomes more complex the benefits of doing this increase.

    Actually, not encountering this style in code-base which is solving a complex problem is a massive warning sign for me. It usually means that concepts are poorly defined and that the logic is scattered randomly all over the code-base.

    Apps Hungarian is just a logical style: name functions, variables, types so that they're meaningful within your domain. The result of which is code which is very easy to understand for someone who understands the domain. This doesn't mean long names - if you're doing anything math intensive then using the short names which conform to the norms of the field is perfect. For a business process it probably isn't :)

    • >Actually, not encountering this style in code-base which is solving a complex problem is a massive warning sign for me.

      There is a fine line between capturing some common meaning in a reusable type (e.g. Optional<T>) and spending most of your time crafting a straight-jacket of type restrictions that are supposed to thwart some imaginary low-level bug in the future.

      Java is actually a great example of how this mentality backfires in real life. The language, from day 1, had a zillion ways to force someone else to do something they didn't want to. Abstract classes. Private methods. Final classes. Etc. It was supposed to ensure good design, but in practice just led to the endless reenactment of the banana-gorilla-jungle problem. (https://pastebin.com/uvr99kBE)

      At the same time, the language designers didn't add such basic features as lambdas and streams until 1.8. This lead to creation of ungodly amount of easily preventable bugs, since everyone was writing those awful imperative loops.

      6 replies →

    • It's particularly powerful if you can great an algebra of your domain. You can define invariants which are then enforced by the type system.

      I use this a lot when writing C# code and the whole experience is getting better with every new release. Microsoft keep adding awesome stuff like ADTs which make working with these types a dream.

      9 replies →

    • > Actually, not encountering this style in code-base which is solving a complex problem is a massive warning sign for me.

      This is very narrow-minded. DDD as a broad concept is good and something I practice.

      DDD in enterprise is a disastrous nightmare. When people on your team are wasting time Googling lingo to try to figure out what kind of object they need, where to put a file, or where a method should go, it's a huge red flag.

      I was on a team that seemingly spent all its time trying to figure out what the hell DDD was prescribing, let alone trying to figure out how to do it.

      12 replies →

    • This can go two ways. One can end up with lots of different types that do nothing but wrap a string. On the other hand I have also seen a float being used to represent a price. The correct way is somewhere in between those two extremes. If you have some logic specific to a value, by all means, make it a type but if there isn't it just ends up with lots of boilerplate.

  • Another good example of this is having separate classes for something like unsafe strings vs. safe strings in a web app. The functions which interact with the outside world accept unsafe strings and emit safe strings to the rest of the application. Then the rest of the application only works with safe strings.

    Anything that accepts a safe string can make an assumption that it doesn't need to do any validation (or "parsing" in the context of the OP), which lets you centralize validation logic. And since you can't turn an unsafe string into a safe string without sending it through the validator, it prevents unsafe strings from leaking into the rest of the app by accident.

    This concept can be used for pretty much anything where you are doing data validation or transformation.

    • Also a good way to prevent hashed passwords from being accidentally logged.

          Class PasswordType(django.db.models.Field):
              hashed_pw = CharField()
          
              def __str__():
                  # you can even raise an Exception here
                  return '<confidential data>'
      

      Not that you should be trying to log this stuff anyways, but unless you're a solo dev you can't prevent other people from creating bugs, but you can mitigate common scenarios.

    • What are safe and unsafe strings supposed to mean? All strings seem like normal string to me, a "DELETE * FROM db" is no different from any other string until it's given to a SQL query.

      20 replies →

  • Rust is extremely good at this with newtype syntax and the ability to apply impls and derives to masking types.

    Your JSON data might be a string, but having a Json type that guarantees its valid Json is way better. Same with stringified Base64, a constrained numeric value for a tunable, etc. Because using from impls on these types lets the compiler figure out almost all invalid type usages at compile time and give you eloquent feedback on it.

  • Your comment gets the history hilariously backwards.

    Actually using the type system has been standard practice in ML-family languages since at least the '70s. Simonyi described Hungarian notation as a way of, effectively, emulating a type system in less-powerful languages; describing Hungarian notation as "prefixing with the type" was not a mistake (as Spolsky claims) but an accurate description of how the technique was intended to be used.

    The "Systems Hungarian" mistake came about because C programmers misunderstood the term "type" to refer to something far more limited than it does.

    The technique in the article is not "encoding Apps Hungarian into the type system". It's doing the very thing that inspired ("Apps") Hungarian in the first place!

  • The original paper by Simonyi actually does describe prefixing with datatype. E.g. b for byte, ch for character, w for word, sz for pointer to zero-terminated string.

    He specifically states:

    The basic idea is to name all quantities by their types (...) the concept of "type" in this context is determined by the set of operations that can be applied to a quantity. (..)

    Note that the above definition of type (which, incidentally, is suggested by languages such as SIMULA and Smalltalk) is a superset of the more common definition, which takes only the quantity's representation into account

    So he is comparing to early object oriented languages where presumably Hungarian notation is not needed anymore because the type system of the language itself is rich enough to constrain the set of valid operations on the value/object.

    So the "metadata" encoded in the Hungarian prefix is exactly what would be called type in a modern language. The idea of Hungarian was valid in a language with a insufficiently expressive type system.

    I think the term "Typed Hungarian Notation" is really weird. Hungarian Notation is a workaround for a limited type system. "Typed Hungarian" is just... typed code.

    • > I think the term "Typed Hungarian Notation" is really weird.

      I'm glad I'm not the only one. "Typed Hungarian notation" reminded me of "horseless carriage". :)

  • This helps the compiler, but only works for languages with type declarations, and is less helpful to the programmer, as to find a variable's kind (e.g. valid data vs. raw data) you need to look at its declaration, which may involve scrolling. The original idea, described in Joel Spolsky's original article on Apps vs. Systems Hungarian (https://www.joelonsoftware.com/2005/05/11/making-wrong-code-...) is that the programmer can see, in the expression that the variable is used or the function is called, that the code is wrong.

    I apply a variant of Apps Hungarian in my own code, e.g.

        (setf input
              (inputOfEdge 
               (first (edgesOfSocket
                       (outputOfGate
                        (nth (position visibleInput (inputsOfVertex node))
                             (ingatesOfEnclosure enclosure)))))))
    

    (In the system this is lifted from, node was confirmed to be a Vertex a few lines earlier, and Output is a subclass of Socket.)

    The general idea is language independent: you could write similar code in Java or C.

  • I generally like this approach, but it's unfortunately really cumbersome in most mainstream languages like Java (and even Scala 2.x) or C#, but it works beautifully in Haskell because of two features:

    - non-exported newtypes (also known as opaque types in Scala 3.x); this means your little module is in full control of instantiation of a 'parsed' value from an 'untyped' value. - DerivingVia (not sure if there's an equivalent in Scala 3.x); this makes declaring type class instances equivalent to the 'wrapped' value completely trivial.

    These two features (and, I guess, the pervasiveness of type classes) in Haskell make this sort of thing as close to zero effort as possible.

    EDIT: Looking at some sibling posts, I think my C# knowledge may be out of date. Apologies for that.

    • I think you mean GeneralizedNewtypeDeriving, not DerivingVia. GND lets you pull instances through a newtype, trivially. DerivingVia lets you define your instance in terms of an instance on another newtype.

      2 replies →

  • Here is a blog post applying a similar idea to matrix math:

    https://www.sebastiansylvan.com/post/matrix_naming_conventio...

    • I wish there were a type system for numerical computation that could encode things like rank and size of tensors and could statically check them before running. Currently, catching bugs requires running the code which can make for a slow development loop, plus it would make for a much more robust system than documenting these things in comments

  • What you’re describe is in a similar vein to what is described in my blog post, and it’s often absolutely a good idea, but it isn’t quite the same as what the post is about. Something like NameType attaches a semantic label to the data, but it doesn’t actually make any illegal states representable… since there really aren’t any illegal states when it comes to names. (See, for example, https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-... .)

    In other situations, the approach of using an abstract datatype like this can actually rule out some invalid states, and I allude to that in the penultimate section of the blog post where I talk about using abstract types to “fake” parsers from validation functions. However, even that is still different from the technique focused on in most of the post. To illustrate why, consider an encoding of the NonEmpty type from the blog post in Java using an abstract type:

      public class NonEmpty<T> {
        public final ImmutableList<T> list;
    
        private NonEmpty(ImmutableList<T> list) {
          this.list = list;
        }
    
        public static <T> Optional<NonEmpty<T>> fromList(List<T> list) {
          return list.isEmpty()
            ? Optional.none()
            : Optional.of(new NonEmpty<>(ImmutableList.copyOf(list)));
        }
    
        public T head() {
          return list.get(0);
        }
      }
    

    In this example, since the constructor is private, a NonEmpty<T> can only be created via the static fromList method. This certainly reduces the surface area for failure, but it doesn’t technically make illegal states unrepresentable, since a mistake in the implementation of the NonEmpty class itself could theoretically lead to its list field containing an empty list.

    In contrast, the NonEmpty type described in the blog post is “correct by construction”—it genuinely makes the illegal state impossible. A translation of that type into Java syntax would look like this:

      public class NonEmpty<T> {
        public final T head;
        public final ImmutableList<T> tail;
    
        public NonEmpty(T head, ImmutableList<T> tail) {
          this.head = head;
          this.tail = tail;
        }
    
        public static <T> Optional<NonEmpty<T>> fromList(List<T> list) {
          return list.isEmpty()
            ? Optional.none()
            : Optional.of(new NonEmpty<>(list.get(0), ImmutableList.copyOf(list.subList(1, list.size()))));
        }
      }
    

    This is a little less compelling than the Haskell version simply because of Java’s pervasive nullability and the fact that List is not an inductive type in Java so you don’t get the exhaustiveness checking, but the basic ideas are still the same. Because NonEmpty<T> is correct by construction, it doesn’t need to be an abstract type—its constructor is public—in order to enforce correctness.

    • Maybe I shouldn't have included the Java example, some others have jumped on it as well. That wasn't meant to be a summarization, but a similar idea in a different context.

      You also seem to have missed the point of the Java example anyway, in a misses-the-forest-for-the-trees way. It was meant as a 1:1 example of an important result of your blog post, not an example of constructing the parse function:

      > However, this check is fragile: it’s extremely easy to forget. Because its return value is unused, it can always be omitted, and the code that needs it would still typecheck. A better solution is to choose a data structure that disallows duplicate keys by construction, such as a Map. Adjust your function’s type signature to accept a Map instead of a list of tuples, and implement it as you normally would.

      Using NameType instead of String is the same as using a Map instead of a list of tuples.

      It was why I included AddressType in the Java example. Just like changing the function signature to not accept a tuple of lists and require a Map instead, forcing you to use the parse function to construct the Map, functions that only work on AddressType or only work on NameType can't receive the other one as an argument - where with a String, they could. They have to pass through the requisite parse function to convert String into a NameType or AddressType first, however those are implemented.

      And I've seen the falsehoods lists before; "Name" and "Address" were simply the first thing that popped into mind while typing that up. Examples are just examples, and Name and Address are conceptually different enough regardless of the falsehoods that the main idea behind the example ought to get by anyway.

      1 reply →

  • I currently use this in my language "Grammar" (https://jtree.treenotation.org/designer/#standard%20grammar). In Grammar you can currently define two things: cellTypes and nodeTypes. Grammar looks at the suffix to determine the type. So if you are building a language with the concept of a person node and an age cell, you'd write "personNode" and "ageCell". "ageCell" might extend "intCell". I've found it a pleasure to use.

  • > making the language's type system handle the work that a person would normally have to think about in Apps Hungarian.

    That sounds a lot like "type validation" to me, whereas the title of the article includes "Don't Validate"

    • Read all the way down through the "Parsing, not validating, in practice" header. It eventually gets to examples exactly like mine, where "parse" means "transform into a different datatype so it can't be used wrong", while "validate" means "generate a message, but don't preserve that message in the type system in a way that's usable later".

    • > That sounds a lot like "type validation" to me, whereas the title of the article includes "Don't Validate"

      The variable name version does, sure. That's why OP said:

      > The suggestion in this post is taking Apps Hungarian, and instead of relying on variable names, encoding it into the type system itself.

      Making it clear that their example is different from what the article suggests.

  • My earliest programming experience taught me to use Systems Hungarian with scope. So, something like lcName for local character name.

    • On one of my first jobs all the class names were prefixed with C, but it wasn't carried through to variables or anything. I said we should stop prefixing class names with C and got shouted down. That's where I learned to mostly just do what I want without asking. Once you have 5k lines of something your way it is really too late for anyone to ask you to change it :)

  • Your comment is wrong is wrong. You are talking about "encapsulation tricks", where you don't make illegal states unrepresentable, but just all the construction so you can audit than declare the unrepresented state won't be constructed.

    Sometimes this is good, but the benefits are a lot lower, and many an overzealous programmer has taking the "boolean blindness" "string blindness" argument too far with copious newtypes.

    Actually ruling out invalid states by construction, however, is way more valuable, and far more likely to be worth the work. It requires more thought than just newtyping away, but that's a feature, not a bug.

    • Similar to some other replies, you've jumped to the Java example and made assumptions about it without reading the parts immediately before it.

Yes, yes yes! Encode invariants in your data, don't enforce invariants on your data. I particularly think this point needs to be stressed more, because practicably speaking, not every language has the typechecking faculties of Haskell:

> Sometimes, making an illegal state truly unrepresentable is just plain impractical given the tools Haskell provides, such as ensuring an integer is in a particular range. In that case, use an abstract newtype with a smart constructor to “fake” a parser from a validator.

For instance, few type systems will let you encode prime numbers as datatypes, so you'd have to do something in your language like

    newtype Prime = Integer
    mkPrime :: Integer -> Maybe Prime
    mkPrime | isPrime p = Just (Prime p)
            | otherwise = Nothing

which is parsing!

  • This is why learning Elm has been valuable for me.

    Without the convoluted type system of Haskell, you have to do this more, and it is a lot simpler to understand and maintain. You are never going to be able to use the type system to guarantee everything about your data, so let your runtime code check it, add lots of tests and keep it isolated (Elm calls them "Opaque Types") so it is easy to reason about.

    The equivalent is possible in OO with classes, but with the caveat that only if you make everything immutable and ideally have a layer of compile time null checking on top.

    In short, with Haskell I am learning and relearning language features constantly and banging my head. With Elm I grok the language and I am focused on solving the problem.

    • I think it’s fantastic that Elm works well for you; it’s a wonderful language and there’s no doubt that it is much simpler than Haskell. I agree with you that, in many cases, the technique you’re describing is sufficient, and it requires much less sophisticated machinery than some of the fancier Haskell techniques.

      That said, I do think what you’re describing is different from what is described in the blog post. Opaque/abstract types reduce the trusted code significantly, which is fantastic, but the example provided in the blog post is actually even better than that: because NonEmpty is correct by construction, the amount of trusted code goes down to zero. A lot of the tools Haskell provides, like GADTs and type families, can be used to capture more invariants in a correct-by-construction way, rather than relying on abstraction boundaries. There are advantages and disadvantages to both.

      I certainly don’t have any ill will toward anyone who finds the complexity of Haskell not worth the extra benefits, so to be quite clear, I’m not trying to sell you on Haskell instead of Elm! I do think being aware of the distinction is still useful, though.

      2 replies →

    • Yes, being able to make trivial value types is quite valuable. Despite being a rather different language, this is something Go gets right as well.

  • > such as ensuring an integer is in a particular range

    I find it funny that one of the most advanced programming languages of our day can't do what Pascal did back in 1980 :-)

  • In short, "algorithms are for things we are able to encode in the data structures". Or "Types first, Logic second, if at all."

I learned a lot from this article and hope to revisit it.

I had one quarrel:

> Consider: what is a parser? Really, a parser is just a function that consumes less-structured input and produces more-structured output. By its very nature, a parser is a partial function—some values in the domain do not correspond to any value in the range—so all parsers must have some notion of failure.

I think it's reasonable to include some total functions as parsers as well, because some grammars -- structures that are recognized by the parser and that specify languages -- generate all strings as the corresponding language. In that case, there are no failure cases because every string is a valid input.

I came up with some more-trivial examples, but I think a less-trivial example is the split() function in many languages. Its Haskell type signature is String -> [String], and it is a total function. It handles the case where the input is a nonempty string with no delimiters (the output is a list containing a single token equal to the input), as well as the case where input is an empty string (the output is an empty list), and the case where the input consists only of delimiters (in many implementations, the output is a list containing n+1 empty strings).

Another trivial-ish case is where the grammar allows an arbitrary trailing string following the object of interest (similar to C's atoi). In that case, a parser will always return the empty object if there is no interesting or relevant data in the input. (For example, atoi returns 0 if the input string doesn't begin with an integer.) This could be viewed as a kind of error, but there may be applications in which this behavior is useful.

  • I think this is a reasonable point, and your split() example is a good one. I wasn’t sure while I was writing the blog post if I considered isomorphisms to be parsers, and I’m still not entirely sure if I do or not, but there’s an argument to be made that they are simply parsers where the set of possible failures is empty. I don’t have strong feelings about that distinction one way or the other, though.

    • Further nit-picking: even if there are no errors, I don't think it's an isomorphism unless the parser preserves all information? Usually parsers treat some input strings as equivalent (by stripping whitespace and comments), so there are fewer outputs than valid inputs.

      I can't think of an easy way to make string splitting an isomorphism in both directions. You can serialize any list of strings to a comma-separated list, but only if you use escape sequences or encode the length of each item, and then there are some input strings that are errors.

      I guess you could define a type as the set of all valid strings in the input language, but that's going out of your way to make sure parsing has to be done more than once.

      Parsers that report no errors don't seem desirable anyway; error-checking is usually good.

    • Some other random feedback: the short phrase makes me think that I should parse instead of validating. Might I suggest: "Don't just validate: parse"?

      5 replies →

    • Structure is orthogonal to partiality.

      Partiality is what you get when you choose you only accept one branch of a disjunctive structure.

  • This seems needlessly pedantic - we could also notice that a total function is just a special case of a partial function, and so total functions are already included by the article.

Dear awesome Haskell writers. You are writing great articles but I can’t understand the code examples as Haskell is too far from the programming languages. Can you provide examples in TypeScript / Rust / Swift / Reason or another C like language? If not, I’m curious why? Love, Iddan

  • I have provided a translation of the NonEmpty datatype into Java in this comment: https://news.ycombinator.com/item?id=21478322

    However, to answer your “why” question: mostly just because I write Haskell for a living, and my primary personal motivation for writing this blog post was to share with my coworkers and to point to during code reviews. Therefore, it makes the most sense for the examples to be in Haskell! Haskell is also especially well-suited for communicating this sort of thing if you already understand the notation, as you can sort of see from the Java translation: it’s significantly more verbose just to get the same idea across. Still, it’d be great if someone were to provide a “translation” of the same ideas to other languages, no doubt!

  • Dead iddan,

    Haskell is in fact a programming language, despite what the phrasing of your comment might imply. Translation from language to language is often difficult, we don't do typically do it from English to say Spanish not because it wouldn't be valued but because we may not have the same expertise in Spanish and it would be expensive to hire a person to do it. Expertise is important, because key ideas may get lost in translation.

  • I believe I understood the article completely, even though I haven't written a line of Haskell in my life.

  • Since you mentioned typescript first, here's a library that is basically an implementation of the ideas OP tries to convey: https://github.com/gcanti/io-ts

    Can't recommend using this library enough; being able to essentially reject any input (e.g. json over http) at runtime because it does not conform to the type definition is such an amazing thing!

  • One of Haskell's virtues is that it is concise enough to make pleasant reading in a blog post.

    Come learn Haskell! It's fun and edificational.

I really enjoyed this article!

I think that leveraging the type system to enforce invariants about your data model is one of the most effective things you can do as an engineer.

I gave a talk on this topic at my workplace, using Elixir and Haskell as examples. It's a little haphazard, since it wasn't written for public consumption, but someone might find it interesting: https://github.com/QuinnWilton/programming-with-types-talk

Not only was this the first article about Haskel that I could actually understand. But it was also the first article where type annotations makes sense. That it actually helps you think about edge cases, instead of just annoy you.

My personal experience with type annotations (not to be confused with type systems) makes the code harder to read, for very little benefit. I would like a type checker that, instead of you having to babysit the type checker, the type checker should babysit you. Instead of the type checker crying over trivial issues, and needs your comfort. The type checker should comfort me when I'm sad.

While manually writing human friendly error message helps, you still have to discover edge cases yourself. It would be nice with a tool, not necessary baked into the types/system, that automatically detects edge cases. For example detect when a variable has the possibility of becoming undefined/null. Maybe achieved via fussing.

A college once asked me "How do I get rid of the damn nulls?"

Null's are basically a lazy error/exception. Error handling is so laborious we often "forget" to do it. And this is very confusing for beginners. There are basically two camps, one that think errors are exceptions, and the other that thinks errors should be first class citizens. But there are different domains of course. I don't want my CPU to maybe return a error message when I ask it to multiply two numbers, I want that to always work, I don't care if the result get shifted (that would be a job for the type checker to detect, or better; automatically use mult64 instead of mult32), because handling errors at that level and detail would be too annoying and bad for performance. Writing network services however, it is crucial that I get noticed if a request failed.

  • For what it's worth: it's _very_ rare for a type annotation to be required in Haskell. It's just considered best practice (supported by compiler warning options) to have them on all top-level declarations, for two reasons:

    - it's a good extra layer of _human-readable_ documentation about basic behavior and requirements (as in this article); and

    - it lets the compiler give better error messages.

    The compiler is _always_ going to do its own inference, whether you give it a signature or not. If it infers a valid type _which isn't the one you thought you wrote_, and you didn't specify what the type _should_ be, that function/statement will compile -- you won't get an error until later, when you attempt to consume it and the types don't match. This can be harder to trace, especially if there's a bunch of polymorphic code in the middle where any type will pass. Supplying a manual annotation lets the compiler immediately say "hey, these don't match."

I would like to see an example on something less trivial than encoding `not empty` on the type. At some point the difference between what the author calls "parsing" and "validating" gets blurry, or the type system can't encode.

It's pretty cute. Still no static types for "Strings that end with a dot" or "Strings that match [a-z]+" or "Strings that are proper nouns". ;)

I'm not sure I understand the benefits of this compared to a traditional parser generator.

"Use a data structure that makes illegal states unrepresentable." is not effort spent well. (And given Gödels incompleteness theorems unachievable in the general case.)

A proper parser generator using a grammar will give you the structural guarantees you need, a lexer will reject invalid terms. And a transformation step gives you the target structure. Eventually you will have to validate though at runtime.

  • > Still no static types for "Strings that end with a dot" or "Strings that match [a-z]+"

    Sure there are. :) Technically speaking, anyway. Here’s a type for “strings that end with a dot”:

      -- | A string that ends in the character '.',
      -- represented as a string without its trailing dot.
      newtype StringEndingInDot = StringEndingInDot String
    
      fromString :: String -> Maybe StringEndingInDot
      fromString str = case reverse str of
        '.':cs -> Just $ StringEndingInDot (reverse cs)
        _      -> Nothing
    
      toString :: StringEndingInDot -> String
      toString (StringEndingInDot str) = str ++ "."
    

    And here’s a type for “strings that match [a-z]+”:

      data Letter = A | B | C | D | ... | X | Y | Z
      type StringAZ = NonEmpty Letter
    

    Now, admittedly, I would never use either of these types in real code, which is probably your actual point. :) But that’s a situation where there is a pragmatic “escape hatch” of sorts, since you can create an abstract datatype to represent these sorts of things in the type system without having to genuinely prove them:

      module StringEndingInDot(StringEndingInDot, fromString, toString) where
        newtype StringEndingInDot = StringEndingInDot { toString :: String }
    
        fromString :: String -> Maybe StringEndingInDot
        fromString str = case reverse str of
          '.':_ -> Just $ StringEndingInDot str
          _     -> Nothing
    

    You may rightly complain that’s a lot of boilerplate just to represent this one property, and it doesn’t compose. That’s where the “Ghosts of Departed Proofs” paper (https://kataskeue.com/gdp.pdf) cited in the conclusion can come in. It provides a technique like the above that’s a little more advanced, but brings back composition.

    • >Technically speaking, anyway. Here’s a type for “strings that end with a dot”:

      These are just two functions that wrap and unwrap a string. Your "type" doesn't generate a compile-time error when constructed with the wrong data. This doesn't even handle the simplest use case:

         putStrLn (toString (fromString "something."))

      4 replies →

  • > It's pretty cute. Still no static types for "Strings that end with a dot" or "Strings that match [a-z]+" or "Strings that are proper nouns". ;)

    That’s not true, you can wrap these in types that can only be constructed from valid data, and from that point on you can be sure everything is correct.

  • Idris:

      data EndsWith : Type where
        EndsWithPeriod : (s: String) -> { auto prf : ("." `Strings.isSuffixOf` s = True) } -> EndsWith
    
      s : EndsWith
      s = EndsWithPeriod "."
    
      -- Value of type False = True cannot be found
      -- s2 : EndsWith
      -- s2 = EndsWithPeriod ""
    

    The noun one might be possible -- I'm honestly less familiar with what a proper noun is offhand. (E: prefix=>suffix)

  • > "Use a data structure that makes illegal states unrepresentable." is not effort spent well.

    Am I to presume you do all your programming in assembly (or opcodes?), since you never need to distiguish integers from strings in your source code?

    > Gödels incompleteness theorems

    is never relevant in the real world.

    In the general case, all real computers are broken.

  • Ignoring all the convoluted type examples here, the simplest case is that you create the type that matches the parsed value. For example, you can create an EmailAddress class that can only contain valid email addresses. Instead of String, you use EmailAddress whenever you need to ensure the valid parsed value.

  • Proper nouns are finite, albeit quite numerous, so these _could_ be enumerated as a sum type.

    • By the time you finished there would be new ones though (someone giving their kid a novel name, a sports team or company rebranding, etc).

Nice article! It got me thinking about an issue I've noticed in dynamically typed languages (namely JavaScript) where it's very easy to end up doing lots of (potentially redundant) validation all over the place because it's much more difficult / unwieldy to pass that information along.

  • Everybody notices this. That's why most dynamically typed languages, after having reached good adoption, try to shoehorn type safety to fix that.

  • In this case not even TypeScript can rescue you. You might naively implement:

        const head = (input: A[]) => input[0]
    

    and it will happily infer a return type of A. Then you'll fall flat on your face the first time you encounter an empty array:

        head([]) // -> undefined
    

    To make it correct you need to explicitly define a return type of (input: A[]): A | undefined, just as the Maybe. It's obviously impossible for TS to guarantee that an array will not be empty at runtime, but I wish this specific case triggered some help from the type checker.

    • > It's obviously impossible for TS to guarantee that an array will not be empty at runtime, but I wish this specific case triggered some help from the type checker.

        type NonEmptyArray<T> = { head: T, rest: T[] };
      
        function makeNonEmptyArray<T>(array: T[]): NonEmptyArray<T> | null {
          if (array.length == 0) return null;
          return { head: array[0], rest: array.slice(1, array.length) };
        }
      
        function head<T>(array: T[]): T | null;
        function head<T>(array: NonEmptyArray<T>): T;
        function head(array: any): any {
          if (Array.isArray(array)) {
            if (array.length === 0) return null;
            return array[0];
          }
      
          if (typeof array === "object") {
            return array.head;
          }
        }
      

      http://www.typescriptlang.org/play/#code/C4TwDgpgBAcg9gOwKIF...

      It's impossible for Haskell to guarantee an array will not be empty at runtime as well; that's why we can write a new type + a smart constructor to track runtime properties.

      1 reply →

    • > It's obviously impossible for TS to guarantee that an array will not be empty at runtime,

      Maybe you could use "Rest elements in tuple types" and do an overloaded signature like this:

          function head<T>(...args: [T, ...any[]]): T;
          function head<T>(...args: []): undefined
          function head<T>(...args: [T, ...any[]] | []): T | undefined
          {
              return args[0];
          }
      

      (not tested)

      You'd have to spread the arguments or use call/bind/apply, though.

      [1]: https://github.com/microsoft/TypeScript/pull/24897

  • Yep. For example you call:

        config.load_from_file(....
    

    And you can't know if it check if the file exist first or not. From pure habit you add checks... (what? looking at the code of the library? thats crazy!).

    • The “exists” state is dynamic (something else in the system could be in the process of deleting the file since it was last checked). It’s more reliable if you don’t check in advance and simply try to open the file every time you need it. You handle errors while opening the file.

      3 replies →

  • Yeah, the author mentions that too.

    For me, the funny thing about this article is that I kind of had the core insight just yesterday while hacking together a cli framework, but like the author, I couldn't quite explain what I learned.

Am I right to suspect the only reason "there can’t be any performance overhead" is that this is being done in a lazy language like Haskell? Meaning the statement won't hold in >99% of practical cases? Or did I misunderstand something?

  • No, this doesn't have anything to do with Haskell's lazy evaluation (at least not in the NonEmpty list example presented). The general idea is that if you are going to perform validation checks at some point in your dynamic language, you won't lose any performance performing those checks up-front in your static language.

    • But how can this possibly be true in general? As an example, just imagine I want to check that a string represents a list of space-delimited integers, where each one is also less than 10 digits. It's far more trivial to verify that than to actually parse the integers. And by performing the verification pass separately before the parsing pass, I can reject invalid inputs early, leading to much faster rejections than if I parse them as I validate them. The only way I can see there being zero cost difference is if everything is implicitly lazy, such that at runtime the verification won't even happen until the parsing needs to be performed too. Right?

      4 replies →

I understand the concept of wrapping values in specific types which gives you certain guarantees at compile time. And I really like this concept and will play around with this some more because the empty something issue is something I myself struggle with. But what really urgs me is the usage of throw in the exceptional case. My goto type in these situations would be Either rather than a throw. But this would create nearly the same issues on the caller site as an Maybe would create. Now one could argue that this is an exceptional case the user or Programm can’t possibly handle. So how do you handle this then? My main usage is in Rust and here the panic! handle seems to be used as often as unsafe raw pointers :)

Using this programming style requires a rather powerful type system, if you want to go past the simple examples the author shows and encode more complex validity conditions. I am still learning these things, but AFAIU the extreme edition of all this is the Calculus of Constructions used in Coq and Lean, where you can encode theorems and algorithm specifications as very complicated type; proving a theorem corresponds to showing an element of the theorem itself (seen as a type), and if you find an element of a type encoding a certain specification, Coq can work out for you a program that implements that specification.

This is what I understood, at least. Things are quite complicated!

  • A pragmatic interpretation of this is to:

    * subclass or compose String whenever possible. (Id, Name, Address, AddressLine1)

    * subclass or compose Number/Matrix types whenever possible (e.g. Users, Packages, Widgets)

    * Use immutable collections (e.g. Google Guava library in Java)

    I have built very powerful software with small teams using these principles.

    At scale, your day to day programming becomes easy as the compiler is doing all the error checking.

    It is also very slow to program this way. You write A LOT of boilerplate, and there's only so many code-generation libraries you can throw at it before it becomes a pain to setup your IDE.

    But it is worthwhile for complex applications. We did not have bugs.

  • > Using this programming style requires a rather powerful type system, if you want to go past the simple examples the author shows and encode more complex validity conditions.

    You need a strong type system if you want to encode complex constraints at compile time, but you can still encode a surprising number of conditions in weaker type systems. And if you can't enforce a constraint you care about at compile time, capturing that constraint in an ADT using sum types and runtime assertions can still provide a lot of value to anyone reading or maintaining your code.

    This sort of style requires a lot more diligence and verbosity than the same code in Haskell would, but I think dynamically typed programs can make a lot more use of their type system than they usually do.

    At the very least, humans can try to pick up the slack where a compiler is insufficient, and while designing your data model in Ruby as if it were in Haskell may not help the interpreter at all, it can give the programmer a lot more context about how the code is intended to work, and what assumptions it is making about its data.

  • I think it's a pragmatic style even with languages with much less powerful type systems. You might have to do more of the work yourself -- manually creating types, for example -- but the principle is good.

    I think it's advice you don't have to go all in on -- you get benefits even if you use it just a little.

Kotlin's smart cast is nice in this context.

One example is with nullable types. if you have val foo: Whatever? // might be null and Whatever and Whatever? are two separate types. foo.someMethod()// compilation error because foo is not of type Whatever

if(foo != null) { foo.someMethod() // works because foo was smart cast to Whatever after the null check }

In the same way doing a switch on a type smart casts the variable to that type. Smart casting of nullable types also gets rid of a lot of clumsy things like Optional, Maybe, etc. you'd need in other languages where you'd have to call a method to extract the value, assign it to a new variable, etc.

What is the equivalent in TypeScript? const head = (arr: Array<number>):number => arr[0]; happily returns undefined if you pass an empty array (with strict null checks)

  • TypeScript is sadly very unsound by design, so doing this kind of thing in TypeScript is always going to be more “best effort” and less “exhaustive proof.” That’s not necessarily wrong or bad per se, as after all, Haskell has escape hatches, too (and so do dependently-typed languages, for that matter). However, there are philosophical differences between the way the languages’ type systems are designed.

    When I’ve talked to people who develop and use TypeScript, the general impression I’ve gotten from them is that TS is actually as much about tooling as it is about correctness. TS enables things like autocomplete, jump to definition, and quick access to documentation, and it does that by leveraging the typing information to “see through” the dynamic dispatch that makes that sort of thing infeasible in dynamically-typed JavaScript. The TS type system does provide some correctness benefits, don’t get me wrong, but where ease of use and correctness are in conflict, usually ease of use is preferred.

    This is true even with all the “strictness” options enabled, like strict null checking. For some examples of TypeScript unsoundness, see [1] and [2]. Flow is actually generally a lot better than TS in this regard (though it does still have some major soundness holes), but it looks like TS has basically “won,” for better or for worse. But again: TS won because its tooling was better, not because its ability to actually typecheck JS programs was better, which I think reinforces what I said in the previous paragraph on what TS is really about.

    [1] https://twitter.com/lexi_lambda/status/1068704405124452352

    [2] https://twitter.com/lexi_lambda/status/1068705142109810688

  • I recently [0] updated my Control Flow Graph representation in a Typescript parser for AVM1 bytecode because I had the exact same issue as in the article. I previously represented my graph as an array of blocks. But a graph always has at least one block (the entry point). The old representation forced me to check for the empty case despite knowing that valid graphs always have at least one block.

    Here is the old representation:

        export interface Cfg {
          blocks: CfgBlock[];
        }
    

    And here is the new one:

        export interface Cfg {
          head: CfgBlock;
          tail: CfgBlock[];
        }
    

    Switching to this new representation allowed to then simplify code manipulating this graph. This representation is also simple enough that most schema validators can represent it.

    You still have to note that with this representation you no longer have a single array. It's not an issue in my case but your situation may vary. You may try the tuple-based solution mentioned in a sibling comment if you need to have a single array.

    [0] https://github.com/open-flash/avm1-types/commit/ec85efab8c9e...

When we started writing the code for Heavenly-x (https://heavenlyx.com), the first thing we needed to write before anything else is the definition of a data structure that represents the requirements as close as possible (a concrete syntax tree).

We’re building a tool with a custom DSL for building CRUD GraphQL APIs, with auth and data validation and transformation. So our architecture consists of three phases: - Parser - Setup - Runtime

There’s no way the parser would succeed if the input is not valid. We’re writing our software in Scala and we are using parboiled2 for parsing the DSL into a concrete syntax tree, so if it succeeds then it’s valid and if it fails, it fails early and we don’t have to worry about validation in later phases. We wrote some custom validation logic that traverses the concrete syntax tree after the parser to validate for some requirements that we couldn’t encode in the concrete syntax tree, but it’s really a small portion of our codebase and it’s easy to maintain.

At the Setup and the Runtime phase we assume that the concrete syntax tree is valid.

At the Runtime phase we have a running GraphQL server so we have a parsing phase too but it’s handled by Sangria so we don’t have to worry about it.

We are also building a UI for those who don’t like using our language. It’s a React app where the state data structure looks exactly like our concrete syntax tree.

We’re launching soon. You can subscribe for early access here: https://heavenlyx.com/get-started

Haxe has a nice mechanism to handle this called "abstract types". https://haxe.org/manual/types-abstract.html

The critical validation step happens when the abstract type is created, not when it is used or passed to other functions...similar to the example in TFA.

The added benefit is that you get a type that behaves pretty closely to a class, but adds no typing or runtime memory overhead. E.g. here's an example for an "unsafe" string : https://gist.github.com/jdonaldson/9a09287e540e7392857f

Another benefit is that you abstract types can define relationships between multiple types in this way, making it possible to perform validation for functions that need to handle "Either<T>"-style types.

It doesn't make sense to use JSON as the interchange format for a statically typed language. There is an impedence mismatch between the two. You are forced to infer types which are not actually there.

The reason why JSON is so popular is the same reason why dynamic typed languages became popular. The interfaces between components are simple and human-readable. A library written in a dynamically typed language can be effectively integrated into any system without the possibility of type mismatches.

If you have an API written with Protocol Buffers, every system which interacts with yours needs to agree with your type system; this creates tighter coupling.

  • Doesn't this hold true for all serialization ever? Replace JSON with "bytes" (aka any serialized data) and it still holds:

    It doesn't make sense to use bytes as the interchange format for a statically typed language. There is an impedence mismatch between the two. You are forced to infer types which are not actually there.

  • > It doesn't make sense to use JSON as the interchange format for a statically typed language.

    Two problems here. One, there are multiple statically typed languages with different type systems and it is nontrivial to translate between them. It's not "having a language" versus "not having a language"; it's Type system A vs Type system B vs Type system C vs none.

    Secondly, this is only true if you assume that no dynamically typed language will ever want to interoperate with your code - which is almost always incorrect to presume.

> head :: [a] -> a

> This function returns the first element from a list. Is it possible to implement?

But the type itself doesn't say that it must return the first element.

> If you see why parseNonEmpty is preferable, you understand what I mean by the mantra “parse, don’t validate.”

Okay, what about more complex cases? E.g. how do you describe type "such value doesn't exist in the db table yet"?

  • > But the type itself doesn't say that it must return the first element.

    Sure, I didn’t say it does. That type isn’t a full specification of head. This blog post isn’t about proving literally everything in your program correct, which is impractical anyway because if you did that it would be as easy to prove the wrong thing as to write a bug in the first place. Some functional tests are still needed.

    > Okay, what about more complex cases? E.g. how do you describe type "such value doesn't exist in the db table yet"?

    I’ve addressed that kind of question more generally in another comment[1], but for that specific case, I think it probably isn’t worth proving, because a database lives outside of your application, and it’s the database’s responsibility to manage that kind of data integrity, not your application logic. That’s one of the kinds of “execution-phase failure modes” that I describe in this paragraph of the blog post:

    > Parsing avoids this problem by stratifying the program into two phases—parsing and execution—where failure due to invalid input can only happen in the first phase. The set of remaining failure modes during execution is minimal by comparison, and they can be handled with the tender care they require.

    [1]: https://news.ycombinator.com/item?id=21478427

    • > Sure, I didn’t say it does.

      Sorry, I implied it from your "foo" example.

      > two phases

      So maybe we need to change the mantra to “parse, and validate as little as possible”?

  • > Okay, what about more complex cases? E.g. how do you describe type "such value doesn't exist in the db table yet"?

    If that's supposed to be an error state then you just throw a different error.

    If it's not supposed to be an error state, then you can have something like a function that accepts a type of "Input" and returns a type of "DBTrackedInput" and in that function you do your saving to the DB. Then things which only want to work with values that are in the DB accept types of DBTrackedInput. That prevents you from passing something which hasn't been saved in the DB to those functions.

    And you can expand from there, maybe the function that saves to the DB only accepts input of type "ValidatedInput". So you first have a function that accepts an "Input", validates it, and emits a "ValidatedInput". Then you have other functions, like your DB saver, which can say "I only want validated data" and still other functions which can say "I only want data that's been saved to the DB".

    • What if there is another application (or e.g. a stored procedure) that uses this DB table so you cannot control all saving operations from your application?

      1 reply →

  • > how do you describe type "such value doesn't exist in the db table yet"?

    Depends what you mean by that, but depending on the context, you may not want this in a type.

    If it's a "Dirty" value / not inserted yet, you can just wrap it with "newtype Dirty ... = SomeRecordType ...".

    If you mean something that you just checked that isn't in the database, you probably just want a properly named true/false, or a Maybe result. Encoding the value itself as not existing in the database will just bite you when something else inserts one in the meantime.

  • > But the type itself doesn't say that it must return the first element.

    The "-> a" means a guarantee you will always get an "a" (by definition). This is why the fix is to change the return type to "Maybe a" which includes "Nothing". Why it isn't like that by default in Haskell is a different issue, but the return type is really guaranteeing that "a".

I have done, and still do, a lot of "shotgun validation" ;-(

But after having read the article it's still not clear how to log illegal values or entities if they can't even be represented in the data model? How do you talk about things you can't name?

  • You could return an error type wrapping the original value (and any other useful context) using Either instead of Maybe.

  • You can log them by serializing the pre-parsed data model, unless I've misunderstood the question.

I agree broadly with this however in many cases that I have been dealing with I don't think that there can be a type system that would be rich enough to encode the validation laws.

For example how would you encode "This purchase request must be made before the closing date for the purchase"?

  • This isn't about encoding the validation rules in the type system, only the requirement that validation with respect to some properties has been done. So in your example you would just define a function

        createPurchaseRequest :: UserRequest -> Purchase -> Maybe PurchaseRequest
    

    now everywhere in the code taking a PurchaseRequest can assume it is valid.

  • This sounds like the Nirvana fallacy. Just because parsing over validation is not perfect, it can still lead to significantly more maintainable code and safer and stabler system.

    Pure validation has a lot of problems, namely that you have to validate the data on all levels for the invariants required at that level.

I like the concept of encoding properties, like a boolean of non-empty, in an object.

Beyond that, this article further convinced me type systems are for the pedantic. A given function signature is impossible? Seems like just another strength of a dynamic language.

  • What does your dynamic language do when you take the first element of an empty list? There is no obvious "correct" thing to do. Furthermore, whatever you do return is unlikely to be the same sort of thing that is returned for a non-empty list.

    A dynamic language will have a behavior that corresponds to some sort of type signature, and it's not possible to write behavior that corresponds with the type signatures given as examples of "impossible" in the article.

    A type signature is merely a statement about behavior, so it's nonsensical to make a false statement about the behavior, and Haskell catches this.

  • Why would you want a language which extra support for bugs?

    That's like saying Python is for the pedantic because it doesn't have a flag --crash_randomly, or won't let me write 0/0 and rely on the runtime to pick an abritrary value and keep going, even though I might want that.