Comment by Izkata

6 years ago

This is describing what I've known as "Typed Hungarian notation", and have seen a few times before, though I can't seem to find now.

The original intent of Hungarian notation was to encode metadata into the variable name - for example, "validData = validate(rawData); showValidData(validData)". The idea was that the prefixes would make it just look wrong to the programmer if someone accidentally used mismatched names, such as "showValidData(rawData)", indicating a likely bug.

This notation mutated into the far more simplistic and arguably less useful "prefix with the datatype", such as iFoo indicating Foo is an integer. This mutation became known as Systems Hungarian, while the original became Apps Hungarian.

The suggestion in this post is taking Apps Hungarian, and instead of relying on variable names, encoding it into the type system itself.

The first time I recall seeing something like this suggested was actually in Java, with examples in that blog post involving class attributes:

  public String Name;
  public String Address;

And creating classes to represent these values, preventing them from being used the wrong way:

  public NameType Name;
  public AddressType Address;

...which is why I remember this as "Typed Hungarian" - making the language's type system handle the work that a person would normally have to think about in Apps Hungarian.

The Java example you've provided is known as Domain Modeling, it's well described (and honestly greatly expanded on) in Domain Driven Design by Eric Evans.

It's applied heavily in the enterprise world simply because it's so powerful. In general as the domain logic becomes more complex the benefits of doing this increase.

Actually, not encountering this style in code-base which is solving a complex problem is a massive warning sign for me. It usually means that concepts are poorly defined and that the logic is scattered randomly all over the code-base.

Apps Hungarian is just a logical style: name functions, variables, types so that they're meaningful within your domain. The result of which is code which is very easy to understand for someone who understands the domain. This doesn't mean long names - if you're doing anything math intensive then using the short names which conform to the norms of the field is perfect. For a business process it probably isn't :)

  • >Actually, not encountering this style in code-base which is solving a complex problem is a massive warning sign for me.

    There is a fine line between capturing some common meaning in a reusable type (e.g. Optional<T>) and spending most of your time crafting a straight-jacket of type restrictions that are supposed to thwart some imaginary low-level bug in the future.

    Java is actually a great example of how this mentality backfires in real life. The language, from day 1, had a zillion ways to force someone else to do something they didn't want to. Abstract classes. Private methods. Final classes. Etc. It was supposed to ensure good design, but in practice just led to the endless reenactment of the banana-gorilla-jungle problem. (https://pastebin.com/uvr99kBE)

    At the same time, the language designers didn't add such basic features as lambdas and streams until 1.8. This lead to creation of ungodly amount of easily preventable bugs, since everyone was writing those awful imperative loops.

    • Banana-gorilla-jungle problem results from OOP. This article (and presumably, the GP) is describing encoding domain-driven logic in a statically-typed functional language, and thus the OOP issues are not a problem.

      Functional languages don't carry around their environment with them, so you just get a banana. No gorilla or jungle.

      3 replies →

    • Java has just had a terrible steward in Oracle. Plus the community seems to like boilerplate code (see: Angular for more of that if you're into javascript).

      I switched to C# as my main language specifically because they were gaining awesome features like generics, lambda, LINQ. Since then it just got better and better.

      In the C# world you seem to encounter systems where the logic is either in:

      * UI event handlers * Methods on helper classes which consume a model generated from a database. * Domain modeled with classes with the logic split between the classes themselves and services which consume classes.

      In the first two cases the code tends to have a massive sprawling dependency tree which makes it harder to maintain. In the latter case you've usually got a better chance of having properly isolated code. Even if the isolation isn't super you at least have the logic well organised.

    • it's extremely useful especially at interface level but also in day to day coding, pity the people whose method signature is (string, string, string) because no ide can save them from the churn.

  • It's particularly powerful if you can great an algebra of your domain. You can define invariants which are then enforced by the type system.

    I use this a lot when writing C# code and the whole experience is getting better with every new release. Microsoft keep adding awesome stuff like ADTs which make working with these types a dream.

  • > Actually, not encountering this style in code-base which is solving a complex problem is a massive warning sign for me.

    This is very narrow-minded. DDD as a broad concept is good and something I practice.

    DDD in enterprise is a disastrous nightmare. When people on your team are wasting time Googling lingo to try to figure out what kind of object they need, where to put a file, or where a method should go, it's a huge red flag.

    I was on a team that seemingly spent all its time trying to figure out what the hell DDD was prescribing, let alone trying to figure out how to do it.

    • > When people on your team are wasting time Googling lingo to try to figure out what kind of object they need, where to put a file, or where a method should go, it's a huge red flag.

      I agree, but for a different reason. If you've got a team of programmers on a project without any understanding of what the client is doing you're bound to implement stuff ass-backwards from the clients perspective.

      DDD is fine in enterprise, but the domain does need to be communicated to all involved parties. You can't just dump a codebase on someone and expect them to get to work, that only barely works for non-DDD codebases let alone DDD codebases.

    • The kind of DDD this article is describing is totally different from Java-world enterprise DDD. It's DDD done right, with no 50-member OOP classes. Just a single function that can only take a type that describes your domain in a way that prevents whole classes of errors.

      1 reply →

    • > [...] trying to figure out what the hell DDD was prescribing, let alone trying to figure out how to prescribe it.

      Well... If the carpenter needs to Google how to use a hammer and nails while at the construction side, something went terribly wrong, didn't it?

      8 replies →

  • This can go two ways. One can end up with lots of different types that do nothing but wrap a string. On the other hand I have also seen a float being used to represent a price. The correct way is somewhere in between those two extremes. If you have some logic specific to a value, by all means, make it a type but if there isn't it just ends up with lots of boilerplate.

Another good example of this is having separate classes for something like unsafe strings vs. safe strings in a web app. The functions which interact with the outside world accept unsafe strings and emit safe strings to the rest of the application. Then the rest of the application only works with safe strings.

Anything that accepts a safe string can make an assumption that it doesn't need to do any validation (or "parsing" in the context of the OP), which lets you centralize validation logic. And since you can't turn an unsafe string into a safe string without sending it through the validator, it prevents unsafe strings from leaking into the rest of the app by accident.

This concept can be used for pretty much anything where you are doing data validation or transformation.

  • Also a good way to prevent hashed passwords from being accidentally logged.

        Class PasswordType(django.db.models.Field):
            hashed_pw = CharField()
        
            def __str__():
                # you can even raise an Exception here
                return '<confidential data>'
    

    Not that you should be trying to log this stuff anyways, but unless you're a solo dev you can't prevent other people from creating bugs, but you can mitigate common scenarios.

  • What are safe and unsafe strings supposed to mean? All strings seem like normal string to me, a "DELETE * FROM db" is no different from any other string until it's given to a SQL query.

    • Escaping modes. All strings are not equivalent: "Bobby tables" is very different from "'; drop table users; --".

      The idea is to encode the contexts where a string is safe to use directly into the type of the variable, and ensure that functions that manipulate them or send them to outside systems can only receive variables of the proper type. When you receive data from the outside world, it's always unsafe: you don't even know if you've gotten a valid utf8 sequence. So all external functions return an UnsafeString, which you can .decode() into a SafeString (or even call it a String for brevity, since most manipulations will be on a safe string). Then when you send to a different system, all strings need to be explicitly encoded: you'd pass a SqlString to the DB that's been escaped to prevent SQL injection, you'd pass a JSONString to any raw JSON fragments that's had quotes etc. escaped, you'd pass an HtmlString to a web template that properly escapes HTML entities, and so on. It's legal to do a "SELECT $fields FROM tablename where $whereClause" if $fields and $whereClause are SqlStrings, but illegal if they are any other type of strings. And if you do <a href="$url"> where $url is an UnsafeString, the templating engine will barf at you.

      There are various ways to cut down the syntactic overhead of this system by using sensible defaults for functions. One common one is to receive all I/O as byte[], assume all strings are safe UTF-8 encoded text, and then perform escaping at the library boundaries, using functionality like prepared statements in SQL or autoescaping in HTML templating languages. Most libraries provide an escape-hatch for special cases like directly building an SQL query out of text, using the typed-string mechanism above.

    • A safe string is something you got from the programmer (or other trusted source), and an unsafe string is something you got from the network/environment/etc.

    • Are you genuinely curious, or are you being a troll?

      Look at the content of your string, make a decision as to whether you would give it to a SQL engine. If you have not looked, it's presumed unsafe. If you have validated it - parsed it, in the context of this article and this discussion - and decided that you consider it safe, then it is a safe string from that point on.

      This isn't a philosophical debate about what "safe" means to humans, it's a programming discussion that says if you only want to pass "select * from reports" to your database, check that's what the string contains before you pass it anywhere.

      11 replies →

Rust is extremely good at this with newtype syntax and the ability to apply impls and derives to masking types.

Your JSON data might be a string, but having a Json type that guarantees its valid Json is way better. Same with stringified Base64, a constrained numeric value for a tunable, etc. Because using from impls on these types lets the compiler figure out almost all invalid type usages at compile time and give you eloquent feedback on it.

Your comment gets the history hilariously backwards.

Actually using the type system has been standard practice in ML-family languages since at least the '70s. Simonyi described Hungarian notation as a way of, effectively, emulating a type system in less-powerful languages; describing Hungarian notation as "prefixing with the type" was not a mistake (as Spolsky claims) but an accurate description of how the technique was intended to be used.

The "Systems Hungarian" mistake came about because C programmers misunderstood the term "type" to refer to something far more limited than it does.

The technique in the article is not "encoding Apps Hungarian into the type system". It's doing the very thing that inspired ("Apps") Hungarian in the first place!

  • ...Please re-read the part of my comment that starts with "The original intent"

    • Thank you for your condescension. Please re-read Simonyi's paper. What you called "metadata" is in fact type. Please re-read my comment.

The original paper by Simonyi actually does describe prefixing with datatype. E.g. b for byte, ch for character, w for word, sz for pointer to zero-terminated string.

He specifically states:

The basic idea is to name all quantities by their types (...) the concept of "type" in this context is determined by the set of operations that can be applied to a quantity. (..)

Note that the above definition of type (which, incidentally, is suggested by languages such as SIMULA and Smalltalk) is a superset of the more common definition, which takes only the quantity's representation into account

So he is comparing to early object oriented languages where presumably Hungarian notation is not needed anymore because the type system of the language itself is rich enough to constrain the set of valid operations on the value/object.

So the "metadata" encoded in the Hungarian prefix is exactly what would be called type in a modern language. The idea of Hungarian was valid in a language with a insufficiently expressive type system.

I think the term "Typed Hungarian Notation" is really weird. Hungarian Notation is a workaround for a limited type system. "Typed Hungarian" is just... typed code.

  • > I think the term "Typed Hungarian Notation" is really weird.

    I'm glad I'm not the only one. "Typed Hungarian notation" reminded me of "horseless carriage". :)

This helps the compiler, but only works for languages with type declarations, and is less helpful to the programmer, as to find a variable's kind (e.g. valid data vs. raw data) you need to look at its declaration, which may involve scrolling. The original idea, described in Joel Spolsky's original article on Apps vs. Systems Hungarian (https://www.joelonsoftware.com/2005/05/11/making-wrong-code-...) is that the programmer can see, in the expression that the variable is used or the function is called, that the code is wrong.

I apply a variant of Apps Hungarian in my own code, e.g.

    (setf input
          (inputOfEdge 
           (first (edgesOfSocket
                   (outputOfGate
                    (nth (position visibleInput (inputsOfVertex node))
                         (ingatesOfEnclosure enclosure)))))))

(In the system this is lifted from, node was confirmed to be a Vertex a few lines earlier, and Output is a subclass of Socket.)

The general idea is language independent: you could write similar code in Java or C.

I generally like this approach, but it's unfortunately really cumbersome in most mainstream languages like Java (and even Scala 2.x) or C#, but it works beautifully in Haskell because of two features:

- non-exported newtypes (also known as opaque types in Scala 3.x); this means your little module is in full control of instantiation of a 'parsed' value from an 'untyped' value. - DerivingVia (not sure if there's an equivalent in Scala 3.x); this makes declaring type class instances equivalent to the 'wrapped' value completely trivial.

These two features (and, I guess, the pervasiveness of type classes) in Haskell make this sort of thing as close to zero effort as possible.

EDIT: Looking at some sibling posts, I think my C# knowledge may be out of date. Apologies for that.

Here is a blog post applying a similar idea to matrix math:

https://www.sebastiansylvan.com/post/matrix_naming_conventio...

  • I wish there were a type system for numerical computation that could encode things like rank and size of tensors and could statically check them before running. Currently, catching bugs requires running the code which can make for a slow development loop, plus it would make for a much more robust system than documenting these things in comments

What you’re describe is in a similar vein to what is described in my blog post, and it’s often absolutely a good idea, but it isn’t quite the same as what the post is about. Something like NameType attaches a semantic label to the data, but it doesn’t actually make any illegal states representable… since there really aren’t any illegal states when it comes to names. (See, for example, https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-... .)

In other situations, the approach of using an abstract datatype like this can actually rule out some invalid states, and I allude to that in the penultimate section of the blog post where I talk about using abstract types to “fake” parsers from validation functions. However, even that is still different from the technique focused on in most of the post. To illustrate why, consider an encoding of the NonEmpty type from the blog post in Java using an abstract type:

  public class NonEmpty<T> {
    public final ImmutableList<T> list;

    private NonEmpty(ImmutableList<T> list) {
      this.list = list;
    }

    public static <T> Optional<NonEmpty<T>> fromList(List<T> list) {
      return list.isEmpty()
        ? Optional.none()
        : Optional.of(new NonEmpty<>(ImmutableList.copyOf(list)));
    }

    public T head() {
      return list.get(0);
    }
  }

In this example, since the constructor is private, a NonEmpty<T> can only be created via the static fromList method. This certainly reduces the surface area for failure, but it doesn’t technically make illegal states unrepresentable, since a mistake in the implementation of the NonEmpty class itself could theoretically lead to its list field containing an empty list.

In contrast, the NonEmpty type described in the blog post is “correct by construction”—it genuinely makes the illegal state impossible. A translation of that type into Java syntax would look like this:

  public class NonEmpty<T> {
    public final T head;
    public final ImmutableList<T> tail;

    public NonEmpty(T head, ImmutableList<T> tail) {
      this.head = head;
      this.tail = tail;
    }

    public static <T> Optional<NonEmpty<T>> fromList(List<T> list) {
      return list.isEmpty()
        ? Optional.none()
        : Optional.of(new NonEmpty<>(list.get(0), ImmutableList.copyOf(list.subList(1, list.size()))));
    }
  }

This is a little less compelling than the Haskell version simply because of Java’s pervasive nullability and the fact that List is not an inductive type in Java so you don’t get the exhaustiveness checking, but the basic ideas are still the same. Because NonEmpty<T> is correct by construction, it doesn’t need to be an abstract type—its constructor is public—in order to enforce correctness.

  • Maybe I shouldn't have included the Java example, some others have jumped on it as well. That wasn't meant to be a summarization, but a similar idea in a different context.

    You also seem to have missed the point of the Java example anyway, in a misses-the-forest-for-the-trees way. It was meant as a 1:1 example of an important result of your blog post, not an example of constructing the parse function:

    > However, this check is fragile: it’s extremely easy to forget. Because its return value is unused, it can always be omitted, and the code that needs it would still typecheck. A better solution is to choose a data structure that disallows duplicate keys by construction, such as a Map. Adjust your function’s type signature to accept a Map instead of a list of tuples, and implement it as you normally would.

    Using NameType instead of String is the same as using a Map instead of a list of tuples.

    It was why I included AddressType in the Java example. Just like changing the function signature to not accept a tuple of lists and require a Map instead, forcing you to use the parse function to construct the Map, functions that only work on AddressType or only work on NameType can't receive the other one as an argument - where with a String, they could. They have to pass through the requisite parse function to convert String into a NameType or AddressType first, however those are implemented.

    And I've seen the falsehoods lists before; "Name" and "Address" were simply the first thing that popped into mind while typing that up. Examples are just examples, and Name and Address are conceptually different enough regardless of the falsehoods that the main idea behind the example ought to get by anyway.

    • > You also seem to have missed the point of the Java example anyway, in a misses-the-forest-for-the-trees way.

      Perhaps I did, yes. I do think the kind of thing you’re describing is valuable, to be clear. A lot of my comment was intended more as clarification for other people reading these comments than as an argument against what you were saying. I imagine that if I misunderstood your point, other people are likely to, too, so if anything, your clarification is generally a good outcome, I think!

I believe Joel Spolsky discussed this in an article, having a wart on the name to represent if strings have raw inputs which are at risk of injection attacks.

I currently use this in my language "Grammar" (https://jtree.treenotation.org/designer/#standard%20grammar). In Grammar you can currently define two things: cellTypes and nodeTypes. Grammar looks at the suffix to determine the type. So if you are building a language with the concept of a person node and an age cell, you'd write "personNode" and "ageCell". "ageCell" might extend "intCell". I've found it a pleasure to use.

> making the language's type system handle the work that a person would normally have to think about in Apps Hungarian.

That sounds a lot like "type validation" to me, whereas the title of the article includes "Don't Validate"

  • Read all the way down through the "Parsing, not validating, in practice" header. It eventually gets to examples exactly like mine, where "parse" means "transform into a different datatype so it can't be used wrong", while "validate" means "generate a message, but don't preserve that message in the type system in a way that's usable later".

  • > That sounds a lot like "type validation" to me, whereas the title of the article includes "Don't Validate"

    The variable name version does, sure. That's why OP said:

    > The suggestion in this post is taking Apps Hungarian, and instead of relying on variable names, encoding it into the type system itself.

    Making it clear that their example is different from what the article suggests.

My earliest programming experience taught me to use Systems Hungarian with scope. So, something like lcName for local character name.

  • On one of my first jobs all the class names were prefixed with C, but it wasn't carried through to variables or anything. I said we should stop prefixing class names with C and got shouted down. That's where I learned to mostly just do what I want without asking. Once you have 5k lines of something your way it is really too late for anyone to ask you to change it :)

Your comment is wrong is wrong. You are talking about "encapsulation tricks", where you don't make illegal states unrepresentable, but just all the construction so you can audit than declare the unrepresented state won't be constructed.

Sometimes this is good, but the benefits are a lot lower, and many an overzealous programmer has taking the "boolean blindness" "string blindness" argument too far with copious newtypes.

Actually ruling out invalid states by construction, however, is way more valuable, and far more likely to be worth the work. It requires more thought than just newtyping away, but that's a feature, not a bug.

  • Similar to some other replies, you've jumped to the Java example and made assumptions about it without reading the parts immediately before it.