How well can LLMs write COBOL?

2 years ago (bloop.ai)

This looks interesting. I'm working on an OpenAI-based tool for coding tasks that are too complex for ChatGPT - https://github.com/plandex-ai/plandex

It's working quite well for me, but it definitely needs some time spent on benchmarking and ironing out edge cases.

I'm especially curious how it will do on more "obscure" languages. Not that Cobol is obscure exactly--I suppose there's probably quite a bit of it in GPT-4's training considering how pervasive it is in some domains. In any case, I'll try out this benchmark and see how it goes.

  • > Not that Cobol is obscure exactly--I suppose there's probably quite a bit of it in GPT-4's training considering how pervasive it is in some domains

    There is a huge amount of COBOL code in existence – but, almost all of it is non-public code used to run business and governments. Very little of it is publicly source-available (whether open source or something more restrictive than that)

    Unless GPT-4's training data includes non-public code bases (I doubt it), it likely has rather little COBOL code in it

  • > I'm especially curious how it will do on more "obscure" languages.

    There’s definitely a lack of training data and ability (but unfortunately not confidence) in less widespread languages. It’s quite bad at pinescript, confusing versions of the language and producing unrunnable code, and being unable to correct it when given feedback

One thing that makes OpenAI so valuable over time is how they can take all of this expert input from the rush to test out the new technology and use that to exponentially improve the next model. You have to think, just like we see here, experts in their field who know all of these edge cases or other fundamental aspects of what makes their own companies or projects valuable and they tend to hand that over in the process.

It's kind if interesting how far this could be taken with all kinds of valuable information from people and companies, especially code and business logic. Everyone wants to say AI this and AI that to keep up with the times and they all keep dumping all of this valuable data in for free.

Then consider the authentication method used tied to this data collection and if you are using a company emial, etc, it helps them weed out the garbage too.

I guess we can't fight progress...

  • I think you’re forgetting about the use case where the LLM returns something partially correct to a discerning expert, who is still able to use the response, but does not bother with a message like “btw I had to do X to make your suggestions usable”.

  • That might be true, but my feeling so far is that OpenAI doesn't want to do what they'd need to do to make any one product actually good, so they're going to keep bouncing from AI thing to thing, making foundation models that have a lot of wow factor but can't really deliver on their promise because they're too closed off to integrate into a workflow that might alleviate the issues.

    Other people will try to build on OpenAI stuff, find that it's not quite good enough and OpenAI doesn't care to really make it good enough because it's a lot of work, and it won't be until we get competitors that take less sexy model tech and take the time to make it REALLY GOOD at certain things that AI really makes good on its promise. I'm guessing that will be driven by people taking open source tools that are ~80% of the way there and really building a system and domain logic around it to make it excellent.

I tried to get ChatGPT to write LLVM IR last year. The results were interesting: the LLM wrote superficially correct-looking IR, but it ultimately failed to grasp the concept of SSA, as it kept trying to write to SSA registers. ChatGPT can generalize across language syntax reasonably well, but it doesn't understand deeper differences in language semantics.

  • I tried to get ChatGPT to write 6502 assembly and it had similar issues.

    • FWIW, I had similar issues trying to get both it and Claude to help me with x86 assembly: it kept thinking if it added another * or some more parentheses it could get some impossible assembly to work.

      1 reply →

    • I got this from Gemini:

      > Gemini:

      > JSON: In JSON, double quotes are used to delimit strings. If a double quote needs to be included within the string itself, it's escaped by doubling it (e.g., `"This is a string with a "quote" in it."`).

      (And, just an editorial note: there are no backslashes in Gemin's response / HN is displaying it correctly. Gemini's response is nonsense — wrong, but also not even internally consistent.)

      > Me:

      > Does your JSON example parse as valid JSON?

      > Gemini:

      > Yes, the JSON example I provided ("This is a string with a "quote" in it.") is actually valid JSON.

      > The reason it's valid is because the doubled quote within the string is interpreted as an escaped character, representing a single literal quote within the string itself. The JSON parser recognizes the double quotes at the beginning and end as delimiters for the entire string, not the escaped quote within.

      (Its example, is, of course, quite malformed.)

I've asked chatGPT a fair number of Fortran questions. There are differences - Fortran is still in use lots of places, there are forums and documentation sites (though presumably that's true for COBOL). But compare to python, there is way less info out there on how to do different things, for example little Stack Overflow content.

I'd say I has mixed results, definitely chatGPT knows the language and can give examples but I've also had a lot of frustrating things it wasn't able to resolve.

  • Fortran also has the disadvantage of numerous separate dialects/flavours that plenty of times can't be mixed. And rarely is the exact style mentioned in random code found somewhere.

    It got a lot better with Fortran 95 and newer, but in the old world of e.g. lahey compilers and custom commands only available there, any LLM has failed me consistently to stick to these intricacies. I can't even blame them, when asking humans questions about these topics, you'll get all sorts of answers that are equally close to a correct solution, but almost never precisely correct.

On one hand, it'd be super cool to solve the talent shortage in the field, also Cobol code isn't very pleasant to write.

On the other, I'm not sure I'd want tools known for poor code quality, and hallucinations, to write these super critical systems.

Guess there might be a copilotesque productivity booster for human developers, but I think these systems are some of the last places I'd want LLMs to contribute.

  • Why do people always add this "blah blah hallucinations and critical systems"? 1. People write bad and buggy code. 2. You act like we're just blindly throwing untested code at production systems from LLMs.

    It's just intellectually dishonest to talk this way.

    They will still be helpful but we obviously need to test before we add code into systems. It goes without saying.

    • > It's just intellectually dishonest to talk this way.

      > They will still be helpful but we obviously need to test before we add code into systems. It goes without saying.

      It's not intellectually dishonest at all. It's an issue of conditioning. There's a class of developers that blindly copy and paste code from StackOverflow or the first hit on Google. They're the same class that will uncritically copy and paste ChatGPT answers.

      ChatGPT is worse than SO because it's adaptive. If someone pastes in a SO answer and it doesn't immediately work the developer has to at least engage with the code. ChatGPT can be asked to refine its hallucination until it parses/compiles.

      The class of developer blindly copying and pasting answers will not have the expertise to spot hallucinations or likely even fix the inevitable bugs they introduce. Additionally ChatGPT by its nature elides the source of its answers. At the very least a SO answer has some provenance. Not only the poster but some social signally through votes that the answer is legitimate.

      ChatGPT answers don't have any of that. It will also happily hallucinate references.

      Conditioning junior developers and learners to rely on and trust AI coding is setting them up to fail. It's also going to stunt their growth as developers because they'll never gain any domain knowledge. In the meantime they'll be unknowingly sabotaging products with legit looking but broken code.

      1 reply →

    • Look, we can't all just be realistic about a thing that's going to take our jobs, so we have to lean on tired old excuses instead. Instead of being reasonable, why don't you pick a team - for or against, and then fight about it on Internet forums because, well, what else are you going to do while the build compiles? Look at cat videos?

      1 reply →

Since the LLM sometimes generates invalid COBOL a simple practical solution would be to use an API and allow it to test its code with GnuCOBOL, feed the output and have it try again a couple of times. I wonder what the updated benchmarks would be in that setting.

  • The general approach seems to work anyway. I tried it out with ChatGPT 3.5 and an online Cobol compiler[0], manually feeding back the output, and it managed to produce a working program on the 10th attempt (that displays the first 10 Fibonacci numbers).

    Edit: Well, maybe. With the example from the article it wasn't as successful.

    [0] https://onecompiler.com/cobol/

I tried to get chatgpt to write 6502 assembler for the 1980s 8-bit home computer the BBC Micro. It was game, but clueless

  • It depends on what you test for.

    I am from the C64 demo scene and in this regard ChatGPT is pretty useless. VIC tricks, raster timing - nothing a LLM can help with at the moment judging from my experience with ChatGPT 4.0 so far.

    Same goes for Amiga and simple blitter access for scrolling.

    LLMs will be very limited here unless they receive sensory feedback repeatedly.

    Pure algorithms like sorting for example may be doable by GPT, but the mentioned machines are very creatively used to come up with effects.

    Here are techniques reused, not so much code itself, which gets modified and optimized for every demo and often times recombined with other techniques.

    Most techniques for the VIC are pretty well documented, but the timing as well as recombining them is the heavy lifting.

  • I tried to get it to write 6502 assembler for the NES. As I recall it mistakenly thought that there was a BIOS containing a print function.

  • That's a little surprising; 6502 assembly is a fairly popular hobbyist thing, so I would expect the data is out there. It's also mostly pretty simple, but you do have to watch out for quirks like rotating through the carry bit.

A while back, I asked a question here, roughly, why hasn’t someone written, say, a C to COBOL translator? Such a program might take a lot of work, but it seemed to me that with an impending dearth of COBOL programmers, there would be demand for such an app. I was informed that there were so many different COBOLs in use that the output of such a program would STILL have to be tended to by an experienced programmer in the output dialect desired. This is just the Copilot situation.

  • As much as everyone likes to poke fun at COBOL, the language itself really isn't the problem with maintaining/updating old COBOL systems. It's old, but it's not that bad.

    The real problem is the entire ecosystem around those systems. Remember, a lot of COBOL software dates back to a time before things like relational databases. You'll be working with flat files that might, if you're very lucky, have column and record separators and useful names/documentation explaining what they are. If you're unlucky you'll have to figure out field widths from the code and infer what the fields are based on their actual usage. Oh and if you get it wrong you just messed up something related to payroll or financial compliance; enjoy the punishing fines.

    That kind of stuff, more than the language, is the reason nobody wants to touch old COBOL systems.

It would be interesting to feed it a formal language specification of some language it hasn't seen and then ask it write code and see how it does.

That could be a test of reasoning and reading comprehension

  • I've been thinking about a benchmark designed this way for a while. It doesn't even need to be code, particularly, it could be basic reasoning problems. The key is that you define a new, random language that has never before been seen (maybe it has statistical similarity to existing languages, maybe not), create a translation key, then ask a question in that language.

  • Reasoning vs being a completion engine (I could make a guess at how well that would work)

    • Reasoning is a form of completion (logical), the problem is that LLMs aren't language agnostic in their learned semantic reasoning.

I suspect there were relatively few training examples for COBOL. It would be interesting to see the results for a system which had a significant number of such examples in the training set.

  • Indeed, even the author wasn't aware of modern COBOL, with IDE tooling and OOP capabilities, focusing on the classical micros instead.

    • The issue is less COBOL the language and more the literal business logic the COBOL is encoding. You can learn the COBOL language as easily as any other. What you can't learn as easily are the accounting rules, administration policies, and regulations any major COBOL codebase is implementing.

      You'll be able to see the code multiplying a dollar value by 0.03 but not necessarily know that this is because of some statutory requirement of some Minnesota tax code that only applies to industries producing both left and right handed monkey wrenches but only if the company was incorporated before 1975. That obscure law isn't referenced in any documentation but was found by an accountant in 1982. The change was made to the code but only referenced in a paper memo with a small distribution list but all of those memos were shredded after being archived for ten years.

      ChatGPT can't really help document code that's decades old and doesn't have any references to the why of the code. The how is straightforward but rarely as important as the why.

      1 reply →

  • There is probably a business opportunity for an AI company to build private LLMs for large enterprises trained on their own COBOL code bases. They won't find much available as open source, and individual companies tend to have significantly different coding styles tied to COBOL versions and database schemas.