Max severity RCE flaw discovered in widely used Apache Parquet

11 days ago (bleepingcomputer.com)

Broken record, but "has a CVSS score of 10.0" is literally meaningless. In fact, over the last couple years, I've come to take vulnerabilities with very high CVSS scores less seriously. Remember, Heartbleed was a "7.5".

  • I am pretty convinced that CVSS has a very significant component of "how enterprise is it." Accepting untrusted parquet files without verification or exposing apache spark directly to users is a very "enterprise" thing to do (alongside having log4j log untrusted user inputs). Heartbleed sounded too technical and not "enterprise" enough.

    • > alongside having log4j log untrusted user inputs

      I'd think logging things like query parameters is extremely common.

  • It may be noisy, but recently Draytek routers had a 10 point one, and indeed, an office router had been taken over. It would stubornly reboot every couple of minutes, and not accept upgrades.

  • Yep. Any software these days can be "network accessible" if you put a server in front of it; that's usually what pumps the score up.

It's so dumb to assign it a CVSS score of 10.

Unless you are blindly accepting parquet formatted files this really doesn't seem that bad.

A vulnerability in parsing images, xml, json, html, css would be way more detrimental.

I can't think of many services that accept parquet files directly. And of those usually you are calling it directly via a backend service.

  • Unless you're logging user input without proper validation, log4j doesn't really seem that bad.

    As a library, this is a huge problem. If you're a user of the library, you'll have to decide if your usage of it is problematic or not.

    Either way, the safe solution is to just update the library. Or, based on the link shared elsewhere (https://github.com/apache/parquet-java/compare/apache-parque...) maybe avoid this library if you can, because the Java-specific code paths seem sketchy as hell to me.

    • It’s incredibly common to log things which contain text elements which come from a user request. I’ve worked on systems that do that 100s of thousands of times per day. I’ve literally never deserialized a parquet file that came from someone else even a single time and I’ve used parquet since it very first was released.

    • > Unless you're logging user input without proper validation, log4j doesn't really seem that bad.

      Most systems do log user input though, and "proper validation" is an infamously squishy phrase that mostly acts as an excuse. The bottom line is that the natural/correct/idiomatic use of Log4j exposed the library directly to user-generated data. The similar use of Apache parquet (an obscure tool many of us are learning about for the first time) does not. That doesn't make it secure, but it makes the impact inarguably lower.

      I mean, come on: the Log4j exploit was a global zero-day!

      2 replies →

  • The score is meant for consumption by users of the software with the vulnerability. In the kind of systems where Parquet is used, blindly reading files in a context with more privileges than the user who wrote them is very common. (Think less "service accepting a parquet file from an API", more "ETL process that can read the whole company's data scanning files from a dump directory anyone can write to".)

    • I get the point you’re making but I’m gonna push back a little on this (as someone who has written a fair few ETL processes in their time). When are you ever ETLing a parquet file? You are always ETLing some raw format (css, json, raw text, structured text, etc) and writing into parquet files, never reading parquet files themselves. It seems a pretty bad practise to write your etl to just pick up whatever file in whatever format from a slop bucket you don’t control. I would always pull files in specific formats from such a common staging area and everything else would go into a random “unstructured data” dump where you just make a copy of it and record the metadata. I mean it’s a bad bug and I’m happy they’re fixing it, but it feels like you have to go out of your way to encounter it in practice.

  • Vendor CVSS scores are always inherently meaningless because they can't take into account the factors specific to the user's environment.

    Users need to do their own assessments.

    • This comment over generalises the problem, but is inherently absurd. There are key indicators in scoring that explain the attack itself which isn't environment specific.

      I do agree that in most cases the deployment specific configuration affects the ability to be exploited and users or developers should analyse their own configuration.

As per the PoC, yes — this is the usual Java Deserialization RCE where it’ll instantiate arbitrary classes. Java serialization really is a gift that keeps on giving.

When did vulnerability reports get so vague? Looks like a classic serialization bug

https://github.com/apache/parquet-java/compare/apache-parque...

Maybe the headline should note that this a parser vulnerability, not the format itself. I suppose that is obvious, but my first knee-jerk thought was, "Am I going to have to re-encode XXX piles of data?"

  • Also that it's in the Java parquet library, which somehow is nowhere in the article

  • What would it mean for the vulnerability to be in the format and not the parser?

    • I don't know. Something like a Python pickle file where parsing is unavoidable.

      On a second read, I realized a format problem was unlikely, but the headline just said, "Apache Parquet". My mind might the same conclusion if it said "safetensors" or "PNG".

    • That data had to be encoded in a certain way which would lead to unavoidable exploitation in every conforming implementation. For example, PDF permits embedded JavaScript and… that has not gone well.

"Maximum severity RCE" no longer means "unauthenticated RCE by any actor", it now means "the vulnerability can only be exploited if a malicious file is imported"

Grumbling about CVE inflation

  • CVSS, at least in its current form, needs to be taken out back and shot. See, for instance, https://daniel.haxx.se/blog/2025/01/23/cvss-is-dead-to-us/

    • I like the idea of CVSS, but it's definitely less precise than I'd like as-is. e.g. I've found that most issues which I would normally think of as low-severity get bumped up to medium by CVSS just for being network-based attack vectors, even if the actual issue is extremely edge case, extremely complex and/or computationally expensive to exploit, or not clearly exploitable at all.

  • But Parquet is intended to be a safe format. So importing a malicious file should still be safe.

    Like if a browser had a vulnerability parsing HTML of course it is a major concern because very often browsers to parse HTML from untrusted parties.

  • There's no such thing as CVE inflation because CVEs don't have scores. You're grumbling about CVSS inflation. But: CVSS has always been flawed, and never should have been taken seriously.

Soon to be announced "Quake PAK files identified carrying malware, critical 10/10 vulnerability"

Does anyone know if pandas is affected? I serialize/deserialize dataframes which pandas uses parquet under the hood.

  • Pandas doesn't use the parquet python package under the hood: https://pandas.pydata.org/docs/reference/api/pandas.read_par...

    > Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

    Those should be unaffected.

  • https://www.endorlabs.com/learn/critical-rce-vulnerability-i...

    > Any application or service using Apache Parquet Java library versions 1.15.0 or earlier is believed to be vulnerable (our own data indicates that this was introduced in version 1.8.0; however, current guidance is to review all historical versions). This includes systems that read or import Parquet files using popular big-data frameworks (e.g. Hadoop, Spark, Flink) or custom applications that incorporate the Parquet Java code. If you are unsure whether your software stack uses Parquet, check with your vendors or developers – many data analytics and storage solutions include this library.

    Seems safe to assume yes, pandas is probably affected by using this library.

I migrated off apache parquet to a very simple columnar format. Cut processing times in half, reduced RAM usage by almost 90%, and (as it turns out) dodged this security vulnerability.

I don't want to make too harsh remarks about the project, as it may simply not have been the right tool for my use case, though it sure gave me a lot of issues.