Hacker News new | past | comments | ask | show | jobs | submit login
Why the OpenSSL punycode vulnerability was not detected by fuzz testing (2022) (allsoftwaresucks.blogspot.com)
112 points by lemper 13 days ago | hide | past | favorite | 53 comments





I'm not sure the post (from 2022) is/was correct. I've looked into it too, and I expect this was reachable by the existing x509 fuzzer. There's a fallacy in assuming that a fuzzer will solve for all reachable code paths in a reasonable time, and that if it doesn't then there must be a problem with the harness. The harness is a reasonable top-level x509 parsing harness, but starting all the way from the network input makes solving those deep constraints unlikely to happen by (feedback-driven) chance, which is what I think happened here.

Of course, a harness that started from the punycode parsing -- instead of the top-level X509 parsing -- finds this vulnerability immediately.


I've found that writing randomized unit tests for each small part of a system finds this sort of stuff immediately.

In this case, a test that just generated 1,000,000 random strings and passed them to punycode would probably have found the problem (maybe not in the first run, but after a week or two in continuous integration).

I try to structure the tests so they can run with dumb random input or with coverage-guided input from a fuzzer. The former usually finds >90% of the bugs the fuzzer would, and does so 100x faster, so it's better than fuzzing during development, but worse during nightly testing.

One other advantage for dumb random input is that it works with distributed systems and things written in multiple languages (where coverage information isn't readily available).


I really like this idea because it avoids the issue of fuzzers needing to burn tons of CPU just to get down to the actual domain logic, which can have really thorny bugs. Per usual, the idea of "unit" gets contentious quickly, but with appropriate tooling I could foresee adding annotations to code that leverage random input, property-based testing, and a user-defined dictionary of known weird inputs.

> maybe not in the first run, but after a week or two in continuous integration

You'd use a different seed for each CI run??

That sounds like a nightmare of non-determinism, and lost of trust in CI system in general.


Not if you log the seed of the failing runs

Yep; I definitely log the seed and re-seed every few seconds.

Most software I work on these days is non-deterministic anyway (it involves the network, etc.), so CI is fundamentally going to fail some runs and not others.

Even stuff like deterministic simulation has this property: Those test suites rely on having a large number of randomly generated schedules, so there's always a chance that running the test one more time (with a new seed) will find a new bug.


If you’re using Git, I’d strongly recommend the hash of the current tree (not the commit). That way your tests are deterministic based on the contents of your tree. For example, if you add a commit and then revert, you’ll end up with the same test seed as if you hadn’t committed.

The important thing to log is the seed passed to the rng. (It’s usually wallclock time measured at nanosecond granularity.)

In a typical night for a production quality system, 100-1000+ hours of such tests would run, all with different seeds, and in diverse configurations, etc, so the seed isn’t derived from the git sha, or the source code.


Or someone just punches retry

Ignoring test failures is a different issue. It can range from known bugs and rational tradeoffs to a hiring/hr issue.

Establishing a healthy dev/engineering culture is hard.


You don't have auto retries anywhere in your tests?

aka property testing.

> Of course, a harness that started from the punycode parsing -- instead of the top-level X509 parsing -- finds this vulnerability immediately.

Yes, it's nice to have a fuzzer that can start from a very high-level entry point, but it's even nicer to fuzz lots of entry points (one for every parser).


If finding it required generating a valid signed certificate from random string generation that seems essentially equivalent to "unreachable" to me....

Not necessarily because the fuzzer has visibility into the execution pathways triggered, so it can "feel" its way through the maze.

For example, say any certificate that starts with 0x00 is invalid (and there's an up-front test and return). Once the fuzzer has tried 0x00 and concluded that the pathway is a dead end, it won't try again, even though there are squillions of data blocks beginning 0x00.

It's not a highly efficient way to generate a valid certificate (especially when you have to navigate the network code as well), but it's also not just rolling 'n' 256-sided dice until the end of the universe.


I think the parent is pointing out that if the signature has to be both syntactically and cryptographically valid, then this would defeat the fuzzer for obvious reasons.

But I don't think it does, for this vulnerability. The signature is checked last.


Yep. That was what I was thinking. Didn't realise the signature was checked later. Thanks for the clarification.

(2022)

Previous:

Why CVE-2022-3602 was not detected by fuzz testing - https://news.ycombinator.com/item?id=33693873 - Nov 2022 (171 comments)


Year added above. Thanks!

I suppose it would be nice to require at least 100% line coverage for tested encryption-related functionality, when tested by a fuzzer. Deviations from this could be easily detected in CI.

Testing cannot be used to prove that a flaw doesn't exist, only that it does.


> Testing cannot be used to prove that a flaw doesn't exist, only that it does.

FWIW, I wrote a similar blog post about a different encryption bug that really seemed like it should have been found by fuzzing, and had 100% coverage.

https://googleprojectzero.blogspot.com/2021/12/this-shouldnt...

Not that I disagree with you, just a practical example.


It's pretty remarkable how effective fuzzing is despite the layers upon layers of optimizations/assumptions that it requires in order to be feasible at all (eg max_len = 10000). I haven't tinkered with fuzzing since afl was a toddler but its mechanism for pruning the test space seemed so brilliant and tweakable at the time.

It would be interesting to find a way to create excursions into these various parameter spaces but some of them are baked into the infrastructure in such a way that it makes it difficult.


> Testing cannot be used to prove that a flaw doesn't exist

This is not universally true, formal methods can take you a long way, depending on the problem domain.

And sometimes you can test for every possible input. This usually isn't feasible, but it's nice when it is.


Formal methods aren't "testing".

As you say, though, exhaustive testing is sometimes possible.


I agree that it's not testing in the strict sense, but fuzzing isn't really testing either.

The work done by a a theorem prover and a fuzzer isn't all that different (and sufficiently advanced fuzzers use SMT solvers etc.)


I guess I need to disagree :). Why isn't fuzzing testing? Ultimately it ends up running the code and seeing if it works—for some definition of "working", that is, it doesn't crash—even if it may pick the ways it chooses the input by considering how the code behaves while running it.

Also, how is even "smart fuzzing" similar to theorem provers in any way? Fuzzers still think in terms of runs, but a theorem considers all cases. The best fuzzer wouldn't be able to tell if a sort algorithm works for all inputs, presuming the input size is unbounded. Actually, my understanding is that a fuzzer wouldn't be able to prove if identity function works for all kinds of objects it may be given.


I think fuzzing is doing testing, but often using other techniques (including sometimes formal methods) to improve the effectiveness of that testing.

This seems to neglect the fundamental difference between static and dynamic analysis. Formal methods approaches don’t usually need me to run my code.

It is normally safe to assume the exhaustive testing isn't possible because the total states to exhaustively tests exceeds the number of atoms in the universe. There are a few exceptions, but in general we should assume it is always impossible to exhaustively test programs. Which means we need to use something else to find what the limits are and test only those, or use formal methods (both would be my recommendation - though I'll admit I have yet to figure out how to use formal methods)

> in general we should assume it is always impossible to exhaustively test programs

The whole program? Yes.

For testing individual components, there's no need to assume. The answer is very likely either clearly yes (the input is a handful of booleans) or clearly no (the input is several integers or even unbounded) and for those cases where it's not immediately clear (one i16? Three u8s?) it's probably still not hard to think through for your particular case in your particular context.


Brute forcing a single 32-bit input has also been usefully performed in some cases, such as for single-precision implementations of special functions[1].

[1] https://randomascii.wordpress.com/2014/01/27/theres-only-fou...


32 bits is a bit over 4 billion states, so that is doable. However a lot of algorithms can have more than 32 bits of states and so quickly get beyond what is doable.

> 32 bits is a bit over 4 billion states, so that is doable.

Whether 4B is doable depends on how much work you're doing for each of those states, but yes, "4B" alone certainly doesn't rule it out. There's also a question of whether it goes with your most frequently run tests or is something that's only run occasionally.

> However a lot of algorithms can have more than 32 bits of states and so quickly get beyond what is doable.

I'm pretty confident we're all in agreement there.


Formal methods catch a different class of problem from other tests and so are worth it. However they cannot catch everything. The classic example: "midpoint = (high + low) / 2" is easy to formally prove correct - but in fact it isn't correct because high + low could overflow.

Any formal system worth its salt is going to model overflow.

And with SPARK/Ada it will. When I was learning it that was actually a very common thing that it caught in my code (various forms of over/underflow errors). In practice most of them would not occur, but a few could have under real-world use-cases.

There is a point here: if your implementation doesn't match the specification, then that can be the case. And your example presumes the specification assumed unbounded numbers, while an implementation did not have them. Well, in that case the specification just needs to be revised to include that information.

But the trick is to remember making the specification model the real world at sufficient detail: not too detailed to make the specification unwieldy, not too relaxed to skip real real-world issues that need to be modeled.

I don't think there are too many good solutions to that, except perhaps theorem provers from where you can also extract code from (e.g. Coq), but it seems either these tools are still difficult to use for engineers, or there is an expectation that models and actual implementations are going to be different anyway, making it difficult to see if the implementation actually implements the spec.


If you've formally proved that correct, then either it can't overflow or there's a massive bug in the prover.

I don't mean to be rude, but why are you commenting on formal methods when you clearly have no idea what they are?

The specific example I gave was from a 1982 paper (I forget who wrote it, but a big name in computer science) where that was used to prove an algorithm and thus argue that he didn't need to test it as it was proved correct.

The famous Donald Knuth has also stated "Beware of bugs in the above code; I have only proved it correct, not tried it.", which again hints that there are limits to proved programs.

I'll admit that I don't know much about them, what I know meringues me - but I can't figure out how to use them in the real world. However I know enough to know that they are not perfect.


You use them judiciously, you learn their capabilities and limitations and the scope of their applicability.

TLA+ is very useful, but it's not going to prove out your software itself, that will require testing or other methods, it will help prove out your specification. I've used it to good effect both for designing distributed and concurrent systems and for identifying issues in existing systems (without an existing formal spec, by creating it and then playing with the invariants and other aspects of the TLA+ spec). SPARK/Ada, I mentioned in my other comment, will detect many problems, and help you prove the absence of many others. But it's also not suitable (yet) for all projects, or more than a subset of a project, since it has specific limitations on what it can prove and is only compatible with a subset of Ada (this is being expanded each year). The subset of Ada it does apply to is actually quite significant and useful, but you have to be capable of understanding that it is limited and then apply it within that limited scope.

This is true for every other formal method approach I've seen. Pay attention, read, ask questions, and don't judge them all by one paper you half remember from 42 years ago.


Is "meringues me" a typo, or a really fun new vocab word for me?

True, I should have remembered to express it as Non-exhaustive testing cannot...

I also consider formal methods not testing. But model checking (e.g. with tlc) is something in between, and sort of takes advantage of the "This usually isn't feasible, but it's nice when it is.": when you limit your domain enough, it becomes feasible, yet exhaustively checking each small-world scenario gives a pretty good confidence on the model being correct.


Would it be possible to attach an LLM to a debugger sessions executing all of the fuzzer seeds, and ask it to figure out how to expand coverage?

Not to dismiss the novelty and power of LLM's but why would you turn to a black box language interface for that?

Wouldn't you expect a designed system to be more efficient, complete, reliable, and auditable? Most or those characteristics are critical to security applications and the nature of LLM's largely run counter to them.



These links are a little different to the GP comment, though. Both of these cases (which I agree show LLMs being an excellent choice for improving fuzzing coverage) are static analysis, going from the project source code to a new harness.

Some issues with that are that the model probably doesn't have enough context to be given all the project source code, so you have to work out which subset to share, including definitions of all relevant symbols (but not too many).

It helps a bit that the foundation models were already pre-trained on these large open source projects in oss-fuzz, so they already know something about the project's symbols and definitions from their original training sets -- and even from public discussions about the code! -- but that wouldn't work for a private codebase or for recent changes to a large public one.

Then the harness source that the LLM writes might have syntax errors/fail to compile, and you have to deal with that somehow, and the harness source that the LLM writes might be valid but not generate any coverage improvement and you have to deal with that, and so on.

GP seems to be talking about instead some form of LLM-aided dynamic analysis, where you are probably using some kind of symbolic execution to generate new seeds, not new harnesses.

That's important work too, because I think in this case (disagreeing with the blog post author) the vulnerable function was actually reachable by existing harnesses, just not through the seed corpora (at least the public ones).

One approach could be for the LLM to become a kind of a symbolic execution constraint solver, using the debugger as a form of path instrumentation and producing new seeds by predicting what a new input would look like when you invert each interesting constraint that the fuzzing coverage is blocked by, as the debugger hits the test for that constraint (which is difficult because it can require actual computation, not pattern matching, and because of path explosion).

Or perhaps more plausibly, the LLM could generate Z3 or other SAT-solver code to define and solve for those constraints to generate new seeds, replacing what is currently extremely tedious and time-consuming work when done by humans.


Those demonstrate that they're capable of generating capable ones, which is really cool but also not surpising.

What matters for engineering is how that technique compares to others in the context of specific requirements.

A big part of "falling for hype" is in mistaking a new and capable tool for the being the right or oprimal tool.


It's fine to have LLM skepticism as a default, but here it's not justified. Google is showing here that the LLM-written harnesses improve massively on the harnesses in oss-fuzz that were written over many years by the combined sum of open source security researchers. Most dramatically, they improved tinyxml2 fuzzing coverage by 31% compared to the existing oss-fuzz harnesses, through an entirely automated flow for harness generation by LLMs.

Whatever engineering technique you are imagining would be better is not one that humanity actually applied to the problem before the automated LLM harnesses were written. In general, writing and improving fuzzing harnesses is extremely tedious work that is not being done (or paid for) by nearly enough people to adequately protect critical open source software. The LLM approach is a legitimate breakthrough in the field of open source fuzzing.


Fair enough, interesting, and plausible! I looked at the first link and saw it as more of a capabilities demo, but didn't dig into the Google one. I'm mostly just encouraging thoughtful reflection on tool choice by raising questions, not making a case against. Very cool.

Why is OpenSSL using punycode? To do internationalized domain name parsing?

I would assume it's needed for hostname validation. Is one hostname equivalent to another? Does this wildcard match this hostname?

EDIT: I looked it up. It is used to implement RFC 8398 (Internationalized Email Addresses in X.509 Certificates)

https://www.rfc-editor.org/rfc/rfc8398#section-5


Yes.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: