Artificial Intelligence and Formal Verification Part 3: What Verification Means

Verification is not perfection; it's a ratchet.

Jul 07, 2023

[This is the third in a series about the formal verification of software: why it’s useful, why it’s hard to do, and why artificial intelligence may make it much easier.

If you want to understand the motivation for this installment, I’d go back and read Part 1. If you want a concrete example of a formally verified piece of code, read Part 2. If you have insomnia, read both of them back to back.]

Verification is not Perfection

So formal verification of software looks great, doesn’t it? It’s a ton of work, maybe more than you’re willing to put in, but if you were able to do it, it seems, at least at first glance, like you would finally get perfect software that’s totally free of bugs and executes correctly every time.

Well, not so fast. Sadly, it’s not that easy.

In Part 1, I noted several times that I was using weasel words pretty liberally. A proof is only as good as its assumptions, and there are always a lot of them, including that your operating system and hardware work correctly. In our second installment, I hinted that it’s all fine and well to prove a property correct, but how do you know that you’ve found the right property to prove?

In the following, I’m going to first dispense with the notion that you can actually achieve perfection with formal verification. I’m then going to claim this is not actually a problem, because verification, properly understood, isn’t about achieving perfection, but about ratcheting to ever higher levels of assurance, and that, viewed as part of an iterative engineering process, it remains a revolutionary advance over ordinary testing even if perfect software is still out of reach.

Your Environment is (Probably) Not Verified

Let me dispose of the easiest problem first. Let’s say you decide to run a verified C compiler like CompCert. CompCert was built by writing and verifying the compiler in Coq, and then extracting the Coq code to OCaml1 so that it could be compiled into a normal, high performance executable.

It should be noted that the code that extracts Coq to OCaml is not verified, and neither is the OCaml compiler, and neither is the OCaml runtime environment. The operating system you are running CompCert on is (probably) not verified either, and neither (probably) is the (possibly different) operating system running the code the compiler outputs. (There are formally verified microkernels out there, but you’re probably not using one.2) Furthermore, you’re (probably) not running on a verified microprocessor.3

However, even so, the verification of CompCert eliminated the largest source of problems one usually experiences when running a compiler, which is to say, the overwhelming bulk of the bugs in the compiler itself. This means that the work wasn’t in vain. When exhaustively tested, CompCert has indeed been found to be vastly more reliable than GCC, Clang, and a plethora of other compilers, even though it runs on unverified underpinnings.4

(I will return to this point several times: even if you cannot achieve perfection, simply verifying those parts of a system you can manage has incredible quality benefits.)

If I’m right, and verification can be made many orders of magnitude cheaper, someday we will get the whole stack of abstractions people depend on verified.5 However, even now, with most of those underlying abstractions unverified, there is still incremental benefit in verifying parts of systems. Even if you can’t yet have a fully verified technology stack, that incremental benefit will more than pay for itself, at least if we can lower the cost of verification sufficiently.

(This probably sounds a bit weaselish. “Of course it would be cost effective if you lower the cost enough, that’s a tautology!” But in this case, I really do think there’s a point here; if verification wasn’t significantly harder to perform for software engineers than normal testing, the benefits would be overwhelming.)

Hardware failures are another problem to consider. Even if a microprocessor is verified,6 and even if you’ve verified (under some model) most of the surrounding hardware, hardware operates in our debased physical reality, not in the platonic realm of pure mathematics, and thus is subject to the messy imperfections of our actual world.

Hardware is inherently imperfect. Mistakes can happen during fabrication. Mechanical or physical problems can make electronics flaky. Power can experience glitches. Sometimes a cosmic ray from a distant supernova will decide to chart a path straight through the cache of your microprocessor and will flip some bits as it traverses your chips. Whether it’s because of bad luck while fabricating a device, a component that behaves badly at certain temperatures, damage from being dropped, radiation, or a myriad of other causes, hardware will not always behave as expected.

Any proof of correctness assumes, indeed, must assume, that the program is executed correctly by the hardware. If the hardware doesn’t behave as assumed, the proof is worthless.

However, even here, there are methods that go beyond verifying the hardware that can help us mitigate the imperfections in physical machines. We can test our equipment thoroughly, including stress testing it at manufacturing time, and we can periodically retest it in the field to make sure it has not degraded. We can design circuits that are robust against likely failure modes and are thus less likely to silently fail. We can build mechanisms to correct (or at least detect) memory errors, or errors in the transmission of data through the buses in our hardware, or in stored data in mass storage. For sufficiently sensitive applications, we can even do calculations in multiple copies of hardware and compare the results.

So, we are not completely helpless against hardware failures. (Indeed, all of these techniques and more have been used in high reliability systems, like the computers used to control manned spacecraft.)

The theme here is the same: even though we cannot achieve perfection, we can make things better with careful engineering of every part of a system we build. We can’t assure correct execution of a program if the computer is destroyed in a data center fire — proofs are powerless against actual incineration of circuitry — but we can still dramatically reduce hardware error rates if we want to.

That said, even absent the use of such design techniques, verifying the hardware design and all the software on top of it still brings us an incredible improvement in reliability. It can’t bring about a system that never fails; the fact that hardware cannot be made perfectly reliable means that nothing built on top of it can be perfectly reliable either. However, verifying everything we can remains a huge improvement that, I contend, dramatically lowers error rates. Even if they cannot be brought to zero, and even if we can’t verify everything, verifying key parts of systems brings us dramatic incremental improvements in assurance.

You Might Not Verify the Right Things

In our example in Part 2, our specification of a correct sorting function was a long description in a formal language. Some specifications, like a formal description of the semantics of the C programming language, which you need to build a verified C compiler like CompCert, can be very, very long indeed. All programmers know, from bitter experience, that long pieces of code have bugs, and a specification in a language like Coq is not very different in flavor from a program.

Our proofs of correctness can only show our program faithfully implements a specification. How do we know that our specification itself is correct?

Even worse, how do we know we specified “enough”?

As an example of this latter problem, let’s say that we’re worried about whether our implementation of a cryptographic primitive is secure. We think hard, come up with some complicated formal properties that correctly describe the output the algorithm should have for every input, and we prove that the code does what we wanted. We rejoice, but we’ve overlooked something important.

Years ago, Paul Kocher discovered that you can deduce information about the cryptographic keys in use in a supposedly secure device by analyzing the power used or the time needed by the device to encrypt a particular input. This was an unexpected development to the people who built such systems.

Perhaps when you wrote a logical specification of the behavior of your code, you were unaware of this particular problem, or perhaps a new problem like it is discovered. Either way, there’s a feature of “correctness” that you as a human want but that your existing specification didn’t capture.

Clearly we cannot know that we’ve perfectly specified our program’s behavior, or that we’ve captured all the properties we need to prove. Some such properties might only be discovered in the future. Does that mean that proof is useless?

Far from it! However, we do have to alter our view about what proof provides us with. It’s clearly not perfection. Instead, it’s a ratchet.

Verification is a Ratcheting Mechanism

The right way to think of verification is not as a way to assure for all time that your program is flawless, but rather to assure, for all time, that if there’s a property you are aware you want some part of the program to have, it has it. You will almost certainly become aware of mistakes in specifying the properties or properties that you missed specifying, but that’s okay. As you become aware of problems, improve your specifications, and add the proofs for the improved specifications, your system gets better and better assurance.

Consider again the cryptographic library example. Initially, you thought only about making sure your cryptographic code produced the correct output. A proof would still demonstrate, in a way no test could, that this was true.

Say you then later learn that this was insufficient, that you also need to make sure every execution of the code takes the same amount of time. Your original proof didn’t cover this, so your code was imperfect. However, you still got the benefit of knowing that the output was always correct, and now by fixing the code to address this new isochrony requirement, and adding a proof for it, you’ve ratcheted to a new level of assurance. You may have had to redo the original proof that the output is always correct, but having done so, you are assured that the original property remains true and that the new property is also true. You will not backslide on the original property, but you did gain a new one.

When you rely on testing, every time you change your code to meet new requirements, you are hoping that your original tests will detect ways your changes broke old requirements. Often this is true, but sometimes, unfortunately, it isn’t. When you rely on proofs, and you maintain those proofs as you change your code, any old property you proved remains true going forward. You may of course have forgotten to specify an important property, but you will not lose assurance of the things you did specify.

As you add more and more proofs of properties around your code over time, you do not lose progress. You ratchet in the direction of increased quality. When you realize you failed to ask an important question about the code or that a specification wasn’t quite right, once you’ve fixed the code and successfully proven that it now does the right thing, you’ve closed the door on that problem for good.

This sort of benefit is seen in real systems that have been formally verified. It is routine that people discover that some property was not included in the proof and was important, but once it’s been added to the things being proved, the problem permanently ceases to be a future concern.7

So the right way of thinking about verification isn’t as a way of achieving perfection, but rather as a much, much more powerful mechanism for achieving quality than testing can produce, and in particular, a mechanism where quality ratchets upwards.

Testing Specifications

Another question might come to mind: if you have a very complicated specification, say the formal semantics of a programming language like C, how do you know it is “correct”? A programming language, for example, has a large and complicated description; even for a fairly simple programming language like C, the English language standards document is hundreds of pages long, and is the subject of frequent bug reports to the standards committee because of ambiguities, gaps, or outright mistakes. Even after 35 years of work, what the document means is sometimes unclear.

So, how can you know what the “correct” formal semantics of the programming language would be? The answer, yet again, is that you can’t know with complete certainty, but that’s okay if you’re not expecting perfection but rather an extremely powerful quality improvement tool, both for compilers that translate the language and to allow proofs about the properties of programs written in the language.

One practical method used by the (multiple) projects that have written formal semantics for C has been to build tools to test those semantics against large corpuses of existing C code (and even randomly generated example programs) and to see if the behavior of the executables produced by many different compilers matches the behavior predicted by the semantics. If the semantics do generally predict the consensus behavior of the real implementations, you have reason to believe that you’ve captured most of the intent of the standards document in your formalization.

You might reasonably say: “But wait, isn’t this then circular? How can you learn anything about bugs in compilers if you used the behavior of unverified compilers to check your formal semantics? Did you actually learn anything of interest this way?”

The answer is, yes, you did. First of all, the semantics are a much, much smaller description of what programs in C are supposed to do than any compiler’s source code, and one which describes the behavior of C programs at a much higher level and in a more general way than a compiler could. (The fact that the semantics are in formal logic and not an imperative programming language is a big plus in itself.) Second, you’ve tested it against many implementations, and have found it matches most of them most of the time but not all of them all of the time, and have made conscious decisions about which version of the behavior to follow; you’ve thus learned something about tricky edge cases and what they should mean along the way.8

This turns out to be a very general technique for assuring that specifications are more or less correct: one tests them! One might naïvely feel that there’s something ironic or wrong about relying on testing to make sure you have the right logical description of what you’re trying to prove. However, that feeling goes away if you properly understand formal verification as a tool to improve assurance rather than a way to produce perfect assurance.

If you (incorrectly) view the exercise as a way to achieve perfection, one might be disturbed about attempting to build a perfect implementation on top of only approximately understood specifications. However, if verification is not (incorrectly) viewed as a means of achieving total perfection, but rather as a much more powerful tool than conventional testing, and as a good way of ratcheting assurance towards higher and higher levels, then the fact that your specification itself might need testing to determine if it actually captures what you wanted it to mean should not be disturbing at all.

Verification Does Not Eliminate the Need for Testing

One might also (naïvely) expect that verification eliminates the need for testing. It does not, because you need to know that the code you are verifying is (at least mostly) correct. To quote an old saw, “you can spend an awfully long time trying to prove a false proposition.”9 Sometimes the process of trying to prove code correct turns up contradictions and counterexamples, but it’s a whole lot easier to get your work done starting with code you strongly expect to be correct to begin with.

It is therefore important to have tools to make sure that a proposition you are trying to prove appears to be true before you attempt to find a proof. Luckily, there’s a technique called Property Based Randomized Testing that can be of significant use here.

Imagine, as in Part 2, that one has written both a sorting function and a specification that one might want to prove later on to define what it means for such a function to be correct. (In the case of sorting, the property we developed in Part 2 was a machine readable definition that amounted to “sort takes a list of numbers, and returns an ordered list of numbers that is a permutation of the original list”.)

Given such a specification, one can then feed huge numbers of randomly generated inputs to the function and mechanically check that the property (“is the output an ordered permutation of the input?”) does indeed hold each and every single time. If the property holds for a huge number of random tests, one has some empirical evidence that the code indeed has the claimed property, and one can then proceed to attempt to prove that exact same property. (If it does not hold, finding a counterexample is a very good starting point for fixing a bug.)

Note this process is a two way test: if the property doesn’t hold, it might not mean that the code is wrong, but rather that the property is stated incorrectly. That’s fine, though, because we also need to make sure we didn’t make a mistake specifying what it was that we wanted to prove.

Property based testing is especially powerful because there is a direct match between an empirical test and a later attempt at proof; the two processes (ideally) share a single statement about what correctness means.10

An important part of software engineering is writing code in such a way that it is easy to test. As we’ve noted, this remains true when code needs to be formally verified, because code that is to be verified first needs to pass testing.11

In the future, we will use AI based systems to generate code, to generate propositions about code, and to find proofs of the behavior of code. Doing that efficiently will almost certainly make heavy use of test systems to get reasonable results while achieving tractable performance, because automated tests, when properly constructed, detect problems much more quickly and efficiently than other methods.

Is Verification Still Worthwhile? Is it Feasible?

Verification isn’t a panacea. It doesn’t give you perfect programs running flawlessly on perfect hardware. It doesn’t avoid the need to architect programs correctly, or to test them. Even when systems get verified, they will still sometimes have imperfections because of flaws in the specification. Hardware failures cannot be tamed by verification.

However, verification is still a gigantic and unequivocal win. The real world systems I’ve seen verified are not completely free of flaws, but they’re by far the least buggy code I’ve ever encountered, and objective tests seem to conclusively indicate that this isn’t simply my imagination but rather is a real phenomenon.

However, making practical and widespread use of verification is hard. Up until now, formal verification has usually involved large teams of extremely skilled engineers and heroic amounts of effort. The initial formal verification of the seL4 microkernel took twenty person-years, and most of those involved were doctoral students or had doctorates. The initial proof of correctness of CompCert required six person-years, again by extraordinarily talented engineers.

To some extent, this should not be unexpected. After all, proper testing of code is often incredibly laborious. As an example, take SQLite, a small database management system that’s renowned for its quality. It consists of 155,000 lines of C code and about 92,000,000 lines of tests, almost 600 times the number of lines of tests as shipped code. Compared to this effort, formal verification isn’t actually so bad. On the other hand, it should also be noted that very few programs in existence are this well tested; it’s an ideal almost every project would like to aspire to, but it’s also an ideal that is almost never realized in the real world.

Can we have our cake and eat it too? Is there a way we could achieve dramatically higher levels of code quality, even employing formal verification across our codebases, but without requiring vast amounts of human labor to achieve it?

Obviously, given the title of this series of essays, I think we can get to that goal using artificial intelligence. We can develop machine learning based systems that aid us with coding itself, with producing rigorous specifications of programs as we write them, and with proving that the code, whether created by people or machines, meets those specifications.

We’ll start to discuss that topic in the upcoming installments of this series.

There are many ways to use Coq in particular to prove a system correct. “Extraction” is a pretty popular one. In it, you write your code in Coq, prove it correct, and then “extract” the proven code into another, more efficient programming language.

The extraction process itself usually isn’t proven correct for a technical reason: there is often no formal semantics of the target language (that is to say, a description in logic of how the target language works), so there is no way to prove the extraction preserves the semantics of the program.

(Ironically, several pretty good sets of formal semantics exist for C, a fairly messy if relatively simple programming language, but no formal semantics exists for OCaml, the language Coq itself is implemented in. The cobbler’s children go shoeless.)

There are also several other options people have for using Coq for verifying software.

You can use what amounts to a compiler to convert your code (say written in C) into a description of its behavior in logic, and then prove that the logical specification you had in mind and the logical description of the final program match. This is used, for example, by Princeton’s Verified Software Toolchain for C.

You can create what amounts to a new programming language or a new hardware specification language that you embed inside Coq, in which you write your system which then gets compiled into machine code or a hardware description.

There are many other possibilities; these are just a couple of examples.

You’re certainly not using one for software development; there isn’t yet a formally verified POSIX style operating system, though it would be incredibly useful if there were.

There are in fact, several end-to-end formally verified microprocessor designs, but they’re still pretty unusual, and they’re not yet being done for high performance devices. That said, large chunks of most commercial microprocessors are verified. For example, tricky parts of modern hardware like cache coherency circuitry and even floating point units are often verified because it’s just too hard for humans to get them right without such tools, and the cost of replacing microprocessors in the field is extremely high.

See, for example, the paper Finding and Understanding Bugs in C Compilers by John Regehr and his students. Using a clever randomized testing tool called csmith, Regehr et al found very large numbers of bugs in GCC, Clang, and other C compilers, but found very few in CompCert, and none of them was in an area covered by the proof of correctness. Regehr later followed up with an interesting blog post about further testing his team did; it was unable to find bugs by randomized testing in the latest versions of CompCert they checked. Interestingly, Xavier Leroy noted in the comments to that blog post that he has used Regehr’s csmith tool to find bugs during CompCert’s development process, a topic I’ll return to later in this essay.

There’s incredible benefit to verifying infrastructure like microprocessors, operating systems, compilers, and run time systems. These are the foundations on which our computer systems are built, and verifying the foundations makes everything built on top of them more reliable.

In addition to the ecosystem benefits of verifying foundational components, these are also good targets because of their thorough existing technical specifications. Microprocessor architectures, for example, are often extremely well specified, including pseudoformal descriptions of the behavior of instructions. The POSIX operating system API is the subject of a detailed technical specification, though it is in natural language. Similarly, programming languages often have associated standards documents that attempt to be as comprehensive as natural language permits.

One might ask what it means to verify a microprocessor or other hardware. It’s much the same idea as verifying software; one builds a mathematical proof that the behavior of the device is what you believe it should be. Of course, a proof is always subject to assumptions, and in the case of hardware, one usually makes simplifying assumptions like idealized behavior of the boolean gates you implement with your transistors.

In theory, one could prove that the gates themselves work as expected assuming particular mathematical models of the behavior of the underlying components like the transistors, but I’ve never heard of that being done. Perhaps that will become feasible once AI based verification systems are well enough developed that people can afford to attempt harder proofs.

For example, the original CompCert proof didn’t include important properties of both the front end and of separately compiled code, and those had bugs. However, once those were fixed, no new bugs were found related to those areas.

As an aside, I’ll note that most bugs in compilers are in their optimizers, and that running a compiler with optimization turned on often (incorrectly!) gives you a program with different behavior than without optimization. One can use this difference in behavior to catch bugs, and I’ll allude to techniques for this elsewhere, but compilers and their particular pain points aren’t the core of this series of essays.

If your formal logic is consistent, proving a false proposition should take forever because you can’t actually do it. In practice, one starts noticing that a proof isn’t going through because it requires you prove something obviously false. We will need AI based verification tools to be able to notice such situations, too.

Ideally, the machine readable statement of a property can be directly checked on machine generated input / output pairs without any need to manually write test code, and can then be used, unchanged, as the statement one later proves.

Note that there’s a lot to property based testing that I’m glossing over here, including how you check those input / output pairs, techniques to try to generate “interesting” test cases, how to try to cover as much of the code being tested as possible, and how to reduce generated counterexamples to minimal form to make debugging easier.

If you find the topic interesting, I suggest reading more on the subject. “QuickCheck” is a typical PBT system, originally written for Haskell and now ported to a number of other languages; “QuickChick” is a similar PBT system for Coq itself.

When one is planning on verifying code, there’s a new requirement as well: code needs to be designed to be easy to prove. What that implies goes beyond the scope of these essays.

Diminished Capacity

Discussion about this post

Ready for more?