Saturday, August 15, 2015

Testing is never enough

Summary: Testing shows the presence, not the absence of bugs.

Recently, someone suggested to me that, thanks to test suites, things like changing compiler version or versions of library dependencies was "no big deal". If dependency changes still result in a passing test suite, then they have caused no harm. I disagree, and fortunately for me, Dijkstra explains it far more eloquently than I ever could:

Testing shows the presence, not the absence of bugs. Dijkstra (1969)

While a test suite can give you confidence in changes you make, it does not provide guarantees. Below are just a few reasons why.

The test suite does not cover all the code

For any reasonably sized code base (> 100 lines), covering all the lines of code is difficult. There are a number of factors that mean that mean a test suite is unlikely to provide 100% coverage:

  • Producing tests is a resource intensive activity, and most projects do not have the necessary manpower to test everything.
  • Sometimes there is no good way to test simple sugar functions - the definition is a specification of what the function should do.
  • Testing corner cases is difficult. As the corners get more obscure, the difficulty increases.
  • Testing error conditions is even harder. Some errors conditions have code to deal with them, but are believed to be unreachable.

The test suite does not cover all the ways through the code

Assuming the test suite really does cover every line of the code, making it cover every path through the code is almost certainly computationally infeasible. Consider a program taking a handful of boolean options. While it might be feasible to test each individual option in the true and false states, testing every state in conjunction with every other state requires an exponential amount of time. For programs with loops, testing every number of loop iterations is likely to be highly time consuming.

There is plenty of code you can't see

Even if you cover every line of source code, the compiler may still thwart your valiant efforts. Optimising compilers like to inline code (make copies of it) and specialise code (freeze in some details that would otherwise be dynamic). After such transformations, the compiler might spot undefined behaviour (something almost all C/C++ programs contain) and make modifications that break your code. You might have tested all the source code, but you have not tested all the code generated by the compiler. If you are writing in C/C++, and undefined behaviour and optimisation doesn't scare you a lot, you should read this LLVM article series.

Functions have huge inputs

Testing functions typically involves supplying their input and inspecting their output. Usually the input space is too large to enumerate - which is likely to be the case even if your function takes in an integer. As soon as your function takes a string or array, enumeration is definitely infeasible. Often you can pick cases at which the code is likely to go wrong (0, 1, -1, maxBound) - but maybe it only fails for Carmichael numbers. Random testing can help, and is always advisable, but the effort to deploy random testing is typically quite a bit higher than input/output samples, and it is no panacea.

Functions are not functions

Testing functions usually assumes they really are functions, which depend only on their input. In pure functional languages that is mostly true, but in C/C++ it is less common. For example, functions that have an internal cache might behave differently under parallelism, especially if their cache is not managed properly. Functions may rely on global variables, so they might perform correctly until some seemingly unrelated operation is performed. Even Haskell programs are likely to depend on global state such as the FPU flags, which may be changed unexpectedly by other code.

In my experience, the non-functional nature of functions is one of the biggest practical difficulties, and is also a common place where dependency changes cause frustration. Buggy code can work successfully for years until an improved memory allocator allows a race condition to be hit.

Performance testing is hard

Even if your code gives the correct results, it may take too long or use too much memory. Alas, testing for resource usage is difficult. Resource numbers, especially runtime, are often highly variable between runs - more so if tests are run on shared hardware or make use of parallelism. Every dependency change is likely to have some impact on resource usage, perhaps as dependencies themselves chose to trade time for memory. Spotting erroneous variations often requires a human to make a judgement call.

What is the solution?

Tests help, and are valuable, and you should aim to test as much as you can. But for any reasonably sized program, your tests will never be complete, and the program will always contain unknown bugs. Most likely someone using your code will stumble across one of these bugs. In this case, it's often possible (and indeed, highly desirable) to add a new test case specifically designed to spot this error. Bugs have a habit of recurring, and a bug that happens twice is just embarrassing.

Thinking back to dependency versions, there is often strength in numbers. If all your users are on the same version of all the dependencies, then any bug that is very common is likely to be found by at least one user and fixed for all.

Thinking more generally, it is clear that many of these issues are somewhat ameliorated by pure functional programming. I consider testability and robustness to be one of the great strengths of Haskell.

Monday, August 10, 2015

Upcoming talk to the Cambridge UK Meetup, Thursday 13 Aug (Shake 'n' Bake)

I'll be talking at the Cambridge NonDysFunctional Programmers Meetup this coming Thursday (13 Aug 2015). Doors open at 7:00pm with talk 7:30-8:30pm, followed by beer/food. I'll be talking about Shake 'n' Bake. The abstract is:

Shake is a Haskell build system, an alternative to Make, but with more powerful and accurate dependencies. I'll cover how to build things with Shake, and why I laugh at non-Monadic build systems (which covers most things that aren't Shake). Shake is an industrial quality library, with a website at http://shakebuild.com.

Bake is a Haskell continuous integration system, an alternative to Travis/Jenkins, but designed for large semi-trusted teams. Bake guarantees that all code arriving in your master branch passes all tests on all platforms, while using as few resources as possible, allowing you to have hours of tests, 100's of commits a day and one a few lonely test servers. Bake is held together with duct tape.

I look forward to seeing people there.

Thursday, July 30, 2015

Parallel/Pipelined Conduit

Summary: I wrote a Conduit combinator which makes the upstream and downstream run in parallel. It makes Hoogle database generation faster.

The Hoogle database generation parses lines one-by-one using haskell-src-exts, and then encodes each line and writes it to a file. Using Conduit, that ends up being roughly:

parse =$= write

Conduit ensures that parsing and writing are interleaved, so each line is parsed and written before the next is parsed - ensuring minimal space usage. Recently the FP Complete guys profiled Hoogle database generation and found each of these pieces takes roughly the same amount of time, and together are the bottleneck. Therefore, it seems likely that if we could parse the next line while writing the previous line we should be able to speed up database generation. I think of this as analogous to CPU pipelining, where the next instruction is decoded while the current one is executed.

I came up with the combinator:

 pipelineC :: Int -> Consumer o IO r -> Consumer o IO r

Allowing us to write:

 parse =$= pipelineC 10 write

Given a buffer size 10 (the maximum number of elements in memory simultaneously), and a Consumer (write), produce a new Consumer which is roughly the same but runs in parallel to its upstream (parse).

The Result

When using 2 threads the Hoogle 5 database creation drops from 45s to 30s. The CPU usage during the pipelined stage hovers between 180% and 200%, suggesting the stages are quite well balanced (as the profile suggested). The parsing stage is currently a little slower than the writing, so a buffer of 10 is plenty - increasing the buffer makes no meaningful difference. The reason the drop in total time is only by 33% is that the non-pipelined steps (parsing Cabal files, writing summary information) take about 12s.

Note that Hoogle 5 remains unreleased, but can be tested from the git repo and will hopefully be ready soon.

The Code

The idea is to run the Consumer on a separate thread, and on the main thread keep pulling elements (using await) and pass them to the other thread, without blocking the upstream yield. The only tricky bit is what to do with exceptions. If the consumer thread throws an exception we have to get that back to the main thread so it can be dealt with normally. Fortunately async exceptions fit the bill perfectly. The full code is:

pipelineC :: Int -> Consumer o IO r -> Consumer o IO r
pipelineC buffer sink = do
    sem <- liftIO $ newQSem buffer  -- how many are in flow, to avoid excess memory
    chan <- liftIO newChan          -- the items in flow (type o)
    bar <- liftIO newBarrier        -- the result type (type r)
    me <- liftIO myThreadId
    liftIO $ flip forkFinally (either (throwTo me) (signalBarrier bar)) $ do
        runConduit $
            (whileM $ do
                x <- liftIO $ readChan chan
                liftIO $ signalQSem sem
                whenJust x yield
                return $ isJust x) =$=
            sink
    awaitForever $ \x -> liftIO $ do
        waitQSem sem
        writeChan chan $ Just x
    liftIO $ writeChan chan Nothing
    liftIO $ waitBarrier bar

We are using a channel chan to move elements from producer to consumer, a quantity semaphore sem to limit the number of items in the channel, and a barrier bar to store the return result (see about the barrier type). On the consumer thread we read from the channel and yield to the consumer. On the main thread we awaitForever and write to the channel. At the end we move the result back from the consumer thread to the main thread. The full implementation is in the repo.

Enhancements

I have specialised pipelineC for Consumers that run in the IO monad. Since the Consumer can do IO, and the order of that IO has changed, it isn't exactly equivalent - but relying on such IO timing seems to break the spirit of Conduit anyway. I suspect pipelineC is applicable in some other moands, but am not sure which (ReaderT and ResourceT seem plausible, StateT seems less likely).

Acknowledgements: Thanks to Tom Ellis for helping figure out what type pipelineC should have.

Sunday, July 19, 2015

Thoughts on Conduits

Summary: I'm growing increasingly fond of the Conduit library. Here I give my intuitions and some hints I'd have found useful.

Recently I've been working on converting the Hoogle database generation to use the Conduit abstraction, in an effort to reduce the memory and improve the speed. It worked - database generation has gone from 2Gb of RAM to 320Mb, and time has dropped from several minutes (or > 15 mins on memory constrained machines) to 45s. These changes are all in the context of Hoogle 5, which should hopefully be out in a month or so.

The bit that I've converted to Conduit is something that takes in a tarball of one text files per Hackage file, namely the Haddock output with one definition per line (this 22Mb file). It processes each definition, saves it to a single binary file (with compression and some processing), and returns some compact information about the definition for later processing. I don't expect the process to run in constant space as it is accumulating some return information, but it is important that most of the memory used by one definition is released before the next definition. I originally tried lazy IO, and while it somewhat worked, it was hard to abstract properly and very prone to space leaks. Converting to Conduit was relatively easy and is simpler and more robust.

The Conduit model

My mental model for a conduit Conduit a m b is roughly a function [a] -> m [b] - a values go in and b values come out (but interleaved with the monadic actions). More concretely you ask for an a with await and give back a b with yield, doing stuff in the middle in the m Monad.

A piece of conduit is always either running (doing it's actual work), waiting after a yield for the downstream person to ask for more results (with await), or waiting after an await for the upstream person to give the value (with yield). You can think of a conduit as making explicit the demand-order inherent in lazy evaluation.

Things to know about Conduit

I think it's fair to say Conduit shows its history - this is good for people who have been using it for a while (your code doesn't keep breaking), but bad for people learning it (there are lots of things you don't care about). Here are some notes I made:

  • The Data.Conduit module in the conduit library is not the right module to use - it seems generally accepted to use the Conduit module from the conduit-combinators package. However, I decided to build my own conduit-combinators style replacement in the Hoogle tree, see General.Conduit - the standard Conduit module has a lot of dependencies, and a lot of generalisations.
  • Don't use Source or Sink - use Producer and Consumer - the former are just a convenient way to get confusing error messages.
  • Don't use =$ or $=, always use =$= between conduits. The =$= operator is essentially flip (.).
  • Given a Conduit you can run it with runConduit. Alternatively, given a =$= b =$= c you can replace any of the =$= with $$ to run the Conduit as well. I find that a bit ugly, and have stuck to runConduit.
  • Conduit and ConduitM have their type arguments in different orders, which is just confusing. However, generally I use either Conduit (a connector) or Producer (something with a result). You rarely need something with a result and a return value.
  • You call await to see if a value is available to process. The most common bug I've had with conduits is forgetting to make the function processing items recursive - usually you want awaitForever, not just await.
  • The ByteString lines Conduit function was accidentally O(n^2) - I spotted and fixed that. Using difference lists does not necessarily make your code O(n)!

Useful functions

When using Conduit I found a number of functions seemingly missing, so defined them myself.

First up is countC which counts the number of items that are consumed. Just a simple definition on top of sumC.

countC :: (Monad m, Num c) => Consumer a m c
countC = sumC <| mapC (const 1)

While I recommend awaitForever in preference to await, it's occasionally useful to have awaitJust as the single-step awaitForever, if you are doing your own recursion.

awaitJust :: Monad m => (i -> Conduit i m o) -> Conduit i m o
awaitJust act = do
    x <- await
    whenJust x act

I regularly find zipFrom i = zip [i..] very useful in strict languages, and since Conduit can be viewed as a strict version of lazy lists (through very foggy glasses) it's no surprise a Conduit version is also useful.

zipFromC :: (Monad m, Enum c) => c -> Conduit a m (c, a)
zipFromC !i = awaitJust $ \a -> do
    yield (i,a)
    zipFromC (succ i)

Finally, it's useful to zip two conduits. I was surprised how fiddly that was with the standard operators (you have to use newtype wrappers and an Applicative instance), but a simple |$| definition hides that complexity away.

(|$|) :: Monad m => ConduitM i o m r1 -> ConduitM i o m r2 -> ConduitM i o m (r1,r2)
(|$|) a b = getZipConduit $ (,) <$> ZipConduit a <*> ZipConduit b

Why not Pipes?

I am curious if all Pipes users get asked "Why not use Conduit?", or if this FAQ is asymmetrical?

I realise pipes are billed as the more "principled" choice for this type of programming, but I've yet to see anywhere Conduit seems fundamentally unprincipled. I use WAI/Warp and http-conduit, so learning Conduit gives me some help there.

Saturday, June 20, 2015

Announcing the 'extra' library

Summary: I wrote an extra library, which contains lots of my commonly used functions.

When starting to write Bake, my first step was to copy a lot of utility functions from Shake - things like fst3 (select the first element of a triple), concatMapM (monadic version of concatMap), withCurrentDirectory (setCurrentDirectory under bracket). None of these functions are specific to either Bake or Shake, and indeed, many of the ones in Shake had originally came from HLint. Copy and pasting code is horrible, so I extracted the best functions into a common library which I named extra. Unlike the copy/paste versions in each package, I then wrote plenty of tests, made sure the functions worked in the presence of exceptions, did basic performance optimisation and filled in some obvious gaps in functionality.

I'm now using the extra library in all the packages above, plus things like ghcid and Hoogle. Interestingly, I'm finding my one-off scripts are making particularly heavy use of the extra functions. I wrote this package to reduce my maintenance burden, but welcome other users of extra.

My goal for the extra library is simple additions to the standard Haskell libraries, just filling out missing functionality, not inventing new concepts. In some cases, later versions of the standard libraries provide the functions I want, so there extra makes them available all the way back to GHC 7.2, reducing the amount of CPP in my projects. A few examples:

The module Extra documents all functions provided by the library, so is a good place to go to see what is on offer. Modules such as Data.List.Extra provide extra functions over Data.List and also reexport Data.List. Users are recommended to replace Data.List imports with Data.List.Extra if they need the extra functionality.

Which functions?

When selecting functions I have been guided by a few principles.

  • I have been using most of these functions in my packages - they have proved useful enough to be worth copying/pasting into each project.
  • The functions follow the spirit of the original Prelude/base libraries. I am happy to provide partial functions (e.g. fromRight), and functions which are specialisations of more generic functions (whenJust).
  • Most of the functions have trivial implementations. If a beginner couldn't write the function, it probably doesn't belong here.
  • I have defined only a few new data types or type aliases. It's a package for defining new utilities on existing types, not new types or concepts.

Testing

One feature I particularly like about this library is that the documentation comments are tests. A few examples:

Just True ||^ undefined  == Just True
retry 3 (fail "die") == fail "die"
whenJust (Just 1) print == print 1
\x -> fromRight (Right x) == x
\x -> fromRight (Left  x) == undefined

These equalities are more semantic equality than Haskell's value equality. Things with lambda's are run through QuickCheck. Things which print to stdout are captured, so the print 1 test really does a print, which is scraped and compared to the LHS. I run these tests by passing them through a preprocessor, which spits out this code, which I then run with some specialised testing functions.

Friday, June 19, 2015

Maximum Patches, Minimum Time

Summary: Writing a fast continuous integration server is tricky.

When I started writing the continuous integration server Bake, I thought I had a good idea of what would go fast. It turned out I was wrong. The problem Bake is trying to solve is:

  • You have a set of tests that must pass, which take several hours to run.
  • You have a current state of the world, which passes all the tests.
  • You have a stream of hundreds of incoming patches per day that can be applied to the current state.
  • You want to advance the state by applying patches, ensuring the current state always passes all tests.
  • You want to reject bad patches and apply good patches as fast as possible.

I assume that tests can express dependencies, such as you must compile before running any tests. The test that performs compilation is special because it can go much faster if only a few things have changed, benefiting from incremental compilation.

Both my wrong solution, and my subsequent better solution, are based on the idea of a candidate - a sequence of patches applied to the current state that is the focus of testing. The solutions differ in when patches are added/removed from the candidate and how the candidates are compiled.

A Wrong Solution

My initial solution compiled and ran each candidate in a separate directory. When the directory was first created, it copied a nearby candidate to try and benefit from incremental compilation.

Each incoming patch was immediately included in the candidate, compiled, and run on all tests. I would always run the test that had not passed for the longest time, to increase confidence in more patches. Concretely, if I have run test T1 on patch P1, and P2 comes in, I start testing T2 on the combination of P1+P2. After that passes I can be somewhat confident that P1 passes both T1 and T2, despite not having run T2 on just P1.

If a test fails, I bisect to find the patch that broke it, reject the patch, and immediately throw it out of the candidate.

The Problems

There are three main problems with this approach:

  • Every compilation starts with a copy of a nearby candidate. Copying a directory of lots of small files (the typical output of a compiler) is fantastically expensive on Windows.
  • When bisecting, I have to compile at lots of prefixes of the candidate, the cost of which varies significantly based on the directory it starts from.
  • I'm regularly throwing patches out of the candidate, which requires a significant amount of compilation, as it has to recompile all patches that were after the rejected patch.
  • I'm regularly adding patches to the candidate, each of which requires an incremental compilation, but tends to be dominated by the directory copy.

This solution spent all the time copying and compiling, and relatively little time testing.

A Better Solution

To benefit from incremental compilation and avoid copying costs, I always compile in the same directory. Given a candidate, I compile each patch in the series one after another, and after each compilation I zip up the interesting files (the test executables and test data). To bisect, I unzip the relevant files to a different directory. On Windows, unzipping is much more expensive than zipping, but that only needs to be done when bisecting is required. I also only need to zip the stuff required for testing, not for building, which is often much smaller.

When testing a candidate, I run all tests without extending the candidate. If all the tests pass I update the state and create a new candidate containing all the new patches.

If any test fails I bisect to figure out who should be rejected, but don't reject until I've completed all tests. After identifying all failing tests, and the patch that caused each of them to fail, I throw those patches out of the candidate. I then rebuild with the revised candidate and run only those tests that failed last time around, trying to seek out tests where two patches in a candidate both broke them. I keep repeating with only the tests that failed last time, until no tests fail. Once there are no failing tests, I extend the candidate with all new patches, but do not update the state.

As a small tweak, if there are two patches in the queue from the same person, where one is a superset of the other, I ignore the subset. The idea is that if the base commit has an error I don't want to track it down twice, once to the first failing commit and then again to the second one.

Using this approach in Bake

First, the standard disclaimer: Bake may not meet your needs - it is a lot less developed than other continuous integration systems. If you do decide to use Bake, you should run from the git repo, as the Hackage release is far behind. That said, Bake is now in a reasonable shape, and might be suitable for early adopters.

In Bake this approach is implemented in the StepGit module, with the ovenStepGit function. Since Bake doesn't have the notion of building patches in series it pretends (to the rest of Bake) that it's building the final result, but secretly caches the intermediate steps. If there is a failure when compiling, it caches that failure, and reports it to each step in the bisection, so Bake tracks down the correct root cause.

I am currently recommending ovenStepGit as the "best" approach for combining git and an incremental build system with Bake. While any incremental build system works, I can't help but plug Shake, because its the best build system I've ever written.

Sunday, June 07, 2015

ghcid 0.4 released

Summary: I've released a new version of ghcid, which is faster and works better for Vim users.

I've just released version 0.4 of ghcid. For those who haven't tried it, it's the simplest/dumbest "IDE" that could possibly exist. It presents a small fixed-height console window that shows the active warnings/errors from ghci, which is updated on every save. While that might not seem like much, there are a number of improvements over a standard workflow:

  • You don't have to switch to a different window and hit :reload, just saving in your editor is enough.
  • It includes warnings from modules that successfully compiled previously, as opposed to ghci which omits warnings from modules that didn't reload.
  • It reformats the messages to the height of your console, so you can see several messages at once.

I find it gives me a significant productivity boost, and since there is no editor integration, it adds to my existing tools, rather than trying to replace them (unlike a real IDE).

Version 0.4 offers a few additional improvements:

  • We use fsnotify to watch for changes, rather than polling as in older versions. The result is a significant decrease in battery usage, combined with faster responses to changes.
  • If you change the .cabal or .ghci file then ghcid will restart the ghci session to pick up the changes.
  • If you are a Vim user, then the sequence of temporary files Vim uses on save could upset ghcid. I believe these issues are now all eliminated.
  • The number of errors/warnings is displayed in the titlebar of the console.
  • There is a feature to run a test after each successful save (using --test). I am expecting this to be a niche feature, but feel free to prove me wrong.

Tempted to give it a try? See the README for details.