Thursday, July 30, 2015

Parallel/Pipelined Conduit

Summary: I wrote a Conduit combinator which makes the upstream and downstream run in parallel. It makes Hoogle database generation faster.

The Hoogle database generation parses lines one-by-one using haskell-src-exts, and then encodes each line and writes it to a file. Using Conduit, that ends up being roughly:

parse =$= write

Conduit ensures that parsing and writing are interleaved, so each line is parsed and written before the next is parsed - ensuring minimal space usage. Recently the FP Complete guys profiled Hoogle database generation and found each of these pieces takes roughly the same amount of time, and together are the bottleneck. Therefore, it seems likely that if we could parse the next line while writing the previous line we should be able to speed up database generation. I think of this as analogous to CPU pipelining, where the next instruction is decoded while the current one is executed.

I came up with the combinator:

 pipelineC :: Int -> Consumer o IO r -> Consumer o IO r

Allowing us to write:

 parse =$= pipelineC 10 write

Given a buffer size 10 (the maximum number of elements in memory simultaneously), and a Consumer (write), produce a new Consumer which is roughly the same but runs in parallel to its upstream (parse).

The Result

When using 2 threads the Hoogle 5 database creation drops from 45s to 30s. The CPU usage during the pipelined stage hovers between 180% and 200%, suggesting the stages are quite well balanced (as the profile suggested). The parsing stage is currently a little slower than the writing, so a buffer of 10 is plenty - increasing the buffer makes no meaningful difference. The reason the drop in total time is only by 33% is that the non-pipelined steps (parsing Cabal files, writing summary information) take about 12s.

Note that Hoogle 5 remains unreleased, but can be tested from the git repo and will hopefully be ready soon.

The Code

The idea is to run the Consumer on a separate thread, and on the main thread keep pulling elements (using await) and pass them to the other thread, without blocking the upstream yield. The only tricky bit is what to do with exceptions. If the consumer thread throws an exception we have to get that back to the main thread so it can be dealt with normally. Fortunately async exceptions fit the bill perfectly. The full code is:

pipelineC :: Int -> Consumer o IO r -> Consumer o IO r
pipelineC buffer sink = do
    sem <- liftIO $ newQSem buffer  -- how many are in flow, to avoid excess memory
    chan <- liftIO newChan          -- the items in flow (type o)
    bar <- liftIO newBarrier        -- the result type (type r)
    me <- liftIO myThreadId
    liftIO $ flip forkFinally (either (throwTo me) (signalBarrier bar)) $ do
        runConduit $
            (whileM $ do
                x <- liftIO $ readChan chan
                liftIO $ signalQSem sem
                whenJust x yield
                return $ isJust x) =$=
            sink
    awaitForever $ \x -> liftIO $ do
        waitQSem sem
        writeChan chan $ Just x
    liftIO $ writeChan chan Nothing
    liftIO $ waitBarrier bar

We are using a channel chan to move elements from producer to consumer, a quantity semaphore sem to limit the number of items in the channel, and a barrier bar to store the return result (see about the barrier type). On the consumer thread we read from the channel and yield to the consumer. On the main thread we awaitForever and write to the channel. At the end we move the result back from the consumer thread to the main thread. The full implementation is in the repo.

Enhancements

I have specialised pipelineC for Consumers that run in the IO monad. Since the Consumer can do IO, and the order of that IO has changed, it isn't exactly equivalent - but relying on such IO timing seems to break the spirit of Conduit anyway. I suspect pipelineC is applicable in some other moands, but am not sure which (ReaderT and ResourceT seem plausible, StateT seems less likely).

Acknowledgements: Thanks to Tom Ellis for helping figure out what type pipelineC should have.

Sunday, July 19, 2015

Thoughts on Conduits

Summary: I'm growing increasingly fond of the Conduit library. Here I give my intuitions and some hints I'd have found useful.

Recently I've been working on converting the Hoogle database generation to use the Conduit abstraction, in an effort to reduce the memory and improve the speed. It worked - database generation has gone from 2Gb of RAM to 320Mb, and time has dropped from several minutes (or > 15 mins on memory constrained machines) to 45s. These changes are all in the context of Hoogle 5, which should hopefully be out in a month or so.

The bit that I've converted to Conduit is something that takes in a tarball of one text files per Hackage file, namely the Haddock output with one definition per line (this 22Mb file). It processes each definition, saves it to a single binary file (with compression and some processing), and returns some compact information about the definition for later processing. I don't expect the process to run in constant space as it is accumulating some return information, but it is important that most of the memory used by one definition is released before the next definition. I originally tried lazy IO, and while it somewhat worked, it was hard to abstract properly and very prone to space leaks. Converting to Conduit was relatively easy and is simpler and more robust.

The Conduit model

My mental model for a conduit Conduit a m b is roughly a function [a] -> m [b] - a values go in and b values come out (but interleaved with the monadic actions). More concretely you ask for an a with await and give back a b with yield, doing stuff in the middle in the m Monad.

A piece of conduit is always either running (doing it's actual work), waiting after a yield for the downstream person to ask for more results (with await), or waiting after an await for the upstream person to give the value (with yield). You can think of a conduit as making explicit the demand-order inherent in lazy evaluation.

Things to know about Conduit

I think it's fair to say Conduit shows its history - this is good for people who have been using it for a while (your code doesn't keep breaking), but bad for people learning it (there are lots of things you don't care about). Here are some notes I made:

  • The Data.Conduit module in the conduit library is not the right module to use - it seems generally accepted to use the Conduit module from the conduit-combinators package. However, I decided to build my own conduit-combinators style replacement in the Hoogle tree, see General.Conduit - the standard Conduit module has a lot of dependencies, and a lot of generalisations.
  • Don't use Source or Sink - use Producer and Consumer - the former are just a convenient way to get confusing error messages.
  • Don't use =$ or $=, always use =$= between conduits. The =$= operator is essentially flip (.).
  • Given a Conduit you can run it with runConduit. Alternatively, given a =$= b =$= c you can replace any of the =$= with $$ to run the Conduit as well. I find that a bit ugly, and have stuck to runConduit.
  • Conduit and ConduitM have their type arguments in different orders, which is just confusing. However, generally I use either Conduit (a connector) or Producer (something with a result). You rarely need something with a result and a return value.
  • You call await to see if a value is available to process. The most common bug I've had with conduits is forgetting to make the function processing items recursive - usually you want awaitForever, not just await.
  • The ByteString lines Conduit function was accidentally O(n^2) - I spotted and fixed that. Using difference lists does not necessarily make your code O(n)!

Useful functions

When using Conduit I found a number of functions seemingly missing, so defined them myself.

First up is countC which counts the number of items that are consumed. Just a simple definition on top of sumC.

countC :: (Monad m, Num c) => Consumer a m c
countC = sumC <| mapC (const 1)

While I recommend awaitForever in preference to await, it's occasionally useful to have awaitJust as the single-step awaitForever, if you are doing your own recursion.

awaitJust :: Monad m => (i -> Conduit i m o) -> Conduit i m o
awaitJust act = do
    x <- await
    whenJust x act

I regularly find zipFrom i = zip [i..] very useful in strict languages, and since Conduit can be viewed as a strict version of lazy lists (through very foggy glasses) it's no surprise a Conduit version is also useful.

zipFromC :: (Monad m, Enum c) => c -> Conduit a m (c, a)
zipFromC !i = awaitJust $ \a -> do
    yield (i,a)
    zipFromC (succ i)

Finally, it's useful to zip two conduits. I was surprised how fiddly that was with the standard operators (you have to use newtype wrappers and an Applicative instance), but a simple |$| definition hides that complexity away.

(|$|) :: Monad m => ConduitM i o m r1 -> ConduitM i o m r2 -> ConduitM i o m (r1,r2)
(|$|) a b = getZipConduit $ (,) <$> ZipConduit a <*> ZipConduit b

Why not Pipes?

I am curious if all Pipes users get asked "Why not use Conduit?", or if this FAQ is asymmetrical?

I realise pipes are billed as the more "principled" choice for this type of programming, but I've yet to see anywhere Conduit seems fundamentally unprincipled. I use WAI/Warp and http-conduit, so learning Conduit gives me some help there.

Saturday, June 20, 2015

Announcing the 'extra' library

Summary: I wrote an extra library, which contains lots of my commonly used functions.

When starting to write Bake, my first step was to copy a lot of utility functions from Shake - things like fst3 (select the first element of a triple), concatMapM (monadic version of concatMap), withCurrentDirectory (setCurrentDirectory under bracket). None of these functions are specific to either Bake or Shake, and indeed, many of the ones in Shake had originally came from HLint. Copy and pasting code is horrible, so I extracted the best functions into a common library which I named extra. Unlike the copy/paste versions in each package, I then wrote plenty of tests, made sure the functions worked in the presence of exceptions, did basic performance optimisation and filled in some obvious gaps in functionality.

I'm now using the extra library in all the packages above, plus things like ghcid and Hoogle. Interestingly, I'm finding my one-off scripts are making particularly heavy use of the extra functions. I wrote this package to reduce my maintenance burden, but welcome other users of extra.

My goal for the extra library is simple additions to the standard Haskell libraries, just filling out missing functionality, not inventing new concepts. In some cases, later versions of the standard libraries provide the functions I want, so there extra makes them available all the way back to GHC 7.2, reducing the amount of CPP in my projects. A few examples:

The module Extra documents all functions provided by the library, so is a good place to go to see what is on offer. Modules such as Data.List.Extra provide extra functions over Data.List and also reexport Data.List. Users are recommended to replace Data.List imports with Data.List.Extra if they need the extra functionality.

Which functions?

When selecting functions I have been guided by a few principles.

  • I have been using most of these functions in my packages - they have proved useful enough to be worth copying/pasting into each project.
  • The functions follow the spirit of the original Prelude/base libraries. I am happy to provide partial functions (e.g. fromRight), and functions which are specialisations of more generic functions (whenJust).
  • Most of the functions have trivial implementations. If a beginner couldn't write the function, it probably doesn't belong here.
  • I have defined only a few new data types or type aliases. It's a package for defining new utilities on existing types, not new types or concepts.

Testing

One feature I particularly like about this library is that the documentation comments are tests. A few examples:

Just True ||^ undefined  == Just True
retry 3 (fail "die") == fail "die"
whenJust (Just 1) print == print 1
\x -> fromRight (Right x) == x
\x -> fromRight (Left  x) == undefined

These equalities are more semantic equality than Haskell's value equality. Things with lambda's are run through QuickCheck. Things which print to stdout are captured, so the print 1 test really does a print, which is scraped and compared to the LHS. I run these tests by passing them through a preprocessor, which spits out this code, which I then run with some specialised testing functions.

Friday, June 19, 2015

Maximum Patches, Minimum Time

Summary: Writing a fast continuous integration server is tricky.

When I started writing the continuous integration server Bake, I thought I had a good idea of what would go fast. It turned out I was wrong. The problem Bake is trying to solve is:

  • You have a set of tests that must pass, which take several hours to run.
  • You have a current state of the world, which passes all the tests.
  • You have a stream of hundreds of incoming patches per day that can be applied to the current state.
  • You want to advance the state by applying patches, ensuring the current state always passes all tests.
  • You want to reject bad patches and apply good patches as fast as possible.

I assume that tests can express dependencies, such as you must compile before running any tests. The test that performs compilation is special because it can go much faster if only a few things have changed, benefiting from incremental compilation.

Both my wrong solution, and my subsequent better solution, are based on the idea of a candidate - a sequence of patches applied to the current state that is the focus of testing. The solutions differ in when patches are added/removed from the candidate and how the candidates are compiled.

A Wrong Solution

My initial solution compiled and ran each candidate in a separate directory. When the directory was first created, it copied a nearby candidate to try and benefit from incremental compilation.

Each incoming patch was immediately included in the candidate, compiled, and run on all tests. I would always run the test that had not passed for the longest time, to increase confidence in more patches. Concretely, if I have run test T1 on patch P1, and P2 comes in, I start testing T2 on the combination of P1+P2. After that passes I can be somewhat confident that P1 passes both T1 and T2, despite not having run T2 on just P1.

If a test fails, I bisect to find the patch that broke it, reject the patch, and immediately throw it out of the candidate.

The Problems

There are three main problems with this approach:

  • Every compilation starts with a copy of a nearby candidate. Copying a directory of lots of small files (the typical output of a compiler) is fantastically expensive on Windows.
  • When bisecting, I have to compile at lots of prefixes of the candidate, the cost of which varies significantly based on the directory it starts from.
  • I'm regularly throwing patches out of the candidate, which requires a significant amount of compilation, as it has to recompile all patches that were after the rejected patch.
  • I'm regularly adding patches to the candidate, each of which requires an incremental compilation, but tends to be dominated by the directory copy.

This solution spent all the time copying and compiling, and relatively little time testing.

A Better Solution

To benefit from incremental compilation and avoid copying costs, I always compile in the same directory. Given a candidate, I compile each patch in the series one after another, and after each compilation I zip up the interesting files (the test executables and test data). To bisect, I unzip the relevant files to a different directory. On Windows, unzipping is much more expensive than zipping, but that only needs to be done when bisecting is required. I also only need to zip the stuff required for testing, not for building, which is often much smaller.

When testing a candidate, I run all tests without extending the candidate. If all the tests pass I update the state and create a new candidate containing all the new patches.

If any test fails I bisect to figure out who should be rejected, but don't reject until I've completed all tests. After identifying all failing tests, and the patch that caused each of them to fail, I throw those patches out of the candidate. I then rebuild with the revised candidate and run only those tests that failed last time around, trying to seek out tests where two patches in a candidate both broke them. I keep repeating with only the tests that failed last time, until no tests fail. Once there are no failing tests, I extend the candidate with all new patches, but do not update the state.

As a small tweak, if there are two patches in the queue from the same person, where one is a superset of the other, I ignore the subset. The idea is that if the base commit has an error I don't want to track it down twice, once to the first failing commit and then again to the second one.

Using this approach in Bake

First, the standard disclaimer: Bake may not meet your needs - it is a lot less developed than other continuous integration systems. If you do decide to use Bake, you should run from the git repo, as the Hackage release is far behind. That said, Bake is now in a reasonable shape, and might be suitable for early adopters.

In Bake this approach is implemented in the StepGit module, with the ovenStepGit function. Since Bake doesn't have the notion of building patches in series it pretends (to the rest of Bake) that it's building the final result, but secretly caches the intermediate steps. If there is a failure when compiling, it caches that failure, and reports it to each step in the bisection, so Bake tracks down the correct root cause.

I am currently recommending ovenStepGit as the "best" approach for combining git and an incremental build system with Bake. While any incremental build system works, I can't help but plug Shake, because its the best build system I've ever written.

Sunday, June 07, 2015

ghcid 0.4 released

Summary: I've released a new version of ghcid, which is faster and works better for Vim users.

I've just released version 0.4 of ghcid. For those who haven't tried it, it's the simplest/dumbest "IDE" that could possibly exist. It presents a small fixed-height console window that shows the active warnings/errors from ghci, which is updated on every save. While that might not seem like much, there are a number of improvements over a standard workflow:

  • You don't have to switch to a different window and hit :reload, just saving in your editor is enough.
  • It includes warnings from modules that successfully compiled previously, as opposed to ghci which omits warnings from modules that didn't reload.
  • It reformats the messages to the height of your console, so you can see several messages at once.

I find it gives me a significant productivity boost, and since there is no editor integration, it adds to my existing tools, rather than trying to replace them (unlike a real IDE).

Version 0.4 offers a few additional improvements:

  • We use fsnotify to watch for changes, rather than polling as in older versions. The result is a significant decrease in battery usage, combined with faster responses to changes.
  • If you change the .cabal or .ghci file then ghcid will restart the ghci session to pick up the changes.
  • If you are a Vim user, then the sequence of temporary files Vim uses on save could upset ghcid. I believe these issues are now all eliminated.
  • The number of errors/warnings is displayed in the titlebar of the console.
  • There is a feature to run a test after each successful save (using --test). I am expecting this to be a niche feature, but feel free to prove me wrong.

Tempted to give it a try? See the README for details.

Thursday, May 21, 2015

Handling Control-C in Haskell

Summary: The development version of ghcid seemed to have some problems with terminating when Control-C was hit, so I investigated and learnt some things.

Given a long-running/interactive console program (e.g. ghcid), when the user hits Control-C/Ctrl-C the program should abort. In this post I'll describe how that works in Haskell, how it can fail, and what asynchronous exceptions have to do with it.

What happens when the user hits Ctrl-C?

When the user hits Ctrl-C, GHC raises an async exception of type UserInterrupt on the main thread. This happens because GHC installs an interrupt handler which raises that exception, sending it to the main thread with throwTo. If you install your own interrupt handler you won't see this behaviour and will have to handle Ctrl-C yourself.

There are reports that if the user hits Ctrl-C twice the runtime will abort the program. In my tests, that seems to be a feature of the shell rather than GHC itself - in the Windows Command Prompt no amount of Ctrl-C stops an errant program, in Cygwin a single Ctrl-C works.

What happens when the main thread receives UserInterrupt?

There are a few options:

  • If you are not masked and there is no exception handler, the thread will abort, which causes the whole program to finish. This behaviour is the desirable outcome if the user hits Ctrl-C.
  • If you are running inside an exception handler (e.g. catch or try) which is capable of catching UserInterrupt then the UserInterrupt exception will be returned. The program can then take whatever action it wishes, including rethrowing UserInterrupt or exiting the program.
  • If you are running with exceptions masked, then the exception will be delayed until you stop being masked. The most common way of running while masked is if the code is the second argument to finally or one of the first two arguments to bracket. Since Ctrl-C will be delayed while the program is masked, you should only do quick things while masked.

How might I lose UserInterrupt?

The easiest way to "lose" a UserInterrupt is to catch it and not rethrow it. Taking a real example from ghcid, I sometimes want to check if two paths refer to the same file, and to make that check more robust I call canonicalizePath first. This function raises errors in some circumstances (e.g. the directory containing the file does not exist), but is inconsistent about error conditions between OS's, and doesn't document its exceptions, so the safest thing is to write:

canonicalizePathSafe :: FilePath -> IO FilePath
canonicalizePathSafe x = canonicalizePath x `catch`
    \(_ :: SomeException) -> return x

If there is any exception, just return the original path. Unfortunately, the catch will also catch and discard UserInterrupt. If the user hits Ctrl-C while canonicalizePath is running the program won't abort. The problem is that UserInterrupt is not thrown in response to the code inside the catch, so ignoring UserInterrupt is the wrong thing to do.

What is an async exception?

In Haskell there are two distinct ways to throw exceptions, synchronously and asynchronously.

  • Synchronous exceptions are raised on the calling thread, using functions such as throw and error. The point at which a synchronous exception is raised is explicit and can be relied upon.
  • Asynchronous exceptions are raised by a different thread, using throwTo and a different thread id. The exact point at which the exception occurs can vary.

How is the type AsyncException related?

In Haskell, there is a type called AsyncException, containing four exceptions - each special in their own way:

  • StackOverflow - the current thread has exceeded its stack limit.
  • HeapOverflow - never actually raised.
  • ThreadKilled - raised by calling killThread on this thread. Used when a programmer wants to kill a thread.
  • UserInterrupt - the one we've been talking about so far, raised on the main thread by the user hitting Ctrl-C.

While these have a type AsyncException, that's only a hint as to their intended purpose. You can throw any exception either synchronously or asynchronously. In our particular case of caonicalizePathSafe, if canonicalizePath causes a StackOverflow, we probably are happy to take the fallback case, but likely the stack was already close to the limit and will occur again soon. If the programmer calls killThread that thread should terminate, but in ghcid we know this thread won't be killed.

How can I catch avoid catching async exceptions?

There are several ways to avoid catching async exceptions. Firstly, since we expect canonicalizePath to complete quickly, we can just mask all async exceptions:

canonicalizePathSafe x = mask_ $
    canonicalizePath x `catch` \(_ :: SomeException) -> return x

We are now guaranteed that catch will not receive an async exception. Unfortunately, if canonicalizePath takes a long time, we might delay Ctrl-C unnecessarily.

Alternatively, we can catch only non-async exceptions:

canonicalizePathSafe x = catchJust
    (\e -> if async e then Nothing else Just e)
    (canonicalizePath x)
    (\_ -> return x)

async e = isJust (fromException e :: Maybe AsyncException)

We use catchJust to only catch exceptions which aren't of type AsyncException, so UserInterrupt will not be caught. Of course, this actually avoids catching exceptions of type AsyncException, which is only related to async exceptions by a partial convention not enforced by the type system.

Finally, we can catch only the relevant exceptions:

canonicalizePathSafe x = canonicalizePath x `catch`
    \(_ :: IOException) -> return x

Unfortunately, I don't know what the relevant exceptions are - on Windows canonicalizePath never seems to throw an exception. However, IOException seems like a reasonable guess.

How to robustly deal with UserInterrupt?

I've showed how to make canonicalizePathSafe not interfere with UserInterrupt, but now I need to audit every piece of code (including library functions I use) that runs on the main thread to ensure it doesn't catch UserInterrupt. That is fragile. A simpler alternative is to push all computation off the main thread:

import Control.Concurrent.Extra
import Control.Exception.Extra

ctrlC :: IO () -> IO ()
ctrlC act = do
    bar <- newBarrier
    forkFinally act $ signalBarrier bar
    either throwIO return =<< waitBarrier bar

main :: IO ()
main = ctrlC $ ... as before ...

We are using the Barrier type from my previous blog post, which is available from the extra package. We create a Barrier, run the main action on a forked thread, then marshal completion/exceptions back to the main thread. Since the main thread has no catch operations and only a few (audited) functions on it, we can be sure that Ctrl-C will quickly abort the program.

Using version 1.1.1 of the extra package we can simplify the code to ctrlC = join . onceFork.

What about cleanup?

Now we've pushed most actions off the main thread, any finally sections are on other threads, and will be skipped if the user hits Ctrl-C. Typically this isn't a problem, as program shutdown automatically cleans all non-persistent resources. As an example, ghcid spawns a copy of ghci, but on shutdown the pipes are closed and the ghci process exits on its own. If we do want robust cleanup of resources such as temporary files we would need to run the cleanup from the main thread, likely using finally.

Should async exceptions be treated differently?

At the moment, Haskell defines many exceptions, any of which can be thrown either synchronously or asynchronously, but then hints that some are probably async exceptions. That's not a very Haskell-like thing to do. Perhaps there should be a catch which ignores exceptions thrown asynchronously? Perhaps the sync and async exceptions should be of different types? It seems unfortunate that functions have to care about async exceptions as much as they do.

Combining mask and StackOverflow

As a curiosity, I tried to combine a function that stack overflows (using -O0) and mask. Specifically:

main = mask_ $ print $ foldl (+) 0 [1..1000000]

I then ran that with +RTS -K1k. That prints out the value computed by the foldl three times (seemingly just a buffering issue), then fails with a StackOverflow exception. If I remove the mask, it just fails with StackOverflow. It seems that by disabling StackOverflow I'm allowed to increase my stack size arbitrarily. Changing print to appendFile causes the file to be created but not written to, so it seems there are oddities about combining these features.

Disclaimer

I'm certainly not an expert on async exceptions, so corrections welcome. All the above assumes compiling with -threaded, but most applies without -threaded.

Wednesday, May 06, 2015

Announcing js-jquery Haskell Library

Summary: The library js-jquery makes it easy to get at the jQuery Javascript code from Haskell. I've just released a new version.

I've just released the Haskell library js-jquery 1.11.3, following the announcement of jQuery 1.11.3. This package bundles the minified jQuery code into a Haskell package, so it can be depended upon by Cabal packages. The version number matches the upstream jQuery version. It's easy to grab the jQuery code from Haskell using this library, as an example:

import qualified Language.Javascript.JQuery as JQuery

main = do
    putStrLn $ "jQuery version " ++ show JQuery.version ++ " source:"
    putStrLn =<< readFile =<< JQuery.file

There are two goals behind this library:

  • Make it easier for jQuery users to use and upgrade jQuery in Haskell packages. You can upgrade jQuery without huge diffs and use it without messing around with extra-source-files.
  • Make it easier for upstream packagers like Debian. The addition of a jQuery file into a Haskell package means you are mixing licenses, authors, and distributions like Debian also require the source (unminified) version of jQuery to be distributed alongside. By having one package provide jQuery they only have to do that work once, and the package has been designed to meet their needs.

It's pretty easy to convert something that has bundled jQuery to use the library, as some examples:

The library only depends on the base library so it shouldn't cause any version hassles, although (as per all Cabal packages) you can't mix and match libraries with incompatible js-jquery version constraints in one project.

As a companion, there's also js-flot, which follows the same ideas for the Flot library.