Thursday, May 21, 2015

Handling Control-C in Haskell

Summary: The development version of ghcid seemed to have some problems with terminating when Control-C was hit, so I investigated and learnt some things.

Given a long-running/interactive console program (e.g. ghcid), when the user hits Control-C/Ctrl-C the program should abort. In this post I'll describe how that works in Haskell, how it can fail, and what asynchronous exceptions have to do with it.

What happens when the user hits Ctrl-C?

When the user hits Ctrl-C, GHC raises an async exception of type UserInterrupt on the main thread. This happens because GHC installs an interrupt handler which raises that exception, sending it to the main thread with throwTo. If you install your own interrupt handler you won't see this behaviour and will have to handle Ctrl-C yourself.

There are reports that if the user hits Ctrl-C twice the runtime will abort the program. In my tests, that seems to be a feature of the shell rather than GHC itself - in the Windows Command Prompt no amount of Ctrl-C stops an errant program, in Cygwin a single Ctrl-C works.

What happens when the main thread receives UserInterrupt?

There are a few options:

  • If you are not masked and there is no exception handler, the thread will abort, which causes the whole program to finish. This behaviour is the desirable outcome if the user hits Ctrl-C.
  • If you are running inside an exception handler (e.g. catch or try) which is capable of catching UserInterrupt then the UserInterrupt exception will be returned. The program can then take whatever action it wishes, including rethrowing UserInterrupt or exiting the program.
  • If you are running with exceptions masked, then the exception will be delayed until you stop being masked. The most common way of running while masked is if the code is the second argument to finally or one of the first two arguments to bracket. Since Ctrl-C will be delayed while the program is masked, you should only do quick things while masked.

How might I lose UserInterrupt?

The easiest way to "lose" a UserInterrupt is to catch it and not rethrow it. Taking a real example from ghcid, I sometimes want to check if two paths refer to the same file, and to make that check more robust I call canonicalizePath first. This function raises errors in some circumstances (e.g. the directory containing the file does not exist), but is inconsistent about error conditions between OS's, and doesn't document its exceptions, so the safest thing is to write:

canonicalizePathSafe :: FilePath -> IO FilePath
canonicalizePathSafe x = canonicalizePath x `catch`
    \(_ :: SomeException) -> return x

If there is any exception, just return the original path. Unfortunately, the catch will also catch and discard UserInterrupt. If the user hits Ctrl-C while canonicalizePath is running the program won't abort. The problem is that UserInterrupt is not thrown in response to the code inside the catch, so ignoring UserInterrupt is the wrong thing to do.

What is an async exception?

In Haskell there are two distinct ways to throw exceptions, synchronously and asynchronously.

  • Synchronous exceptions are raised on the calling thread, using functions such as throw and error. The point at which a synchronous exception is raised is explicit and can be relied upon.
  • Asynchronous exceptions are raised by a different thread, using throwTo and a different thread id. The exact point at which the exception occurs can vary.

How is the type AsyncException related?

In Haskell, there is a type called AsyncException, containing four exceptions - each special in their own way:

  • StackOverflow - the current thread has exceeded its stack limit.
  • HeapOverflow - never actually raised.
  • ThreadKilled - raised by calling killThread on this thread. Used when a programmer wants to kill a thread.
  • UserInterrupt - the one we've been talking about so far, raised on the main thread by the user hitting Ctrl-C.

While these have a type AsyncException, that's only a hint as to their intended purpose. You can throw any exception either synchronously or asynchronously. In our particular case of caonicalizePathSafe, if canonicalizePath causes a StackOverflow, we probably are happy to take the fallback case, but likely the stack was already close to the limit and will occur again soon. If the programmer calls killThread that thread should terminate, but in ghcid we know this thread won't be killed.

How can I catch avoid catching async exceptions?

There are several ways to avoid catching async exceptions. Firstly, since we expect canonicalizePath to complete quickly, we can just mask all async exceptions:

canonicalizePathSafe x = mask_ $
    canonicalizePath x `catch` \(_ :: SomeException) -> return x

We are now guaranteed that catch will not receive an async exception. Unfortunately, if canonicalizePath takes a long time, we might delay Ctrl-C unnecessarily.

Alternatively, we can catch only non-async exceptions:

canonicalizePathSafe x = catchJust
    (\e -> if async e then Nothing else Just e)
    (canonicalizePath x)
    (\_ -> return x)

async e = isJust (fromException e :: Maybe AsyncException)

We use catchJust to only catch exceptions which aren't of type AsyncException, so UserInterrupt will not be caught. Of course, this actually avoids catching exceptions of type AsyncException, which is only related to async exceptions by a partial convention not enforced by the type system.

Finally, we can catch only the relevant exceptions:

canonicalizePathSafe x = canonicalizePath x `catch`
    \(_ :: IOException) -> return x

Unfortunately, I don't know what the relevant exceptions are - on Windows canonicalizePath never seems to throw an exception. However, IOException seems like a reasonable guess.

How to robustly deal with UserInterrupt?

I've showed how to make canonicalizePathSafe not interfere with UserInterrupt, but now I need to audit every piece of code (including library functions I use) that runs on the main thread to ensure it doesn't catch UserInterrupt. That is fragile. A simpler alternative is to push all computation off the main thread:

import Control.Concurrent.Extra
import Control.Exception.Extra

ctrlC :: IO () -> IO ()
ctrlC act = do
    bar <- newBarrier
    forkFinally act $ signalBarrier bar
    either throwIO return =<< waitBarrier bar

main :: IO ()
main = ctrlC $ ... as before ...

We are using the Barrier type from my previous blog post, which is available from the extra package. We create a Barrier, run the main action on a forked thread, then marshal completion/exceptions back to the main thread. Since the main thread has no catch operations and only a few (audited) functions on it, we can be sure that Ctrl-C will quickly abort the program.

Using version 1.1.1 of the extra package we can simplify the code to ctrlC = join . onceFork.

What about cleanup?

Now we've pushed most actions off the main thread, any finally sections are on other threads, and will be skipped if the user hits Ctrl-C. Typically this isn't a problem, as program shutdown automatically cleans all non-persistent resources. As an example, ghcid spawns a copy of ghci, but on shutdown the pipes are closed and the ghci process exits on its own. If we do want robust cleanup of resources such as temporary files we would need to run the cleanup from the main thread, likely using finally.

Should async exceptions be treated differently?

At the moment, Haskell defines many exceptions, any of which can be thrown either synchronously or asynchronously, but then hints that some are probably async exceptions. That's not a very Haskell-like thing to do. Perhaps there should be a catch which ignores exceptions thrown asynchronously? Perhaps the sync and async exceptions should be of different types? It seems unfortunate that functions have to care about async exceptions as much as they do.

Combining mask and StackOverflow

As a curiosity, I tried to combine a function that stack overflows (using -O0) and mask. Specifically:

main = mask_ $ print $ foldl (+) 0 [1..1000000]

I then ran that with +RTS -K1k. That prints out the value computed by the foldl three times (seemingly just a buffering issue), then fails with a StackOverflow exception. If I remove the mask, it just fails with StackOverflow. It seems that by disabling StackOverflow I'm allowed to increase my stack size arbitrarily. Changing print to appendFile causes the file to be created but not written to, so it seems there are oddities about combining these features.

Disclaimer

I'm certainly not an expert on async exceptions, so corrections welcome. All the above assumes compiling with -threaded, but most applies without -threaded.

Wednesday, May 06, 2015

Announcing js-jquery Haskell Library

Summary: The library js-jquery makes it easy to get at the jQuery Javascript code from Haskell. I've just released a new version.

I've just released the Haskell library js-jquery 1.11.3, following the announcement of jQuery 1.11.3. This package bundles the minified jQuery code into a Haskell package, so it can be depended upon by Cabal packages. The version number matches the upstream jQuery version. It's easy to grab the jQuery code from Haskell using this library, as an example:

import qualified Language.Javascript.JQuery as JQuery

main = do
    putStrLn $ "jQuery version " ++ show JQuery.version ++ " source:"
    putStrLn =<< readFile =<< JQuery.file

There are two goals behind this library:

  • Make it easier for jQuery users to use and upgrade jQuery in Haskell packages. You can upgrade jQuery without huge diffs and use it without messing around with extra-source-files.
  • Make it easier for upstream packagers like Debian. The addition of a jQuery file into a Haskell package means you are mixing licenses, authors, and distributions like Debian also require the source (unminified) version of jQuery to be distributed alongside. By having one package provide jQuery they only have to do that work once, and the package has been designed to meet their needs.

It's pretty easy to convert something that has bundled jQuery to use the library, as some examples:

The library only depends on the base library so it shouldn't cause any version hassles, although (as per all Cabal packages) you can't mix and match libraries with incompatible js-jquery version constraints in one project.

As a companion, there's also js-flot, which follows the same ideas for the Flot library.

Saturday, April 25, 2015

Cleaning stale files with Shake

Summary: Sometimes source files get deleted, and build products become stale. Using Shake, you can automatically delete them.

Imagine you have a build system that compiles Markdown files into HTML files for your blog. Sometimes you rename a Markdown file, which means the corresponding HTML will change name too. Typically, this will result in a stale HTML file being left, one that was previously produced by the build system, but will never be updated again. You can remove that file by cleaning all outputs and running the build again, but with the Shake build system you can do better. You can ask for a list of all live files, and delete the build products not on that list.

A basic Markdown to HTML converter

Let's start with a simple website generator. For each Markdown file, with the extension .md, we generate an HTML file. We can write that as:

import Development.Shake
import Development.Shake.FilePath

main :: IO ()
main = shakeArgs shakeOptions $ do
    action $ do
        mds <- getDirectoryFiles "." ["//*.md"]
        need ["output" </> x -<.> "html" | x <- mds]

    "output//*.html" %> \out -> do
        let src = dropDirectory1 out -<.> "md"
        need [src]
        cmd "pandoc -s -o" [out, src]

    phony "clean" $ do
        removeFilesAfter "output" ["//*.html"]

Nothing too interesting here. There are three parts:

  • Search for all .md files, and for each file foo/bar.md require output/foo/bar.html.
  • To generate an .html file, depend on the source file then run pandoc.
  • To clean everything, delete all .html files in output.

Using a new feature in Shake 0.15, we can name save this script as Shakefile.hs and then:

  • shake will build all the HTML files.
  • shake -j0 will build all the files, using one thread for each processor on our system.
  • shake output/foo.html will build just that one HTML file.
  • shake clean will delete all the HTML files.

Removing stale files

Now let's imagine we've added a blog post using-pipes.md. Before publishing we decide to rename our post to using-conduit.md. If we've already run shake then there will be a stale file output/using-pipes.html. Since there is no source .md file, Shake will not attempt to rebuild the file, and it won't be automatically deleted. We can do shake clean to get rid of it, but that will also wipe all the other HTML files.

We can run shake --live=live.txt to produce a file live.txt listing all the live files - those that Shake knows about, and has built. If we run that after deleting using-pipes.md it will tell us that using-conduit.md and output/using-conduit.md are both "live". If we delete all files in output that are not mentioned as being live, that will clean away all our stale files.

Using Shake 0.15.1 (released in the last hour) you can write:

import Development.Shake
import Development.Shake.FilePath
import Development.Shake.Util
import System.Directory.Extra
import Data.List
import System.IO

pruner :: [FilePath] -> IO ()
pruner live = do
    present <- listFilesRecursive "output"
    mapM_ removeFile $ map toStandard present \\ map toStandard live

main :: IO ()
main = shakeArgsPrune shakeOptions pruner $ do
     ... as before ...

Now when running shake --prune it will build all files, then delete all stale files, such as output/using-pipes.html. We are using the shakePrune function (just sugar over --live) which lets us pass a pruner function. This function gets called after the build completes with a list of all the live files. We use listFilesRecursive from the extra package to get a list of all files in output, then do list difference (\\) to delete all the files which are present but not live. To deal with the / vs \ path separator issue on Windows, we apply toStandard to all files to ensure they match.

A few words of warning:

  • If you run shake output/foo.html --prune then it will only pass output/foo.html and foo.md as live files, since they are the only ones that are live as you have asked for a subset of the files to be built. Generally, you want to enable all sensible targets (typically no file arguments) when passing --prune.
  • Sometimes a rule will generate something you care about, and a few files you don't really bother tracking. As an example, building a GHC DLL on Windows generates a .dll and a .dll.a file. While the .dll.a file may not be known to Shake, it probably doesn't want to get pruned. The pruning function may need a few special cases, like not deleting the .dll.a file if the .dll is live.

Thursday, March 26, 2015

New website with new talks

My website is now located at ndmitchell.com and has a few new talks on it:

My old website at community.haskell.org is going to stop being updated, and I'll be putting in redirections shortly. That server is going to stop hosting websites, so I bought myself a domain name and setup a GitHub pages website. The repo is here, including all the data, metadata, templates and scripts.

Sunday, March 22, 2015

Finding a GHC bug

Summary: I found a nasty bug in GHC 7.10 RC3. It's been fixed.

For Shake, I have an extensive test suite (2500+ lines of tests). I also test on 7 GHC versions, including GHC HEAD. After adding GHC 7.10 Release Candidate 3 to the mix one of the tests started failing. A week later the bug has been simplified, diagnosed and fixed as bug #10176. Below is a tale of how that happened, including a technical explanation of the bug in Step 8.

Step 1: Write a lot of tests

The Shake test that caught this particular bug checks that if the user makes a mistake then the error message must contain certain substrings correctly identifying the problem. With GHC 7.10 RC3 on Travis this test stopped throwing an exception entirely, continuing as though nothing were wrong. Weird.

Step 2: Reproduce locally

I tried to reproduce the failure locally, which ended up spotting a fatal bug in the GHC 7.10 RC3 32bit Windows version. After opting for the 64bit version, at first I couldn't reproduce the error. Eventually I realised that you needed to turn on optimisation (at -O1), and that running through ghci (how I usually develop Haskell) didn't cause the problem. Noticing that -O1 was required gave me a clue, that it was related to an optimisation. The typical cause of programs that work without optimisation but fail with it are programs that raise exceptions in pure code (since the exception can change due to optimisations) or those that call unsafePerformIO (it has unsafe in the name for a reason). I certainly do both those things in Shake, but I wasn't aware of anywhere I did them in a dubious manner.

Step 3: Reduce the test case

I spent a lot of time trying to reduce the test case. By inserting print statements I narrowed the place the difference was happening to Development.Shake.Core.applyKeyValue, which is a pretty core bit of Shake. However, while I was able to chop out a lot of auxiliary features (lint tracking, command tracing) the actual code remained difficult to reduce to any great extent, for two reasons. Firstly, the bug was incredibly fragile - moving a monomorphic NOINLINE function from one module to another made the bug disappear. Secondly, the applyKeyValue function is right in the middle of Shake, and the test required a few successful Shake runs to set up things for the failing test, so I couldn't change its observable semantics too much.

What I did conclude was that Shake didn't seem to be doing anything dodgy in the small patch of code that seemed relevant, giving me the first hint that maybe GHC was at fault, not Shake.

Step 4: Differences at the Core level

At this point, I reached out to the GHC mailing list, asking if anyone had any ideas of a culprit. They didn't, but Simon Peyton Jones suggested finding the smallest breaking change and comparing the generated Core. You can do that by compiling with -ddump-simpl, and adding -dsuppress-all -dsuppress-uniques to get something a bit easier to diff. Fortunately, by this point I had a very small change to make the error appear/disappear (moving a function from one module to another), so the difference in Core was tiny. The change in the problematic version read:

case (\_ -> error "here") of {}

In GHC Core a case always evaluates its scrutinee until it has the outermost value available (aka WHNF). The empty alternatives mean that GHC has proven that the evaluation always results in an exception. However, a lambda already has a value available (namely the lambda) so evaluation never throws an exception. As a result, GHC has violated the rules of Core and bad things happen.

Step 5: Reducing further

In order to reduce the bug further I now had a better test, namely:

ghc Core.hs -O -ddump-simpl | grep -F "case (\\"

With this test I didn't have to keep the internals of Shake working, and in fact didn't even have to provide a runnable entry point - all I had to do was look for the dodgy construction in the Core language. Note that I'm not actually looking for case of a lambda with empty alternatives, reasoning (seemingly correctly) that any case on a lambda with non-empty alternatives would be eliminated by the GHC simplifier, so any case followed by lambda is buggy.

I reduced by having a ghcid Window open in one corner, using the warnings -fwarn-unused-binds and -fwarn-unused-imports. I hacked out some part of the program and then patched everything up so it no longer raised an error using ghcid for rapid feedback. I then ran the grep test. If the bug had gone I put the program back to how it was and tried somewhere else. If the bug remained I then cleaned up the now redundant declarations and imports and checked again, repeating until the code was minimal.

Several hours later I was left with something like:

buggy :: (() -> Bool) -> () -> Bool -> IO ()
buggy fun unit bool =
    runReaderT (
        (if bool then liftIO $ print () else p) >>
        (if fun unit then error2Args unit unit >> p else p)) ()

{-# NOINLINE error2Args #-}
error2Args :: () -> () -> a
error2Args _ _ = error "here"

Note that error2Args must be in a different module to buggy.

Step 6: Bisecting

At this point hvr stepped in and bisected all the changes between GHC 7.10 RC2 and RC3, determining that a large Typeable change introduced the bug in the original shake test case. However, using the minimal program, the bug was also present in GHC 7.10 RC2. That suggested the bug might have been around for a while.

Step 7: Augmenting GHC's Lint Checker

GHC already has a pass in the compiler, enabled with -dcore-lint, which checks for dodgy constructs in the Core language. Enabling it didn't pick up this example (hence I used grep instead), so Joachim Breitner added such a check. He also added the example as a test case, so that if it ever breaks in future things it will be spotted immediately.

Step 8: Diagnose and Fix

Joachim then continued to diagnose and fix the issue, the details of which can be found in the patch. The problem (as I understand it) is that GHC looks at the code:

fun x = error "foo" x

And concludes two facts.

  1. If fun is called with one argument then the code will raise an error. That's true, and allows the compiler to replace fun () () with fun ().
  2. After analysing all calls of fun it spots that fun is always called with two arguments, so it is free to change fun to be fun x y = error "foo" x y.

By applying these two facts, we can make the transformation:

case fun () () of {}
-- apply the first rule
case fun () of {}
-- inline fun after applying the second rule
case (\x y -> error "foo" x y) () of {}
-- beta reduce:
case (\y -> error "foo" () y) of {}

Now we have caused invalid Core to be produced. While the two facts are each individually correct, applying the first fact causes the second fact to stop being true. Joachim fixed this by making the call argument count analysis stop at the first argument that guarantees an error.

Step 9: The Impact

The manifestation of the bug is quite interesting. Essentially GHC decides something is an error, but then fails to actually throw the error. As a result, any code the simplifier places after the error call will be eliminated, and that can remove a large chunk of the program. However, any code the simplifier doesn't manage to hoist next to the code will still get run, even though it should have been skipped due to an error. In essence, given exactly the wrong conditions to trigger the bug, you can write:

main = do
    putStrLn "here1"
    ... error "foo" ...
    putStrLn "here2"
    ...
    putStrLn "here3"

And end up with the program printing here1 followed by here3, without throwing an exception. In the case of my original Shake test it started to compile, should have stopped with an error but instead just skipped compiling altogether and went on to do the bits after compiling. A very weird manifestation.

Disclaimer: I've eliminating many missteps of mine, which included pushing random patches to try and reduce on the Travis machine and installing a Linux VM.

Monday, March 09, 2015

Implementing a Functor instance

Summary: Implementing a Functor instance is much easier than implementing a Monad instance, and can turn out to be quite useful.

Haskell forces all programmers to understand some details of the Monad typeclass to do basic IO, but currently nothing forces people to learn the Functor typeclass. However, Functor is much simpler than Monad, and all Monads must be Functors, so thinking more about Functor can be a nice route to understanding Monad better.

An intuitive description of a functor is:

A container whose contents can be replaced, without changing the shape of the container.

Some example functors include lists and Maybe. Both contain values, and you can replace the values inside them. In fact, most types with a single type parameter can be made functors. For example, in CmdArgs I define something similar to:

data Group a = Group {groupUnnamed :: [a], groupNamed :: [(String, [a])]}

This Group structure contains a values inside it. Sometimes it is useful to transform all the underlying a values, perhaps to a different type. The Functor instance has a single member:

fmap :: Functor f => (a -> b) -> f a -> f b

For the above type, we instantiate f to Group so we get:

fmap :: (a -> b) -> Group a -> Group b

We can implement fmap by applying f to every a value inside Group:

instance Functor Group where
    fmap f (Group a b) = Group (map f a) [(x, map f y) | (x,y) <- b]

Note in particular that Group is usually written Group a, but in the instance declaration we're omitting the a, to say Group itself (without any arguments) is a functor. Providing insufficient type arguments like that makes Functor a higher-kinded type class, in contrast to those like Eq or Ord which would have been on Group a.

When implementing fmap the type checker eliminates most bad implementations, so the only law you need to think about is that fmap id = id - given the identity function, the value shouldn't change. We can show this law for Group with:

Group a b = fmap id (Group a b)
-- inline fmap
Group a b = Group (map id a) [(x, map id y) | (x,y) <- b]
-- map id x ==> x
Group a b = Group a [(x, y) | (x,y) <- b]
-- simplify list comprehension
Group a b = Group a b
-- equal

In fact, the function map is just fmap specialised to [], so the rule map id x ==> x is just applying the fmap id = id law on lists. From this law, we can derive the additional law that:

fmap (f . g)  ==  fmap f . fmap g

Both these laws can serve as the basis for optimisation opportunities, reducing the number of times we traverse a value, and GHC exploits these laws for the list type.

In general, most data types that take a type parameter can be made functors, but there are a few common exceptions:

  • You have a value on the left of an arrow – for example data Foo a = Foo (a -> Int) cannot be made a functor, since we have no way to change the incoming b back to an a.
  • You have an invariant relating the structure and the elements. For example data OrdList a = Nil | Gt a (OrdList a), where all functions on OrdList have an Ord context, and OrdList is exported abstractly. Here the functor would break the abstraction.
  • You require an instance for the element type, e.g. Data.Vector.Storable requires a Storable instance to create a vector, which Functor does not allow.

The name functor may sound scary, or confusing to C++ programmers (who accidentally say functor to mean function) – but they are a nice simple abstraction.

Wednesday, February 25, 2015

Making withSocketsDo unnecessary

Summary: Currently you have to call withSocketsDo before using the Haskell network library. In the next version you won't have to.

The Haskell network library has always had a weird and unpleasant invariant. Under Windows, you must call withSocketsDo before calling any other functions. If you forget, the error message isn't particularly illuminating (e.g. getAddrInfo, does not exist, error 10093). Calling withSocketsDo isn't harmful under Linux, but equally isn't necessary, and thus easy to accidentally omit. The network library has recently merged some patches so that in future versions there is no requirement to call withSocketsDo, even on Windows.

Existing versions of network

The reason for requiring withSocketsDo is so that the network library can initialise the Windows Winsock library. The code for withSocketsDo was approximately:

withSocketsDo :: IO a -> IO a
#if WINDOWS
withSocketsDo act = do
    initWinsock
    act `finally` termWinsock
#else
withSocketsDo act = act
#endif

Where initWinsock and termWinsock were C functions. Both checked a mutable variable so they only initialised/terminated once. The initWinsock function immediately initialised the Winsock library. The termWinsock function did not terminate the library, but merely installed an atexit handler, providing a function that ran when the program shut down which terminated the Winsock library.

As a result, in all existing versions of the network library, it is fine to nest calls to withSocketsDo, call withSocketsDo multiple times, and to perform networking operations after withSocketsDo has returned.

Future versions of network

My approach to removing the requirement to call withSocketsDo was to make it very cheap, then sprinkle it everywhere it might be needed. Making such a function cheap on non-Windows just required an INLINE pragma (although its very likely GHC would have always inlined the function anyway).

For Windows, I changed to:

withSocketsDo act = do evaluate withSocketsInit; act 

{-# NOINLINE withSocketsInit #-}
withSocketsInit = unsafePerformIO $ do
    initWinsock
    termWinsock

Now withSocketsDo is very cheap, with subsequent calls requiring no FFI calls, and thanks to pointer tagging, just a few cheap instructions. When placing additional withSocketsDo calls my strategy was to make sure I called it before constructing a Socket (which many functions take as an argument), and when taking one of the central locks required for the network library. In addition, I identified a few places not otherwise covered.

In newer versions of the network library it is probably never necessary to call withSocketsDo - if you find a place where one is necessary, let me know. However, for compatibility with older versions on Windows, it is good practice to always call withSocketsDo. Libraries making use of the network library should probably call withSocketsDo on their users behalf.