Wednesday, April 07, 2010

File Recovery with Haskell

Haskell has long been my favoured scripting language, and in this post I thought I'd share one of my more IO heavy scripts. I have an external hard drive, that due to regular dropping, is somewhat unreliable. I have a 1Gb file on this drive, which I'd like to copy, but is partly corrupted. I'd like to copy as much as I can.

In the past I've used JFileRecovery, which I thoroughly recommend. The basic algorithm is that it copies the file in chunks, and if a chunk copy exceeds a timeout it is discarded. It has a nice graphical interface, and some basic control over timeout and block sizes. Unfortunately, JFileRecovery didn't work for this file - it has three basic problems:


  1. The timeout sometimes fails to stop the IO, causing the program to hang.

  2. If the copy takes too long, it sometimes gives up before the end of the file.

  3. If the block size is too small it takes forever, if it is too large it drops large parts of the file.



To recover my file I needed something better, so wrote a quick script in Haskell. The basic algorithm is to copy the the file in 10Mb chunks. If any chunk fails to copy, I split the chunk and retry it after all other pending chunks. The result is that the file is complete after the first pass, but the program then goes back and recovers more information where it can. I can terminate the program at any point with a working file, but waiting longer will probably recover more of the file.

I have included the script at the bottom of this post. I ran this script from GHCi, but am not going to turn it in to a proper program. If someone does wish to build on this script please do so (I hereby place this code in the public domain, or if that is not possible then under the ∀ n . BSD n licenses).

The script took about 15 minutes to write, and makes use of exceptions and file handles - not the kind of program traditionally associated with Haskell. A lot of hard work has been spent polishing the GHC runtime, and the Haskell libraries (bytestring, exceptions). Now this work has been done, slotting together reasonably complex scripts is simple.


{-# LANGUAGE ScopedTypeVariables #-}

import Data.ByteString(hGet, hPut)
import System.IO
import System.Environment
import Control.Monad
import Control.Exception


src = "file on dodgy drive (source)"
dest = "file on safe drive (destination)"

main :: IO ()
main =
withBinaryFile src ReadMode $ \hSrc ->
withBinaryFile dest WriteMode $ \hDest -> do
nSrc <- hFileSize hSrc
nDest <- hFileSize hDest
when (nSrc /= nDest) $ hSetFileSize hDest nSrc
copy hSrc hDest $ split start (0,nSrc)


copy :: Handle -> Handle -> [(Integer,Integer)] -> IO ()
copy hSrc hDest [] = return ()
copy hSrc hDest chunks = do
putStrLn $ "Copying " ++ show (length chunks) ++ " of at most " ++ show (snd $ head chunks)
chunks <- forM chunks $ \(from,len) -> do
res <- Control.Exception.try $ do
hSeek hSrc AbsoluteSeek from
hSeek hDest AbsoluteSeek from
bs <- hGet hSrc $ fromIntegral len
hPut hDest bs
case res of
Left (a :: IOException) -> do putChar '#' ; return $ split (len `div` 5) (from,len)
Right _ -> do putChar '.' ; return []
putChar '\n'
copy hSrc hDest $ concat chunks

start = 10000000
stop = 1000

split :: Integer -> (Integer,Integer) -> [(Integer,Integer)]
split i (a,b) | i < stop = []
| i >= b = [(a,b)]
| otherwise = (a,i) : split i (a+i, b-i)


There are many limitations in this code, but it was sufficient to recover my file quickly and accurately.

6 comments:

Chris Dornan said...

Point well made and very neat. These kinds of notes are really important. Like you I have long favoured scripting in Haskell and, as you point out, the platform is so well built now that it is a reality.

Folks wonder where the functional shell is: really its right there with GHCI.

Thanks!

Alex Gerdes said...

Nice example of Haskell scripting! I use ddrescue for this kind of functionality.

Neil Mitchell said...

Alex: ddrescue looks promising, and there is a version in Cygwin - I'll give it a go with my next corrupted file. Thanks!

meteficha said...

I was going to recommend ddrescue as well. It is well tested and allows you to stop and continue the copy again without recopying what was rescued already.

Thanks for sharing your thoughts with us!

Ganesh Sittampalam said...

Just used this on a corrupt DVD, and also tried ddrescue. I thought this script did a better job because it tries a wider range of sizes so runs faster on the uncorrupt bits.

A few notes:
- it should use powers of 2 for the sizes to better align with logical sectors
- you can't get the size of block devices using hGetSize, which is typically what you want to copy from on linux. I got it by hand using the blockdev command from this StackOverflow URL: http://stackoverflow.com/questions/1027037/determine-the-size-of-a-block-device and then hardcoded it

Neil Mitchell said...

Ganesh: I'm rather shocked it performed better than ddrescue under any circumstances.

I'm hoping to not have enough corrupt files to make it worth working on this any further, but anyone is welcome to take the code and build on it.