Stream data from archives using the streamly library.
Please see the README on GitHub at https://github.com/shlok/streamly-archive#readme
streamly-archive
Stream data from archives (tar, tar.gz, zip, or any other format supported by libarchive) using the Haskell streamly library.
Requirements
Install libarchive on your system.
- Debian Linux:
sudo apt-get install libarchive-dev
. - macOS:
brew install libarchive
.
Quick start
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeApplications #-}
module Main where
import Crypto.Hash (hashFinalize, hashInit, hashUpdate)
import Crypto.Hash.Algorithms (SHA256)
import Data.ByteString (ByteString)
import Data.Function ((&))
import Data.Maybe (fromJust, fromMaybe)
import Data.Void (Void)
import qualified Streamly.Data.Fold as F
import qualified Streamly.Data.Stream.Prelude as S
import Streamly.External.Archive
( Header,
groupByHeader,
headerPathName,
readArchive,
)
import Streamly.Internal.Data.Fold.Type (Fold (Fold), Step (Partial))
import Streamly.Internal.Data.Unfold.Type (Unfold)
main :: IO ()
main = do
-- Obtain an unfold for the archive.
-- For each entry in the archive, we will get a Header followed
-- by zero or more ByteStrings containing chunks of file data.
let unf :: Unfold IO Void (Either Header ByteString) =
readArchive "/path/to/archive.tar.gz"
-- Create a fold for converting each entry (which, as we saw
-- above, is a Left followed by zero or more Rights) into a
-- path and corresponding SHA-256 hash (Nothing for no data).
let entryFold :: Fold IO (Either Header ByteString) (String, Maybe String) =
Fold
( \(mpath, mctx) e ->
case e of
Left h -> do
mpath' <- headerPathName h
return $ Partial (mpath', mctx)
Right bs ->
return $
Partial
( mpath,
Just . (`hashUpdate` bs) $
fromMaybe (hashInit @SHA256) mctx
)
)
(return $ Partial (Nothing, Nothing))
( \(mpath, mctx) ->
return
( show $ fromJust mpath,
show . hashFinalize <$> mctx
)
)
-- Execute the stream, grouping at the headers (the Lefts) using the
-- above fold, and output the paths and SHA-256 hashes along the way.
S.unfold unf undefined
& groupByHeader entryFold
& S.mapM print
& S.fold F.drain
Benchmarks
See ./bench/README.md
. We find on our machine† that (1) reading an archive using this library is just as fast as using plain Haskell IO
code; and that (2) both are somewhere between 1.7x (large files) and 2.5x (many 1-byte files) slower than C.
The former fulfills the promise of streamly and stream fusion. The differences to C are presumably explained by the marshalling of data into the Haskell world and are currently small enough for our purposes.
† April 2023; Linode; Debian 11, Dedicated 32GB: 16 CPU, 640GB Storage, 32GB RAM.