Description
Stream data from archives using the streamly library.
Description
Please see the README on GitHub at https://github.com/shlok/streamly-archive#readme
README.md
streamly-archive
Stream data from archives (tar, tar.gz, zip, or any other format supported by libarchive) using the Haskell streamly library.
Requirements
Install libarchive on your system.
- Debian Linux:
sudo apt-get install libarchive-dev
. - macOS:
brew install libarchive
.
Quick start
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeApplications #-}
module Main where
import Crypto.Hash
import Data.ByteString (ByteString)
import Data.Function
import Data.Functor
import Data.Maybe
import Streamly.Data.Fold (Fold)
import qualified Streamly.Data.Fold as F
import qualified Streamly.Data.Stream.Prelude as S
import Streamly.External.Archive
main :: IO ()
main = do
-- A fold for converting each archive entry (which is a Header followed by
-- zero or more ByteStrings) into a path and corresponding SHA-256 hash
-- (Nothing for no data).
let entryFold :: Fold IO (Either Header ByteString) (String, Maybe String) =
F.foldlM'
( \(mpath, mctx) e ->
case e of
Left h -> do
mpath' <- headerPathName h
return (mpath', mctx)
Right bs ->
return
( mpath,
Just . (`hashUpdate` bs) $
fromMaybe (hashInit @SHA256) mctx
)
)
(return (Nothing, Nothing))
<&> ( \(mpath, mctx) ->
( show $ fromMaybe (error "path expected") mpath,
show . hashFinalize <$> mctx
)
)
-- Execute the stream, grouping at the headers (the Lefts) using the above
-- fold, and output the paths and SHA-256 hashes along the way.
S.unfold readArchive (id, "/path/to/archive.tar.gz")
& groupByLeft entryFold
& S.mapM print
& S.fold F.drain
Benchmarks
See ./bench/README.md
. Summary (with rough figures from our machine†):
- For 1-byte files, this library has roughly a 70 ns/byte overhead compared to plain Haskell
IO
code, which has roughly a 895 ns/byte overhead compared to plain C. - For larger (> 10 KB) files, this library performs just as good as plain Haskell
IO
code, which has roughly a 0.15 ns/byte overhead compared to plain C.
† July 2024; NixOS 22.11; Intel i7-12700K (3.6 GHz, 12 cores); Corsair VENGEANCE LPX DDR4 RAM 64GB (2 x 32GB) 3200MHz; Samsung 970 EVO Plus SSD 2TB (M.2 NVMe).