Computations that automatically track data dependencies.
see README.md
Motivation
Haskell is a great language for data processing. You load some data in the IO monad, parse it, funnel the data through various functions and write the result back to disk or display it via a web server.
The programmer has the let
and where
patterns at hand which can be used to sub-structure a single function, e.g.
workflow x y = let
a = f x
b = g a y
in h a b
To the environment program, however, the values of the intermediate steps a
and b
are invisible and the reader does not know you used the auxiliary functions f
, g
and h
, although they might be important when an outsider tries to check the correctness of the result of the workflow
function. This is where the Provenience monad comes in.
How it works
The Provenience monad is an ordinary state monad transformer. The state is a data flow graph, which we call the variable store. Nodes are Pandoc renderings of so-called variables. A variable is simply a pair of an ordinary Haskell value together with its node in the graph. A computation in the Provenience monad performs any number of the following five actions.
- Register a new variable in the variable store
- Provide a description of a registered variable (in form of a Pandoc Block)
- Provide a short name for a registered variable (used in hyperlinks)
- Render the value of a registered variable into its node in the variable store (as a Pandoc
Block
). There is a class for default rendering methods akin to theShow
class. - Apply a variable holding a function to a variable holding a value, similar to the
<*>
operator ofApplicative
functors. In the Provenience monad, we write<%>
instead.
The fifth action is the only action that adds edges to the data dependency graph. Suppose we have registered a variable f
holding a value of type a -> b
and a variable x
holding a value of type a
. The description of f
should explain to the reader what the function that is the value of f
does. The monadic action
y <- pure f <%> x
does not register y
as a new variable; instead y
points to the same node in the variable store as f
. However, the value of y
is the application of the value of f
to the value of x
and there is now an edge from x
to y
in the data flow graph labelled with the description of f
. If y
is not itself a function but the desired result, you should overwrite the node's description (which is still the description of f
) with a new description of the value of y
.
Why this design choice? Because otherwise partial application is impossible. If <%> always registered new variables, then
f <%> a <%> b
would register both f(a)
and f(a)(b)
as variables, which might not be what the user intended. But overwriting f
also means that we can not re-use the same function variable in several applications. When that is desired, use a Provenience action producing a variable instead of the variable itself. Consider the following.
let f = var succ
x <- input 4
y <- f <%> x
z <- f <%> y
Since the Haskell identifier f
is bound to a Provenience action that registers a new variable holding the succ
function, all three of x
, y
and z
are distinct variables. The take-home message is that
f <- var succ
x <- input 4
y <- pure f <%> x
is a dangerous style because the value of f
is not what the corresponding node in the graph is being used for anymore.
alternative Representation
The variable store also permits to save an alternative representation of each variable in addition to the Pandoc rendering, since you might want to provide a machine-readable data flow graph in addition to a Pandoc document. Similarly to the IHaskellDisplay
class, each type used in a variable must have a type class instance that allows automatic conversion into the alternative representation. If you don't need this feature, simply choose () as the alternative representation type. The graph of alternative representations can be extracted from the variable store. We provide code to assemble the store into a spreadsheet (of static cells). Foldable structures of basic values become columns while doubly-nested structures become tables.
Example
Continuing the example above, in the Provenience monad you would write something like the following. Of course it is up to the programmer to decide how fine-grained the decomposition into Provenience actions should be.
workflow x' y' = do
---------- register and render the input variables ------------------
x <- input x' -- register and render x'
y <- input y'
x `named` "x" -- links to x show "x" as text
y `named` "y"
x <? renderDefault "first item of input data" -- describe x
y <? renderDefault "second item of input data"
linkx <- linkto x -- create a hyperlink, used below
let what_f_does = Para [Str "auxiliary function f applied to ",linkx]
---------------------------------------------------------------------
------ the actual computation is three lines as in the pure code ----
a <- func f what_f_does <%> x
b <- func g (renderDefault "auxiliary function g") <%> a <%> y
c <- func h (renderDefault "auxiliary function h") <%> a <%> b
------ only book-keeping below --------------------------------------
---------------------------------------------------------------------
a `named` "a" >> b `named` "b" >> c `named` "result"
a <? renderDefault "first intermediate result"
b <? renderDefault "second intermediate result"
c <? renderDefault "the workflow result"
render a >> render b >> render c
return c
Above, the action func
registers a new variable and immediately supplies a description, which is then used as edge label by the <%>
operator on the same line.
You see that instead of one line of pure Haskell you are burdened with writing four kinds of Provenience actions: register, describe, alias and render. But of the four actions, three are only concerned with providing descriptions that the pure code did not contain.
Remarks
This package was inspired by the Javelin Software. Thanks to John R Levine, one of the authors of Javelin, for explaining the concepts underlying Javelin.
By using Pandoc the user has a number of output format choices. With a little CSS, the above example may be rendered like follows. Unfortunately, hackage does not allow raw html in markdown, so you have to convert the markdown yourself.
(For the sake of example, we used f = abs
, g = replicate
and h = fmap concat . replicate
).
Alternatives and shortcomings
Spreadsheets excel at displaying intermediate values, although comprehending the meaning and intention of spreadsheet formulas requires experience. Needless to say, Haskell is way more expressive than the formula language of contemporary spreadsheets.
Limiting yourself to one of Pandoc's output formats, you might write a haskintex document instead of a Provenience expression. Haskintex may be regarded as a form of literate programming, and you can use any of the numerous string interpolation packages or a library for type-safe construction of the target document format (e.g. HaTeX, blaze or Pandoc) to achieve the same outcome. However, you will not get the complete data flow graph for free.
Some programming languages provide interactive notebooks, e.g. the IPython interactive notebooks or the IHaskell variant. Hyper-Haskell is another approach to Haskell workbooks and shares some ideas with this package. Typed Spreadsheets sort of do what this package does in an interactive way, although it does not show intermediate results.
Working with large quantities of input data might become unwieldy, however. Rendering a notebook or interactive spreadsheet, though, requires the underlying language to be installed on the system. In contrast the philosophy of Provenience is that computation and display take place on different machines, and that a Provenience computation is only a small part of an actual application.
Perhaps the biggest shortcoming at present is the behaviour of mapM
.
When you mapM
a Provenience function over a traversable structure like a list, each element registers the same number of variables, whereas your intention probably was to register the entire traversable structure as input and obtain a single output variable. At present you have to do this transformation yourself.
TODOs
Add support for an actual graph layout of the variable store by means of a library such as graphViz.
Perhaps it is wise to make usage of variables which meanwhile have been overwritten illegal to use.
How could one retain the functions in the data flow graph, making the data flow interactive? We'd have to map a subset of Haskell onto some serializable type, e.g. spreadsheet formulas or JavaScript.