Generic encoding of records.
Generic encoding of records. It currently provides a single, polymorphic function to encode sum types (i.e. categorical variables) as one-hot vectors.
record-encode
Encoding categorical variables
This library provides generic machinery to encode values of some algebraic type as points in a vector space.
Values of a sum type (e.g. enumerations) are also called "categorical" variables in statistics, because they encode a choice between a number of discrete categories.
On the other hand, many data science / machine learning algorithms rely on a purely numerical representation of data; the conversion code from values of a static type is often "boilerplate", i.e. largely repeated and not informative.
The encodeOneHot
function provided here is a generic utility function (i.e. defined once and for all) to compute the one-hot representation of any sum type.
Usage example
{-# language DeriveGeneric -#}
import qualified GHC.Generics as G
import qualified Generics.SOP as SOP
import Data.Record.Encode
data X = A | B | C deriving (G.Generic)
instance SOP.Generic X
> encodeOneHot B
OH {oDim = 3, oIx = 1}
Please refer to the documentation of Data.Record.Encode for more examples and details.
Acknowledgements
Gagandeep Bhatia (@gagandeepb) for his Google Summer of Code 2018 work on Frames-beam
, Mark Karpov (@mrkkrp) for his Template Haskell tutorial, Anthony Cowley (@acowley) for Frames
, @mniip on Freenode #haskell for helping me better understand what can be done with generic programming.