Auto-generation of records data type.
cassava-records library helps in auto-creating record data types using Template Haskell by inferring types from the columns of a csv or compatible input file. The record and type classes instances generated can be seamlessly used with cassava(the haskell csv reader library) to load the data into these record types without dealing with any other level of abstraction. Please see README on Github at https://github.com/gdevanla/cassava-records#readme
cassava-records
A library extension for cassava
(Haskell CSV parser library) that automatically creates a Record data type given an columnar input file. eg a CSV file.
What is this tool for?
Say you are working on a project that involves processing a number of comma separated or tab serparated files. Assuming, you are using cassava for loading the input files, here is a typical workflow you would follow
a. Inspect the file that contains Salaries
of Employees
.
b. Create a Record
data type called Salaries
to reflect the columns and types found in the file
c. Create required instances of the Salaries
data type that may be required to load the files with cassava
.
Now, imagine this file you are inspecting to contains tens or hundreds of columns. Now, as a good Haskeller you will want to automate steps (a) and (b) to the extend possible. That is precisely, what this library does.
cassava-records
performs the following tasks. Given, a input file (command or tab-seperated for example), it reads the whole file, infers some basic data types for each column and automatically creates a Record
of inferred data types using Template Haskell
.
Quick Start
Example 1 :
Using data/salaries_simple.csv
emp_no,name,salary,status,years
1,John Doe,100.0,True,1
2,Jill Doe, 200.10,False,2
3,John Doe Sr,101.0,T,3
4,Jill Doe Sr, 10101.10,f,4.2
5,John Doe Jr,1010101.0,true,5.1
6,Jill Doe Jr, 10101.10,false,6
{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}
import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT
$(makeCsvRecord "Salaries" "data/salaries_simple.csv" "_" commaOptions)
The makeCsvRecord
function take 4 arguments,
"Salaries"
: AString
that will be used as name for theRecord
.data/salaries_simple.csv
: path to the input file"_"
: string to prefix each field. Useful, if we need to build lens for this recordcommaOptions
:defaultDecodeOptions
defined incassava
library
If you load this code in GHCi
, we will see
1 >:info Salaries
data Salaries
= Salaries {
_emp_no:: Integer,
_name :: Text,
_salary :: Double,
_status :: Bool,
_years:: Double}
Note that all column names are in lower case and "_" has been prefixed to the column names.
To be consistent, cassava-record
converts all column headers to lower-case before created corresponding field names for each column header. Therefore, if column headers were all upper-case, we need to provide a field modifier while creating ToNamedRecord
and FromNamedRecord
instances for cassava
.
Note, that if the column headers are mixed case, it become tricky. Current version of the library does not work very well with mixed case column headers.
There is a convenience method called makeInstances
that can create the instances required for cassava
.The instances created use the default fieldModifieroptions
settings shown below.
fieldModifierOptions :: Options
fieldModifierOptions = defaultOptions { fieldLabelModifier = rmUnderscore }
where
rmUnderscore ('_':str) = DT.unpack . DT.toUpper . DT.pack $ str
rmUnderscore str = str
-- the ToNamedRecord and FromNamedRecord are needed by Cassava since
-- we prefix
instance ToNamedRecord Salaries where
toNamedRecord = genericToNamedRecord fieldModifierOptions
instance FromNamedRecord Salaries where
parseNamedRecord = genericParseNamedRecord fieldModifierOptions
main :: IO ()
main = do
v <- loadData "data/salaries_simple.csv":: IO (V.Vector Salaries)
putStrLn . show $ v
In GHCi
we see (formatted for clarity)
2 >loadData "data/salaries.csv"
[Salaries {_emp_no = 1, _name = "John Doe", _salary = 100.0, _status = False, _years = 1.0},
Salaries {_emp_no = 2, _name = "Jill Doe", _salary = 200.1, _status = False, _years = 2.0},
Salaries {_emp_no = 3, _name = "John Doe Sr", _salary = 101.0, _status = True, _years = 3.0},
Salaries {_emp_no = 4, _name = "Jill Doe Sr", _salary = 10101.1, _status = False, _years = 4.2},
Salaries {_emp_no = 5, _name = "John Doe Jr", _salary = 1010101.0, _status = False, _years = 5.1},
Salaries {_emp_no = 6, _name = "Jill Doe Jr", _salary = 10101.1, _status = False, _years = 6.0}]
Note, the type inference in the above example is as follows:
- If a column has values from the set {
true
,t
,false
,f
} (ignoring case) then the inferred type isBool
. - If a column has values that are all numeric, then an
Integer
type is attempted, or else aDouble
is infered. For example foremp_no
the infered type is aInteger
whereas foryears
the type isDouble
. - For all other cases, a
Text
type is inferred.
Example 2 (Missing Values)
The library also supports type inference when values are missing. For example in, data/salaries_mixed_input.csv
emp_no,name,salary,status,years
1,John Doe,100.0,True,1
2,Jill Doe, 200.10,False,2
3,John Doe Sr,101.0,T,
4,Jill Doe Sr,10101.10,,4.2
5,John Doe Jr,,true,5.1
6,, 10101.10,false,6
the status
for Jill Doe Jr is missing and the salary
for John Doe Sr is missing. In this case, the type as wrapped in a Maybe
type.
In that case, the record instance we get will be as follows:
3 >:info Salaries
data Salaries
= Salaries {
_emp_no:: Integer,
_name :: Maybe Text,
_salary :: Maybe Double,
_status :: Maybe Bool,
_years:: Maybe Double}
Loading this data, would produce the following output
{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}
import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT
$(makeCsvRecord "SalariesMixed" "data/salaries_mixed_input.csv" "_" commaOptions)
$(makeInstance "SalariesMixed")
-- ^ note that we can use this function instead of manually defining
-- all instances required by Cassava
main :: IO ()
main = do
v <- loadData "data/salaries_mixed_input.csv":: IO (V.Vector SalariesMixed)
putStrLn . show $ v
The output will be as follows:
[SalariesMixed {_emp_no = 1, _name = Just "John Doe", _salary = Just 100.0, _status = Just False, _years = Just 1.0},
SalariesMixed {_emp_no = 2, _name = Just "Jill Doe", _salary = Just 200.1, _status = Just False, _years = Just 2.0},
SalariesMixed {_emp_no = 3, _name = Just "John Doe Sr", _salary = Just 101.0, _status = Just True, _years = Nothing},
SalariesMixed {_emp_no = 4, _name = Just "Jill Doe Sr", _salary = Just 10101.1, _status = Nothing, _years = Just 4.2},
SalariesMixed {_emp_no = 5, _name = Just "John Doe Jr", _salary = Nothing, _status = Just False, _years = Just 5.1},
SalariesMixed {_emp_no = 6, _name = Nothing, _salary = Just 10101.1, _status = Just False, _years = Just 6.0}]
Here is a full working code that uses both the examples:
{-#LANGUAGE TemplateHaskell#-}
{-#LANGUAGE DeriveGeneric #-}
{-#LANGUAGE ScopedTypeVariables #-}
{-#LANGUAGE DuplicateRecordFields #-}
{-# LANGUAGE DeriveDataTypeable #-}
module Main where
import Data.Cassava.Records
import Data.Csv
import qualified Data.ByteString.Lazy as BL
import Data.Vector as V
import Data.Text as DT
import qualified Text.PrettyPrint.Tabulate as T
import Language.Haskell.TH
-- import Control.Lens hiding (element)
$(makeCsvRecord "Salaries" "data/salaries_simple.csv" "_" commaOptions)
-- $(makeInstance "Salaries")
$(makeCsvRecord "SalariesMixed" "data/salaries_mixed_input.csv" "_" commaOptions)
$(makeInstance "SalariesMixed")
-- the following instance is not required, if $(makeInstance Salaries) statement
-- is spliced in (currently commented in the example)
myOptions :: Options
myOptions = defaultOptions { fieldLabelModifier = rmUnderscore }
where
rmUnderscore ('_':str) = DT.unpack . DT.toUpper . DT.pack $ str
rmUnderscore str = str
instance ToNamedRecord Salaries where
toNamedRecord = genericToNamedRecord myOptions
instance FromNamedRecord Salaries where
parseNamedRecord = genericParseNamedRecord myOptions
main :: IO ()
main = do
v <- loadData "data/salaries_simple.csv" :: IO (V.Vector Salaries)
v1 <- loadData "data/salaries_mixed_input.csv" :: IO (V.Vector SalariesMixed)
putStrLn . show $ v
putStrLn . show $ v1
Caveats (Or list of future enhancements)
- The columns names along with prefix("_") should be a valid Haskell field names. For example, column names cannot have spaces or other characters not supported for use as
field
names ofRecord
data type. - The library loads the whole file during compilation to infer types. Given the size of the file, this will increase the compile time. Alternative workflows, like stripping the file or dumping the created slice into a file is recommended. In the future, the makeCsvRecord function can take a parameter to specify the max number of rows that can be used to infer the types.
- The inferred types are limited. Text, Bool, Integer, Double and the MayBe variants of those. Future, support may include DateTime.
- Mixed case column headers not automatically supported. A more complex form of
fieldOptionModifiers
needs to be provided. - Currently there is no option to provide custom types.