MyNixOS website logo
Description

Reshape Disorganised Messy Data.

Helps the user to build and register schema descriptions of disorganised (messy) tables. Disorganised tables are tables that are not in a topologically coherent form, where packages such as 'tidyr' could be used for reshaping. The schema description documents the arrangement of input tables and is used to reshape them into a standardised (tidy) output format.

tabshiftr

CRAN_Status_Badge R-CMD-check CoverageStatus Lifecycle:maturing

Overview

Data are stored in many different ways in tables or spreadsheets because no strict semantic or topographic standards for the organisation of tables are commonly accepted. In the R environment the tidy paradigm is a first step towards interoperability of data, in that it requires a certain arrangement of tables, where variables are recorded in columns and observations in rows (see https://tidyr.tidyverse.org/). Tables can be tidied (i.e., brought into a tidy arrangement) via packages such as tidyr, however, all functions that deal with reshaping tables to date require data that are already organised into topologically coherent, rectangular tables. This is often violated in practice, especially in data that are scraped off of the internet.

tabshiftr fills this gap in the toolchain towards more interoperable data via schema descriptions that are built with setters and debugged with getters and a reorganise() function that ties everything together.

Installation

  1. Install the official version from CRAN:
install.packages("tabshiftr")

or the latest development version from github:

devtools::install_github("EhrmannS/tabshiftr")
  1. The vignette gives an introduction, provides an instruction on how to set up schema descriptions by going step by step through certain dimensions of disorganisation to show which table arrangements can be reorganised and how that works.

Examples

A disorganised table may look like the following table:

library(tabshiftr)
library(knitr)

# a rather disorganised table with messy clusters and a distinct variable
input <- tabs2shift$clusters_messy
kable(input)
X1X2X3X4X5X6X7
commoditiesharvestedproduction....
unit 1......
soybean11111112year 1...
maize11211122year 1...
soybean12111212year 2...
maize12211222year 2...
.......
commoditiesharvestedproductioncommoditiesharvestedproduction.
unit 2..unit 3...
soybean21112112soybean31113112year 1
maize21212122maize31213122year 1
soybean22112212soybean32113212year 2
maize22212222maize32213222year 2

If we were to transform this data into tidy data by merely using the functions in tidyr (or the extended tidyverse in general), we’d potentially end up with a massive algorithm, especially for such complicated table arrangements. For other tables that may or may not be as complicated, we’d have to set up yet more algorithms and while a pipeline of tidy functions is relatively easy to set up, it would still become very laborious to repeat this for the dozens of potential table arrangements. In tabshiftr we solve that by describing the schema of the input table and providing this schema description to the reorganise() function. This requires us to use a vastly smaller set of code and makes it thus a lot more efficient to bring multiple heterogeneous data into an interoperable format.

# put together schema description by ...
# ... identifying cluster positions
schema <- setCluster(id = "territories", left = c(1, 1, 4), top = c(1, 8, 8))

# ... specifying the cluster ID as id variable (obligatory for when we deal with clusters)
schema <- schema %>%
   setIDVar(name = "territories", columns = c(1, 1, 4), rows = c(2, 9, 9))

# ... specifying a distinct variable (explicit position)
schema <- schema %>%
   setIDVar(name = "year", columns = 4, rows = c(3:6), distinct = TRUE)

# ... specifying a tidy variable (by giving the column values)
schema <- schema %>%
   setIDVar(name = "commodities", columns = c(1, 1, 4))

# ... identifying the (tidy) observed variables
schema <- schema %>%
   setObsVar(name = "harvested", columns = c(2, 2, 5)) %>%
   setObsVar(name = "production", columns = c(3, 3, 6))

# to potentially debug the schema description, first validate the schema ...
schema_valid <- validateSchema(schema = schema, input = input)

# ... and extract parts of it per cluster (also check out the other getters in
# this package)
getIDVars(schema = schema_valid, input = input)
#> [[1]]
#> [[1]]$year
#> # A tibble: 4 × 1
#>   X4    
#>   <chr> 
#> 1 year 1
#> 2 year 1
#> 3 year 2
#> 4 year 2
#> 
#> [[1]]$commodities
#> # A tibble: 4 × 1
#>   X1     
#>   <chr>  
#> 1 soybean
#> 2 maize  
#> 3 soybean
#> 4 maize  
#> 
#> 
#> [[2]]
#> [[2]]$year
#> # A tibble: 4 × 1
#>   X4    
#>   <chr> 
#> 1 year 1
#> 2 year 1
#> 3 year 2
#> 4 year 2
#> 
#> [[2]]$commodities
#> # A tibble: 4 × 1
#>   X1     
#>   <chr>  
#> 1 soybean
#> 2 maize  
#> 3 soybean
#> 4 maize  
#> 
#> 
#> [[3]]
#> [[3]]$year
#> # A tibble: 4 × 1
#>   X4    
#>   <chr> 
#> 1 year 1
#> 2 year 1
#> 3 year 2
#> 4 year 2
#> 
#> [[3]]$commodities
#> # A tibble: 4 × 1
#>   X4     
#>   <chr>  
#> 1 soybean
#> 2 maize  
#> 3 soybean
#> 4 maize
getObsVars(schema = schema_valid, input = input)
#> [[1]]
#> [[1]]$harvested
#> # A tibble: 4 × 1
#>   X2   
#>   <chr>
#> 1 1111 
#> 2 1121 
#> 3 1211 
#> 4 1221 
#> 
#> [[1]]$production
#> # A tibble: 4 × 1
#>   X3   
#>   <chr>
#> 1 1112 
#> 2 1122 
#> 3 1212 
#> 4 1222 
#> 
#> 
#> [[2]]
#> [[2]]$harvested
#> # A tibble: 4 × 1
#>   X2   
#>   <chr>
#> 1 2111 
#> 2 2121 
#> 3 2211 
#> 4 2221 
#> 
#> [[2]]$production
#> # A tibble: 4 × 1
#>   X3   
#>   <chr>
#> 1 2112 
#> 2 2122 
#> 3 2212 
#> 4 2222 
#> 
#> 
#> [[3]]
#> [[3]]$harvested
#> # A tibble: 4 × 1
#>   X5   
#>   <chr>
#> 1 3111 
#> 2 3121 
#> 3 3211 
#> 4 3221 
#> 
#> [[3]]$production
#> # A tibble: 4 × 1
#>   X6   
#>   <chr>
#> 1 3112 
#> 2 3122 
#> 3 3212 
#> 4 3222

# alternatively, if the clusters are regular, relative values starting from the
# cluster origin could be set
schema_alt <- setCluster(id = "territories",
                         left = c(1, 1, 4), top = c(1, 8, 8)) %>%
  setIDVar(name = "territories", columns = 1, rows = .find(row = 2, relative = TRUE)) %>%
  setIDVar(name = "year", columns = 4, rows = c(3:6), distinct = TRUE) %>%
  setIDVar(name = "commodities", columns = .find(col = 1, relative = TRUE)) %>%
  setObsVar(name = "harvested", columns = .find(col = 2, relative = TRUE)) %>%
  setObsVar(name = "production", columns = .find(col = 3, relative = TRUE))

The reorganise() function carries out the steps of validating, extracting the variables, pivoting the tentative output and putting the final table together automatically, so it merely requires the finalised schema and the input table.

schema # has a pretty print function
#>   3 clusters
#>     origin : 1|1, 8|1, 8|4  (row|col)
#>     id     : territories
#> 
#>    variable      type       row    col    dist 
#>   ------------- ---------- ------ ------ ------  
#>    territories   id         2, 9   1, 4   F  
#>    year          id         3:6    4      T  
#>    commodities   id                1, 4   F  
#>    harvested     observed          2, 5   F  
#>    production    observed          3, 6   F

output <- reorganise(input = input, schema = schema)
kable(output)
territoriesyearcommoditiesharvestedproduction
unit 1year 1maize11211122
unit 1year 1soybean11111112
unit 1year 2maize12211222
unit 1year 2soybean12111212
unit 2year 1maize21212122
unit 2year 1soybean21112112
unit 2year 2maize22212222
unit 2year 2soybean22112212
unit 3year 1maize31213122
unit 3year 1soybean31113112
unit 3year 2maize32213222
unit 3year 2soybean32113212

Contributions

  • tabshiftr is still in development. So far it reliably reorganises 20 different types of tables, but additional dimensions of disorganisation might show themselves. If you encounter a table that can’t be reorganised with the current infrastructure, we’d be more than happy to collaborate on advancing tabshiftr.
  • Informative error management is work in process.
  • Moreover, the resulting schema descriptions can be useful for data archiving or database building and tabshiftr should at some point support that those schemas can be exported into data-formats that are used by downstream applications (xml, json, …), following proper (ISO) standards. In case you have experience with those standards and would like to collaborate on it, please get in touch!

Acknowledgement

This work was supported by funding to Carsten Meyer through the Flexpool mechanism of the German Centre for Integrative Biodiversity Research (iDiv) (FZT-118, DFG).

Metadata

Version

0.4.1

License

Unknown

Platforms (77)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows