MyNixOS website logo
Description

Separate a Data Frame by Normalization.

Separate a data frame in two based on key columns. The function unjoin() provides an inside-out version of a nested data frame. This is used to identify duplication and normalize it (in the database sense) by linking two tables with the redundancy removed. This is a basic requirement for detecting topology within spatial structures that has motivated the need for this package as a building block for workflows within more applied projects.

Lifecycle:stable Travis-CI BuildStatus AppVeyor BuildStatus CoverageStatus CRANstatus

unjoin

The goal of unjoin is to provide unjoin for data frames. This is exactly part of what tidyr::nest does, but with two differences:

  • the split data frames are not nested, they are split and returned as two whole tibbles main and data
  • there is an explicit key column added to identify the de-duplicated rows in main with the rows in data.

Installation

Install unjoin from CRAN:

install.packages("unjoin")

You can install the development unjoin from github with:

# install.packages("devtools")
devtools::install_github("hypertidy/unjoin")

Example

This is a basic example which shows you how to unjoin a data frame.

library(unjoin)

unjoin(iris)
#> $.idx0
#> # A tibble: 1 x 1
#>   .idx0
#>   <int>
#> 1     1
#> 
#> $data
#> # A tibble: 150 x 6
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species .idx0
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int>
#>  1          5.1         3.5          1.4         0.2 setosa      1
#>  2          4.9         3            1.4         0.2 setosa      1
#>  3          4.7         3.2          1.3         0.2 setosa      1
#>  4          4.6         3.1          1.5         0.2 setosa      1
#>  5          5           3.6          1.4         0.2 setosa      1
#>  6          5.4         3.9          1.7         0.4 setosa      1
#>  7          4.6         3.4          1.4         0.3 setosa      1
#>  8          5           3.4          1.5         0.2 setosa      1
#>  9          4.4         2.9          1.4         0.2 setosa      1
#> 10          4.9         3.1          1.5         0.1 setosa      1
#> # … with 140 more rows
#> 
#> attr(,"class")
#> [1] "unjoin"

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
iris %>% unjoin(Species)
#> $.idx0
#> # A tibble: 3 x 2
#>   Species    .idx0
#>   <fct>      <int>
#> 1 setosa         1
#> 2 versicolor     2
#> 3 virginica      3
#> 
#> $data
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width .idx0
#>           <dbl>       <dbl>        <dbl>       <dbl> <int>
#>  1          5.1         3.5          1.4         0.2     1
#>  2          4.9         3            1.4         0.2     1
#>  3          4.7         3.2          1.3         0.2     1
#>  4          4.6         3.1          1.5         0.2     1
#>  5          5           3.6          1.4         0.2     1
#>  6          5.4         3.9          1.7         0.4     1
#>  7          4.6         3.4          1.4         0.3     1
#>  8          5           3.4          1.5         0.2     1
#>  9          4.4         2.9          1.4         0.2     1
#> 10          4.9         3.1          1.5         0.1     1
#> # … with 140 more rows
#> 
#> attr(,"class")
#> [1] "unjoin"

iris %>% unjoin(Species, Petal.Width)
#> $.idx0
#> # A tibble: 27 x 3
#>    Species    Petal.Width .idx0
#>    <fct>            <dbl> <int>
#>  1 setosa             0.2     2
#>  2 setosa             0.4     4
#>  3 setosa             0.3     3
#>  4 setosa             0.1     1
#>  5 setosa             0.5     5
#>  6 setosa             0.6     6
#>  7 versicolor         1.4    11
#>  8 versicolor         1.5    12
#>  9 versicolor         1.3    10
#> 10 versicolor         1.6    13
#> # … with 17 more rows
#> 
#> $data
#> # A tibble: 150 x 4
#>    Sepal.Length Sepal.Width Petal.Length .idx0
#>           <dbl>       <dbl>        <dbl> <int>
#>  1          5.1         3.5          1.4     2
#>  2          4.9         3            1.4     2
#>  3          4.7         3.2          1.3     2
#>  4          4.6         3.1          1.5     2
#>  5          5           3.6          1.4     2
#>  6          5.4         3.9          1.7     4
#>  7          4.6         3.4          1.4     3
#>  8          5           3.4          1.5     2
#>  9          4.4         2.9          1.4     2
#> 10          4.9         3.1          1.5     1
#> # … with 140 more rows
#> 
#> attr(,"class")
#> [1] "unjoin"

This is used to build topological data structures, with a kind of inside-out version of a nested data frame. Whether it’s of broader use is unclear.

There is a record here of some of the thinking that led to unjoin: https://github.com/r-gris/babelfish

The function unjoin replaces the method here: http://rpubs.com/cyclemumner/iout_nest

(d2 <- iris %>% unjoin(Species, Petal.Width))
#> $.idx0
#> # A tibble: 27 x 3
#>    Species    Petal.Width .idx0
#>    <fct>            <dbl> <int>
#>  1 setosa             0.2     2
#>  2 setosa             0.4     4
#>  3 setosa             0.3     3
#>  4 setosa             0.1     1
#>  5 setosa             0.5     5
#>  6 setosa             0.6     6
#>  7 versicolor         1.4    11
#>  8 versicolor         1.5    12
#>  9 versicolor         1.3    10
#> 10 versicolor         1.6    13
#> # … with 17 more rows
#> 
#> $data
#> # A tibble: 150 x 4
#>    Sepal.Length Sepal.Width Petal.Length .idx0
#>           <dbl>       <dbl>        <dbl> <int>
#>  1          5.1         3.5          1.4     2
#>  2          4.9         3            1.4     2
#>  3          4.7         3.2          1.3     2
#>  4          4.6         3.1          1.5     2
#>  5          5           3.6          1.4     2
#>  6          5.4         3.9          1.7     4
#>  7          4.6         3.4          1.4     3
#>  8          5           3.4          1.5     2
#>  9          4.4         2.9          1.4     2
#> 10          4.9         3.1          1.5     1
#> # … with 140 more rows
#> 
#> attr(,"class")
#> [1] "unjoin"

We can chain unjoins together, but make sure not to repeat a key_col in one of these.

unjoin(iris, Species, key_col = "vertex") %>% unjoin(Petal.Width, vertex,  key_col = "branch")
#> $vertex
#> # A tibble: 3 x 2
#>   Species    vertex
#>   <fct>       <int>
#> 1 setosa          1
#> 2 versicolor      2
#> 3 virginica       3
#> 
#> $branch
#> # A tibble: 27 x 3
#>    Petal.Width vertex branch
#>          <dbl>  <int>  <int>
#>  1         0.2      1      2
#>  2         0.4      1      4
#>  3         0.3      1      3
#>  4         0.1      1      1
#>  5         0.5      1      5
#>  6         0.6      1      6
#>  7         1.4      2     11
#>  8         1.5      2     13
#>  9         1.3      2     10
#> 10         1.6      2     15
#> # … with 17 more rows
#> 
#> $data
#> # A tibble: 150 x 4
#>    Sepal.Length Sepal.Width Petal.Length branch
#>           <dbl>       <dbl>        <dbl>  <int>
#>  1          5.1         3.5          1.4      2
#>  2          4.9         3            1.4      2
#>  3          4.7         3.2          1.3      2
#>  4          4.6         3.1          1.5      2
#>  5          5           3.6          1.4      2
#>  6          5.4         3.9          1.7      4
#>  7          4.6         3.4          1.4      3
#>  8          5           3.4          1.5      2
#>  9          4.4         2.9          1.4      2
#> 10          4.9         3.1          1.5      1
#> # … with 140 more rows
#> 
#> attr(,"class")
#> [1] "unjoin"

Also, there’s no escape hatch here, you can’t “unjoin” your way to normal nirvana, each unjoin needs to carry the last unjoin-key with it, and you just end up with the big link table with no attributes. It needs some kind of group-semantic to cut the chain.


Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Metadata

Version

0.1.0

License

Unknown

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows