MyNixOS website logo
Description

Create Data Frames that are Easier to Exchange and Reuse.

The aim of the 'dataset' package is to make tidy datasets easier to release, exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, well-described, interoperable datasets into release and reuse ready form.

The dataset R Package

rhub lifecycle Project Status:WIP CRAN_Status_Badge CRAN_time_from_release Status at rOpenSci Software PeerReview DOI devel-version dataobservatory Codecov testcoverage

dataset: Semantic Metadata for Datasets in R

The dataset package provides tools to create semantically rich and interoperable datasets in R. It improves metadata handling by introducing new S3 classes—defined(), dataset_df(), and bibrecord()—that enhance the behaviour of labelled, tibble, and bibentry objects to meet the requirements of:

  • Statistical Data and Metadata eXchange (SDMX) standards,
  • Open Science metadata practices,
  • Library and archive metadata conventions (Dublin Core, DataCite).

Motivation

Many tools exist to help document, describe, or publish datasets in R, but most separate the metadata from the data itself. This separation increases the risk of losing metadata, misaligning it with the data, or making documentation hard to maintain.

The dataset package addresses this by storing all metadata directly in R object attributes. This preserves semantic information as data is transformed, combined, or exported, preventing the loss of vital documentation and improving reproducibility.

Key Features

defined()

An extended version of labelled() vectors. Adds support for:

  • Variable labels
  • Units of measure (e.g. “million euros”)
  • Concept URIs (standardized definitions)
  • Namespaces (to support URI expansion)
library(dataset)
data(orange_df)
print(orange_df$age)
#> orange_df$age: The age of the tree
#> Measured in days since 1968/12/31 
#>  [1]  118  484  664 1004 1231 1372 1582  118  484  664 1004 1231 1372 1582  118
#> [16]  484  664 1004 1231 1372 1582  118  484  664 1004 1231 1372 1582  118  484
#> [31]  664 1004 1231 1372 1582

This ensures that, for example, “GDP” is always associated with a precise concept and unit, avoiding ambiguity across analyses and publications. See Semantically Enriched Vectors with defined()

bibrecord()

An extension of R’s built-in bibentry() class, with support for:

  • Dublin Core Terms (dcterms)
  • DataCite metadata
  • Contributor roles (e.g. creator, publisher, data manager)
  • Subject tagging and geolocation
as_dublincore(orange_df)
#> Dublin Core Metadata Record
#> --------------------------
#> Title:        Growth of Orange Trees 
#> Creator(s):   N.R. Draper [cre] (http://viaf.org/viaf/84585260); H Smith [cre] 
#> Contributor(s):  :unas 
#> Publisher:    Wiley 
#> Year:         1998 
#> Language:     en 
#> Description:  The Orange data frame has 35 rows and 3 columns of records of the growth of orange trees.

This makes it easier to produce citations and metadata suitable for repositories like Zenodo or Dataverse. See more in the Modernising Citation Metadata in R: Introducing bibrecord

dataset_df()

A semantic wrapper around data.frame or tibble, aligning with SDMX’s data cube model:

  • Variables (columns) can have units, labels, and definitions.
  • Observations (rows) can be assigned unique identifiers.
  • Datasets can carry complete metadata inline (title, creator, description, etc.)
  • Output can be serialized to linked data formats (N-Triples, RDF, etc.)

See more in the Why Semantics Matter for R Data Frames

Why Use This?

  • Machine-readability: Your data and metadata are tightly coupled and structured for reuse.
  • Preservation: Data exported from R retains its full descriptive context.
  • Publication-ready: Integration with modern repository standards (DataCite, DC Terms).
  • Tidy + semantic: Extends tidy principles with semantic rigor.

Example

my_data <- dataset_df(
  country = defined(
    c("AD", "LI"), 
    concept =  "http://data.europa.eu/bna/c_6c2bb82d"),
  gdp = defined(c(3897, 7365), 
                label = "GDP", 
                unit = "million euros"),
  dataset_bibentry = datacite(
    Title = "GDP Data for Small Countries",
    Description = "Example Dataset for the dataset package",
    Creator = person("Jane", "Doe"),
    Publisher = "Open Data Institute",
    Rights = "CC0", 
    Language = "en"
  )
)

head(my_data)
#> 
#> 
#>   rowid      country    gdp        
#>   <hvn_lbl_> <hvn_lbl_> <hvn_lbl_>
#> 1 eg:1       AD         3897      
#> 2 eg:2       LI         7365
as_datacite(my_data)
#> DataCite Metadata Record
#> --------------------------
#> Title:         GDP Data for Small Countries 
#> Creator(s):    Jane Doe 
#> Contributor(s): :unas 
#> Identifier:    :tba 
#> Publisher:     Open Data Institute 
#> Year:          :tba 
#> Language:      en 
#> Description:  Example Dataset for the dataset package

🧪 Contributing

We welcome contributions and discussion!

📜 Code of Conduct

This project adheres to the rOpenSci Code of Conduct. By participating, you are expected to uphold these guidelines.

Metadata

Version

0.3.9

License

Unknown

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows