MyNixOS website logo
Description

Read and Write 'Parquet' Files.

Self-sufficient reader and writer for flat 'Parquet' files. Can read most 'Parquet' data types. Can write many 'R' data types, including factors and temporal types. See docs for limitations.

nanoparquet

R-CMD-check CRAN status

nanoparquet is a reader and writer for a common subset of Parquet files.

Features:

  • Read and write flat (i.e. non-nested) Parquet files.
  • Can read most Parquet data types.
  • Can read a subset of columns from a Parquet file.
  • Can write many R data types, including factors and temporal types to Parquet.
  • Can append a data frame to a Parquet file without first reading and then rewriting the whole file.
  • Completely dependency free.
  • Supports Snappy, Gzip and Zstd compression.
  • Competitive with other tools in terms of speed, memory use and file size.

Limitations:

  • Nested Parquet types are not supported.
  • Some Parquet logical types are not supported: INTERVAL, UNKNOWN.
  • Only Snappy, Gzip and Zstd compression is supported.
  • Encryption is not supported.
  • Reading files from URLs is not supported.
  • nanoparquet always reads the data (or the selected subset of it) into memory. It does not work with out-of-memory data in Parquet files like Apache Arrow and DuckDB does.

Installation

Install the R package from CRAN:

install.packages("nanoparquet")

Usage

Read

Call read_parquet() to read a Parquet file:

df <- nanoparquet::read_parquet("example.parquet")

To see the columns of a Parquet file and how their types are mapped to R types by read_parquet(), call read_parquet_schema() first:

nanoparquet::read_parquet_schema("example.parquet")

Folders of similar-structured Parquet files (e.g. produced by Spark) can be read like this:

df <- data.table::rbindlist(lapply(
  Sys.glob("some-folder/part-*.parquet"),
  nanoparquet::read_parquet
))

Write

Call write_parquet() to write a data frame to a Parquet file:

nanoparquet::write_parquet(mtcars, "mtcars.parquet")

To see how the columns of the data frame will be mapped to Parquet types by write_parquet(), call infer_parquet_schema() first:

nanoparquet::infer_parquet_schema(mtcars)

Inspect

Call read_parquet_info(), read_parquet_schema(), or read_parquet_metadata() to see various kinds of metadata from a Parquet file:

  • read_parquet_info() shows a basic summary of the file.
  • read_parquet_schema() shows all columns, including non-leaf columns, and how they are mapped to R types by read_parquet().
  • read_parquet_metadata() shows the most complete metadata information: file meta data, the schema, the row groups and column chunks of the file.
nanoparquet::read_parquet_info("mtcars.parquet")
nanoparquet::read_parquet_schema("mtcars.parquet")
nanoparquet::read_parquet_metadata("mtcars.parquet")

If you find a file that should be supported but isn't, please open an issue here with a link to the file.

Options

See also ?parquet_options() for further details.

  • nanoparquet.class: extra class to add to data frames returned by read_parquet(). If it is not defined, the default is "tbl", which changes how the data frame is printed if the pillar package is loaded.
  • nanoparquet.compression_level: See ?parquet_options() for the defaults and the possible values for each compression method. Inf selects maximum compression for each method.
  • nanoparquet.num_rows_per_row_group: The number of rows to put into a row group by write_parquet(), if row groups are not specified explicitly. It should be an integer scalar. Defaults to 10 million.
  • nanoparquet.use_arrow_metadata: unless this is set to FALSE, read_parquet() will make use of Arrow metadata in the Parquet file. Currently this is used to detect factor columns.
  • nanoparquet.write_arrow_metadata: unless this is set to FALSE, write_parquet() will add Arrow metadata to the Parquet file. This helps preserving classes of columns, e.g. factors will be read back as factors, both by nanoparquet and Arrow.
  • nanoparquet.write_data_page_version: Data version to write by default. Possible values are 1 and 2. Default is 1.
  • nanoparquet.write_minmax_values: Whether to write minimum and maximum values per row group, for data types that support this in write_parquet().

License

MIT.

Metadata

Version

0.4.2

License

Unknown

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows