MyNixOS website logo
Description

Analyze Big Files Without Loading Them in Memory.

A simple set of wrapper functions for data.table::fread() that allows subsetting or filtering rows and selecting columns of table-formatted files too large for the available RAM. 'b stands for 'big files'. bread makes heavy use of Unix commands like 'grep', 'sed', 'wc', 'awk' and 'cut'. They are available by default in all Unix environments. For Windows, you need to install those commands externally in order to simulate a Unix environment and make sure that the executables are in the Windows PATH variable. To my knowledge, the simplest ways are to install 'RTools', 'Git' or 'Cygwin'. If they have been correctly installed (with the expected registry entries), they should be detected on loading the package and the correct directories will be added automatically to the PATH.

bread

bread offers simple wrapper functions of data.table::fread() that aim at making it easier to use the "cmd" argument with shell Unix (and sometimes PowerShell if available) commands like grep, wd and sed. The functions auto-generate those commands from arguments provided to the function. The main use is to allow computers with low memory to analyze big files (the "b" in bread stands for "big files") and count rows, look up column names, subset rows by index numbers or value and select columns without hitting the memory limit (and the "cannot allocate vector of size" error.) bread functions allow to analyze a 50Gb file with a computer with 8Gb of memory and:

  • split it in several smaller ones by number of rows or by values in one or many columns
  • count the number of rows
  • subset it by row number or column values (string pattern or numerical value)
  • select only the relevant variables/columns

Best practices

There are other (better) ways to do that, like - for example - loading a large file in a SQLite database. Or not working on huge csv files in the first place. But I happened to use those commands often in order to explore data. If you have to, you hopefully won't have to delve right away into the fascinating grammar of Unix commands.

Pre-requisites

bread makes heavy use of Unix commands like grep, sed, wc and cut. They are available by default in all Unix environments. For Windows, you need to install those commands externally in order to simulate a Unix environment and make sure that the executables are in the Windows PATH variable. To my knowledge, the simplest ways are to install RTools, Git or Cygwin. If they have been correctly installed (with the expected registry entries), they will be detected on loading the package and the correct directories will be added automatically to the PATH.

Installation

# Install bread from CRAN
install.packages("bread")
# Or the development version from GitHub:
# install.packages("bread")
devtools::install_github("MagicHead99/bread")
Metadata

Version

0.4.1

License

Unknown

Platforms (75)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows