MyNixOS website logo
Description

“Shuffle and merge overlapping chunks” lossless compression.

Shamochu is short for “Shuffle and merge overlapping chunks” lossless compression.

The idea for the compression is:

  1. Split the input array in chunks of a given size.

  2. Rearrange the order of the chunks in order to optimize consecutive chunks overlaps.

  3. Create a data table with the reordered chunks and an index table that maps the original chunk index to its offset in the data table.

Then the data can be accessed in 𝒪(1) via a few bitwise operations and indexing two arrays.

The same operation can then be applied to the index table and may lead to further compression.

Trivial example (chunk size: 4):

  [1, 2, 3, 4, 2, 3, 4, 5, 0, 1, 2, 3]          # source data
  -> [[1, 2, 3, 4], [2, 3, 4, 5], [0, 1, 2, 3]] # make chunks
  -> [[0, 1, 2, 3], [1, 2, 3, 4], [2, 3, 4, 5]] # rearrange to have best overlaps
  -> {data: [0, 1, 2, 3, 4, 5], offsets: [1, 2, 0]} # overlap chunks & compute
                                                    # their offsets

Then we can retrieve the data from the original array at index i with the following formula:

    mask = (1 << chunk_size) - 1
    original[i] = data[offsets[i >> chunk_size] + (i & mask)]

Since the index array is itself quite repetitive with the real data, we can apply the compression a second time to the offsets table.

The complete algorithm optimizes the chunk sizes for both arrays in order to get the lowest total data size. Given the chunks sizes cs_data and cs_offsets:

  1. We compute the corresponding masks:

    • mask_data = (1 << cs_data) - 1 and

    • mask_offsets = (1 << cs_offsets) - 1.

  2. We can retrieve the original value at index k with the following formula:

    data[
        offsets1[
            offsets2[i >> (cs_data + cs_offsets)] +
            ((i >> cs_data) & mask_offsets)
        ] +
        (i & mask_data)
    ];

Notes

This work took inspiration from “Fast character case conversion… or how to really compress sparse arrays” by Alexander Pankratov.

Metadata

Version

0.1.0.0

Platforms (77)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows