MyNixOS website logo
Description

Datasets and Basic Statistics for Symbolic Data Analysis.

Collects a diverse range of symbolic data and offers a comprehensive set of functions that facilitate the conversion of traditional data into the symbolic data format.

dataSDA

Datasets and Basic Statistics for Symbolic Data Analysis

R License: GPL v2 Version

Overview

dataSDA collects a diverse range of symbolic data and offers a comprehensive set of functions that facilitate the conversion of traditional data into the symbolic data format. It supports reading, writing, and conversion of symbolic data in diverse formats, as well as computing descriptive statistics of symbolic variables.

Installation

From GitHub

# install.packages("devtools")
devtools::install_github("hanmingwu1103/dataSDA")

From source

Download the latest release from the Releases page, then:

# Source package (all platforms)
install.packages("dataSDA_0.2.5.tar.gz", repos = NULL, type = "source")

# Binary package (Windows)
install.packages("dataSDA_0.2.5.zip", repos = NULL, type = "win.binary")

Features

Descriptive Statistics

Interval-valued data (int_*)

Compute mean, variance, covariance, and correlation for interval-valued data with 8 methods: CM, VM, QM, SE, FV, EJD, GQ, SPT.

library(dataSDA)
data(mushroom.int)

int_mean(mushroom.int, var_name = "Pileus.Cap.Width")
int_var(mushroom.int, var_name = c("Stipe.Length", "Stipe.Thickness"), method = c("CM", "FV", "EJD"))

int_cov(mushroom.int, var_name1 = "Pileus.Cap.Width",
        var_name2 = c("Stipe.Length", "Stipe.Thickness"),
        method = c("CM", "VM", "EJD", "GQ", "SPT"))
int_cor(mushroom.int, var_name1 = "Pileus.Cap.Width",
        var_name2 = "Stipe.Length", method = "CM")

Histogram-valued data (hist_*)

Compute mean, variance, covariance, and correlation for histogram-valued data with methods BG and L2W (cov/cor also support BD, B).

library(HistDAWass)

hist_mean(HistDAWass::BLOOD, var_name = "Cholesterol", method = "BG")
hist_var(HistDAWass::BLOOD, var_name = "Cholesterol", method = "L2W")

hist_cov(HistDAWass::BLOOD, var_name1 = "Cholesterol",
         var_name2 = "Hemoglobin", method = "BD")
hist_cor(HistDAWass::BLOOD, var_name1 = "Cholesterol",
         var_name2 = "Hemoglobin", method = "BG")

Data Format Conversion

Interval format conversions

FunctionDescription
int_detect_formatDetect the format of an interval-valued dataset
int_convert_formatConvert between interval formats
int_list_conversionsList all available format conversions
to_all_interval_formatsConvert intervals to all supported formats at once

Other conversion functions

FunctionDescription
RSDA_formatConvert conventional data to RSDA format
set_variable_formatOne-hot encode set variables for RSDA format
aggregate_to_symbolicConvert traditional data to symbolic data format

Interval Geometry

FunctionDescription
int_widthWidth of each interval
int_radiusRadius of each interval
int_centerCenter point of each interval
int_midrangeHalf-range of each interval
int_overlapOverlap measure between two interval variables
int_containmentCheck if one interval contains another

Interval Position and Scale

FunctionDescription
int_medianMedian of interval data
int_quantileQuantiles of interval data
int_rangeRange of interval data
int_iqrInterquartile range
int_madMedian absolute deviation
int_modeMode of interval data

Interval Shape

FunctionDescription
int_skewnessSkewness of interval data
int_kurtosisKurtosis of interval data
int_symmetrySymmetry coefficient
int_tailednessTailedness measure

Interval Distance and Similarity

FunctionDescription
int_distDistance measures (GD, IY, L1, L2, CB, HD, EHD, WD, etc.)
int_jaccardJaccard similarity coefficient
int_diceDice similarity coefficient
int_cosineCosine similarity
int_overlap_coefficientOverlap coefficient
int_tanimotoTanimoto coefficient
int_similarity_matrixPairwise similarity matrix

Interval Robust Statistics

FunctionDescription
int_trimmed_meanTrimmed mean
int_winsorized_meanWinsorized mean
int_trimmed_varTrimmed variance
int_winsorized_varWinsorized variance

Interval Uncertainty and Variability

FunctionDescription
int_entropyShannon entropy
int_cvCoefficient of variation
int_dispersionDispersion index
int_imprecisionImprecision based on interval width
int_granularityVariability in interval sizes
int_uniformityUniformity of interval widths
int_information_contentNormalized entropy

Utilities

FunctionDescription
clean_colnamesClean column names of a data frame
read_symbolic_csvRead symbolic data from CSV file
write_symbolic_csvWrite symbolic data to CSV file
search_dataSearch available datasets by keyword or type
aggregate_to_symbolicConvert traditional data to symbolic data format

Datasets

The package includes 114 built-in datasets for symbolic data analysis:

Interval-valued datasets (53 datasets, .int)

abalone.int, acid_rain.int, age_cholesterol_weight.int, baseball.int, bats.int, blood_pressure.int, car.int, car_models.int, cardiological.int, cars.int, china_temp.int, china_temp_monthly.int, credit_card.int, ecoli_routes.int, employment.int, finance.int, freshwater_fish.int, fungi.int, genome_abundances.int, hdi_gender.int, horses.int, iris.int, judge1.int, judge2.int, judge3.int, lackinfo.int, lisbon_air_quality.int, loans_by_purpose.int, loans_by_risk.int, loans_by_risk_quantile.int, lynne1.int, mushroom.int, nycflights.int, ohtemp.int, oils.int, polish_voivodships.int, profession.int, prostate.int, soccer_bivar.int, synthetic_clusters.int, teams.int, temperature_city.int, tennis.int, trivial_intervals.int, uscrime.int, utsnow.int, veterinary.int, video1.int, video2.int, video3.int, water_flow.int, wine.int, world_cup.int

Histogram-valued datasets (25 datasets, .hist)

age_pyramids.hist, airline_flights.hist, bird_color_taxonomy.hist, blood.hist, china_climate_month.hist, china_climate_season.hist, cholesterol.hist, county_income_gender.hist, cover_types.hist, exchange_rate_returns.hist, flights_detail.hist, french_agriculture.hist, glucose.hist, hardwood.hist, hematocrit.hist, hematocrit_hemoglobin.hist, hemoglobin.hist, hierarchy.hist, hospital.hist, iris_species.hist, lung_cancer.hist, ozone.hist, simulated.hist, state_income.hist, weight_age.hist

Mixed symbolic datasets (11 datasets, .mix)

bird.mix, bird_species.mix, bird_species_extended.mix, census.mix, environment.mix, health_insurance.mix, joggers.mix, mtcars.mix, mushroom_fuzzy.mix, polish_cars.mix, town_services.mix

Interval time series datasets (9 datasets, .its)

crude_oil_wti.its, djia.its, euro_usd.its, ibovespa.its, irish_wind.its, merval.its, petrobras.its, shanghai_stock.its, sp500.its

Modal-valued datasets (7 datasets, .modal)

airline_flights2.modal, crime.modal, crime2.modal, fuel_consumption.modal, health_insurance2.modal, occupations.modal, occupations2.modal

Distribution-valued datasets (3 datasets, .distr)

energy_consumption.distr, energy_usage.distr, household_characteristics.distr

iGAP format datasets (2 datasets, .iGAP)

abalone.iGAP, face.iGAP

Other datasets

bank_rates, hierarchy, mushroom.int.mm

Vignettes

Dependencies

Authors

  • Po-Wei Chen (Author), Chun-houh Chen (Author)
  • Han-Ming Wu (Creator, Maintainer) - [email protected]

License

GPL (>= 2)

Metadata

Version

0.2.5

License

Unknown

Platforms (78)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows