MyNixOS website logo
Description

Datasets with Strong and Spurious Correlations.

Provides datasets from Vigen (2015) <https://web.archive.org/web/20230607181247/https%3A/tylervigen.com/spurious-correlations> rescued from the Internet Wayback Machine. These should be preserved for statistics introductory courses as these make it very clear that correlation is not causation.

Spurious correlations

The goal of spuriouscorrelations is to keep alive the amazing examples from Tyler Vigen. Unfortunately, as of 2023-10-09, the website is down as my students noticed. Therefore, I decided to use the snapshot from the Internet Wayback Machine to save the datasets from 2023-06-07.

Installation

You can install the development version of spuriouscorrelations like so:

remotes::install_github("pachadotdev/spuriouscorrelations")

Example

This is a basic example which shows you how to plot a spurious correlation:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(spuriouscorrelations)

unique(spurious_correlations$var2_short)
#>  [1] US spending on science           Nicholas Cage                   
#>  [3] Cheese consumed                  Age of Miss America             
#>  [5] Arcade revenue                   Space launches                  
#>  [7] Mozzarella cheese consumption    Kentucky marriages              
#>  [9] US crude oil imports from Norway Chicken consumption             
#> [11] Nuclear power plants             Japanese cars sold              
#> [13] Spelling bee letters             Uranium stored                  
#> 14 Levels: Age of Miss America Arcade revenue ... US spending on science

nicholas_cage <- spurious_correlations %>%
  filter(var2_short == "Nicholas Cage")

ggplot(nicholas_cage) +
  geom_point(aes(x = var1_value, y = var2_value, color = year), size = 4) +
  theme_minimal() +
  labs(
    x = "Drownings by falling into a pool",
    y = "Nicholas Cage movies",
    title = "Drownings by Falling into a Pool vs Nicholas Cage Movies"
  )

The correlation is:

cor(nicholas_cage$var1_value, nicholas_cage$var2_value)
#> [1] 0.6660043

Now let’s make it double-axis:

library(tidyr)

nicholas_cage_long <- nicholas_cage %>%
  select(year, var1_value, var2_value) %>%
  pivot_longer(
    cols = c(var1_value, var2_value),
    names_to = "variable",
    values_to = "value"
  ) %>%
  # standardize the values
  group_by(variable) %>%
  mutate(
    value = (value - mean(value)) / sd(value),
    variable = case_when(
      variable == "var1_value" ~ "Drownings by falling into a pool",
      variable == "var2_value" ~ "Nicholas Cage movies"
    )
  )

# make a double y axis plot with year on the x axis
ggplot(nicholas_cage_long, aes(
  x = year, y = value, color = variable,
  group = variable
)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  theme(legend.position = "top") +
  labs(
    x = "Year",
    y = "Standardized values",
    title = "Drownings by Falling into a Pool vs Nicholas Cage Movies"
  )
Metadata

Version

0.1

License

Unknown

Platforms (76)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows