Description

The Data Defect Index for Samples that May not be IID.

Description

Implements Meng's data defect index (ddi), which represents the degree of sample bias relative to an iid sample. The data defect correlation (ddc) represents the correlation between the outcome of interest and the selection into the sample; when the sample selection is independent across the population, the ddc is zero. Details are in Meng (2018) <doi:10.1214/18-AOAS1161SF>, "Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election." Survey estimates from the Cooperative Congressional Election Study (CCES) is included to replicate the article's results.

README.md

cran.r-project.org

d.d.i. (Data Defect Index) for non i.i.d. Samples

A simple set of functions to implement the Data Defect Index (d.d.i.), described in:

Xiao-Li Meng. 2018. “Statistical Paradises and Paradoxes in big data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Presidential Election.” Annals of Applied Statistics 12:2, 685–726. doi:10.1214/18-AOAS1161SF.

(ungated version)

Install

# install.packages("devtools")
remotes::install_github("kuriwaki/ddi")

Usage

With a dataframe with columns for a group’s estimates and components of the formula, ddc computes the data defect correlation (ρ).

An example dataset from the 2016 US Presidential Election is included (this also serves as the replication dataset for the AOAS article). The dataset compares official election results with estimates the Cooperative Congressional Election Study (CCES), the largest political survey in the US. The CCES micro-data is fully public and accessible at its website. Here, we produce state-level estimates which are documented with help(g2016).

library(ddi)
library(tidyverse)

data(g2016)
g2016

## # A tibble: 51 x 10
##    state st    pct_djt_voters cces_pct_djt_vv cces_pct_djtrun… votes_djt
##    <chr> <chr>          <dbl>           <dbl>            <dbl>     <dbl>
##  1 Alab… AL            0.621           0.408            0.428    1318255
##  2 Alas… AK            0.513           0.306            0.319     163387
##  3 Ariz… AZ            0.487           0.423            0.445    1252401
##  4 Arka… AR            0.606           0.416            0.434     684872
##  5 Cali… CA            0.316           0.285            0.305    4483810
##  6 Colo… CO            0.433           0.350            0.371    1202484
##  7 Conn… CT            0.409           0.294            0.318     673215
##  8 Dela… DE            0.419           0.329            0.349     185127
##  9 Dist… DC            0.0409          0.0575           0.0690     12723
## 10 Flor… FL            0.490           0.403            0.422    4617886
## # … with 41 more rows, and 4 more variables: tot_votes <dbl>, cces_n_vv <dbl>,
## #   vap <dbl>, vep <dbl>

We can compute the data defect correlation just by plugging in some numbers. For example

ddc(mu = 62984824/136639786, muhat = 12284/35829, N = 136639786, n = 35829)

## [1] -0.003837163

and the d.d.i. is the square of that, about 0.0000147.

we got these numbers by

select(g2016, cces_pct_djt_vv, cces_n_vv, tot_votes, votes_djt) %>%
  summarize_all(sum)

## # A tibble: 1 x 4
##   cces_pct_djt_vv cces_n_vv tot_votes votes_djt
##             <dbl>     <dbl>     <dbl>     <dbl>
## 1            17.5     35829 136639786  62984824

where

cces_totdjt_vv: The count of Trump voters (among validated voters)
cces_n_vv: The count of CCES validated voters (sample size)
votes_djt: Total votes for Trump
tot_votes: Total turnout
cces_pct_djt_vv: Estimated vote share, cces_totdjt_vv / cces_n_vv
pct_djt_voters: Estimated vote share, votes_djt / tot_votes

The function also takes vectors as inputs:

with(g2016, ddc(mu = pct_djt_voters,
                muhat = cces_pct_djt_vv, 
                N = tot_votes, 
                n = cces_n_vv))

##  [1] -0.0059541279 -0.0062341071 -0.0023488019 -0.0061097707 -0.0009864919
##  [6] -0.0025746344 -0.0035362241 -0.0033951165  0.0014015382 -0.0029747918
## [11] -0.0038228152 -0.0001757426 -0.0073716139 -0.0036437192 -0.0069956521
## [16] -0.0058255411 -0.0059093759 -0.0057837854 -0.0040533230 -0.0047893714
## [21] -0.0024905368 -0.0028280876 -0.0050296619 -0.0043292576 -0.0056626724
## [26] -0.0069305025 -0.0046563153 -0.0075840944 -0.0047785897 -0.0037497506
## [31] -0.0028289070 -0.0025619899 -0.0031936586 -0.0051968951 -0.0078308914
## [36] -0.0057088185 -0.0065654840 -0.0030642004 -0.0039137353 -0.0039907269
## [41] -0.0040871158 -0.0069019981 -0.0050741833 -0.0044884762 -0.0059634270
## [46] -0.0034491625 -0.0040918085 -0.0024121681 -0.0075404659 -0.0051378753
## [51] -0.0086086072

so can be implemented in a tibble as well:

transmute(g2016, st,
          ddc = ddc(mu = pct_djt_voters, 
                    muhat = cces_pct_djt_vv, 
                    N = tot_votes,
                    n = cces_n_vv))

## # A tibble: 51 x 2
##    st          ddc
##    <chr>     <dbl>
##  1 AL    -0.00595 
##  2 AK    -0.00623 
##  3 AZ    -0.00235 
##  4 AR    -0.00611 
##  5 CA    -0.000986
##  6 CO    -0.00257 
##  7 CT    -0.00354 
##  8 DE    -0.00340 
##  9 DC     0.00140 
## 10 FL    -0.00297 
## # … with 41 more rows

A negative ρ means ρ = Cor(Respond, 1(Trump Supporter)) < 0, i.e. Trump supporters were less likely to respond.

r-ddi

d.d.i. (Data Defect Index) for non i.i.d. Samples

Install

Usage

Version

License

Status

Source

Homepage

Platforms (75)

d.d.i. (Data Defect Index) for non i.i.d. Samples

Install

Usage

Version

License

Status

Source

Homepage

Platforms75 (75)

Platforms (75)