Modeling of Ordinal Random Variables via Softmax Regression.
ohenery
Performs softmax regression for ordered outcomes under the Harville and Henery models.
-- Steven E. Pav, [email protected]
Installation
This package can be installed from CRAN, via drat, or from github:
# via CRAN:
install.packages("ohenery")
# via drat:
if (require(drat)) {
drat:::add("shabbychef")
install.packages("ohenery")
}
# get snapshot from github (may be buggy)
if (require(devtools)) {
install_github('shabbychef/ohenery')
}
What is it?
The softmax regression generalizes logistic regression, wherein only one of two possible outcomes is observed, to the case where one of many possible outcomes is observed. As in logistic regression, one models the log odds of each outcome as a linear function of some independent variables. Moreover, a softmax regression allows one to model data from events where the number of possible outcomes differs.
Some examples where one might apply a softmax regression:
- Model which among five sprinters takes first place in a race, having observed characteristics of the runners over many races.
- Model which film is awarded the Academy Award for Best Picture, based on genre information, and co-nomination information. (The number of nominees has varied over the years.)
- Model which major city in the U.S. experiences the most rain in a given calendar year.
Note that in the examples illustrated above, one might be better served by modeling some continuous outcome, instead of modeling the winner. For example, one might model the speed of each racer, or the number of votes each film garnered, or the total amount of rain each city experienced. Discarding this information in favor of modeling the binary outcome is likely to cause a loss of statistical power, and is generally discouraged. However, in some cases the continuous outcome is not observed, as in the case of the Best Picture awards, or in horse racing where finishing times are often not available. Softmax regression can be used in these cases.
The softmax regression can be further generalized to model the case where one observes place information about participants. For example, one might observe which of first through fifth place each sprinter claims in each race.
Or one might only observe some limited information, like the Gold, Silver and Bronze medal winners in an Olympic event.
There is more than one way to generalize the softmax to deal with ranked outcomes like these:
- The Harville model, where the ratio of probabilities of two participants taking first place is equal to the ratio of the conditional probabilities that they take second place, conditional on neither of them taking first place. Effectively in the Harville model once one has observed the first place finisher, the probabilities for second place simply rescale. I believe that if finishing times for racers are exponentially distributed, then their finishing places are distributed under a Harville model.
- The Henery model generalizes the Harville model to the case where the ratio of probabilities of two participants taking first place is equal to the ratio of the conditional probabilities that they take second place, conditional on neither of them taking first place, raised to some power, called gamma. The Harville model is the Henery model with all gamma constants equal to one. I believe that if finishing times for racers are (log) normally distributed, then their finishing places are nearly distributed like the Henery model.
This package supports fitting softmax regressions under both models.
Basic Usage
Best Picture
Here we use softmax regression to model the Best Picture winner odds as linear in some features: whether the film also was nominated for Best Director, Best Actor or Actress, or Best Film Editing, as well as genre designations for Drama, Romance and Comedy. We only observe the winner of the Best Picture award, and not runners-up, so we weight the first place finisher with a one, and all others with a zero. This seems odd, but it ensures that the regression does not try to compare model differences between the runners-up. We find the strongest absolute effect on odds from the Best Director and Film Editing co-nomination variables.
library(ohenery)
library(dplyr)
library(magrittr)
data(best_picture)
best_picture %<>%
mutate(place=ifelse(winner,1,2)) %>%
mutate(weight=ifelse(winner,1,0))
fmla <- place ~ nominated_for_BestDirector + nominated_for_BestActor + nominated_for_BestActress + nominated_for_BestFilmEditing + Drama + Romance + Comedy
osmod <- harsm(fmla,data=best_picture,group=year,weights=weight)
print(osmod)
## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 59 iterations
## Return code 0: successful convergence
## Log-Likelihood: -92
## 7 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## nominated_for_BestDirectorTRUE 3.294 0.850 3.88 0.00011 ***
## nominated_for_BestActorTRUE 1.003 0.316 3.18 0.00149 **
## nominated_for_BestActressTRUE 0.048 0.348 0.14 0.89048
## nominated_for_BestFilmEditingTRUE 1.949 0.399 4.88 1e-06 ***
## Drama -0.659 0.605 -1.09 0.27580
## Romance 0.521 0.974 0.53 0.59291
## Comedy 1.317 1.013 1.30 0.19333
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
## R2: 0.77
## --------------------------------------------
Prediction
Here we use the predict
method to get back predictions under the Harville model produced above. Three different types of prediction are supported:
- Predictions of the odds in odds space, the 'eta'.
- Predictions of the probability of taking first place, the 'mu'.
- The expected rank under a Harville model.
We do not currently have the ability to compute the expected rank under the Henery model, as it is too computationally intensive. (File an issue if this is important.)
prd <- best_picture %>%
mutate(prd_erank=as.numeric(predict(osmod,newdata=.,group=year,type='erank',na.action=na.pass))) %>%
mutate(prd_eta=as.numeric(predict(osmod,newdata=.,group=year,type='eta'))) %>%
mutate(prd_mu=as.numeric(predict(osmod,newdata=.,group=year,type='mu')))
Horse Racing
The package is bundled with a dataset of three weeks of thoroughbred and harness race results from tracks around the world. First, let us compare the 'consensus odds', as computed from the win pool size, with the probability of winning a race. This is usually plotted as the empirical win probability over horses grouped by their consensus win probability.
We present this plot below. Clearly visible are the 'longshot bias' and 'sureshot bias' (or favorite bias). The longshot bias is the effect where longshots are overbet, resulting in elevated consensus odds. This appears as the point off the line in the lower left. One can think of these as points which were moved to the right of the plot from a point on the diagonal by overbetting. The sureshot bias is the complementary effect where favorites to win are underbet, effectively shifting the points to the left.
library(ohenery)
library(dplyr)
library(magrittr)
library(ggplot2)
data(race_data)
ph <- race_data %>%
group_by(EventId) %>%
mutate(mu0=WN_pool / sum(WN_pool)) %>%
ungroup() %>%
mutate(mubkt=cut(mu0,c(0,10^seq(-2,0,length.out=14)),include.lowest=TRUE)) %>%
mutate(tookfirst=as.numeric(coalesce(Finish==1,FALSE))) %>%
group_by(mubkt) %>%
summarize(winprop=mean(tookfirst),
medmu=median(mu0),
nrace=length(unique(EventId)),
nhorse=n()) %>%
ungroup() %>%
ggplot(aes(medmu,winprop)) +
geom_point(aes(size=nhorse)) +
geom_abline(intercept=0,slope=1) +
scale_x_log10(labels=scales::percent) +
scale_y_log10(labels=scales::percent) +
labs(x='consensus odds (from win pool)',
y='empirical win probability',
title='efficiency of the win pool')
print(ph)
Here we use softmax regression under the Harville model to capture the longshot and sureshot biases. Our model computes consensus odds as the inverse softmax function of the consensus probabilities, which are constructed from the win pool. We define a factor variable that captures longshot, sureshot, and 'vanilla' bets based on the win pool-implied probabilities. Here we are only concerned with the win probabilities, so we adapt a Harville model and weight only the winning finishers.
# because of na.action, make ties for fourth
df <- race_data %>%
mutate(Outcome=coalesce(Finish,4L))
# create consensus odds eta0
df %<>%
mutate(weights=coalesce(as.numeric(Finish==1),0) ) %>%
group_by(EventId) %>%
mutate(big_field=(n() >= 6)) %>%
mutate(mu0=WN_pool / sum(WN_pool)) %>%
mutate(eta0=inv_smax(mu0)) %>%
ungroup() %>%
dplyr::filter(big_field) %>%
dplyr::filter(!is.na(eta0)) %>%
mutate(bettype=factor(case_when(mu0 < 0.025 ~ 'LONGSHOT',
mu0 > 0.50 ~ 'SURESHOT',
TRUE ~ 'VANILLA')))
# Harville Model with market efficiency
effmod <- harsm(Outcome ~ eta0:bettype,data=df,group=EventId,weights=weights)
print(effmod)
## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 44 iterations
## Return code 0: successful convergence
## Log-Likelihood: -6862
## 3 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## eta0:bettypeLONGSHOT 1.2602 0.0977 12.9 <2e-16 ***
## eta0:bettypeSURESHOT 1.1896 0.0417 28.5 <2e-16 ***
## eta0:bettypeVANILLA 1.1158 0.0257 43.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
## R2: 0.53
## --------------------------------------------
We see that in this model the beta coefficient for consensus odds is around 1.12 for 'vanilla' bets, slightly higher for sure bets, and much higher for longshots. The interpretation is that long shots are relatively overbet, but that otherwise bets are nearly efficient.
One would like to make a correction to inefficiencies in the win pool by generalizing this idea to multiple cut points along the consensus probabilities. However, the softmax model works by tweaking the consensus odds; the odds translate into win probabilities in a way that depends on the other participants, and so there is no one curve.
Offsets
The print display of the Harville softmax regression includes an 'R-squared' and sometimes a 'delta R-squared'. The latter is only reported when an offset is used in the model. The R-squared is the improvement in spread in the predicted ranks from the model compared to the null model which assumes all log odds are equal. When an offset is given, the R-squared includes the offset, but a 'delta' R-squared is reported which gives the improvement in spread of predicted ranks over the model based on the offset:
# Harville Model with offset
offmod <- harsm(Outcome ~ offset(eta0) + bettype,data=df,group=EventId,weights=weights)
print(offmod)
## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 25 iterations
## Return code 0: successful convergence
## Log-Likelihood: -6872
## 2 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## bettypeSURESHOT 0.816 0.150 5.43 5.5e-08 ***
## bettypeVANILLA 0.437 0.126 3.46 0.00054 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
## R2: 0.1
## delR2: -0.86
## --------------------------------------------
Unfortunately, the R-squareds are hard to interpret, and can sometimes give negative values. The softmax regression is not guaranteed to improve the residual sum of squared errors in the rank.
Henery Model
Here we fit a Henery model to the horse race data. Note that in the data we only observe win, place, and show finishes, so we assign zero weight to all other finishers.
Again this is so the regression does not attempt to model distinctions between runners-up. Moreover, because of how na.action
works on the input data, the response variable has to be coalesced with tie-breakers.
# because of na.action, make ties for fourth
df <- race_data %>%
mutate(Outcome=coalesce(Finish,4L)) %>%
mutate(weights=coalesce(as.numeric(Finish<=3),0) ) %>% # w/p/s
group_by(EventId) %>%
mutate(big_field=(n() >= 5)) %>%
mutate(mu0=WN_pool / sum(WN_pool)) %>%
mutate(eta0=inv_smax(mu0)) %>%
ungroup() %>%
dplyr::filter(big_field) %>%
dplyr::filter(!is.na(eta0)) %>%
mutate(fac_age=cut(Age,c(0,3,5,7,Inf),include.lowest=TRUE)) %>%
mutate(bettype=factor(case_when(mu0 < 0.025 ~ 'LONGSHOT',
mu0 > 0.50 ~ 'SURESHOT',
TRUE ~ 'VANILLA')))
# Henery Model with ...
bigmod <- hensm(Outcome ~ eta0:bettype + fac_age,data=df,group=EventId,weights=weights,ngamma=3)
print(bigmod)
## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 57 iterations
## Return code 0: successful convergence
## Log-Likelihood: -21720
## 8 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## fac_age(3,5] 0.2024 0.0697 2.90 0.0037 **
## fac_age(5,7] 0.1704 0.0794 2.14 0.0320 *
## fac_age(7,Inf] 0.1531 0.0848 1.80 0.0711 .
## eta0:bettypeLONGSHOT 1.5191 0.0641 23.70 <2e-16 ***
## eta0:bettypeSURESHOT 1.1096 0.0367 30.19 <2e-16 ***
## eta0:bettypeVANILLA 1.1107 0.0231 48.06 <2e-16 ***
## gamma2 0.7053 0.0215 32.84 <2e-16 ***
## gamma3 0.5274 0.0191 27.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
Note that the gamma coefficients are fit here as 0.71, 0.53. Values around 0.8 or smaller are typical. (In a future release it would probably make sense to make the gammas relative to the value 1, which would make it easier to reject the Harville model from the print summary.)
Diving
The package is bundled with a dataset of 100 years of Olympic Men's Platform Diving Records, originally sourced by Randi Griffin and delivered on kaggle.
Here we convert the medal records into finishing places of 1, 2, 3 and 4 (no medal), add weights for the fitting, make a factor variable for age, factor the NOC (country) of the athlete. Because Platform Diving is a subjective competition, based on scores from judges, we investigate whether there is a 'home field advantage' by creating a Boolean variable indicating whether the athlete is representing the host nation.
We then fit a Henery model to the data. Note that the gamma terms come out very close to one, indicating the Harville model would be sufficient. The home field advantage does not appear significant in this analysis.
library(forcats)
data(diving)
fitdat <- diving %>%
mutate(Finish=case_when(grepl('Gold',Medal) ~ 1,
grepl('Silver',Medal) ~ 2,
grepl('Bronze',Medal) ~ 3,
TRUE ~ 4)) %>%
mutate(weight=ifelse(Finish <= 3,1,0)) %>%
mutate(cut_age=cut(coalesce(Age,22.0),c(12,19.5,21.5,22.5,25.5,99),include.lowest=TRUE)) %>%
mutate(country=forcats::fct_relevel(forcats::fct_lump(factor(NOC),n=5),'Other')) %>%
mutate(home_advantage=NOC==HOST_NOC)
hensm(Finish ~ cut_age + country + home_advantage,data=fitdat,weights=weight,group=EventId,ngamma=3)
## --------------------------------------------
## Maximum Likelihood estimation
## BFGS maximization, 43 iterations
## Return code 0: successful convergence
## Log-Likelihood: -214
## 12 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## cut_age(19.5,21.5] 0.0303 0.4185 0.07 0.94227
## cut_age(21.5,22.5] -0.7276 0.5249 -1.39 0.16565
## cut_age(22.5,25.5] 0.0950 0.3790 0.25 0.80199
## cut_age(25.5,99] -0.1838 0.4111 -0.45 0.65474
## countryGBR -0.6729 0.8039 -0.84 0.40258
## countryGER 1.0776 0.4960 2.17 0.02981 *
## countryMEX 0.7159 0.4744 1.51 0.13126
## countrySWE 0.6207 0.5530 1.12 0.26172
## countryUSA 2.3201 0.4579 5.07 4.1e-07 ***
## home_advantageTRUE 0.5791 0.4112 1.41 0.15904
## gamma2 1.0054 0.2853 3.52 0.00042 ***
## gamma3 0.9674 0.2963 3.26 0.00109 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
Confirming Inference
The regression coefficients are fit by Maximum Likelihood Estimation, and the maxLik
package does the 'heavy lifting' of estimating the variance-covariance of the coefficients. However, it is not clear a priori that these error estimates are accurate, since the individual outcomes in a given event are not independent: it could be that the machinery for computing the Fisher Information matrix is inaccurate. In fact, the help from maxLik
warns that this is on the user. This is especially concerning when the logistic fits are used in situations where not all finishes are observed, and one uses zero weights to avoid affecting the fits.
Here we will try to establish empirically that the inference is actually correct.
Against Logistic Regression
As the Harville model generalizes logistic regression, we should be able to run a logistic regression problem through the harsm
function. Here we do that, testing data with exactly two participants per event. The key observation is that the usual logistic regression should be equivalent to the Harville form, but with independent variables equal to the difference in independent variables for the two participants. We find that the coefficients and variance-covariance matrix are nearly identical. We believe the differences are due to convergence criteria in the MLE fit.
library(dplyr)
nevent <- 10000
set.seed(1234)
adf <- data_frame(eventnum=floor(seq(1,nevent + 0.7,by=0.5))) %>%
mutate(x=rnorm(n()),
program_num=rep(c(1,2),nevent),
intercept=as.numeric(program_num==1),
eta=1.5 * x + 0.3 * intercept,
place=ohenery::rsm(eta,g=eventnum))
# Harville model
modh <- harsm(place ~ intercept + x,data=adf,group=eventnum)
# the collapsed data.frame for glm
ddf <- adf %>%
arrange(eventnum,program_num) %>%
group_by(eventnum) %>%
summarize(resu=as.numeric(first(place)==1),
delx=first(x) - last(x),
deli=first(intercept) - last(intercept)) %>%
ungroup()
# glm logistic fit
modg <- glm(resu ~ delx + 1,data=ddf,family=binomial(link='logit'))
all.equal(as.numeric(coef(modh)),as.numeric(coef(modg)),tolerance=1e-4)
## [1] TRUE
all.equal(as.numeric(vcov(modh)),as.numeric(vcov(modg)),tolerance=1e-4)
## [1] TRUE
print(coef(modh))
## intercept x
## 0.34 1.49
print(coef(modg))
## (Intercept) delx
## 0.34 1.49
print(vcov(modh))
## intercept x
## intercept 0.00071 0.00012
## x 0.00012 0.00091
print(vcov(modg))
## (Intercept) delx
## (Intercept) 0.00071 0.00012
## delx 0.00012 0.00091
Weighted Observations
We now examine whether inference is correct in the case when weights are used to denote that some finishes are not observed. The first release of this package defaulted to using 'normalized weights' in the Harville and Henery fits. I believe this leads to an inflated Fisher Information that then causes the estimated variance-covariance matrix to be too small. I am changing the default in version 0.2.0 to fix this issue.
Here I generate some data, and fit to the Harville model, using weights to signify that we have only observed three of the 8 finishing places in each event. I compute the the estimate minus the true value, divided by the estimated standard error. I repeat this process many times. Asymptotically in sample size, this computed value should be a standard normal. I check by taking the standard deviation, and Q-Q plotting against standard normal. The results are consistent with correct inference.
library(dplyr)
nevent <- 5000
# 3 of 8 observed values.
wlim <- 3
nfield <- 8
beta <- 2
library(future.apply)
plan(multiprocess,workers=7)
set.seed(1234)
zvals <- future_replicate(500,{
adf <- data_frame(eventnum=sort(rep(seq_len(nevent),nfield))) %>%
mutate(x=rnorm(n()),
program_num=rep(seq_len(nfield),nevent),
eta=beta * x,
place=ohenery::rsm(eta,g=eventnum),
wts=as.numeric(place <= wlim))
modo <- ohenery::harsm(place ~ x,data=adf,group=eventnum,weights=wts)
as.numeric(coef(modo) - beta) / as.numeric(sqrt(vcov(modo)))
})
plan(sequential)
# this should be near 1
sd(zvals)
## [1] 0.98
# QQ plot it
library(ggplot2)
ph <- data_frame(zvals=zvals) %>%
ggplot(aes(sample=zvals)) +
stat_qq(alpha=0.4) +
geom_abline(linetype=2,color='red') +
labs(title='putative z-values from Harville fits')
print(ph)