Alex D'Amour

November 25, 2014

**Why?**

- Stealth regularization (AIC, CV, LASSO)
- Legitimate regularization (nonparametric regression)
- Infer “relevant” parameters (vaguely causal?)

**This talk**: Find a model that provides the best predictive performance for our given sample size. Note that predictive performance includes estimation uncertainty, bias, and residual variation.

**Holy grail**: Eliminate high-dimensional nuisance without high-dimensional priors.

**Is the truth…**

- Finite dimensional and sparse? (Donoho et al. Compressed sensing, fundamentally parametric.)
- Infinite dimensional and dense? (Meng 2014, Everything is variation, fundamentally nonparametric.)

**This talk**: The latter. There may exist a sparse set of predictors, but no reason to believe that the predictors as collected define the proper basis for such sparsity.

**More reading**: Liu and Yang, 2009. “Parametric or nonparametric? A parametricness index for model selection.”

**Example**
Let \( X_i \) be a \( p \)-dimensional multivariate normal with covariance matrix \( \Sigma \) defined so that \( \Sigma_{k,l} = \rho^{|k-l|} \), \( 0 < \rho < 1 \).

Consider: \[ Y_i \sim X_{i,2} - \rho X_{i,1} + 0.2 X_{i,p} + \mathcal N(0,3). \]

“True”“ model includes covariates \( (1,2,p) \). But for any subset \( A \subset \{1,\cdots ,p\} \), \[ Y_i | X_{i,A} \sim \beta_A^{\top} X_{i,A}\mathcal + N(0, \sigma_A). \] because \( (Y_i,X_i) \) are jointly multivariate normal.

"Truth” only has special status because it has minimal residual variance.

Simulation: \( N = 100 \), \( p=25 \), \( \rho = 0.75 \).

For simplicity, consider only growing models \( A_k = \{1, \cdots, k\} \).

**Is it a…**

- Inference problem? (LASSO)
- Decision problem? (Carvalho)

**This talk**: Design problem!

- Not a facet of the underlying system (so not inference).
- Done beforehand to define the problem, incorporating inferential constraints (so not decision).
- Which available conditional distribution can we reliably estimate?

**Estimands mean something.**

- \( \beta_1 \) is only meaningful in context of the rest of \( A \). Cross-model inference about \( \beta_1 \) is awkward, requires meaningless symmetries (Berk et al).

**Separation of selection inference.**

- Arguably post-selection inference impossible without separation (Leeb and Potscher).
- Where distributions exist, require strong assumptions, difficult hypotheses (Lockhart et al).

**Separation of selection and inference.**

- Garden of forking paths.
[M]odels become stochastic in an opaque way when their selection is affected by human intervention based on post-hoc considerations such as “in retrospect only one of these two variables should be in the model” or “it turns out the predictive benefit of this variable is too weak to warrant the cost of collecting it.” (Berk et al 2013).

**Wasserman's HARNESS**

- Response to Lockhart et al LASSO hypothesis testing paper.
- Randomly split data.
- Model selection with one half.
- Conditional on selected model, standard inference on other half.

**Issues with Data Splitting**

Some statisticians are uncomfortable with data-splitting. There are two common objections. The first is that the inferences are random: if we repeat the procedure we will get different answers. The second is that it is wasteful.(Wasserman in response to Lockhart et al.)

**Principled Data Splitting**

- Can we use design principles to improve data-splitting techniques?
- Splitting can be skewed to alleviate concerns.
- Optimization can be exchanged with randomization to navigate optimality/robustness tradeoff.

**Key idea**: Inference is already conditional on \( X \). “Splitting on observables”
can be used to improve power, restrict randomization without biasing inference.

Assume \( (Y_i, X_i) \) multivariate normal, as before.

**Procedure**:

- Use a penalized log-likelihood information criterion to select a model (AIC, BIC, DIC, or other)
- Compute predictive intervals for \( Y^{rep} \) using selected model.

**Lemma**:
Under the multivariate normal model, for fixed split sizes in the model selection set \( n_1 \) and the inference set \( n_2 \), the optimal (oracle) splitting policy maximizes the leverage of the points in the inference set with respect to the selected model.

**Proof**:
Linear regression information criteria have the form
\[ myIC = n_1\log \hat \sigma_A^2 + 2g(p_A, n_1) + C, \]
where \( g \) is a function of model size and sample size, and \( C \) is a constant shared by all models.

Because of multivariate normality, residuals for any set \( A \) are mean-zero normal, so \[ \hat \sigma_A^2 \sim \sigma^2_A \chi^2_{n-p}, \] so all expectations of \( myIC \) do not depend on \( X \).

Meanwhile, the predictive variance has the form:
\[ Var(Y) = X_A^{rep}(X_A^{\top}X_A)^{-1}X_A^{rep\top} \sigma^2_A \]
with trace decreasing in the *leverage* of inference set.

**Achieving the (leverage) oracle**:

- Sequential designs to maximize expected leverage for likely models.

**Relaxed assumptions**:

- Without MVN, model selection depends on \( X \).
- Selection/inference tradeoffs need to be formulated.

**Evaluation**:

- Oracle model recovery is not the goal.
- \( p \) is infinite for all \( N \)? Stochastic process perspective for finite samples.

**Cross-pollenation**:

- Algorithms from CUR decompositions, algorithmic leveraging (Mohoney et al).
- Leverage-based sampling from surveys.
- Complementary model selection and inference methods.

**Goal**:

- Select model to minimize predictive risk given the current sample size.
- Report valid inferences conditional on this model.

**Don't care if**:

- Different model at different sample sizes.
- True/false positives.

**Achievable by**:

- Separating model selection and inference.
- Optimize using design principles.