Alex D'Amour
May 12, 2015
A role model…
on the role of models.
Predictions do not match true distribution in all cases.
Easily formalized by discrepancy measures (e.g., empirical vs nominal coverage, divergencee between empirial predicted distribution, etc.).
Despite errors, model predictions are a reasonable approximation to the truth.
“Usefulness” and “Reasonability” are task-specific.
What if the goal of our task is to generalize our inferences to new settings?
Data: Let \( Z = (Y, W) \) be the observed sample:
Region of Interest: Let \( \mathcal E \subset \mathcal W \) be a set of “generalization conditions of interest”.
Goal: On the basis of one or several samples \( \{Z_k\} \), generalize to \( P(Y' \mid W') \) across \( W' \in \mathcal E \).
Note: Special interest here where \( \mathcal E \) is not a singleton.
Simple linear regression. \( Z_k \) is vector of outcomes \( Y_k \), vector of predictors \( X_k \), and sample size \( N_k \).
Conditioning statistics \( W_k = (X_k, N_k) \).
Generalize to \( W' \in \mathcal E \) where each entry of \( X' \) is contained in an interval \( (a,b) \), and \( N' \in \mathbb N \).
Examples of generalization:
Social network analysis. \( Z_k \) is array of pairwise interaction records \( Y_k \) among actors \( V_k \).
Conditioning statistics \( W_k = (V_k) \) (i.e. number of and properties of actors).
Generalize across sample sizes, to \( W' \in \mathcal E \) where number of actors \( |V'| \in \mathbb N \).
Example of generalization:
Point process. \( Z_k \) is set of observed points \( Y_k \) in spatiotemporal observation window \( A_k \).
Conditioning statistic \( W_k = (A_k) \).
Generalize across observation windows, to \( W' \in \mathcal E \) where area \( |A_k| \in [A_{min}, \infty) \).
Example of generalization:
Causal inference from observational study. \( Z_k \) is observed outcomes \( Y_k \), treatment-invariant covariate matrix \( X_k \), and treatment assignments \( T_k \).
Conditioning statistics \( W_k = (X_k, T_k) \).
Generalize to situaton where treatment set by intervention, i.e. \( T_k \perp\!\!\!\!\perp X_k \).
Example of generalization:
Without circularly invoking model or parameter:
Single-sample: Defines summaries of \( Y_k \) that efficiently capture structure that remains stable across sampling perturbation under a given condition \( W_k \). (parameter)
Generalization: specifies summaries that remain stable across variation in conditions of interest \( \mathcal E \). (superpopulation)
Parameters are stable in both cases when the model is right.
What happens when the model is wrong?
Single-sample: Focus of robustness and semiparametric literature.
Summaries of data with stability across resampling, subsampling, contamination.
Analytical tools: Influence curves, non-projective sequences with limits that represent single sample (model changes along sequence, “uniformity” arguments).
If stable, useful if \( \mathcal E = \{W_k\} \).
See, among others, Huber 1981, Hampel 1987, Bickel, et al. 1998, Yu 2013.
Generalization: To my knowledge, little work.
Seek summaries with stability across conditions despite misspecification.
Analytical tools: Projective sequences governed by single overarchig stochastic process. Study relationship between elements of whole sequence, not just limit.
If stable, useful for more general \( \mathcal E \).
Maximum likelihood estimation
Model Family: \( \mathcal P_{\Theta, \mathcal E} = \{\mathbb P_{\theta, W}\}_{\theta \in \Theta, W \in \mathcal W} \) with parameter \( \theta \in \Theta \) and conditions \( W \in \mathcal W \), with stochastic consistency between conditions.
Model: Each \( \mathbb P_{\theta, W} = \mathbb P_\theta(Y \mid W) \), a candidate model under condition \( W \).
Truth: \( \mathcal P_{0, \mathcal W} = \{\mathbb P_{0, W}\}_{W \in \mathcal W} \), where \( \mathbb P_{0,W} = \mathbb P_0(Y \mid W) \).
Estimation: \( \hat \theta_{W_k} = \arg\max_{\Theta} \log \mathbb P_{\theta}(Y_k \mid W_k) \).
Correctly specified: \( \mathcal P_{0, \mathcal W} = \mathcal P_{\theta_0, \mathcal W} \) for some \( \theta_0 \in \Theta \).
In this case, for all \( W \) in \( \mathcal E \), \( \hat \theta \) is estimating \( \theta_0 \).
Misspecified: \( \mathcal P_{0, \mathcal W} \not\subset \mathcal P_{\Theta, \mathcal W} \).
For each \( W_k \), \( \mathbb P_{\hat \theta_{W_k}, W_k} \) is estimating the “best” approximation in the model family to \( \mathbb P_{0, W_k} \) based on the observed data.
Can we formalize this under a single operator?
Represent estimator \( \hat \theta_{W_k}(Y_k) \) as a functional \( \Psi \) that operates on probability measures and returns a real vector in \( \Theta \). Semiparametric literature.
Let \( \hat {\mathbb P}_{W_k} \) be the empirical distribution of \( W_k \). Then MLE can be written as:
\[ \hat \theta_{W_k} = \Psi_{\mathcal P_{\mathcal \Theta}}(\hat {\mathbb P}_{W_k}) = \arg\max_{\Theta} \mathbb E_{\hat {\mathbb P}_{W_k}}[\log \mathbb P_{\theta}(Y_k \mid W_k)]. \]
Remark: Reduces to the MLE equation because \( \hat P_{W_k} \) is a point mass at \( Y_k \).
Plugging the truth \( \mathbb P_{0, W_k} \) into the operator, we obtain
\[ \bar \theta_{W_k} = \Psi_{\mathcal P_{\mathcal \Theta}}(\mathbb P_{0, W_k}) = \arg\max_{\Theta} \mathbb E_{\mathbb P_{0, W_k}}[\log \mathbb P_{\theta}(Y_k \mid W_k)]. \]
Remark: Maximand is a negative KL-divergence plus constant.
Call \( \bar \theta_{W_k} \) the effective estimand, which \( \hat \theta_{W_k} \) targets.
Correctly specified: \( \bar \theta_{W_k} = \theta_{0, W_k} \) for all \( W_k \) by property of KL-divergence.
Invariant to \( W_k \) because of ancillarity induced by conditional inferential distribution.
Misspecified: \( \bar \theta_{W_k} \) is the best available approximation to \( \mathbb P_{0, W_k} \) in \( \mathcal P_{\Theta, W_k} \).
No guarantee of invariance. If not, generalizing to new \( W_k \) is incoherent – the procedure is effectively estimating different quantities at each \( W_k \).
\( Y = 10 - W^2 + \epsilon \quad \textrm{ but } \quad E_{\theta}(Y) = \theta_1 + \theta_2 X \). \( X_1 = \{-1.5, -0.5, 0.25\}, X_2 = \{-0.25, 0.8, 1.8\} \). \( \mathcal E = \{(X_k, N_k): N_k = 3, X_k \in [-2,2]^3\} \)
Parameter of interst changes with \( W_k \).
For many tasks, propagating information between conditions when effective estimand differs is counterproductive.
Correct model: Ancillarity of \( W_k \) grants stability because \( W_k \) carries no information about the parameter itself, just about measurement of parameter.
Misspecified model: Ideally, \( W_k \) carries no information about the effective estimand.
We say \( W_k \) is effectively ancillary with respect to a true process \( \mathcal P_{0, \mathcal W} \), a proposed model family \( \mathcal P_{\Theta, \mathcal W} \), and a conditioning set \( \mathcal E \) iff the distribution of \( W_k \) does not depend on \( \bar \theta_{W_k} \) for all \( W_k \) in \( \mathcal E \).
In some sense, trivial. \( \bar \theta_{W_k} \) and \( W_k \) deterministically related.
Definition draws distinction between nominal and effective ancillarity.
Even if model interally declares \( W_k \) to be ancillary, effective estimand determined by \( W_k \).
Single-sample goodness-of-fit analyses (e.g. consistency asymptotics) are not enough. Requires investigation of structure across conditions.
Define population process as a stochastic process indexed by \( W_k \).
Assume with known population properties (often asymptotic) separately from proposed model family.
Investigate (perhaps through asymptotics) whether effective estimand is invariant to the finite dimensional distribution selected by any \( W_k \in \mathcal E \).
Complicate model to subsume inhomogeneity.
Change conditioning to obtain invariance to inhomogeneity.
When data are dependent, many single-sample summaries are inhomogeneous in \( W_k \).
Saturation in repulsive point processes: rate decreasing in temporal dimension of \( A_k \). For uniform models, effective estiand decreasing in observation time if \( \mathcal E \) spans large ranges of \( A_k(t) \). Recover with a model with repulsion.
Sparsity in networks: larger samples have smaller proportions of edges. For proposed exchangeable actor model, effective estimand decreasing in size of actor sample if \( \mathcal E \) spans ranges of large \( |V_k| \). Recover by conditionign on less with a truncated model.
Selection on observables: Differences in observed conditional expectations depend on assignment mechanism. Recover by conditioning on more (e.g. propensity scores, matching).
Introduced notion of stability that makes a model useful for generalization, even if it is misspecified.
Because internal model parameters are not “real”, evaluate stability as a property of the effective estimand.
Procedure is stable if conditioning statistics are effectively ancillary.
Projective analyses can be used to identify invariances that recover effective ancillarity.