Adaptive Data Analysis in a Balanced Adversarial Model
יום רביעי 01.01 11:30 - 12:30
- Faculty Seminar
-
Bloomfield 424
Abstract:
Statistical validity is a widely recognized crucial feature of modern science. Lack of validity – popularly known as the replication crisis in science poses a serious threat to the scientific process and also to the public’s trust in scientific findings. One of the factors leading to the replication crisis is the inherent adaptivity in the data analysis process. To illustrate adaptivity and its effect, consider a data analyst who is testing a specific research hypothesis. The analyst gathers data, evaluates the hypothesis empirically, and often finds that their hypothesis is not supported by the data, leading to the formulation and testing of more hypotheses. If these hypotheses are tested and formed based on the same data (as acquiring fresh data is often expensive or even impossible), then the process is of adaptive data analysis (ADA) because the choice of hypotheses depends on the data. However, ADA no longer aligns with classical statistical theory, which assumes that hypotheses are selected independently of the data (and preferably before gathering data). ADA may lead to overfitting and hence false discoveries.
For simplicity, we model the access to the data by a mechanism that gets n i.i.d. samples from an unknown distribution D, and is required to provide accurate estimations to a sequence of adaptively chosen statistical queries with respect to D. Hardt and Ullman (FOCS '14) and Steinke and Ullman (COLT '15) showed that, in general, it is computationally hard to answer more than Θ(n^2) adaptive queries, assuming the existence of one-way functions. However, these negative results strongly rely on an adversarial model that significantly advantages the adversarial analyst over the mechanism, as the analyst, who chooses the adaptive queries, also chooses the underlying distribution D. This imbalance raises questions with respect to the applicability of the obtained hardness results – an analyst who has complete knowledge of the underlying distribution D would have little need, if at all, to issue statistical queries to a mechanism which only holds a finite number of samples from D.
We consider more restricted adversaries, called balanced, where each such adversary consists of two separate algorithms: The sampler, who is the entity that chooses the distribution and provides the samples to the mechanism, and the analyst, who chooses the adaptive queries but has no prior knowledge of the underlying distribution (and hence has no a priori advantage with respect to the mechanism). We improve the quality of previous lower bounds by revisiting them using an efficient balanced adversary, under standard public-key cryptography assumptions. We show that these stronger hardness assumptions are unavoidable in the sense that any computationally bounded balanced adversary that has the structure of all known attacks, implies the existence of public-key cryptography.
The talk is based on a joint work with Kobbi Nissim and Uri Stemmer: https://arxiv.org/abs/2305.15452 (spotlight poster at NeurIPS '23).