Domain $$\mathcal{X}$$, label domain $$\mathcal{Y}$$, true mapping $$f: \mathcal{X}\leftarrow\mathcal{Y}$$, training set $$S=\{(x_i, y_i)\},i=1,\ldots,m$$ where $$x_i \in \mathcal{X}$$, $$y_i = f(x_i)$$, and $$x_i$$ is sampled from distribution $$D$$ i.i.d.

Learner has no idea about $$D$$ (as well $$f$$) but just observes from $$S$$, say an algorithm $$A$$ which is fed with $$S$$ and output a classifier/hypothesis/predictor $$h$$, denoted as $$h=A(S)$$.

The goal is to minimize generalization error $$L_{D,f}(h)=E_D[h(x)!=f(x)]$$ but there is only $$S$$ available. Thus, a learner may search for a $$h\in H$$ (a hypothesis class) to minimize empirical risk (MER) $$L_S(h)=E_{x\in S}[h(x)!=f(x)]$$. And we denote such a learnt hypothesis by $$h_S$$.

It is easy to construct a training set $$S$$ where the MER learner achieves $$L_S(h_S)=0$$ but results in a bad generalization risk. This phenomenon is called overfitting.

Under which condition(s), can we avoid overfitting? It is natural to consider a finite set of hypotheses. Actually, this avoid overfitting if given a large enough training set.

To prove this, let us first assume the learning algorithm is satisfactory (consistent with $$S$$), say that $$\forall S, \min_{h\in H} L_S(h) = 0$$. Our interests lie in the probability that such a MER learns a bad $$h_S$$, i.e., $$L_{D,f}(h_S)\geq\text{a given }\epsilon$$.

Denote bad hypotheses as $$H_B=\{h\in H\text{s.t. }L_{D,f}(h)\geq\epsilon\}$$, the question becomes the probability of sampled an $$S$$ so that $$\exists h\in H_B, L_S(h_S)=0$$.

As $$\forall h\in H_B, L_{f,D}(h)\geq\epsilon$$, for any $$h\in H_B$$, the probability of sampled an instance from $$D$$ that satisfies $$h(x)=f(x)$$ is less or equal to $$1-\epsilon$$ and thus $$\leq e^{-\epsilon}$$.

For each $$h\in H_B$$, the probability of sampled such an $$S$$ is less or equal to $$e^{-m\epsilon}$$.

As probability theory states that $$P(A)+P(B)\geq P(A\cup B)$$, the interested probability is less or equal to $$\sum_{h\in H_B}e^{-m\epsilon}=\|H_B\|e^{-m\epsilon}\leq\|H\|e^{-m\epsilon}$$.

The theorem says that, given a finite hypothesis class $$H$$, if we i.i.d. sample $$m$$ instances from $$D$$ and use MER scheme and some algorithm $$A$$ to learn a consistent $$h_S$$ from $$H$$, the probability that $$L_{D,f}(h_S)\geq\epsilon$$ is upper bounded by $$\|H\|e^{-m\epsilon}$$.

A natural corollary is that: if we want to ensure the fraction of such bad $$S\in D^m$$ to be $$\leq\delta$$, i.e., $$\|H\|e^{-m\epsilon}\leq\delta$$, the cardinality $$\|S\|=m$$ is required to be $$\geq\frac{\log(\|H\|/\delta)}{\epsilon}$$.

This is probably (i.e., with fraction $$1-\delta$$) approximately (i.e., making correct prediction with $$1-\epsilon$$) correct (PAC).