Domain $\mathcal{X}$, label domain $\mathcal{Y}$, true mapping $f: \mathcal{X}\leftarrow\mathcal{Y}$, training set $S=\{(x_i, y_i)\},i=1,\ldots,m$ where $x_i \in \mathcal{X}$, $y_i = f(x_i)$, and $x_i$ is sampled from distribution $D$ i.i.d.

Learner has no idea about $D$ (as well $f$) but just observes from $S$, say an algorithm $A$ which is fed with $S$ and output a classifier/hypothesis/predictor $h$, denoted as $h=A(S)$.

The goal is to minimize generalization error $L_{D,f}(h)=E_D[h(x)!=f(x)]$ but there is only $S$ available. Thus, a learner may search for a $h\in H$ (a hypothesis class) to minimize empirical risk (MER) $L_S(h)=E_{x\in S}[h(x)!=f(x)]$. And we denote such a learnt hypothesis by $h_S$.

It is easy to construct a training set $S$ where the MER learner achieves $L_S(h_S)=0$ but results in a bad generalization risk. This phenomenon is called overfitting.

Under which condition(s), can we avoid overfitting? It is natural to consider a finite set of hypotheses. Actually, this avoid overfitting if given a large enough training set.

To prove this, let us first assume the learning algorithm is satisfactory (consistent with $S$), say that $\forall S, \min_{h\in H} L_S(h) = 0$. Our interests lie in the probability that such a MER learns a bad $h_S$, i.e., $L_{D,f}(h_S)\geq\text{a given }\epsilon$.

Denote bad hypotheses as $H_B=\{h\in H\text{s.t. }L_{D,f}(h)\geq\epsilon\}$, the question becomes the probability of sampled an $S$ so that $\exists h\in H_B, L_S(h_S)=0$.

As $\forall h\in H_B, L_{f,D}(h)\geq\epsilon$, for any $h\in H_B$, the probability of sampled an instance from $D$ that satisfies $h(x)=f(x)$ is less or equal to $1-\epsilon$ and thus $\leq e^{-\epsilon}$.

For each $h\in H_B$, the probability of sampled such an $S$ is less or equal to $e^{-m\epsilon}$.

As probability theory states that $P(A)+P(B)\geq P(A\cup B)$, the interested probability is less or equal to $\sum_{h\in H_B}e^{-m\epsilon}=\|H_B\|e^{-m\epsilon}\leq\|H\|e^{-m\epsilon}$.

The theorem says that, given a finite hypothesis class $H$, if we i.i.d. sample $m$ instances from $D$ and use MER scheme and some algorithm $A$ to learn a consistent $h_S$ from $H$, the probability that $L_{D,f}(h_S)\geq\epsilon$ is upper bounded by $\|H\|e^{-m\epsilon}$.

A natural corollary is that: if we want to ensure the fraction of such bad $S\in D^m$ to be $\leq\delta$, i.e., $\|H\|e^{-m\epsilon}\leq\delta$, the cardinality $\|S\|=m$ is required to be $\geq\frac{\log(\|H\|/\delta)}{\epsilon}$.

This is probably (i.e., with fraction $1-\delta$) approximately (i.e., making correct prediction with $1-\epsilon$) correct (PAC).