This article is my note on a gentle introduction to PAC.

Domain , label domain , true mapping , training set where , , and is sampled from distribution i.i.d.

Learner has no idea about (as well ) but just observes from , say an algorithm which is fed with and output a classifier/hypothesis/predictor , denoted as .

The goal is to minimize generalization error but there is only available. Thus, a learner may search for a (a hypothesis class) to minimize empirical risk (MER) . And we denote such a learnt hypothesis by .

It is easy to construct a training set where the MER learner achieves but results in a bad generalization risk. This phenomenon is called overfitting.

Under which condition(s), can we avoid overfitting? It is natural to consider a finite set of hypotheses. Actually, this avoid overfitting if given a large enough training set.

To prove this, let us first assume the learning algorithm is satisfactory (consistent with ), say that . Our interests lie in the probability that such a MER learns a bad , i.e., .

Denote bad hypotheses as , the question becomes the probability of sampled an so that .

As , for any , the probability of sampled an instance from that satisfies is less or equal to and thus .

For each , the probability of sampled such an is less or equal to .

As probability theory states that , the interested probability is less or equal to .

The theorem says that, given a finite hypothesis class , if we i.i.d. sample instances from and use MER scheme and some algorithm to learn a consistent from , the probability that is upper bounded by .

A natural corollary is that: if we want to ensure the fraction of such bad to be , i.e., , the cardinality is required to be .

This is probably (i.e., with fraction ) approximately (i.e., making correct prediction with ) correct (PAC).