This article is my note on a gentle introduction to PAC.
Domain , label domain , true mapping , training set where , , and is sampled from distribution i.i.d.
Learner has no idea about (as well ) but just observes from , say an algorithm which is fed with and output a classifier/hypothesis/predictor , denoted as .
The goal is to minimize generalization error but there is only available. Thus, a learner may search for a (a hypothesis class) to minimize empirical risk (MER) . And we denote such a learnt hypothesis by .
It is easy to construct a training set where the MER learner achieves but results in a bad generalization risk. This phenomenon is called overfitting.
Under which condition(s), can we avoid overfitting? It is natural to consider a finite set of hypotheses. Actually, this avoid overfitting if given a large enough training set.
To prove this, let us first assume the learning algorithm is satisfactory (consistent with ), say that . Our interests lie in the probability that such a MER learns a bad , i.e., .
Denote bad hypotheses as , the question becomes the probability of sampled an so that .
As , for any , the probability of sampled an instance from that satisfies is less or equal to and thus .
For each , the probability of sampled such an is less or equal to .
As probability theory states that , the interested probability is less or equal to .
The theorem says that, given a finite hypothesis class , if we i.i.d. sample instances from and use MER scheme and some algorithm to learn a consistent from , the probability that is upper bounded by .
A natural corollary is that: if we want to ensure the fraction of such bad to be , i.e., , the cardinality is required to be .
This is probably (i.e., with fraction ) approximately (i.e., making correct prediction with ) correct (PAC).