[latexpage]

this is a brief note of simplified NFL described in Machine Learning

*General Idea: *

“states that any two optimization algorithms are equivalent when their performance is averaged across all possible problems”

**Example:**

Suppose sample space $\mathcal{X}$ and hypothesis space $\mathcal{H}$ are all discrete, and the distribution of problems are even. Given training data $X$, learning algorithm $\mathcal{L}_a$, and denote the unknown truth function to be learned as $f$. Then the summed error on samples out of training set (i.e. in $\mathcal{X}-X$) of learning algorithm $\mathcal{L}_a$ is:

\[

E_{ote}(\mathcal{L}_a|X,f)=\sum_{h}\sum_{x\in \mathcal{X}-X}P(x)I(h(x)\neq f(x))P(h|X,\mathcal{L}_a)

\]

$I(\textit{stmt})$ is the indicator function, returns 1 if the statement included is true, returns 0 otherwise.

This is equation is pretty straight forward, the product of (1)the probability of each hypothesis (2) the probability of each example in $\mathcal{X}-X$ (3)the indicator function is summed.

In the case of binary classification, if we assume (for the sake of simplicity) **the truth object function can be any function** in $\mathcal{X} \mapsto\{0,1\}$, and the function space is $\{0,1\}^{|\mathcal{X} |}$. **Assume $f$ is uniformly distributed**, we sum $E_{ote}$ for all possible $f$(Note the implication of the highlighted assumption):

\begin{align}

\sum_{f}E_{ote}(\mathcal{L}_{a}|X,f)&=\sum_{f}\sum_{h}\sum_{x\in \mathcal{X}-X}P(x)I(h(x)\neq f(x))P(h|X,\mathcal{L}_a)\\

&=\sum_{x\in \mathcal{X}-X}P(x)\sum_{h}P(h|X,\mathcal{L}_{a})\sum_{f}I(h(x)\neq f(x))\\

&=\sum_{x\in \mathcal{X}-X}P(x)\sum_{h}P(h|X,\mathcal{L}_{a})\frac{1}{2} 2^{|\mathcal{X} |}\\

&=\frac{1}{2} 2^{|\mathcal{X} |}\sum_{x\in \mathcal{X}-X}P(x)\sum_{h}P(h|X,\mathcal{L}_{a})\\

&=2^{|\mathcal{X}|-1}\sum_{x\in \mathcal{X}-X} P(x)\cdot 1

\end{align}

**TL;DR**

As a result, under certain assumptions, $E_{ote}$ is not related to learning algorithm $\mathcal{L}_{a}$.

This is not saying that learning algorithm is not important. The takeaway is that comparing the efficiency of learning algorithm without the context of problem is meaningless. The selection of learning algorithm is highly correlated to *a priori* information about the problem we try to solve.