[latexpage]
Chapter 1 Introduction
- Data set
- Instance
- Attribute/Feature
- Attribute Value
- Attribute Space/Sample Space/Input Space
- Dimensionality(of attribute space)
- Learning/Traning
- Training Data
- Training Set
- Hypothesis
- v.s. Ground-truth
- Prediction
- Label
- Label Space
- Classification/Regression
- differ on continuity
- Testing Sample
- Clustering
- Unlabeled training data
- Supervised/Unsupervised Learning
- Gerneralization
- Distribution
- Concept learning/Black Box Model
- Learning Process
- Searching the hypothesis space
- to fit the training set
- Version Space
- Set of hypothesis that fit the training data
- Inductive Bias
- Select unique hypothesis in version space
- Occam’s Razor
- Need a priori information about specific problem, often determines the effectiveness of algorithm
- Theorem of No Free Lunch
Chapter 2 Model Evaluation & Selection
- Accuracy/Error rate
- Training Error = Empirical Error
- Generalization Error
- Overfitting/Underfitting
- Overfitting is unavoidable if granted P$\neq$NP
- Testing Error
- Approximated Generalization Error
- Method of Evaluation
- Hold-out
- Stratified sampling
- k-fold cross validation
- Leave-one-out(k=N)
- Bootstrapping
- Bootstrap sampling
- Estimation Bias: distribution of original dataset is changed
- Hold-out
- Parameter Tuning
- Validation Set
- Performance Measure (of generalization ability)
- Mean Squared Error $E(f;D)=\frac{1}{m}\sum^{m}_{i=1}(f(x_i)-y_i)^{2}$
- Error rate $E(f;D)=\frac{1}{m}\sum^{m}_{i=1}I(f(x_i)\neq y_i)$
- Accuracy $acc(f;D)=1-E(f;D)$
- Precision $P=\frac{TruePostive}{TruePostive+FalsePostive}$
- Recall $R=\frac{TruePostive}{TruePostive+FalseNegative}$
- P-R graph
- Break-Even Point $P=R$
- F1 Measure $F1=\frac{2\times P\times R}{P+R}=\frac{2\times TruePos}{m+TruePos-TrueNeg}$
- $F_{\beta}$measure, macro-{P, R, F1}, micro-{P, R, F1} etc.
- threshold/cut point
- ROC: Receiver Operating Characteristic curve
- $TruePosRate=\frac{TruePos}{TruePos+FalseNeg}$
- $FalsePosRate=\frac{FalsePos}{TrueNeg+FalsePos}$
- AUC= Area Under ROC Curve
- Unequal cost, Cost matrix, Total cost
- Cost-sensitive Error Rate(for binary classification)
\begin{align}
\nonumber
E(f;D;cost)=\frac{1}{m}(\sum_{x_i \in D^{+}} I(f(X_i)\neq y_i)\times cost_{01} \\
+\sum_{x_i \in D^{-}}I(f(X_i)\neq y_i)\times cost_{10})
\end{align} - Cost Curve*
- Hypothesis Test
- statistical analysis of learner’s generalization ability
- skipped. see page 37-44
- Bias-variance Decomposition
- $E(f;D)=bias^2(x)+var(x)+\varepsilon^2$
- see page 45
- bias-variance dilemma