[latexpage]

# Chapter 1 Introduction

• Data set
• Instance
• Attribute/Feature
• Attribute Value
• Attribute Space/Sample Space/Input Space
• Dimensionality(of attribute space)
• Learning/Traning
• Training Data
• Training Set
• Hypothesis
• v.s. Ground-truth
• Prediction
• Label
• Label Space
• Classification/Regression
• differ on continuity
• Testing Sample
• Clustering
• Unlabeled training data
• Supervised/Unsupervised Learning
• Gerneralization
• Distribution
• Concept learning/Black Box Model
• Learning Process
• Searching the hypothesis space
• to fit the training set
• Version Space
• Set of hypothesis that fit the training data
• Inductive Bias
• Select unique hypothesis in version space
• Occam’s Razor
• Need a priori information about specific problem, often determines the effectiveness of algorithm
• Theorem  of No Free Lunch

# Chapter 2 Model Evaluation & Selection

• Accuracy/Error rate
• Training Error = Empirical Error
• Generalization Error
• Overfitting/Underfitting
• Overfitting is unavoidable if granted P$\neq$NP
• Testing Error
• Approximated Generalization Error
• Method of Evaluation
• Hold-out
• Stratified sampling
• k-fold cross validation
• Leave-one-out(k=N)
• Bootstrapping
• Bootstrap sampling
• Estimation Bias: distribution of original dataset is  changed
• Parameter Tuning
• Validation Set
• Performance Measure (of generalization ability)
• Mean Squared Error $E(f;D)=\frac{1}{m}\sum^{m}_{i=1}(f(x_i)-y_i)^{2}$
• Error rate $E(f;D)=\frac{1}{m}\sum^{m}_{i=1}I(f(x_i)\neq y_i)$
• Accuracy $acc(f;D)=1-E(f;D)$
• Precision $P=\frac{TruePostive}{TruePostive+FalsePostive}$
• Recall $R=\frac{TruePostive}{TruePostive+FalseNegative}$
• P-R graph
• Break-Even Point $P=R$
• F1 Measure $F1=\frac{2\times P\times R}{P+R}=\frac{2\times TruePos}{m+TruePos-TrueNeg}$
• $F_{\beta}$measure, macro-{P, R, F1}, micro-{P, R, F1} etc.
• threshold/cut point
• ROC: Receiver Operating Characteristic curve
• $TruePosRate=\frac{TruePos}{TruePos+FalseNeg}$
• $FalsePosRate=\frac{FalsePos}{TrueNeg+FalsePos}$
• AUC= Area Under ROC Curve
• Unequal cost, Cost matrix, Total cost
• Cost-sensitive Error Rate(for binary classification)
\begin{align}
\nonumber
E(f;D;cost)=\frac{1}{m}(\sum_{x_i \in D^{+}} I(f(X_i)\neq y_i)\times cost_{01} \\
+\sum_{x_i \in D^{-}}I(f(X_i)\neq y_i)\times cost_{10})
\end{align}
• Cost Curve*
• Hypothesis Test
• statistical analysis of learner’s generalization ability
• skipped. see page 37-44
• Bias-variance Decomposition
• $E(f;D)=bias^2(x)+var(x)+\varepsilon^2$
• see page 45
• bias-variance dilemma