Lecture Slides for INTRODUCTION TO MACHNE LEARNNG 3RD EDTON ETHEM ALPAYDIN The MIT Press, 2014 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml3e CHAPTER 19: DESIGN AND ANALYSIS OF MACHINE LEARNING EXPERIMENTS Introduction 3 Questions:

Assessment of the expected error of a learning algorithm: Is the error rate of 1-NN less than 2%? Comparing the expected errors of two algorithms: Is k-NN more accurate than MLP ? Training/validation/test sets Resampling methods: K-fold crossvalidation Algorithm Preference 4 Criteria (Application-dependent):

Misclassification error, or risk (loss functions) Training time/space complexity Testing time/space complexity Interpretability Easy programmability Cost-sensitive learning Factors and Response 5 Response function based on output to be maximized Depends on controllable factors

Uncontrollable factors introduce randomness Find the configuration of controllable factors that maximizes response and minimally affected by uncontrollable factors 6 Strategies of Experimentation How to search the factor space? Response surface design for approximating and maximizing the response function in terms of the controllable factors Guidelines for ML experiments 7 A. B. C. D. E.

F. G. Aim of the study Selection of the response variable Choice of factors and levels Choice of experimental design Performing the experiment Statistical Analysis of the Data Conclusions and Recommendations Resampling and K-Fold Cross-Validation 8 The need for multiple training/validation sets {Xi,Vi}i: Training/validation sets of fold i K-fold cross-validation: Divide X into k, Xi,i=1,...,K V 1 X 1 T 1 X 2 X 3 X K V 2 X 2 T 2 X 1 X 3 X K

V K X K T K X 1 X 2 X K 1 Ti share K-2 parts 52 Cross-Validation 9 5 times 2 fold cross-validation 1 (Dietterich, 1998) T X V X 2 1 1 1 1 T 2 X 1 2

V 2 X 11 T 3 X 21 V 3 X 2 2 2 1 T 4 X 2 V 4 X 2 T 9 X 51 V 9 X 5 2 T 10 X 5 2 V 10 X 51 Bootstrapping 10

Draw instances from a dataset with replacement N do not pick an instance Prob that we 1 1 1 e 0.368 after N draws N that is, only 36.8% is new! Performance Measures

11 Error rate = # of errors / # of instances = (FN+FP) / N Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Precision = # of found positives / # of found = TP / (TP+FP) Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity ROC Curve 12 13

Precision and Recall 14 Interval Estimation 15 X = { xt }t where xt ~ N ( , 2) m ~ N ( , 2/N) N m P 1.96 P m 1.96

P m z / 2 ~Z N N N m 1.96 0.95 m 1.96 0.95 N m z / 2 1 N

100(1- ) percent confidence interval m P N 1.64 0.95 P m 1.64 0.95 N P m z 1 N

When 2 100(1- ) percent onesided confidence interval is not known: 2 S x m / N 1 2 t t N m ~ tN 1 S S S

P m t / 2 ,N 1 m t / 2 ,N 1 1 N N 16 Hypothesis Testing 17 Reject a null hypothesis if not supported by the sample with enough confidence X = { xt }t where xt ~ N ( , 2) H0: = 0 vs. H1: 0 Accept H0 with level of significance if 0 is in the 100(1- ) confidence interval N m 0 z / 2 , z / 2 Two-sided test

One-sided test: H0: 0 vs. H1: > 0 Accept if N m 0 18 , z Variance unknown: Use t, instead of z Accept H0: = 0 if N m 0 t / 2 ,N 1 ,t / 2 ,N 1 S 19 Assessing Error: H0:p p0 vs. H1:p > p0 Single training/validation set: Binomial Test If error prob is p0, prob that there are e

e Ntrials j is j N j errors or less in N validation P X e j 1 p0 1 p0 j Accept if this prob is less than 11- N=100, e=20 Normal Approximation to the Binomial 20 Number of errors X is approx N with mean Np0 and var Np0(1-p0) X Np0

~Z Np0 1 p0 Accept if this prob for X = e is less than z1- 1- Paired t Test 21 Multiple training/validation sets xti = 1 if instance t misclassified on fold i N t Error rate of fold i: x i pi

t 1 N With m and s2 average and var of pi , we accept p0 or less error if K m p0 ~ tK 1 S is less than t,KK-1 Comparing Classifiers: H0:0=1 vs. H1:01 22 Single training/validation set: McNemars Test 2 e = e =(e + e )/2 Under H0, we expect 01 10

01 10 e e 1 01 10 e01 e10 Accept if < X2,1 ~ X 12 K-Fold CV Paired t Test 23 Use K-fold cv to get K training/validation folds pi1, pi2: Errors of classifiers 1 and 2 on fold i 1 2 pHi =

p p difference on fold i i : : 0i vs. H : Paired 0 0 0 K K 2whether p has Thenull hypothesis is p p m

i 2 i 1 i i 1 i m s meanK 0 K1 K m 0 K m ~ t K 1 Accept if in t / 2 ,K 1 ,t / 2 ,K 1 s s 52 cv Paired t Test 24 Use 52 cv to get 2 folds of 5 tra/val replications (Dietterich, 1998)

pi(j) : difference btw errors of 1 and 2 on fold j=1, 2 of replication i=1,...,5 pi pi pi 1 2 /2 2 s pi pi pi pi 2 i 1 p11 5 2 i 1 i

s /5 2 2 ~ t5 Two-sided test: Accept H0: 0 = 1 if in (-t/2,5,t/2,5) One-sided test: Accept H0: 0 1 if < t,5 52 cv Paired F Test 25 j 2 5 2 i 1 j 1 i 5 2 i 1 i

p 2 s ~ F10 ,5 Two-sided test: Accept H0: 0 = 1 if < F,10,5 Comparing L>2 Algorithms: Analysis of Variance (Anova) 26 H0 : 1 2 L Errors of L algorithms on K folds X ij ~ N j , 2 , j 1,...,L, i 1,...,K We construct two estimators to 2 . One is valid if H0 is true, the other is always valid. We reject H0 if the two estimators disagree.

If H0 is true : K m j i 1 X ij K ~ N , 2 / K L m j 1 m j S2 m j 2 m

j L L 1 Thus an estimator of 2 is K S 2 , namely, L 2 K m j 1 2 m j L 1 ~ X L2 1 SSb K m j m 2 2 /K j

So when H0 is true, we have SSb 2 ~ X L 1 2 j 27 m 2 m j Regardless of H0 our second estimator to 2 is the average of group variances S 2j : X K

S 2j i 1 2 m ij j K1 L 2 j 1 S 2j L j i

X 2 m ij j L K 1 SSw X ij m j 2 j S 2j i SSw 2 K 1 2 ~ X ~ X L K 1 2

SSb / 2 SSw / 2 SSb / L 1 / ~ FL 1,L K 1 L 1 L K 1 SSw / L K 1 2 K1 H0 : 1 2 L if F ,L 1,L K 1 28 ANOVA table 29 If ANOVA rejects, we do pairwise posthoc tests H0 : i j vs H1 : i j t mi m j 2 w ~ t L(K 1)

Comparison over Multiple Datasets 30 Comparing two algorithms: Sign test: Count how many times A beats B over N datasets, and check if this could have been by chance if A and B did have the same error rate Comparing multiple algorithms Kruskal-Wallis test: Calculate the average rank of all algorithms on N datasets, and check if these could have been by chance if they all had equal error If KW rejects, we do pairwise posthoc tests to find which ones have significant rank difference Multivariate Tests 31

Instead of testing using a single performance measure, e.g., error, use multiple measures for better discrimination, e.g., [fp-rate,fn-rate] Compare p-dimensional distributions Parametric case: Assume p-variate Gaussians 32 Multivariate Pairwise Comparison Paired differences: Hotellings multivariate T2 test For p=1, reduces to paired t test

Multivariate ANOVA 33 Comparsion of L>2 algorithms