Statistical Learning Methods in HEAP Jens Zimmermann, Christian Kiesling Max-Planck-Institut fr Physik, Mnchen MPI fr extraterrestrische Physik, Mnchen Forschungszentrum Jlich GmbH Statistical Learning: Introduction with a simple example Occams Razor Decision Trees Local Density Estimators Methods Based on Linear Separation Examples: Triggers in HEP and Astrophysics Conclusion C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 1 Statistical Learning Does not use prior knowledge No theory required Learns only from examples Trial and error Learning by reinforcement Two classes of statistical learning: discrete output 0/1: classification continuous output: regression Application in High Energy- and Astro-Physics: Background suppression, purification of events Estimation of parameters not directly measured C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 2 # slides 42 21 28 8 71 19 64

31 29 36 15 34 1 48 44 56 51 25 55 12 16 4 3 2 # slides 5 6 x10 # formulas 0 Experimentalists Theorists A simple Example: Preparing a Talk 0

1 2 3 4 5 6 x10 # formulas Data base established by Jens during Young Scientists Meeting at MPI C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 3 Discriminating Theorists from Experimentalists: A First Analysis 0 2 4 6 0 x10 2 6 x10 # slides x10 # formulas 4 5 4

3 0 Talks a week before meeting 1 2 # slides First talks handed in 6 Experimentalists Theorists 0 1 2 3 4 5 6 x10 # formulas C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 4 First Problems Simple Completely model, separable, but no complete but only separation via complicated boundary 4 3

1 4 2 0 3 # slides 2 5 6 # slides x10 5 6 x10 New talk by Ludger: 28 formulas on 31 slides 0 1 2 3 4 5 6 x10 0 1 # formulas

0 1 2 3 4 # formulas 5 6 x10 At this point we cannot know which feature is real! Use Train/Test or Cross-Validation! C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 5 See Overtraining - Want Generalization Need Regularization Test x10 Train Test Set 4 Training Set 3 Overtraining 1 2 # slides 5

6 E 0 Training epochs 0 1 2 3 4 5 6 x10 # formulas E 1 N t N i out ( xi ) 2 Want to tune the parameters of the learning algorithm depending on the overtraining seen! i 1 C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 6

See Overtraining - Want Generalization Need Regularization Test x10 Train Test Set 3 4 Training Set 1 2 # slides 5 6 E 0 Training epochs 0 1 2 3 4 5 6 x10 # formulas E 1

N N 2 ti out ( xi , w) i 1 Regularization will ensure adequate performance (e.g. VC dimensions): Limit the complexity of the model Factor 10 - Rule: (Uncle Bernies Rule #2) C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 7 Philosophy: Occams Razor Pluralitas non est ponenda sine necessitate. Do not make assumptions, unless they are really necessary. 14th century From theories which describe the same phenomenon equally well choose the one which contains the least number of assumptions. First razor: Given two models with the same generalization error, the simpler one should be preferred because simplicity is desirable in itself. Yes! But not of much use. Second razor: Given two models with the same training-set error, the simpler one should be preferred because it is likely No! No free lunch-theorem Wolpert to have lower generalization error. 1996 C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 8 Decision Trees # formulas 0 2 4 6 x10

#formulas < 20 exp #formulas > 60 th 20 < #formulas < 60 ? # slides #slides > 40 #slides < 40 0 2 4 #formulas < 20 exp 6 x10 all events #formulas > 60 rest subset 20 < #formulas < 60 #slides < 40 th exp th #slides > 40 exp C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 th Classify Ringaile:

31 formulas on 32 slides th Regularization: Pruning 9 Local Density Estimators x10 6 5 4 0 1 2 3 # slides 4 3 0 1 2 # slides 5 6 x10 Search for similar events already classified within specified region, count the members of the two classes in that region. 0 1 2 3 4

5 6 x10 # formulas C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 0 1 2 3 4 5 6 x10 # formulas 10 Maximum Likelihood # formulas 0 2 4 # slides 6 x10 0 31 4 6

x10 32 2 3 pTh 0.24 5 5 1 1 pExp 0.04 5 5 Regularization: Binning 2 out= Correlation gets lost completely by projection! C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 11 k=3 out= k=4 out= 6 k=2 out= 4 3 2 1 0 k=5 out= # slides 5 k=1 out= x10

k-Nearest-Neighbour 0 1 2 3 4 5 6 x10 # formulas Regularization: Parameter k For every evaluation position the distances to each training position need to be determined! C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 12 Range Search x10 1 x x 7 9 8 7 2 8 6

0 Small box: checked 1,2,4,9 out= 3 10 1 10 5 4 4 1 9 3 6 y # slides y 5 3 5 4 2 2 6 x Large box: checked all out=

0 1 2 3 4 5 6 x10 # formulas Regularization: Box-Size Tree needs to be traversed only partially if box size is small enough! C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 13 Methods Based on Linear Separation x10 x10 Divide the input space into regions separated by one or more hyperplanes. Extrapolation is done! 6 5 4 0 0 1 1 2 3 # slides

4 3 2 # slides 5 6 LDA (Fisher discr.) 0 1 2 3 4 # formulas 5 6 x10 0 C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 1 2 3 4 5 6 x10 # formulas 14

Neural Networks 1 (a) 1 e a 1 N 2 E ti out ( xi ) N i 1 y wi xi s 1 x10 0 5 6 arbitrary inputs and hidden neurons -1.8 +3.6 4 +3.6 3 -50 +20 2 +1.1 -1.1 +0.2 1 +0.1 0 # formulas 0

Network with two hidden neurons (gradient descent): 1 2 3 4 5 6 x10 C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 # slides Regularization: # hidden neurons weight decay 15 Support Vector Machines Separating hyperplane with maximum distance to each data point: Maximum margin classifier Found by setting up condition for correct classfication yi ( wxi b) 1 2 and minimizing w which leads to the Lagrangian 1 2 L w i yi ( wxi b) 1 2 Necessary condition for a minimum is w i yi xi Output becomes out sgn i yi xi x b Only linear separation? No! Replace dot products: xy ( x ) ( y ) The mapping to feature space : R d F

is hidden in a kernel K ( x , y ) ( x ) ( y ) C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 Non-separable case: 1 2 1 2 w w C i 2 2 16 Physics Applications: Neural Network Trigger at HERA H1 keep physics reject background C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 17 Trigger for J/Events H1 [email protected]=95%: NN SVM k-NN RS C4.5 ML LDA C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 99.6% 98.3% 97.7% 97.5% 97.5% 91.2% 82% 18 Triggering Charged Current Events ~ e

p signal W p X p background [email protected]=80%: NN SVM C4.5 RS k-NN LDA ML C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 74% 73% 72% 72% 71% 68% 65% 19 Astrophysics: MAGIC - Gamma/Hadron Separation Photon Hadron vs. Training with Data and MC Evaluation with Data = signal (photon) enhancement factor Random Forest: = 93.3 C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 Neural Net: = 96.5 20 Future Experiment XEUS: Position of X-ray Photons (Application of Stat. Learning in Regression Problems)

transfer direction XEUS ~10m ~300m electron potential xCOM xc c i i i xCCOM xCOM ( xCOM ) of reconstruction in m NN 3.6 SVM 3.6 k-NN 3.7 RS 3.7 ETA 3.9 CCOM 4.0 C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 21 Conclusion Statistical learning theory is full of subtle details (models statistics) Widely used statistical learning methods studied: Decision Trees LDE: ML, k-NN, RS Linear separation: LDA, Neural Nets, SVMs Neural Networks found superior in the HEP and Astrophysics applications (classification, regression) studied so far Further applications (trigger, offline analyses) under study C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 22 From Classification to Regression k-NN 2 4 3 2 out 8 3 5 5 3

NN 1 E N 4 3 3 5 5 N y 2 RS i out ( xi ) 13 2 out 4 Fit Gauss 2 i 1 a=(-2.1x - 1) b=(+2.1x - 1) C. Kiesling, MPI for Physics, Munich - ACAT03 Workshop, KEK, Japan, Dec. 2003 out=(-12.7a-12.7b+9.4) 23