Feature Engineering

Feature Engineering Geoff Hulten Overview Feature engineering overview Common approaches to featurizing with text Feature selection Iterating and improving (and dealing with mistakes) Goals of Feature Engineering Convert context -> input to learning algorithm. Expose the structure of the concept to the learning algorithm. Work well with the structure of the model the algorithm will create. Balance number of features, complexity of concept, complexity of model, amount of data. Sample from SMS Spam SMS Message (arbitrary text) -> 5 dimensional array of binary features 1 if message is longer than 40 chars, 0 otherwise 1 if message contains a digit, 0 otherwise 1 if message contains word call, 0 otherwise 1 if message contains word to, 0 otherwise 1 if message contains word your, 0 otherwise SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info Long? HasDigit? ContainsWord(Call) ContainsWord(to) ContainsWord(your) Basic Feature Types Binary Features

ContainsWord(call)? IsLongSMSMessage? Contains(*#)? Categorical Features FirstWordPOS -> { Verb, Noun, Other } MessageLength -> { Short, Medium, Long, VeryLong } Numeric Features CountOfWord(call) MessageLength FirstNumberInMessage TokenType -> { Number, URL, Word, Phone#, Unknown } ContainsPunctuation? GrammarAnalysis -> { Fragment, SimpleSentence, ComplexSentence } WritingGradeLevel Converting Between Feature Types Numeric Feature => Binary Feature Single threshold Length of text + [ 40 ] => { 0, 1 } Numeric Feature => Categorical Feature Set of thresholds Length of text + [ 20, 40 ] => { short or medium or long } Categorical Feature => Binary Features

One-hot encoding { short or medium or long } => [ 1, 0, 0] or [ 0, 1, 0] or [0, 0, 1] Binary Feature => Numeric Feature { 0, 1 } => { 0, 1 } Sources of Data for Features System State App in foreground? Roaming? Sensor readings Content Analysis Stuff weve been talking about Stuff were going to talk about next User Information Industry Demographics Interaction History Users report as junk rate # previous interactions with sender # messages sent/received Metadata Properties of phone #s referenced Properties of the sender Run other models on the content Grammar Language Feature Engineering for Text Tokenizing

TF-IDF Bag of Words Embeddings N-grams NLP Tokenizing Breaking text into words Nah, I don't think he goes to usf -> [ Nah, I, don't, think, he, goes, to, usf ] Some tips for deciding If you have lots of data / optimization Dealing with punctuation Nah, -> [ Nah, ] or [ Nah, , ] or [ Nah ] don't -> [ don't ] or [ don, ', t ] or [ don, t ] or [ do, n't ] Normalizing Nah, -> [ Nah, ] or [ nah, ] 1452 -> [ 1452 ] or [ ] Keep as much information as possible Let the learning algorithm figure out what is important and what isnt If you dont have much data / optimization... Reduce the number of features you maintain Normalize away irrelevant things

Focus on things relevant to the concept Explore data / use your intuition Overfitting / underfitting much more later Bag of Words One feature per unique token Bag of words word text a m1: m2: m3: m4: word A word of text. A word is a token. Tokens and features. Few features of text. a of word of is a a text

is token tokens features tokens and features few of text token and features few Training data Tokens Features Bag of Words: Example test1: Some features for a text example. m1 m2 m3 m4 m1: m2: m3: m4: A word of text.

A word is a token. Tokens and features. Few features of text. test1 a 1 1 0 0 1 word 1 1 0 0 0 of 1 0 0 1

0 text 1 0 0 1 1 is 0 1 0 0 0 token 0 1 0 0 0

tokens 0 0 1 0 0 and 0 0 1 0 0 features 0 0 1 1 1 few 0

0 0 1 0 Selected Features Training X Test X Use bag of words when you have a lot of data, can use many features Out of vocabulary N-Grams: Tokens Instead of using single tokens as features, use series of N tokens down the bank vs from the bank Message 1: Nah I don't think he goes to usf Message 2: Text FA to 87121 to receive entry Message 2: Nah I I dont dont think think he he goes

goes to to usf Text FA FA to 87121 to To receive receive entry 0 0 0 0 0 0 0 1 1

1 1 1 Use when you have a LOT of data, can use MANY features N-Grams: Characters Instead of using series of tokens, use series of characters Message 1: Nah I don't think he goes to usf Message 2: Text FA to 87121 to receive entry Message 2: Na ah h I I d do e en

nt tr ry 0 0 0 0 0 0 0 1 1 1 1 1 Helps with out of dictionary words & spelling errors Fixed number of features for given N (but can be very large) TF-IDF Term Frequency Inverse Document Frequency

Term IDF Score 4 3.5 Instead of using binary: ContainsWord() 3 Use numeric importance score TF-IDF: Importance to Document 2.5 2 TermFrequency(, ) = 1.5 % of the words in that are 1 0.5 Novelty across corpus InverseDocumentFrequency(, ) = 0 0% 10% 20% log ( # documents / # documents that contain )

30% 40% 50% 60% 70% 80% 90% 100% % of Documents Containing Term Words that occur in many documents have low score () Message 1: Nah I don't think he goes to usf Message 2: Text FA to 87121 to receive entry Message 2: Nah I don't think he goes to usf

Text FA 87121 receive entry BOW 0 0 0 0 0 0 1 0 1 1 1 1 1 TF-IDF

0 0 0 0 0 0 0 0 .099 .099 .099 .099 .099 Embeddings -- Word2Vec and FastText Word -> Coordinate in N dimension Regions of space contain similar concepts Creating Features Options: Average vector across words Count in specific regions Commonly used with neural networks Replaces words with their meanings sparse -> dense representation

Normalization (Numeric => Better Numeric) Raw X Normalize Mean 36 74 22 81 105 113 77 91 -38.875 -0.875 -52.875 6.125 30.125 38.125 2.125 16.125 Mean: 74.875 Subtract Mean Mean: 0 Std: 29.5188 Normalize Variance Divide by Stdev

-1.31696 -0.02964 -1.79123 0.207495 1.020536 1.29155 0.071988 0.546262 Mean: 0 Std: 1 Helps make models job easier No need to learn what is big or small for the feature Some model types benefit more than others To use in practice: Estimate mean/stdev on training data Apply normalization using those parameters to validation /train Feature Selection Which features to use? How many features to use? Approaches: Frequency Mutual Information Accuracy Feature Selection: Frequency Take top N most common features in the training set Feature Count

to 1745 you 1526 I 1369 a 1337 the 1007 and 758 in 400 Feature Selection: Mutual Information Take N that contain most information about target on the training set ( , ) ( , )= ( , ) ( ) ( ) ( )

1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 1 1 1 ( = , = ) (

( = , = ) . . = . ( = ) ( = ) . . ) 3 1 2 4 Contingency Table 0 1 0 10 1 x=0 x=1 5 5

5 5 Additive Smoothing to avoid 0s: ) Sum over all combinations: MI = 0.086 10 Training Data ( Perfect predictor high MI No Information 0 MI Feature Selection: Accuracy (wrapper) Take N that improve accuracy most on hold out data Greedy search, adding or removing features From baseline, try adding (removing) each candidate Build a model Evaluate on hold out data Add (remove) the best Repeat till you get to N Remove Accuracy 88.2% claim 82.1%

FREE 86.5% or 87.8% to 89.8% Important note about feature selection Do not use validation (or test) data when doing feature selection Use train data only to select features Then apply the selected features to the validation (or test) data Simple Feature Engineering Pattern TrainingContextX runtimeContextX Featurize Training TrainingY Input for machine learning model at runtime Featurize Runtime runtimeX

Featurize Data Selected words / n-grams and their feature indexes Raw data to featurize and do feature selection with Info needed to turn raw context into features TF-IDF weights to use for each word Normalize parameters for numeric features: means and stdevs Simple Feature Engineering Pattern: Pseudocode for f in featureSelectionMethodsToTry: (trainX, trainY, featureData) = FeaturizeTraining(rawTrainX, rawTrainY, f) (validationX, validationY) = FeaturizeRuntime(rawValidationX, rawValidationY, f, featureData) for hp in hyperParametersToTry: model.fit(trainX, trainY, hp) accuracies[hp, f] = evaluate(validationY, model.predict(validationX)) (bestHyperParametersFound, bestFeaturizerFound) = bestSettingFound(accuracies) (finalTrainX, finalTrainY, featureData) = FeaturizeTraining(rawTrainX + rawValidationX, rawTrainY + rawValidationY, bestFeaturizerFound) (testX, testY) = FeaturizeRuntime(rawTextX, rawTestY, bestFeaturizerFound, featureData) finalModel.fit(finalTrainX, finalTrainY, bestHyperParametersFound) estimateOfGeneralizationPerformance = evaluate(testY, model.predict(testX)) Understanding Mistakes Noise in the data

Encodings Bugs Missing values Corruption Noise in the labels Ham: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune Spam: Ill meet you at the resturant between 10 & 10:30 cant wait! Model being wrong Reason? Exploring Mistakes Examine N random false positive and N random false negatives Reason Count Label Noise 2 Slang 5 Non-English 5 Examine N worst false positives and N worst false negatives Model predicts very near 1, but true answer is 0 Model predicts very near 0, but true answer is 1 Approach to Feature Engineering Start with standard for your domain; 1 parameter per ~10 samples Try all the important variations on hold out data

Tokenizing Bag of words N-grams Use some form of feature selection to find the best, evaluate Look at your mistakes Use your intuition about your domain and adapt standard approaches or invent new features Iterate When you want to know how well you did, evaluate on test data Feature Engineering in Other Domains Computer Vision: Gradients Histograms Convolutions Internet: IP Parts Domains Relationships Reputation Time Series: Window aggregated statistics Frequency domain transformations Neural Networks: A whole bunch of other things well talk about later Summary of Feature Engineering Feature engineering converts raw

context into inputs for machine learning Goals are: Match structure of concept to structure of model representation Balance number of feature, amount of data, complexity of concept, power of model Every domain has a library of proven feature engineering approaches Texts include: normalization, tokenizing, n-grams, TF-IDF, embeddings, & NLP Feature selection removes less useful features and can greatly increase accuracy

Recently Viewed Presentations

  • Penny Lab: Exploring the Scientific Method Step 1:

    Penny Lab: Exploring the Scientific Method Step 1:

    Penny Lab: Exploring the Scientific Method. Step 1: Make an Observation. During the summer my family plays a game known simply as "The Penny Game". It involves throwing twenty pennies in the swimming pool and then on the count of...
  • Selected Poetry of Norman MacCaig - WordPress.com

    Selected Poetry of Norman MacCaig - WordPress.com

    "Assisi" & "Visiting Hour" ... The overall structure of the poem contributes to the atmosphere and mood. Verse . 1-3 . are short, staccato and create a sense of place, atmosphere and the poet's feelings. ... Selected Poetry of Norman...
  • Lecture 4 Particulars: bundle theory - University of Helsinki

    Lecture 4 Particulars: bundle theory - University of Helsinki

    The bundle-bundle theory. Distinguish (i) a bundle of properties at any one instant (ii) a series of such bundles standing in relation R (i) is a momentary entity, existing only at that instant; (ii) is an enduring entity, existing across...
  • Text types - WordPress.com

    Text types - WordPress.com

    Snow White said, 'Oh could (I) Tankyou'. Then Snow White told the dwarfs the whole story and Snow White and the 7 dwarfs lived happily ever after. Narrative Orientation Once upon a time there was a Fairy who was pretty....
  • PowerPoint Presentation

    PowerPoint Presentation

    The Silent Way (Caleb Gattegno) ... The First English Word Chart . The English Fidel . The Fidel is a set of charts presenting all the possible spellings of each sound of the language using the same colour code as...
  • Introduction to opioid use disorder

    Introduction to opioid use disorder

    Induction visit and frequent early follow up (consider home induction) Urine testing and prescription logistics. Linkages to psychosocial services. ... Vermont responds to its opioid crisis. Simpatico TA 1. Naltrexone . Opioid antagonist that blocks other opioids.
  • Conventions of Epic Poetry An epic poem has…

    Conventions of Epic Poetry An epic poem has…

    Conventions of Epic Poetry An epic poem has… a hero who embodies national, cultural, or religious ideals a hero upon whose actions depends the fate of his people a course of action in which the hero performs great and difficult...
  • IB Psych 3.28.17 - Mr. Steen's Website

    IB Psych 3.28.17 - Mr. Steen's Website

    Thinkib.net. I've shared some pages for you folks…I know it's been a while for many of you, but it'd be a good idea to get "re-aquainted" Tips on writing an SAQ—we've used this before… The organization of an ERQ (Extended...