(Network Security) / E-mail: [email protected] 1 Agenda Introduction

(Network Security)  / E-mail: nfhuang@cs.nthu.edu.tw 1 Agenda  Introduction

(Network Security) / E-mail: [email protected] 1 Agenda Introduction of Network Security Content Inspection Technologies Pattern Matching Algorithms Flow Classification by Stateful Mechanis m Machine Learning Based Application Iden tification Technologies Network Security Research Topics

Conclusions 2 -- - 2000/3 DDos Yahoo Amazon CNN eBay 2001/7 Amazon.com Bibliofind

2002 2003/1 SQL Slammer 2003/4 2003/8 Blaster 2003/9 SoBig 2003/9 2004/3 Netsky 2004/4 Sasser 2005/5 2005/6 3 ,

4 5 6 7


7x24 8 Denial of Service (DoS), Distributed Denia

l of Service (DDoS) Network Invasion Network Scanning Network Sniffing Torjan Horse and Backdoors Worm 9 (1) DoS/DDoS Prevent another user from using net work connection, or disable server or services: e.g. Smurf and Fraggle

attacks, Land, Teardrop, NewTe ar, Bonk, Boink, SYN flooding, Ping of death, IGMP Nuke, buffer ov erflow. Caused by protocol fault or program f ault. It damages the Availability. 10 DoS Ping Flooding ICMP echo Ping of Death 65,536 ICMP ech o (TCP/IP )

UDP flooding (Chargen) UDP Port 19, Character Generato r UDP 11 DoS Smurf Attack ICMP e cho ICMP reply SYN flooding SYN TCP SYN-ACK

TCP 12 Smurf attack (DoS) Dangerous attacks Network-based, fills access pipes Uses ICMP echo/reply (smurf) or UDP echo (frag gle) packets with broadcast networks to multipl y traffic Requires the ability to send spoofed packets Abuses bounce-sites to attack victims Traffic multiplied by a factor of 50 to 200 Low-bandwidth source can kill high-bandwidth connections

Similar to ping flooding, UDP flooding but more da ngerous due to traffic multiplication 13 Smurf Attack (contd) ICMP echo (spoofed source address of victim) Sent to IP broadcast address ICMP echo reply Internet Perpetrator Victim 14 SYN flooding Attack (DoS) Goal is to deny access to a TCP service

running on a host. Creates a number of half-open TCP connections which fill up a hosts listen queue; host stops accepting connections. Requires the TCP service be open to connections from the victim. 15 SYN flooding (contd) Spoofed SYN ACK to spoofed address : : Attacker

Victim The Innocents 16 DDoS Attack Attacker Handler Agent Agent Handler Agent

Agent Victim Handler Agent Agent Agent Control message Maybe encrypted or hidden in normal packets. Spoofed packets. 17

DDoS Attack Yahoo.com Amazon.com CNN.com buy.com ebay.com DDoS Hacker Host Host Host Host Organization A Internet Communication to Compromised machines Host

Host Host Host DDoS Attack Server Server Server Organization Under Attack Organization B 18

DDoS DDOS Trin00 ( ) Tribe Flood Network TFN ( ) TFN2K Stacheldraht Trin00 Trin00 Tr in00 Daemon UDP

I CMP port unreachable TFN Trin00 , TFN S YN flood UDP flood ICMP flood Smurf TFN 19 (2) Network Invasion Goal is to get into the target system and obtain information Account usernames, passwords Source code, business critical information Usually caused by improper configurations

or privilege setting, or program fault. Network invasion is diverse and various, knowledge about attack pattern may help to detect, but it is quite hard to detect all attacks. 20 Example of network invasion: IIS uni code buffer overflow For IIS 5.0 on windows 2000 without this security patch, a simple URL string: http://address.of.iis5.syst em/scripts/..%c1%1c../win nt/system32/cmd.exe?/c+

dir+c:\ will show the information of root directory. 21 (3) Network Scanning Goal is generally to obtain the chance, the t opology of victims network. The name and the address of hosts and network devices. The opened services. Usually uses technique of ICMP scanning, X mas scan, SYN-FIN scan, SNMP scan. There is an automatic and powerful tool: N

map. 22 (4) Sniffing Goal is generally to obtain the content of co mmunication Account usernames, passwords, mail account Network Topology Usually a program placing an Ethernet adapt er into promiscuous mode and saving inform ation for retrieval later Hosts running the sniffer program (e.g. NetB us) is often compromised using host attack

methods. 23 (5) Backdoor and Torjan horse Usually, the backdoor and torjan horse is the c onsequences of invasion or hostile programs. It may open a private communication channel and wait for remote commands. Available toolkits: Subseven, BirdSpy, Dragger It can be detected by monitoring known contr ol channel activities, but not with 100% precisi on.

24 (6) Worm The chief intention of worm is to propagate and survive. It takes advantages of system vulnerabilities to infect and then tries to infect any possible targets. It may decrease the production of system, leave back doors, steal confidential information and so on. 25 P2P/IM P2P (Peer-to-Peer) IM (Instant Messenger)

Spyware Adware Tunneling 26 P2P: A new paradigm Bottleneck of Server Powerful PC Flexible, efficient information sharing P2P changes the way of Web (Internet) 27 P2P P2P Soft Ether Skype

P2P 28 Famous P2P Examples

BitTorrent eZpeer Kuro eDonkey eMule MLdonkey Gnutella Kazaa/Morpheus Shareaza Direct-connec t

Gnutella Soulseek Opennap Worklink Opennext Jelawat PP

SoftEther iMESH MIB WinMix WinMule Skype 29 Instant Messenger (IM) MSN Yahoo Messenger ICQ YamQQ

AIM (AOL IM) 30 Firewall (Layer-4) VPN SSL VPN PKI IDS/IPS Defense-in-Depth Application Firewall (Layer-7) UTM (Unified Threat Management) NAC (Network Access Control) 31 Intrusion Detection System (IDS)

Intrusion Detection and Prevention System (IPS/IDP) 32 Intrusion Detection System Intrusion Detection System: a computer system that attempts to detect any set of actions that try to compromise the integrity, confidentiality, or availability of a resource. An IDS has much more knowledge and many delicate detection functions than common firewalls. (Remember that, the

main function of a firewall is to do access control). 33 IDS Types Host based vs. Network based. Misused detection vs. Anomaly detection Active vs. Passive Centralized vs. Distributed 34 Host based & Network based IDS Host based IDS: installed on target

host as a monitor service. It checks system activity, user privilege, user behavior. Network based IDS: installed on network node, usually in promiscuous mode to listen all passing traffic. It checks network traffic, nodes interactions. 35 Misused detection & Anomaly detection IDS Misused detection (signature-based): based on the assumption that intrusion attempts can be characterized by the comparison of

user activities against a database of known attacks. Anomaly detection (statistical-based): identify abusive behavior by noting and analyzing audit data that deviates from a predicted norm. 36 Active IDS vs. Passive IDS Active IDS: an participate in the system. Not only observe the events, but also involve in the necessary operation. Also called IPS or IDP (Intrusion Detection and Prevention System) Passive IDS: work on a monitor or bystander basis.

37 Active IDS v.s. Passive IDS ISP ISP Port Mirror

Passive IDS LAN (a) Passive IDS Active IDS LAN (b) Active IDS 38 Centralized IDS v.s. Distributed IDS

Centralized: The sensors are managed by a single analyzer or manager. Distributed: The sensors are managed by multiple automated analyzers or managers. And among analyzers and managers, they can communicate to each other. 39 Comparison between Firewall and Network based active IDS Same : Cant protect insider to insider attack.

Cant protect against connections that dont go through. Can do ACL and filtering. (For Active IDS) Different : IDS has the ability to detect new threats. IDS focuses on intrusion while Firewall focuses on access control and privacy. Firewalls use address as the passport while IDS will do much more checks. 40 The Challenge of IDS Speed limitation: NIDS cannot keep pace with the network speed. (NIDS need to check more fields of a packet than a firewall does.) The inability to see all the traffic: The switched Ethernet is getting largely

deployed. Fail-open/fail-close architecture: when a NIDS fails often without notification of the problem to the central console., leave the network as an open one. A fail-closed methodology means the network is out of service until the NIDS is brought back on-line. 41 IDS False Alarms 42 Content Inspection Technologies 43

A Generic Layer-7 Engine Packet Normalizer Makes sure the integrity of incoming packets Eliminates the ambiguity Decodes URI strings if necessary Pattern-Matching Engine Policy Engine Gather information from pattern-matching engine and issue the verdict to allow/drop the packets Logs, Reports

Policies Policy Engine Matched Signatures Events Verdicts Pattern-Matching Engine Normalized Traffic Filtered Traffic Packet Normalizer

Packet Stream Packet Stream Network 44 Packet Normalizer Integrity Checking IP Fragment Reassemble TCP Segment Reassemble TCP Segments may come out-of-order SEQ out of window size Segment Overlapping

URI Decode URI hex code obfuscation (a = %61) URI unicode/UTF-8 obfuscation self-referential directories obfuscation (/././././ = /) directories obfuscation (/abc/a/../a/../a/ = /abc/ a) 45 Pattern-Matching Engine The most computation-intensive task in packet processing. Normally the PM engine needs to process every single byte in packet payload. In Snort, the PM routine accounts for 31% of the total execution time

46 Pattern Matching is Expensive! ~50 Instructions/ 1500 Byte packet ~30 Instructions/ Byte. 45K Instructions/1500 Byte packet Source: Intel Corp. 47 Content Inspection Technologies Pattern-Matching Algorithms

Software Based Boyer-Moore Aho-Corasick (AC) Wu-Manber Hardware Based Bloom-Filter Reconfigure Hardware (FSM) TCAM-based 48 Pattern Matching Problem Definition Given an input text T = t0, t1, , tn ,and a finite set of strings P = {P1, P2, , Pr}, the string matching p roblem involves locating and identifying the subs tring of T which is identical to Pj = , 1 j r, where ts+i = , 0 i m-1. And this equation can be

j ai as also denoted tsts+m-1 = j j a0 ...am 1 Text G C A T C G C A G A G A G T A T A C A G T A A G G C A G A G A G 49 Aho-Corasick (AC) Algorithm AC is a classic solution to exact set matching.

It works in time O(n + m + z) where z is number of patterns occurrences in T. AC is based on a refinement of a keyword tree. AC is a deterministic algorithm. That is, the performance is independent of the number of patterns. 50 An Example of AC Algorithm Example: P = {ab, ba, babb, bb} 51 An example of AC Algorithm !={h,s}

Patterns: {h} h she e r s {hers} hers his {he}

i {his} s s {he, she} h {s} e {sh} Dashed: fail transitions; those not shown leads to the root 52

An example of AC Algorithm i h e h e r s i s

s s h h i s e Text: h e i s h i s 53 Reconfigure Hardware (FSM) Implement the AC FSM in configurable Logic Elements (LEs) of FPGA. Achieve multiple gigabit performance. (Depe

nds on the FPGA model) A powerful FPGA is necessary to accommodat e thousands of patterns, so that its not prac tical and visible in commercial market. 54 FPGA-based pattern matching FPGA-based 55 Bloom Filter Given a string X, the Bloom filter computes k hash functions on it producing k hash val ues ranging from 1 to m. The same procedu re is repeated for all the members of the pa

ttern set. The input text is verified by generating k h ash values in the same way. If at least one of these k bits is found not set then the stri ng is declared to be impossible to match. Patterns in Length n are grouped into Bn. 56 Bloom Filter (Cont.) 1 2 3 4

5 6 7 A B C D E F G 8

H 9 I J Payload Stream Group signature by length : B2 B3 B4 Bw G4 (X) G3 (X) Bloom Filter (B4)

Bloom Filter (B3) False positive : Bloom Filter (B2) H1 H2 G2 (X) H3 Hk 1 1 1 Mim

0 m 1 1 1 m 0 1 1

1 0 m 1 0 f = (0.5)K, while m = (k x n) / Ln2 1 So, total space, sum(Bi) = m x (w - 1) if k = 1, n = 2048, m = 3072 bits k = 1, n = 3072, m = 4608 bits

1 m K Hash functions H1, H2, , Hk if k = 4, f = 0.0625 k = 5, f = 0.0313 k = 6, f = 0.0156 57 TCAM fundamental TCAM stores data with three logic values: 0, 1, X (dont care) Multiple match modes are needed. 58

Policy Engine Collect the matching events from Pattern-Matching Engine. Clarify the relationship between matched patterns: Ordered: A policy may consists more than one pattern and should be matched in order. Offset, Depth: The matched position should be within a certain range or location. Distance, Within: The distance between two matched patterns should be taken into consideration also. Trace Application States Some applications are difficult to identify by using only one signature (e.g. P2P). Policy Engine needs to track the connection state like the Msg Exc Data

Request following diagram: hange Exchange File S0 S1 S2 S3 59 Fast Pattern Matching Algorithm A Pattern Matching Coprocessor for Deep and Large Signature Set in Network Security Syst

em (IEEE GLOBECOM 2005) Hierarchical Matching Algorithm (HMA) for Intrusion Detection Systems (IEEE GLOBECOM2 005) A Time and Memory Efficient String Matching Algorithm for Intrusion Detection Systems, (I EEE GLOBECOM 2006) A non-Computation Intensive Pre-filter for String Pattern Matching in Network Intrusion Det ection Systems, (IEEE GLOBECOM 2006) Smart Architecture for High-speed Intrusion Detection and Prevention Systems, Internation al Conference on Cryptology and Network Security (CANS 2006, Acceptance rate < 18%). A Deterministic Cost-effective String Matching Algorithm for Network Intrusion Detection S ystems, (IEEE ICC2007). A Novel Algorithm and Architecture for High Speed Pattern Matching in Resource-limited Si licon Solution, (IEEE ICC2007) Flow Digest: A State Synchronization Scheme for Stateful High Availability, (IEEE ICC2007). Performing Packet Content Inspection by Longest Prefix Matching Technology, (IEEE GLOB ECOM2007). 60

Security SoC BroadWeb Security SoC ARM922 RISC CPU (250Mhz) Hardware NAT (400Mbps) Hardware Content Inspection Engine (40Mbps) Two 10/100/1000 RJ-45 Ports

Embedded-Linux NSS and ICSA approved IPS signature database IPS/Anti-virus functions IM/P2P Management Turn-key solution (ASIC + Software module) 1-tier Customers 61 Security SoC (Cont.) BroadWeb Security SoC (2nd Generation)

ARM926EJ RISC CPU (300Mhz) Intelligent Hardware NAT (1Gbps) Hardware Content Inspection Engine (100Mbps) Embedded GbE Smart Switch and 4-port GPHY core NSS and ICSA approved IPS Technology IPS/Anti-virus functions IM/P2P Management Turn-key solution (ASIC+Software module) 1-tier Customers 62 Cisco/Linksys Wireless Security Router

IEEE 802.11n 108 Mbps EWC Wireless LAN IPS protection and IM/P2P management Firewall/VPN/Routing Gigabit Ethernet x 5 63 State Machine Based Technologies 64 State Machine Based Technologies 3 4 ACTIVE_Ok

ACTIVE_Req PASV_Req PASV_Ok 0 Login 1 PASS_Per Flow_Close 2

7 LIST_Req File_Req File_Ok LIST_Ok 5 6 Transitions Trans. ports Patterns

Login TCP, dport:21 PASS PASS_Per TCP, sport:21 230 , User , logged in ACTIVE_Req TCP, dport:21 PORT

ACTIVE_Ok TCP, sport:21 200 PORT command successful PASV_Req TCP, dport:21 PASV PASV_Ok TCP, sport:21 227 Entering Passive Mode

LIST_Req TCP, dport:21 LIST LIST_Ok TCP, sport:21 226 Transfer complete. File_Req TCP, dport:21 RETR

File_Ok TCP, sport:21 226 Transfer complete The FA Example : FTP 65 The FAs of BitTorrent protocols. Annouce 0 Transitions

Annouce Get_Peers Get_Peers 1 Connect_to_Peer 2 Connect Phase Patterns GET /announce HTTP/1.0 200 OK, e5:peers 0 1

Download Phase Transitions Patterns Connect_to_Peer 0x13, BitTorrent protocol 66 The FAs of Yahoo Messenger protocol. Msg_Service Login Phase 0 Auth_Resp

P2P_File_Tx 1 2 File_Tx_Status_ BRB Flow_Close 3 Transitions Auth_Resp Trans. ports Patterns TCP, sport:5050

YMSG , 0x54 Chat Phase Transitions Msg_Service P2P_File_Tx File_Tx_Status_BRB Trans. ports Patterns TCP, sport:5050 YMSG , 0x06 YMSG , 0x4d YMSG , 0x4d, 0x1

67 We can identify and manage Over 60 Applications IM MSN, Yahoo Messanger, AIM, Q Q, Google Talk, TM, ICQ, iChat, MIRC, Odigo, Rediff, Gadu-Gadu Web-IM Meebo.com, eBuddy.com, iLove IM.com, MSN, AIM, Yahoo, ICQ P2P eDonkey, BitTorrent, Gnutella, Foxy, FastTrack, Vagaa, Winny, BitComet, DirectConnect, PiGo, PP365, WInMX, POCO, iMesh, Cl

ubBox Streaming-Media QQLive, Podcast Bar, PPLive, R ealPlayer, Window Media Player , iTunes, WinAMP, Player 365, QuickTime, FlashMedia Video, T VAnts Webmail Yahoo, Hotmail, Gmail

VoIP Skype (3.6) File Transfer FTP, Web File Transfer, Thunde r, GetRight, FlashGet VPN VNN, SpftEther, Hamachi, Tiny VPN, PacketiX, HTTP-Tunnel, T or, Ping-Tunnel Terminal Control VNC, PCAnywhere Online Game QQGame, OurGame, Cga.com. cn, QQFO 68 Machine Learning Based Technologies

69 Application Traffic identification Traffic identification(or traffic classification) issues are focused in recently years since: The introducing of P2P application greatly impacts the network management task. Port number is not the best and efficient discriminator to identify these prevalent traffics. How about string matching method? Accurate! But It cannot identify the encrypted traffic. High cost on manually maintenance work for protocol signatures. High cost to match string in very high speed network.

Privacy issue is under debating. 70 How to resolve the problem? Heuristics methods(2004~2005) Based on some intrinsically different behavior, so me rule can be constructed. E.g. # dest ip == of dest port the host is ru nning P2P. To differentiate P2P or non-P2P traffic. Machine learning based techniques:(2004 ~ ) To construct the statistical signatures for differe nt categories/application protocols. Most machine learning techniques are directly em ployed to construct traffic signature. 71

The Milestone of Researches on Application Traffic Identification Before 2003: String matching and port number. 2003~2005: Heuristics Machine learning method. 2006~ : Machine learning method for real-time based traffic classification. First k data packet sizes and direction of TCP connection. Stage-based classification(Statistical data in each stage) 72 Different Objects of Application Traffic

Identification At different levels Category level or QoS class (Bulk data transfer - FTP&P2P, interactive, mail, web, streaming) Protocol level (Kazza, eMule/eDonkey, Bittorrent, MSN, FTP , POP3, SMTP, HTTP, Skype, Winny, Share,.) Behavior level (FTP control, FTP data, MSN file transfer, MS N message chatting, MSN voip, Skype Chatting, Skype voip , Skype File transfer, Skype Video conference,) All existing researches focus on classification in protocol o r category level. Application field Offline based: traffic trend analysis. Online based: traffic shaping, traffic engineering, security management. 73

The Classes of Applied Machine Learning Algorithms Supervised-Machine learning The model of traffic characteristics is constructed from the training instances with previously defined class label. Unsupervised-Machine learning (Clustering) The model of traffic characteristics is constructed from the training instances without previously defined class label. However, all the existing training set employed by both include pre-classified label. Because each cluster would contain several different classes/protocols. 74 The Discriminators (Attributes)

The key issues for machine-learning based traffic identification are: What are the most distinguishable characteristics (attributes/discriminators)? How to remove the expensive cost on training? Different discriminators: From L3/L4 layerpacket inter-arrival time, total packet size, number of packets,,etc. Combination of L3/L4 attributes with different perspectives. e.g. upload/download size ratio. 75 The Milestone of Researches (Applying Machine Learning techniques) 2003~2004:

[Matthew Roughan, IMC04] Class-of-Service Mapping for Qo S. 2005: [Sebastian Zander] Automated Traffic Classification. [Andrew W. Moore] Using Bayesian Analysis Techniques. 2006: [Sebastian Zander] Internet Archeology: Estimating Individual Application Trends in Incomplete Historic Traffic Traces. [Laurent Bernaille] Traffic classification on the fly. (first 5 pac kets of TCP with k-means clustering). [Jeffrey Erman] Internet Traffic Identification using Machine L earning (k-means, EM clustering). 76 The Milestone of Researches (Applying Machine Learning techniques) 2006 (cont.):

[Laurent Bernaille] Early Application Identification.(first 4 p ackets of TCP with k-means, GMM , and HMM clustering) 2007: Real time based methods [Zhu Li] Accurate Classification of the Internet Traffic Base d on the SVM Method. (TCP and UDP flow classification) [Laurent Bernaille] Early Recognition of Encrypted Applicat ion. (first 3 packets of TCP with GMM clustering) [Jeffrey Erman] Semi-Supervised Network Traffic Classifica tion. (Stage-based classification) 77 Class-of-Service Mapping for QoS: A Statisti cal Signature-based Approach to IP Traffic ACM SIGCOMM Internet Measurement Conference (IMC '04) Matthew Roughan1, Subhabrata Sen2, Oliver Spatscheck2, Nick Duffield2 1

School of Mathematical Sciences, University of Adelaide, Australia 2 AT&T Labs Research, Florham Park, NJ, USA 78 Introduction Before this paper: Traditional researches tried to find the model for trad itional protocol (FTP, web, mail). Most researches of traffic characteristics modeling which focus on P2P and IM are case studies. Features: This paper studied the requirements and proposed a framework of QoS for traffic which consists of traditi onal and novel P2P/IM application in QoS class level. Classification is based on utilizing the statistics of p articular applications in order to form signatures.

79 Ideas The statistical attributes are aggregated with respect to Server ports and Server IP addresses, separately. Employing machine learning techniques to construct the mappi ng from Server port aggregation/Server IP aggregation to differ ent QoS classes. Nearest Neighbor(NN) Linear Discriminant Analysis(LDA) Then, the port number of aggregation that belongs to particular QoS class can form one rule. Disadvantage: Applications that require different QoS might us e the same server port number.(e.g. P2P)

80 Nearest Neighbor To classify a data point x, lets find the nearest neighbor! The points with same property should be closely. The class of the nearest neighbor will be assigned to the data point x. K- Nearest Neighbor: To find the k nearest neighbors and let them vote. More information: http://neural.cs.nthu.edu.tw/jang/books/dcpr/4.2-knnr.asp?title=4-2 K-nearest-neighbor Rule 81 Linear Discriminant Analysis

To find the good projection for original points. Linear discriminant analysis finds a linear transformation ("discrimina nt function") of the two predictors, X and Y, that yields a new set of tra nsformed values that provides a more accurate discrimination than eit her predictor alone: Transformed Target = C1*X + C2*Y 2 features 3 features More information: http://www.dtreg.com/lda.htm http://neural.cs.nthu.edu.tw/jang/books/dcpr/index.asp 82 Evaluation Example Attributes for this evaluation: the average packet size, flow duration, bytes per flow, packets per flow, and Root Mean

Square (RMS) packet size. 83 Internet Traffic Classification Using Bayesian Analysis Techniques ACM SIGMETRICS'05 Andrew W. Moore1, Denis Zuev2 1 University of Cambridge 2 University of Oxford 84 Introduction Features:

Only TCP flows are considered. Category-level classification. Supervised-machine-learning Nave Bayesian algorithm ( ). Uniquely use data that has been hand-classified (based upon flow content) to one of a number of categories. Feature selection was applied to improved the ac curacy. 85 Ideas Discriminators: About 248 discriminators of each flow. E.g. Packet inter-arrival time (mean, variance, . . . ), Pa yload size (mean, variance, . . . ), Fourier Transform of the packet inter-arrival time, TTL value, Flow duration,

TCP Portetc. Nave Bayesian classifier For a flow with known statistical attributes, which class is most likely happened? To find the maximum probability Pr(Ci | X): Ci is i-th class X is the attributes of flow which will be classified. Only about 65% accuracy on flow level was achieved. 86 Ideas(cont.) Improvement: Nave Bayes Kernel estimation method. Kernel estimation was used instead of Gaussian di stribution model assumed by Nave Bayesian. Discriminator selection and dimension reduction.

The accuracy was improved upto 95% Disadvantages: All the discriminators are available after the flow is closed. Only TCP flows are considered for classification. Network management might need more finer classes (proto col level or behavior level). 87 Evaluation for Train and Test sets from traffic of different time 88 FCBF: Fast Correlation-Based Filter Traffic Classification on the Fly

ACM SIGCOMM Computer Communication Review Journal, Volume 36 , Issue 2, 200604 Laurent Bernaille, Renata Teixeira, Ismael Akodkenou, Augusti n Soule, Kave Salamatian LIP6, Universit e Pierre et Marie Curie, Thomson Paris Lab Paris, FRANCE 89 Introduction Features: The first paper focused on real-time flow-level application classification. To approximately model the L7 protocol handshaking. Protocol level classification. Unsupervised machine learning.

K-means clustering. (50 clusters are the best) Protocol assignment: for each cluster, the protocol of the largest proportion dominates the cluster. Discriminators: the first q data packet sizes (payload) and direction of each TCP connection. q = 5 is the best. (+300, -200, +100, +200, -400) 90 K-means Clustering For given number of clusters k, to iteratively find k centers of these k clusters and partition all the points into these k clusters until the nearest center does not change. Each data point is expressed as a vector, and Euclidean distance is the most common distance

computation function. 91 Evaluation Result Above 80% average accuracy can be achieved. Disadvantages: Only TCP connections are considered. Protocol assignment will result in classification starvation. The protocols which dont dominate any

cluster will be always classified as other protocol. 92 Early Application Identification 200612-ACM Conf-CONEXT06 (International Conference On Emerging Networking Experiments And Technologies) Laurent Bernaille, R. Teixeira and K. Salamatian, Universit e Pierre et Marie Curie LIP6, CNRS Paris, France 93 Introduction Features: Three unsupervised machine learning (clustering) algorithms

were used to evaluate cluster assignment accuracy and protocol labeling accuracy. K-means Gaussian Mixture Models (GMM) on an Euclidean space Spectral clustering on Hidden Markov Models (HMM, in order to consider order of packets) Discriminators: size and direction of first P data packets. To deal with the starvation problem in each group, a labeling heuristic method based on standard server port number (e.g. 25 for SMTP, 110 for POP3) is used to classify protocols in each cluster group. Only focus on TCP flows. Wireless traffic trace has been included for evaluation. 94 Discriminators Discussion about the discriminators:

The size and direction of each packet adds more information to di stinguish applications than arrival time related metrics. The range of packet sizes for each application is similar across tr aces. These models can be used to classify the same set of application s at another network. P = 4 packets for the three clustering methods. Clustering number: Kh = 30 for HMM, Kk = 40 for K-Means and Kg = 45 for GMM. 95 Packet size is a better attribute 96

On-line Classification 97 Labeling set of standard server ports std(S) ={FTP, SSH, SMTP, HTTP, POP3, NNTP, HTTPS, POP3S}. 98 Labeling Accuracy 99 Features

Pros: Easy, fast, and simple! Payload size and packet direction of first P data pack ets. Unsupervised training automatic learning mecha nism. Cons: In [Jeffrey Erman HP TR]: is unsuccessful classi fying application types with variable-length packets in their protocol handshakes such as Gnutella. Neit her of these studies access the byte accuracy of the ir approaches which makes direct comparisons to o ur work difficult. 100 Features

Cons: Only TCP are included for classification. According to the description of traces, there are un-ignorable fraction of flows which contain less than 4 data packets! And, the control flow might prevent the identification system from classifying detailed protocol behavior. Classification starvation is still exist for protocols which dont use standard port. 101 Early Recognition of Encrypted Applications 20070405-0406 Passive and Active Measurement Conference (PAM 2007)

Laurent Bernaille, Renata Teixeira Universite Pierre et Marie Curie - LIP6-CNRS Paris, France 102 Introduction Features: The classification of SSL-encrypted protocols. Two stages:SSL detection & Protocol identification. First 3 packets and 35 clusters for Gaussian Mixture Model. Size of original packet: Most accurate method is to look up the encryption method in the handshake packets and transform the size of application packets accordingly. For the five most common ciphers this method is overkill

because the increase varies from 21 to 33 bytes. Simple heuristic: subtract 21 from the size of the encrypted packet regardless of the cipher. Extending the Cluster+Port labeling heuristic SSL-specific ports: 443 for HTTPS, 993 for IMAPS and 995 for POP3S. 103 104 Accurate Classification of the Internet Traffic Based on the SVM Method IEEE ICC 2007 Zhu Li1, Ruixi Yuan1, and Xiaohong Guan1, 2 1 Center for Intelligent and Networked Systems (CFINS) Tsinghua University, Beijing 100084 , China 2

SKLMS Lab and MOE Key Lab for Intelligent Networks and Network Securit y Xian Jiatong University, Xian 710049, China 105 Introduction Features: Category level classification. Supervised-machine learning. Support Vector Machine. Feature selection (Discriminator selection) is employed to select the best set of attributes. Both TCP and UDP are considered. Discriminators: Statistical data of flows.

Disadvantages: the discriminators are available after the flow has finished the communication. 106 Feature Selection Sequential forward selection Begin with 0 feature chosen; sequentially append 1 feature which can arrive at the best classification result. Plus-m-minus-r algorithm Begin with 0 feature chosen; sequentially append m features into chosen ones and pop r features from them (m>r) each time. Plus-2-minus-1 was used in this paper.

107 Feature Selection (Cont.) 108 Accuracy After Feature selection For the data sample set with respect to original proportion in the traffic 109 Offline/Realtime Traffic Classification Using Semi-Supervised Learning 20070713-Technique Report-HP Presented at Performance 2007, 2-5 October 2007, Cologne, Germany, and published in Perfor mance Evaluation journal

(special issue on Performance 2007 for the Proceedings of IFIP Performance 2007) Jeffrey Erman, Anirban Mahanti, Martin Arlitt, Ira Cohen, Car ey Williamson Enterprise Systems and Software Laboratory HP Laboratories Palo Alto 110 Introduction Features: Semi-supervised learning techniques Allows classifiers to be designed from training data that consists of only a few labeled and many unlabeled flows. Both high byte accuracy and flow accuracy (i.e., > 90%). To examine traffic over an extended period of time, to assess the longevity of the classifiers. Focused on TCP only.

It would likely be advantageous to have a separate classier for the non-TCP traffic.(future work). Consideration about the elements in training set. Elephant vs. Mice Flows In order to obtain higher byte accuracy. 111 Introduction Semi-supervised Learning: Hypothesis: few flows are labeled in each cluster, we have a reasonable basis for creating the clusters to application type mapping. Step1: Clustering: K-Means Step 2: Mapping from the clusters to the different known q applications (Y) according to the fraction of labeled

application flows within the cluster. The clusters are unlabeled if they have no labeled flows. Use the unlabeled clusters to represent new or unknown applications. For most experiments, the number of clusters K = 400. 112 Discriminators 11 Discriminators: (After feature selection from 25 discriminators)

Total number of packets. Average packet size. Total bytes. Total header (transport plus network layer) bytes. Number of caller to callee packets. Total caller to callee bytes. Total caller to callee payload bytes. Total caller to callee header bytes. Number of callee to caller Packets. Total callee to caller payload bytes. Total callee to caller header bytes. 113

On-line Classification Online classification Layered classification system. A packet milestone is reached when the count of the total number of packets a flow (SYN/SYNACK packets are included) has sent or received reaches a specific value. Each layer is an independent model that classifies ongoing flows into one of the many class types using the flow statistics available at the chosen milestone. Each milestone's classification model is trained using flows that have reached each specific packet milestone. Reclassifying whenever a upper layer is reached: When a flow is reclassified, any previously assigned labels are disregarded.

114 Byte Accuracy April 13, 9 am trace 78% of the flows had correct labels after classifi 115 Features Pros: Semi-supervised mechanism reduces the cost to prepare large training data set. Considering sampling techniques to form the training set. Cons: Only TCP are included.

Is exponential packet milestone suitable for real-time classification? 116 A High Accurate MachineLearning Algorithm for Identifying Application Traffic in Early Stage Nen-Fu Huang+ , Gin-Yuan Jai+, and Han-Chieh Chao1 +Department of Computer Science, National Tsing Hua University, Taiwan *Department of Electronics, National Ilan University, Taiwan 117 Classification in Early Stage To get characteristics of protocol handshaking for e ach flow in L7 perspective. Flow idtuple (sip, sport, dip, dport, protocol) Statistical information of each flow at first k rounds.

Elapsed time, transmitted size, throughput, respo nse time, inter-arrival time. 118 119 Rule-based Machine Learning Rule-based ML (Supervised machine learning) Rules generated are suitable for intrinsic architect ure of firewall and IDS/IPS. Rules generated by ML algorithm provide informati on to understand potential characteristics of applic ation protocols One Rule, PART, Ripple down, DecisionTable, Conj unctiveRule, Ripper ML Name

Accuracy ML Name Accuracy ML Name Accuracy PART 85.58 % Ripple Down 82.94 %

Ripper 81.8 % One R 69.19% Conjunctive Rule 9.898 % 120 Experiment Architecture Traffic Dump (payload included)

Protocol signature Result Flow Preprocessing Flow Sets Random Split Average

Machine Learning Training Sets 10 Test Sets 1 Training Sets 1 Result 1

Result 10 Sample Set Flow Sampling Test Sets 10 10-fold cross validation 121 Accuracy Comparison with Respective to Sample Set L. Bernaille

2006 122 Accuracy Comparison with Respective to Sample Set(cont.) Zhu Li ICC2007 123 Accuracy After Discriminators Selection 124 Conclusions Machine learning based techniques to identify the N

etwork Applications are more and more important. Focus on real-time based, protocol level requiremen t of application traffic classification. No existing common traffic traces provided for com paring the performance in the same base line. Expensive training is still a problem. Identifying encrypted traffic (e.g. Skype, Winny, Encr ypted BT) is a new challenge. Identifying detailed behaviors of encrypted traffic is even a big challenge. 125

Recently Viewed Presentations

  • Reflexive verbs

    Reflexive verbs

    There are two ways to talk about washing in Spanish First, to wash something else Second, to wash part of one's body Lavarse Here are the forms: (Yo) Me lavo nos lavamos Te lavas os laváis Se lava se lavan...
  • English 10 - 2/13/12

    English 10 - 2/13/12

    Also learn how to use parenthetical notes in the essay. Homework - Complete your Works Cited page and make sure you have a parenthetical note in you paper for each work cited. English 10 - 2/13/15. Write on a subject...
  • Odyssey: Part 2 - Mrs. Szatkowski&#x27;s Awesome-ish Website

    Odyssey: Part 2 - Mrs. Szatkowski's Awesome-ish Website

    Odyssey: Part 2 What does an epic reveal about its culture? ... Epic Simile Simile Metaphor Personification Epithet "Land of the Dead" Read extended version In one paragraph, discuss some of the changes that the movie director made and explain...
  • 9 Nonlinear Functions and Models Copyright  Cengage Learning.

    9 Nonlinear Functions and Models Copyright Cengage Learning.

    Logarithmic Functions and Models. Today, computers and calculators have done away with that use of logarithms, but many other uses remain. In particular, the logarithm is used to model real-world phenomena in numerous fields, including physics, finance, and economics.
  • Do Now - Mrs. Battiste-Joseph&#x27;s World Cultures Class Website

    Do Now - Mrs. Battiste-Joseph's World Cultures Class Website

    Do Now . Number the paragraphs in the Background Essay in the DBQ. In the DBQ handout, go to page 525. Answer questions 1-3. (Understanding the Question) Independent Reading - Dialectical Journal. Choose a book from the class library. After...
  • Topics What will you research to find an

    Topics What will you research to find an

    Brown vs. Board of Education (1954): Case ruled against segregation in schools. Rosenberg Trial (1953): Immigrants Julius and Ethel Rosenberg were accused of passing atomic secrets to Russia during the McCarthy Era. They were found guilty and electrocuted.
  • Languages and Finite Automata - WPI

    Languages and Finite Automata - WPI

    Introduction to Theoretical Computer Science COMP 335 Fall 2004 Slides by Costas Busch, Rensselaer Polytechnic Institute, Modified by N. Shiri & G. Grahne, Concordia University
  • Being an effective consumer of preclinical research Wendy

    Being an effective consumer of preclinical research Wendy

    Consideration of relevant biological variables (e.g., sex) Factor into both design and analysis. Methods to ensure identity and validity of key biological and/or chemical resources used in the proposed studies. Cell lines, specialty chemicals, antibodies