FlowRoute: Inferring Forwarding Table Updates Using Passive Flow-level Measurements Amogh Dhamdhere (CAIDA/UCSD) [email protected] with Lee Breslau, Nick Duffield, Cheng Ee, Alexandre Gerber, Carsten Lund and Shubho Sen (AT&T Labs-Research) Motivation Routing protocol performance during routing events can affect end-to-end performance Transient loops and packet losses may occur during routing reconvergence Network operators need to monitor routing protocol performance Do routers respond as expected? Update their forwarding tables in a timely manner? Update their forwarding tables to the expected state? 01/25/20
IMC 2010, Melbourne Australia 2 Monitoring Routing Events Control plane monitors (e.g., OSPFmon, BGPmon) Monitor the control plane cannot measure when a router implemented a change in its forwarding table Active probing Can only monitor paths that are probed Spatial and temporal resolution limited by placement of probes and probing frequency 01/25/20 IMC 2010, Melbourne Australia 3
FlowRoute A data-plane monitoring tool to work in conjunction with control plane monitors Infer forwarding table updates using flow-level measurements Works offline, for after-the-fact forensics and analysis No additional overhead on routers Uses flow-level measurements (e.g., Netflow) that are already collected 01/25/20 IMC 2010, Melbourne Australia 4 Basic Method T1: f1
N1 R T2: f2 N2 Single packet flows f1 and f2 towards D f1 seen at N1: R is previous hop at time T1 N1 is Rs next hop towards D at T1 f2 seen at N2: R is previous hop at time T2 N2 is Rs next hop towards D at T2 Rs next hop towards D changed in [t1,t2] 01/25/20
IMC 2010, Melbourne Australia 5 Routing Flow Records o i Rp Rn R R sees flow towards destination D from tf to tl Netflow: (R, i, o, tf, tl, D)
Map outgoing interface Duplicate first o to next hop router packet timestamp Map incoming Subtract link interface i to propagation previous hop delays router (Rp, tf-, tl- ,D,R) 01/25/20 (R, tf, tf, D, Rn) One flow record at R produces two routing flow
records, giving the routing state of R and R p IMC 2010, Melbourne Australia 6 Inferring Forwarding Table Updates Collect netflow records from all routers Convert to Routing Flow Records (RFRs) for offline processing (R, T1, T2, N1, D) (R, T3, T4, N2, D) T2 < T3 N1 T1
01/25/20 N2 T2 T3 T4 R changed next hop towards D in the time window [t2,t3] range of range of forwarding table update IMC 2010, Melbourne Australia 7 Inferring Forwarding Table Updates Collect netflow records from all routers
Convert to Routing Flow Records (RFRs) for offline processing (R, T1, T2, N1, D) (R, T3, T4, N2, D) T2 > T3 N2 N1 T1 01/25/20 T3 T2
T4 Routing flow records overlap could be due to Equal Cost Multi-Path (ECMP) IMC 2010, Melbourne Australia 8 ECMP [T1,T2]: f1 N1 R [T3, T4]: f2 01/25/20
D N2 Router R can forward flows destined to D to either N1 or N2 RFRs generated at N1 and N2 can overlap inconsistency Non-overlapping RFRs can appear as a routing change for every flow IMC 2010, Melbourne Australia 9 Filtering ECMP Observation: In 99% of next hop changes due to ECMP, a router routes fewer than 20 flows
towards one next hop, before routing a flow towards an equal-cost next hop Filtering heuristic: Declare routing change only if >20 flows were routed to the old next hop before a flow is routed to new next hop Conservative: May miss routing changes before 20 flows are forwarded to the old next hop 01/25/20 IMC 2010, Melbourne Australia 10 Sampling Both packet and flow sampling in high-speed networks Sampling does not affect correctness of inferred ranges Sampling affects the width of ranges; more sampling lower temporal resolution More discussion in the paper
01/25/20 IMC 2010, Melbourne Australia 11 Timely Forwarding Table Updates Forwarding table update ranges OSPF event range of cluster All ranges overlap with OSPF event cluster 01/25/20 IMC 2010, Melbourne Australia 12
Delayed Forwarding Table Updates Forwarding table updates consistent with OSPF events Forwarding table updates delayed w.r.t OSPF events Such behavior is not detectable using a control plane monitor alone! 01/25/20 IMC 2010, Melbourne Australia 13 Delayed Forwarding Table Updates
Used FlowRoute on a 2-month dataset 2666 OSPF event clusters 97010 time ranges consistent with OSPF event clusters 117 ranges that showed delayed forwarding table updates Two routers showed delayed updates 14 times in the 2-month dataset Subsequently retired from the network 01/25/20 IMC 2010, Melbourne Australia 14 Loops Delayed forwarding table updates can cause transient loops Example in the paper of how this can happen
392 instances of 1-hop loops during 2-month dataset Mostly short-lived (sub-second) A few loops lasted 10s of seconds Long-lived loops were due to delayed updates by one or more routers 01/25/20 IMC 2010, Melbourne Australia 15 Summary FlowRoute: A data plane monitor to work in conjunction with control plane monitors for forensics and analysis of forwarding table updates Used to study forwarding table updates in a tier-1 ISP network Found cases of delayed forwarding table updates due to buggy routers
Also found transient loops during routing convergence and spikes in link utilization 01/25/20 IMC 2010, Melbourne Australia 16 Thanks! [email protected] www.caida.org/~amogh 01/25/20 IMC 2010, Melbourne Australia 17 Practical Issues What should be the destination? Can be either destination IP address, prefix, or MPLS tunnel
endpoint Need to observe sufficient flow volume We choose MPLS tunnel endpoint Sampling Both packet and flow sampling occur in high-speed networks Sampling does not affect correctness of inferred ranges Affects the width of the ranges; more sampling lower temporal resolution 01/25/20 IMC 2010, Melbourne Australia 18 Existing Approaches Control plane monitors (e.g., OSPFmon, BGPmon) Monitor the control plane, cannot measure when a router implemented a change in its forwarding table
Collect and process router logs Large volume of data, transporting and processing is hard Limited by polling frequency, e.g., 5 minutes with SNMP Active probing Spatial and temporal resolution limited by placement of probes and probing frequency 01/25/20 IMC 2010, Melbourne Australia 19 Delayed Forwarding Table Updates Used FlowRoute on a 2-month dataset -- 2666 OSPF event clusters 97010 time ranges consistent with OSPF event clusters 58 clusters, 117 ranges that showed delayed forwarding table updates Two routers showed delayed updates 14 times in
the 2-month dataset Subsequently retired from the network 01/25/20 IMC 2010, Melbourne Australia 20