(version 1.1) 2010.06.15 NANOG49 day 2 part 2 notes Lightning talks http://nanog.org/meetings/nanog49/abstracts.php?pt=MTYwNCZuYW5vZzQ5&nm=nanog49 net. NETwork DOcumentation Tool Carlos Vincente, University of Oregon Reminder, voting for lightning talks starts at 6pm, notifications will be... http://nanog.org/meetings/nanog49/presentations/Tuesday/Vicente-netdot-presentation-nanog49.pdf Carlos talks about a tool they have been using for a while. It discovers devices and their topology via SNMP IPAM (IPv4 and v6 address management) documents cable plant details organizes contact information role based access control. Centralize and integrate network information reduce manual proceedures (manual == outdated) generate reports more easily delegate tasks to departments/customers no similar open source projects other tools only do one thing. They use it as source of truth; sysmon, nagios, cacti database all populated from it, as well as rancid. device documentation interfaces, ips, subnets, vlans, modules, ARP tables, forwarding tables, spanning tree manu, model, OS uses SNMP_Info from Netdisco project Discovers layer 2 topology CDP/LLDP, spanning tree, and forwarding tables to come up with layer two topology more information can be added manually Screen snapshot of device inventory IP address management organize v4 and v6 addresses hierarchically learns subnets from layer 3 devices gives graphical representation of usage tracks IP and MAC over time generates A/AAAA records for interfaces can delegate management of records to departments Graphical IPv4 block view shows what's in use, reserved, lets you drill in, see what's for dynamic IPs, you can see purple cells for items discovered in ARP cache but not documented. Hierarchy tree can be generated Cable plant document interbuilding fiber, copper, inbuilding jacks, fibers use pairs of fibers to create circuits, associate with device interfaces. They keep closet pictures in it upcoming features: RESTful API IPv6 space visualization playing with hierarchical quadtree model IPv6 address collection via SNMP DHCPv6 config support DNS updates with nsupdate instead of just static zone files. take advantage of BIND auto-resigning Improve cable plant section inventory of equipment stock take advantage of barcode scanners more access control granularity Q: This is very timely, given content on the list. Q: Joel Jaegli, this sure has come a long way since he left, wow, thanks!! http://netdot.uoregon.edu/ Doug Madory i-root instance in china Accidentally importing Censorship. http://nanog.org/meetings/nanog49/presentations/Tuesday/Madory-I-root-lightning-talk.pdf Users outside of China experienced DNS tampering, similar to chinese internet censorship due to route leak for i-root in beijing. Great Firewall blocks access to some IPs, returns incorrect DNS responses intercepts and resets TCP connections DNS queries routed through GFW can return bogus answers impact users outside of china We don't know *who* does these things, but we can report on observed behaviours. You can do lookups though their machines to see the impact on names like www.facebook.com i-root IP 192.36.148.17 has instance in Beijing 24 March 2010, email on list saw strange responses for queries for well known domains; instead of getting referrals to roots, was getting bogus return records. There was a leaked route for i-root instance out of Beijing, making it out of China to PacNet to customers and peers. By crossing chinese networks, these queries were getting subjected to tampering behaviour. To have this happen, you'd have to query a blocked domain, your result would have to not be cached, .com would have to not be cached you'd have to ask i-root, and you'd have to be directed to china's instance of it. At that point, you'd get a poisoned set of results. Saw behaviour for the /24, starting in January, through March, until the email came out; it also hit NANOG list. Following day, the route was withdrawn. Was gone for a few months, working on talk last week, and saw the route had come back again. As of today, leak again seems to be there, and answers seem to be legitimate now... Q: Kurt Lindquist, are you seeing incorrect responses outside of China now? A: No, now reponses are correct; that's later in the presentation. Now they test blocked domains, it's not just facebook, it's others as well; from chinese client side, you get bad data back. After your query, you get two bad responses, your client tries to connect, and gets an unreachable. This is known behaviour, you see same behaviour for almost any DNS against different DNS servers. Q: multiple responses is often a fingerprint for multiple packet injectors responding; you can sometimes identify the injectors based on the data. US clients now get valid data back when querying those i-root instances. But then chinese client got good data after that... then 45 minutes after that, they got bad data back again. non-chinese clients continue to see correct answers. No recent evidence of bogus results being delivered outside of China. Netnod is routing properly, and serving data up correctly. This affects not just i-root, but f and j in China as well for users inside China. Risk still exists behaviour could come back in future. There are assurances this won't happen again, there was a misconfiguration that allowed it to happen. Recommendations: root server operators need to keep close eye on routes people use to reach their servers in challenging environments. Running local domestic service of a global service needs to be very carefully kept local. Q: Randy Bush, IIJ, I can get bogus dns replies here, thanks to ISP trying to protect children, from Pirate Bay, and others; happens in other countries too. If we keep this up, we will run out of countries to put root servers in. Q: Jay Hennigan This doesn't parse for him; no insight into how it happened, but it won't happen again is a bit of a mental conundrum. A: I'll defer to Kurt on that. OK, up next is Wide BGP Communities Robert Raszuk http://nanog.org/meetings/nanog49/presentations/Tuesday/Raszuk-Wide_BGP_Communities_00_2.pdf With 4-byte ASNs, customers cannot inject BGP communities; so there is a need for this to be solved. new proposed encoding to handle both 2-byte and 4-byte BGP communities. chance to define community-wide well known communities. current RFC BGP community defines 2 bytes for AS #, and 2 bytes for value. specifies AS field is for source AS value field doesn't really allow for encoding target AS. Wide communities are 8 octets, so room for source AS, target AS, and actions. Use single encoding for both 2 and 4 byte communities. You can also encode that AS matches have to happen for action to take place. UP to 255 actions can be defined. 0x00 default action, no special handling 0x01 informational only support indicator 0x02 mandatory 0x03 - 07 Ability to reserve communities From web pages, grabbed the most common uses for BGP communities today. Needs more feedback from community on what other uses for communities there are. BLACKHOLE (with source AS) SOURCE_BLACKHOLE (from cust) SOURCE_RPF HIGH_PRIORITY_PREFIX Advertisement control to which should a prefix be advertised? which prefixes should I advertise a prefix for? Another hint was the path hint from Bett Sweeney Tell remote peer is to send traffic back through a path with a given AS Also negative hint; which ASes I don't want to use for transit. Also to allow path modifications REPLACE some paths PREPEND_UPSTREAM PREPEND_PEERS PREPEND_CUSTOMERS Another group is for making geographically Another group is for incrementing and decrementing localprefs; just to increase or decrease from a given value. Tony Li PATH TTL , drop if PATH is > X (1-15) ftp://ftpeng.cisco.com/raszuk/bgp_wide_comms/ Q: RAS, nlayer one problem with predefing actions; where they are applicable depends on policies. You may not want your customers to be able to set these in a given region. Adjusting localprefs, region encodings, are all very network specific details. Can we use extended communities to specify actions to an ASN, rather than just using information attributes. A: Each AS may have a whole bunch of their own local ones to use...he left 42000 for ASes to use in their own system. Negative feedback is also good. RAS--supplying examples is good; but mandating them is a bit more problematic. Q: Joel Jaegli, OpsSec working group chair. RFC5365, remote triggered blackhole triggering; should these be standard communities? It's a bit dangerous to standardize when there is not broad agreement on a standard. On the flip side, this is the right room of people to help decide on that. If there were community consensus on standardizing some of these, that would be good to document A: These are still to be enabled by hand on a router, so they are not mandatory by any means! Q: Dani Roisman Every routing policy he worked on, you didn't transit communities, you stripped them before sending routes out; the more communities you have, the less packing you can do. It's a bad idea to assume communities will be transitive. A: With 3.3 and 3.5 packing, that's less of a concern. but good to know. Q: Leo Bicknell, They use the no-export community today, they are one of the few who tries to use it with many providers. RFC says if you have routers that know about communities, they MUST respond to no-export. But ISPs strip communities on ingress, and never honor it. Historically, it was hard to strip all but a small list of communities on neighbors. Now, that has become easier. The reason for leaks from DNS roots is that they tag routes with no-export...and then leak them. Please, can we get no-export and no-advertise supported internet-wide? As an operator, there is much fear, because customers will expect these to be universally honored. People will send them, expect behaviour to happen, and will get upset when they don't. A: Perhaps 'well known' was a wrong name. He was thinking more of these as templates for people to follow; no expectations that people will necessarily have to honor them. Q: Leo, do you really expect your peers to honor no-export? A: Yes, he expects people to follow the RFC, which means that people *MUST* accept the no-export community. The things being proposed here do have downstream implications. Core things like no-export have almost no pitfalls. Q: If one were to produce a BCP that contained templates for well known communities, it would mean you could ask your upstream "do you support BCP X for BGP communities or not?" A: Yes, that is exactly what was intended by this; it is a bit of opening the pandora's box. OK, Peter is up next, from Huawei IEEE 802.1aq shortest path bridging http://nanog.org/meetings/nanog49/abstracts.php?pt=MTYwNSZuYW5vZzQ5&nm=nanog49 couple of new acronyms; if you work with a layer 2 portion of your network, you have spanning tree ring based protocol split lag type protocol IEEE looked at these issues, attempted to solve some of the problems there. Challenges L2 networks that scale to 1000 bridges use of arbitrary mesh topologies use of (multiple) shortest paths efficient broadcast/multicast routing and replication points avoid address learning by tandem devices get recovery times into 100s of ms range for larger topologies good scale without loops create logical layer 2 topologies within that. maintain properties of layer2 within that; ordering, symmetry, congruency of unicast and multicast maintain/reuse all existing ethernet OAM Shows a slide example of a 36 node STP network. useful paths can't be used, devices all the way up and down the tree have to learn addresses, etc. 802.1aq/SPB built on 802.1 standards build on Q-in-Q and MAC-in-MAC QinQ is SPBV mode MinM is SPBM mode backward compatible create multiple shortest path routing trees. head end or tandem replication models complete deterministic traffic flows mean offline tools can compute it. You can thus do lightweight traffic engineering head-end assignment to 16 shortest paths. IS-IS protocol already proven to more than 1000 devices. huge improvement over the STP scales Good convergence with minimal fuss some pre-standard deployments which show good behaviour. Can operate as separate IS-IS instance, or can work with existing IS-IS. also supports multi-topology extensions for ISIS membership advertised in same protocol as topology. MAC-in-MAC mode, learning happens only at the edge. unicast/multicast populated at same time, symmetry and congruence. Some diagrams of paths through a network showing all the links between nodes being usable. Application (M|R)STP replacement many more nodes without regions low effort to get good routing fast convergence Application datacenter multiple shortest path routing (inter server traffic) deterministic traffic flows flexible subnet -- expand/shrink anywhere... address isolation for m-in-m totally compatible with vmware server functions OA&M, motion, backup, etc apps that sit on vmware 'just work' totally compatible with microsoft load balancing (multicast over layer 2) VRRP transparent it just makes the L2 part of the DC to grow larger and be better utilized. Lightweight L2VPNs in a metro region. inexpensive form of VPLS E-LAN, E-LINE, E-TREE flavours can do VPLS style headend replication can do p2mp style transit replication operator perspective, how does it work? plug NNI's together group ports/c-vlans at UNIs assign an I-SID to each group internally, ISIS reads mac address, forms NNI adjacencies ISIS advertises mac addresses, no config needed ISIS reads assignments of services to ports, advertises that along the path computations produce FIB data path (M mode) traffic arrives at UNI encapsulated with B-SA of bridge encapsulated with I-SID configured for group encapsulated with B-VID chosen for route C-DA is looked up unicast transit state consists of unique shortest path from this node to all other nodes forwarding state contains B-MAC for each other node, points to interface out along shortest path. repeat for up to 16 shortest paths symmetry of routing is assured through 16 different tie-breaking algorithms. Some slides show the paths through the network for given BVIDs through the core. FDB (mcst M in M mode) if no services require tandem replication, there is no tandem FDB transit multicast, format is known by all notes, no need for head-end replication visual demonstration shown for mcast m-in-m mode For e-lan 100 node with 7 members, paths are shown with calculated shortest path trees; paths are calculated in one direction, and are used in reverse direction for multicast traffic. Control plane is ISIS link-state protocol Doesn't require IP to operate. allows v4 or v6 to exist in same instance, but not required. SYSID carries B-MAC address introduces no new PDUs to IS-IS Hellow TLVs augmented to pass equal cost algorithm and new NLPID UPdated TLVs to advertise SPB data, etc. Slide showing the details for a given link in terms of TLVs. Loops suppression and avoidance done on data path using an SA check prevents 99.99% of loops if FDBs create one no impact on convergence rates exploits symmetry/congruence of routing uses reverse learning options of most h/w to discard avoidance done by the control path ensures no loops are ever configured in FDB hellos augmented with topology digests. You bound forwarding updates to not move more than a certain distance, to not allow loops to form in the first place. In terms of OAM, it inherits it by design, service, link, network layer inherited from 802.1ag, ah, and Y.1731 Recovery mechanism where multicast mechanism for UNI ports can be used to tie control planes together; that way updates can be placed on multicast tree to get updates simultaneously sent to every other node in the network to speed up convergence dramatically. Example included in the deck for those who like to see forwarding tables. Keeping wikipedia page up to date Will do in-depth tutorial at next NANOG. He has an emulator on his laptop, he'll be happy to sit at the bar and show you. Q: (mic isn't on, can't hear it) A: They usually show up afterwards. Q: Those of us with big datacenters would really like to see this happen. Why use 802.1aq vs trill; what is benefit of this over trill? A: He's a pragmatist; trill has adopted a certain class of ECMP behaviour, and 802.1aq has a different ECMP behaviour. It's too bad the two groups can't unify. Can we take the best of both, and get the two groups to compromise on a single, best of breed model? Some vendors are in middle ground using ethernet mac-in-mac with TTL, using ISIS to control it. If you believe in it, please, by all means, help them. Enterprise QoS, Tim Chung, Google corporate enterprise architecture team. http://nanog.org/meetings/nanog49/abstracts.php?pt=MTU2MyZuYW5vZzQ5&nm=nanog49 this is just about google corporate network, NOT google.com! Challenges. WAN links are bottleneck in terms of bandwidth, makes QoS essential on the links. for latency sensitive apps, voip, video, need proper QoS for them. many apps to classify, classification ACL becomes a challenge. different agreements with different transit provider about how much traffic can go into each queue. Diverse vendor, different queue support Tend to run IPSec on the enterprise. Performance is important. Classifications changed at least once a week. Want classification as close to edge as possible. May not always be possible. Do you trust, or not trust end nodes? When you have creative users with root access, they can tag all their traffic as network control. Port based classification, based on layer 4 ports. Define at WAN edge router than at switches, since that's where congestion really happens. Pay close attention to classification ACL, to make sure it's as granular as possible. Layer four classification is imperfect; not all port 80 traffic is web. specify IP address as well as ports, put some counters on them. Juniper commit scripts, used extensively to handle configs for forwarding classes, to modify classifiers where there are specific routing instances on a certain subset of the routers. tend to have policers and shapers; becomes problematic if you don't use commit scripts. Commit script takes input from interface tags, look at platform it's running on, and transit provider to figure out the right policer. They have traffic agreements with different providers; firewall rule are pre-policer and post-policer. operational aspects of QoS need to effect changes quickly. want statistical counters around what was changed. security team at google built an acl generation program called caprica http://code.google.com/p/capirca pretty much all acls at google are constructed using it. network definition files have IP addresses in them. create policy file, then the generation process converts metalanguage to vendor-specific output. policy file is built from network definition files and service definition files. ACLs can be generated for multiple platforms that way. BGP QoS over IPSec. Use BGP to tie offices together, run BGP through IPSec tunnels. BGP was dropping, Juniper feature, precedence 6 for control traffic, but stuffed into queue 0, best effort. Once service PIC is involved, and control traffic is stuffed into tunnel, and goes out PIC with rewrite rule. When inner TOS bit is copied out to outer TOS header, when it exits from the service PIC the TOS gets rewritten back to effectively 0. As of 10.0, you can fix that, and override on the egress side. You have to know your applications well, and know your platforms well. Don't expect to send 10G of throughput on 100 term ACL. Cisco, you have TCAM issues; just because you can generate the file, don't expect it will fit on TCAM. Test everything, use your traffic generate, make sure you validate before pushing live. Netflow running, and firewall counters are graphed. Page of URLs is a bit hard to read from back here. No questions from the room. VCR giveaway will happen now...betamax? Thanks to everyone who stopped by the vendor collaboration room, they're giving away a netgear wireless router. Now, it's break time, thanks to Network Hardware Resale. After break, the research forum will be in the room here; peering talk will go first, then research forum will be after. [Matt has to attend a meeting from 4-5--upon returns, they had just finished the ITER approach presentation] http://nanog.org/meetings/nanog49/abstracts.php?pt=MTYwNyZuYW5vZzQ5&nm=nanog49 DNSSec visualization http://nanog.org/meetings/nanog49/presentations/Tuesday/dnsviz.pdf Part of this came from earlier, signed zones at Sandia.gov last year. Set it up so they could get data from failures. Looked at failures in first couple of weeks; failures were all over the map. So, he integrated a tool to diagnose and visualize failures for administrators. in the old days, you could use dig and look at results. Now, with DNSSec, the results are harder to understand with the naked eye. When something goes wrong, you get a SERVFAIL message, could be DNSSec, could be some other failure. When you get a failure, if you pull out dig, you might get lucky and be able to spot a broken trust chain delegation. With multiple key roles, it's harder to spot which role or which key fails. And dnssec attributes may not follow roles. dig +sigchase can give more information http://dnsviz.net visualize a connected graph with nodes and edges, different shapes depends on node type, and different attributes set change decorations on the nodes (shaded, dashed, filled, double border) diamond nodes are signed, but have insecure delegations. connections between nodes are for digests, signatures, and status of those updates. zones and dependencies are overlaid on top of the graph. bottom line--is there a valid chain of trust from trust anchor down to the node? looking at medicare.gov example. There is a DS type record for it, but it doesn't match up with any DNS key RR There are some options in the tool to change things around; you can change supported algos, and put in your own trust anchor. This lets you play around a bit to see how you'd look with different rules in place. He shows some examples of real issues, non-anonymized, and walks through the breakage in each case. (at the top, you can see the root, with broken indicators, since it's not signed) the .arpa zone from last week showing the expired signature DNSviz will map revoked records to pre-revoked bit so they can recognize bad rollovers. Some of the dependency complexities are shown, with alias dependencies through different names which do have problems. Tool checks consistency dnskey rrset signature serial PMTU on server side NSEC3 awareness future work documentation visual history of zone for reference, post mortem analysis regular polling, monitoring/alert services ctdecci at sandia.gov http://dnsviz.net Q: Thanks for doing this, it is very helpful for administrators who are trying to diagnose issues! Q: Kevin Oberman Thanks for giving the heads-up; the problem shown there wasn't an appliance issue, it was that the spam filter caught all the alerts from the appliance, so the updates were missed. Lucas from UCLA tool for IP address allocation EyeP to visualize IPv4 allocation and usage worked with Ricardo and Lixia http://nanog.org/meetings/nanog49/presentations/Tuesday/EyeP.pdf IPv4 addresses are running out, but how are they being used? relationship between allocation and usage is not clear. Are prefixes announced in same size as they are allocated? Tool to see allocated but unannounced prefixes. Are any unallocated prefixes announced in BGP? Why would people announce space that is not allocated to them? Amount of data is so huge, it's not feasible for humans to dig through it. This tool will help you to not only get big picture of entire v4 space, but drill down into specific IP range you are interested in. Provides comprehensive picture of overall allocation and relationship between allocation and routing announcements allocated but not observered in DFZ announced but not allocated allocated and announced in many different forms Data source: whois database from all 5 RIRs RIR address allocation records daily snapshots from Feb 2005 RouteViews snapshots RIPE collection snapshots It lets you do searches on org names or prefixes. Below search bar, entire IPv4 allocation space is shown, grouped by 32 /8 blocks. There are only 16 unallocated /8s left on the chart. the top part shows the legacy chunk, and the bottom shows the multicast space. colour coded blocks showing which RIR controls which pieces. Sub-blocks are visualized showing the sub-allocations within the /8 when you zoom in, coloured based on which RIR does the suballocation. To visualize multiple /8s, it simply shrinks the view, making the rectangles into smaller bars. He delves into 24/8, 1/8, and 193/8 to show where sub-allocations were given to orgs within ARIN, RIPE, and AFRINIC One way to turn your address into prefixes; you can see the allocation size, and then the children within BGP of the covered prefix announcements. putting them together, you get allocation block above the line, with announcements below the line. Would be happy to get feedback on how people could use the tool. Will talk about allocated but unannounced blocks, and unallocated but announced blocks. Nov 28, to Dec 6, 2008, SwissComm announcing a bunch of /8s that didn't belong to it. after announcement, smaller announcements masked. At end of July 2009, curve returned to normal when SwissComm stopped announcing those blocks. In terms of measurement, he aggregates blocks together, even if they don't line up on CIDR boundries. He has a mapping of DFZ-invisible blocks, allocated but unannounced. Other chart shows unallocated prefixes in BGP. (generally less than 2500 blocks) The amount of addresses in that range decrease drastically from the end of 2009. Who announced prefixes that are not allocated? Origin AS distribution in time axis; Y-axis is AS-ID, there are a LOT of ASes announcing space that is not allocated. Some ASes almost continously do it. others just do it for a short period of time. July 12 2005, Aug 2, 2005, AS2905 announced 50 /8 prefixes; BGPMon captured the events. AS16215 announced more than 60 /8 prefixes, two monitors caught them. EyeP helps you manage and visualize the allocation of IPs, as well as the structure in the routing tables. Results show that about 25 /8's out of allocated space are not visible. 1-1.5 /8's unallocated space are observed in BGP on average. Q: What is significance of blocks not being visible in DFZ? A: This is just some measurement--after address given out to organization, they may be in use inside their own AS; no conjecture really about why that is. Evaluating Potential routing diversity for internet failure recovery http://nanog.org/meetings/nanog49/presentations/Tuesday/IER.pdf failure is part of everyday life in IP networks eg 675,000 excavation accidents in 2004 Taiwan earthquake incident is an example; only two of nine cross-sea fibers not cut. Internet is not as reliable as people expect. 32% of ASes are vulnerable to single critical provider link cut 93.7% of tier 1 single-homed customers subject to connectivity loss during depeering. Two places with more routing diversity; IXPs participants may not be connected via BGP yet internet valley-fee routing policy if peering rules are relaxed to allow peer to carry traffic, victim can find more routing paths to restore connectivity Dataset for evaluation most complete AS topology graph collect data from RouteViews, RIPE, Abilene, CERnet P2P traceroute between 992,000 IPs in over 3,700 ASes in total, 120K AS links with AS relationships Failure modes: peering link teardown tier 1 depeering (cogent, level 3) provider-customer link teardown several breakdowns of tier 1 provider customer lnks mixed types of link breakages Evaluation metrics recovery ratio path diversity shifted path Results: tier 1 depeering 36 experiments for 9 tier-1 ASes recovery ratio: most of the lost AS peers can be recovered. minimum ratio was 23% Path diversity multiple AS paths between lost AS pairs mostly between 2 and 4; in only a few cases is path diversity less than 2 shifted path on average, 3.75 to 17.2 for all 36 experiments moderate traffic load shifted onto recovery links How can we use these potential resources? Economic model B pays to A for recovery risk alliance (like airlines) -- price determined beforehand pay on bandwidth and duration of emergency communication channel for peers have direct connections to peers for co-located ASes in the same IXP ASes are connected by switches in modern IXPs messages are broadcasted via switches message confidentiality through public key crypto Diagram of automatic communcations Query phase--who can connect to the destination AS? Reply phase--I can provide X bandwidth to AS ACK phase--I would like to buy X BW to AS After outage, messages can be sent to tear down recovery path. Optimal selection of helper ASes buy from multiple transit providers Summary Point out a new venue for Internet failure recovery Evaluate potential routing diversity via IXP and PR with most complete AS topology graph 40-80% of affected AS pairs can be recovered via IXP and PR with multiple paths and moderate shifted paths Q: RAndy Bush, IIJ, seems to rely on valley-free hypothesis, and seems to rely on ....[couldn't hear] and three, that data plane follows control plane; recent papers indicate you should not trust that. Thanks everyone, please give the speakers feedback, and enjoy the social tonight! Meeting wraps up at 1758 hours Pacific time.