Internet Exchange Point Switching "Wishlist"
Version 3.0.2
Mike Hughes <mike@linx.net>
London Internet Exchange
July 2005
Abstract
At the RIPE meeting held in
Amsterdam in February 2000, a number of participants agreed that the
group should produce a "wishlist" to guide equipment manufacturers
when producing boxes aimed at the core switching market. Over the coming
months, ideas were collected from the EIXP community to form the basis
of this document.
In Europe, most Internet Exchange Points use a shared switch fabric to
which the participants connect. Organisations then arrange peering via
bi-lateral peering agreements. It is not compulsory for all participants
to peer with every other participant (called multi-lateral peering).
Once two participants agree to peer, they will set up BGP4 sessions between
their routers connected to the Exchange to exchange routes and traffic.
In the majority of cases, the Exchange Point operator does not become
involved in the routing of any traffic across the Exchange, they choose
to leave this to the participants.
For this reason, switched Ethernet has become one of the most common
choices for Exchange Point media. The main reasons behind this are:
- Cost effectiveness
- Simplicity of setup
- Can use standard CAT5 wiring - easy to implement and maintain
- Interfaces available across a wide range of platforms
With the growth of the Internet, more and more traffic is being routed
to Internet Exchange points, and the importance of IXPs has grown in line
with this, especially in Europe where private peering is less common than
North America.
The IXP operators feel that having the right tools and features implemented
in the equipment they deploy will play an important part of scaling ethernet
technology to meet the demands placed upon Exchange Points.
This is an informational document to outline the various features which
IXPs would like to see implemented in core Ethernet Switching products.
Security and Management Features:
a) Control of dynamic MAC learning
----------------------------------
Currently, switches are provided with two options, either statically
configured or dynamically learned forwarding information.
Exchange Points like to monitor and control how many MAC addresses are
connected to a participant's port. The XP operator generally does not
desire ad-hoc extensions connected to their network. The common way of
managing this is to enforce a "router-only" or "limited
MAC address" rule.
This is currently controlled by statically configuring forwarding information,
or not controlled, but policed by counting the number of MAC addresses
learned on each port, and action taken against offenders.
Static configuration of forwarding information is a somewhat inelegant
option, as this increases configuration overhead, and decreases flexibility,
especially in case of emergencies.
We propose a configurable "maximum learning" limit, configurable
on a per port basis. In this way, operators can configure participants
ports according to their house rules, but retain the flexibility of dynamic
learning.
The filtering should not impose a performance hit on those ports which
are mac-limited.
An additional "twist" to this feature could be configurable
actions dependent on ethertype or other layer two header features. Therefore,
if a frame of a forbidden ethertype or forbidden destination MAC (e.g.
STP), then the port could be shut down. This could be relaxed for annoying
but less dangerous things like CDP - e.g. write a log message, but do
not shut down the port.
There should be multiple levels of locking:
- "Forwarding" limit - the maximum number of addresses you
will forward for on this port
- "Soft" limit - the limit at which you will record syslog
events
- "Hard" limit - the limit at which you shut down the port
(drop link if able) and record a syslog event
All the above locking, timeout, and reset rules should be configurable
by the network operator, such as:
- A hard limit may require manual intervention to reset the locking
and bring the port back up.
- The lockdown should automatically flush when there is a state transition
on the interface.
This feature must include a "first in-last out" mechanism in
the lockdown facility, to avoid forwarding information for valid addresses
being overwritten by addresses in excess of the house rules of the exchanges.
One possible issue is that even though the port is shutdown due to breaking
the port security rules, the port will have to be re-enabled to find out
what caused the shutdown. Could details of the offending frame be included
in the syslog? Or copied to a mirror port? Or captured in some sort of
buffer on the switch, where it can be viewed?
Could this feature also be extended to operate on a per-VLAN basis on
a tagged port, so that each VLAN on a tagged port could have it's own
lockdown limits? One assumes this would largely be dependent on the switch
architecture itself.
b) Disable acting on STP BPDU information
-----------------------------------------
Many exchange operators currently deploy Spanning Tree Protocol (STP)
in networks which contain redundant links/full meshing.
There is however, a danger presented by STP information leaked from a
participant's network. The participant may have connected a poorly configured
switch/router product, and may be leaking their STP information into that
of the exchange.
We would wish to see a configurable option to allow STP information to
be ignored, and filtered in hardware on "edge ports", on a per
port basis.
There should also be an option to generate traps or log messages based
on transgressions of the policy.
c) Wire-speed ACL-type filtering based on L2/L3 header info
-----------------------------------------------------------
The ability to look into the layer 2 or 3 header information of a packet,
and selectively monitor, or filter, based on certain layer 2 or 3 criteria.
This could be done using pattern matching or masking.
A common example of an area where this is desirable is to permit frames
of only certain ethertypes to enter the network through an edge port.
This sort of filtering should be implemented in hardware wherever possible,
and not have an effect on the forwarding performance of the system. Where
this is not possible, it must be clearly documented.
d) L2 Broadcast/Multicast/Unknown Unicast control, ARP snooping
--------------------------------------------------------------
Many exchange points insist on participants using IP addresses they have
assigned by the exchange operator. It is desirable for the operator to
be able to monitor/restrict "off-net" ARP.
As Ethernet is a broadcast medium, broadcast storms have been known to
bring exchanges to their knees, affecting the forwarding abilities of
both the switches of the exchanges, and the participants' routers. Monitoring/rate
limiting/control of Ethernet broadcast frames is desirable.
Most exchanges also forbid the speaking of interior routing protocols
across their peering network. Since these take the form of broadcast or
multicast frames on ethernet, control would help monitor this type of
incidence.
Such control should be able to distinguish (through appropriate configuration)
between legitimate ARP and genuine broadcast storms.
In addition, unknown unicast floods may also start to become an issue
when there is a range of different port speeds across a single layer 2
environment - it's possible for a large flow of unknown unicast from a
1G port to saturate a 100M port. Being able to do hardware rate-limiting/discard
of unknown unicast traffic will help maintain an uncongested port for
end stations with slower access speeds.
There should be suitable configuration knobs to be able to rate limit,
shut down, log exceptions, etc.
e) Support for MARP?
--------------------
We had looked at seeking support for something such as MARP - "MultiAccess
Reachability Protocol". This was defined in an Internet Draft, which
allows for detection of failures at L2 across multiple switch hops.
However, this has failed to become published as an RFC. As Cisco has
now asserted IPR over the content of the draft, MARP would therefore require
licensing. This seems to have somewhat killed this promising idea off.
f) Policy exception logging
---------------------------
In the above paragraphs, we have asked for some policy-based tools.
Operators need to know when these policies have been breached.
Good logging of policy exceptions need to be implemented:
Adequate documentation of possible syslog events/messages - e.g. to
help in configuring network monitoring systems.
g) Access to management interfaces
----------------------------------
In the past, security of management interfaces on Ethernet switching products
has often been lacking.
CLI or web interfaces should support authentication using username/password
pairs, to avoid the use of "password only" authentication which
implies shared passwords.
CLI interfaces should also support SSHv2 access, using either username/password
pair, or public key authentication.
Web interfaces should be HTTPS/SSL enabled, to avoid passwords being
passed in the clear over HTTP.
Support SCP/SFTP for config copy/upload/download, as well as existing
methods (TFTP/FTP).
Management interfaces should be able to perform authentication from an
external source, such as TACACS, RADIUS or LDAP services, as well as providing
locally held accounts (have to be retained for emergencies)
All management interfaces, CLI, web and SNMP should be able to benefit
from access-list control. The access lists should be able to support variable-length
subnet masks.
Ability to disable management interfaces on a per-VLAN basis. Many XP
operators choose to configure a "management" VLAN, so that all
management is done out-of-band of the core peering traffic. It is desirable
to have the management interfaces to listen on the management networks
only.
On devices which are designed to support high bandwidth per-slot, such
as high-density GigE or 10GigE, it is preferable to have a 10/100 Mb management
port provided on the system, to avoid burning a fast port for management.
h) Port mirroring
-----------------
It is sometimes necessary to mirror participants' ports, either because
a participant is suspected of some inappropriate activity, or to help
obtain information to debug a problem.
Not all exchange points have staff on site 24x7, and port mirroring may
need to be remotely set up, without hands-on intervention on-site.
The ability to allow any port to mirror any other port with a similar
or lower speed within the chassis would allow the operator to connect
a traffic collector/analyser device to a monitoring port, and simply configure
the switch to mirror a port as desired to monitoring port.
Possibility to define a capture filter of things which need to be sent
to the mirror port. This would help if you are only looking for a specific
frame type.
A user-configurable "sampled" mirror port - a sampling rate
can be set up to
Both the latter features would permit a slower port to mirror items of
interest from a faster port.
i) Statistics and Accounting
----------------------------
As well as implementing de facto SNMP counters/RMON, also consider implementing
the following:
- Per-VLAN traffic statistics
- sFlow export support (via management interface)
- Counters based on common ethertypes (IPv4, IPv6, multicast, ARP,
etc)
Scalability and Resilience:
a) Spanning Tree
----------------
Spanning Tree is currently the only cross-platform dynamic solution available
to operators of exchange points for dynamically managing multiple redundant
links in their architecture.
There are a number of problems with Spanning Tree:
As the routes collected at an Exchange Point can be routed all over the
world, any routing instability can act like dropping a pebble in a pond,
and will spread around the Internet.
It's desirable to maintain stable routing sessions across Exchange Point
LANs to minimise these routing flaps, because of load it places on routers,
and the effects of route dampening penalties.
We believe that being able to declare ports as "end-stations"
should avoid them being counted in the STP calculation, enable these ports
to start forwarding more rapidly, and speed overall STP convergence time.
Rapid spanning tree (IEEE 802.1w) should be implemented (http://www.ieee802.org/1/pages/802.1w.html),
and results from testing RSTP on certain platforms show that for simple
topologies with few redundant links, sub-second failover and re-convergence
is achievable with minimal tuning or additional configuration.
b) Ring Restoration Protocols
-----------------------------
IEEE 802.17 - This is a standards-based version of the technology currently
used by Cisco called DPT (Dynamic Packet Transport). This consists of
a counter-rotating ring-system, with spacial reuse and "ring wrapping"
circuit protection.
The Cisco version is currently implemented over SONET/SDH media, however,
the standardised version is being designed to be more media agnostic,
and the IEEE working group has already elected to provide support for
Gigabit Ethernet and 10 Gigabit Ethernet.
Proprietary Protocols - There are a number of proprietary ring protocols,
such as Extreme's EAPS (published as informational RFC3619), or Foundry's
MRP.
They are relatively similar in operation, in that they make assumptions
about the number of redundant links in a topology (i.e. only one), have
a concept of master and transit nodes, use a "heartbeat" sent
out by the master, and topology change messages are passed between the
nodes to speed network re-convergence (by triggering FDB flushing, and
backup port unblocking on the master node).
These recovery protocols may become less important as RSTP becomes more
graceful, however, it may be possible to enhance these protocols:
"Non-revertive" behaviour - the ring will only fail back to
the "worker" state from the protect state if it is manually
triggered by the operator, or by a failure elsewhere on the ring.
"Interlocked" behaviour with the link state or link state protection
mechanisms such as UDLD or LFN - there is a positive check that a link
is up and stable before triggering a topology re-convergence.
c) Trunking and Link-Aggregation
--------------------------------
It's become increasingly common for exchange points to become multiple
switch and multiple site based, and many need to deploy link aggregation
to handle the volume of interswitch traffic, where it exceeds the maximum
speed of a single link.
Most equipment implements load-sharing using either round-robin or address-based
algorithms.
In exchange points, many pieces of equipment will have similar MAC addresses,
especially the first and last bytes (corresponding to vendor and slot
position on router).
This causes significant problems if the load-sharing hash does not use
enough significant bits in the frame. If the hash is only based on part
of the address, this can result in poor efficiency of load-sharing, and
"clutching" of traffic on a single link inside a group.
It's preferable that load-sharing algorithms should consider the whole
L2 address, and where possible the L3 header information, when calculating
the hash used.
Load-sharing of broadcasts and multicast traffic should be implemented.
This is because behaviour such as forwarding all broadcast/multicast traffic
out of the "primary" port in a trunk have been observed when
load-sharing using destination MAC addresses has been implemented.
IEEE 802.3ad link-aggregation "LACP" should be implemented.
Port security features should also be applicable to trunks/link-aggregated
groups, and work across that group as though the group was a single port.
d) Multicast Control and Containment
------------------------------------
Most switches are configured with IGMP snooping for multicast control.
However, in an exchange point, with only routers attached, there is no
IGMP present, only PIM and MSDP, and all multicast packets are flooded
out of all ports.
An exchange point, however, is an ideal place for multicast peering to
happen, inject the traffic once, and it comes out several times (as much
as is needed, or in the current situation, as much as isn't needed!).
Cisco developed RGMP (Router Group Management Protocol). This is a proprietary
technology whereby the router can communicate to the switch which multicast
groups it wishes to see.
This remains, despite being released as an informational RFC (RFC3488),
a vendor specific feature, and a wide range of routing and platforms are
present at many exchange points - both in equipment used by the operator,
and the participants. These are true multi-vendor environments.
Therefore, this is not a workable solution for most exchange points,
whose principles are often include "equal treatment" of participants.
While it may not solve all potential issues with multicast peering, implementing
PIM-SM snooping and pruning within the switches will achieve the traffic
containment requirements.
Where PIM snooping is available, this should not have a negative effect
on the overall forwarding performance of the system - e.g. PIM snooping
should be able to operate in concert with hardware flooding of the unicast
frames. Where there is a performance impact, this and it's surrounding
caveats shall be clearly documented.
e) VLAN tagspace issues/overlapping
-----------------------------------
A serious emerging issue is VLAN tag space overlapping/clashing issues.
Most metro transport networks can solve this by using q-in-q (tag stacking),
however, this doesn't apply to shared networks like Internet Exchanges.
Current switches use a 1:1 mapping of 802.1q vlans to bridge groups,
which is the way 802.1q was probably intended. This mapping should be
loosened if not abandoned - nowadays there are so many ways to egress
an ethernet frame from a switch that more and more often we have to resort
to 'tricks' to put the right label on the right ethernet packet going
out the right interface.
This problem is being exacerbated by a number of issues:
- Increased use of switch router products (e.g. Cisco 7600)
- Use of switches as "channel-banks" - breaking out higher
speed router interfaces
- Use of metro-ethernet, lan extension or Ethernet over MPLS ("Martini")
circuits to connect to the IXP
We think there are two (fairly similar) approaches to solving this:
- Basic VLAN tag rewrite
- Separate the tag from the virtual bridge instance
VLAN tag rewrite is, as it's name suggests, being able to rewrite a dot1q
tag on a specific interface to a VLAN ID on the switch. This would need
to be implemented on both ingress and egress.
The other option is complete separation of VLAN ID from the virtual bridges
inside the switch. You assemble a framework where you can place untagged
ports, tagged ports, q-in-q tagged ports, mpls endpoints, atm vc's all
together in into the same virtual bridge. Effectively a bridge group which
can contain any number of these sort of entities.
f) Link failure detection
-------------------------
Link failure detection should be implemented, and should look like:
This avoids the risk of an ethernet link going "one-way" and
fooling the restoration protocols that the link is working, when really
it isn't.
The switch should also provide user configurable options for link aggregated
(trunked) ports - the option may be to shut down the entire link-agg group,
or keep operating on the remaining ports in the group.
Environmental Monitoring
There should be reasonable environmental monitoring provided:
- Temperature sensors
- Fan health sensors
- Power supply health sensors
There should be exception logging via SNMP trap and syslog (as specified
above) of any incidents.
It should also be possible to shut down a malfunctioning element in the
system (automatic, user configurable, or manual), in order to preserve
system health.
For example, a power supply failing in a system could cause an instability
in the device. If the system could make a decision to shut that power
supply down, and assuming a redundant configuration, the switch would
then operate in a stable condition until such time that the power supply
could be exchanged.
Physical Wishes
IXPs are high-uptime environments. The equipment used in an IXP needs
to be able to satisfy this requirement, in terms of redundancy, and hot-swappable
components.
- Hot swap of management/switch fabric cards with instantaneous failover
to any installed redundancy (not rebooting onto the "backup").
- Hitless upgrade of software/software components without forwarding
impact on the system.
- Full-redundancy of PSUs, and hot-swap (i.e. box should run on 50%
of PSUs).
- Rapid booting and card startup (after all, much functionality is
implemented in the ASIC hardware).
- GBIC/SFP-optics for flexibility, easy replacement, and maximised
port utilisation (freedom to choose SX/LX, etc).
- * "Coloured" (DWDM/CWDM) GBIC/SFP/Xenpak/XFP (etc, etc)
support.
- Vendor "lockdown" of pluggable interfaces should either
not be implemented, or be able to be switched off in configuration.
- 220-240V AC power options. Unlike most telco-managed facilities,
the carrier-neutral facilities common in Europe do not provide indigenous
48V DC power. Power distribution is done using the regular utility supply
voltage in that country - usually ~230V AC in EU countries.
- Cable testing functionality in copper ports, and optical power metering
in optical ports.
In chassis based systems, the following are major considerations:
- Front to back cooling is preferred.
- Vertical orientation of slots makes for easier cable management.
- Cable management must be taken into account when designing the system
- it should be possible, if cabled correctly, to remove one module for
maintenance, without affecting the cables plugged into the adjacent
card. For this, cable management brackets are preferred, and it is significantly
easier if the cables are routable from the top of the chassis.
Acknowledgements and Thanks
Thanks are due to the "usual suspects" in the RIPE EIX Working
Group, but specifically Christian Panigl, Kurtis Lindqvist, Keith Mitchell,
Daniele Arena, Remco Van Mook, and Rob Lister for their contributions
to this document.
|