There are (at least) two ways to monitor the characteristics of the Internet:
Passive: by monitoring the amount of traffic that passes a certain point
and,
Active: by generating test traffic and measuring how much time it
takes to ship the test traffic over the network.
The advantage of an active measurement over a passive measurement, is
that for the active method, the test traffic can be generated under
well-defined and controlled conditions, whereas the passive measurement
depends on the traffic that happens to pass a certain point.
Another important advantage of active measurements is that it avoids
the privacy concerns associated with passive measurements. As we
want to deliver an objective measurements of the performance of the
Internet, we have decided to focus on active measurements only.
Figure 2: Experimental Setup
The principle of our measurements is shown in figure 2.
A test-box is connected 0 hops away from (or, if that is not feasible, as close
as possible to) the border router of each participating ISP. By connecting
the test-box 0 hops away from the border router, we exclude effects of the
internal network from our measurements.
If the ISP has more than 1 border router, we will either connect the
test-box in such a way that the delays to each border router are similar
or we will install multiple test-boxes. This
way, we avoid a bias for traffic in certain directions in our measurements.
The box at ISP-A generates a pre-defined pattern of test messages. The test
messages are send through the border router and the network to a similar
box at ISP-B. Both test-boxes are connected to a clock. By looking at
these clocks when a test message is sent and when it arrives,
one can determine the
delay between these two providers. There are other measurements that
one can do with this setup, this will be discussed in section 2.1.
The test-boxes at both ISP's are identical, thus the process can be
reversed to measure the delay between ISP-B and ISP-A. Also, the boxes
at both ISP's can be used to send test traffic to similar boxes at
other ISP's.
It should be obvious from figure 2 that the clocks
at the two ISP's should be synchronized (in other words, the
time offset between the two clocks should be known and remain
constant) and provide the time with a high accuracy compared to the time
difference that we want to measure. This will be discussed
in more detail below.
In the extreme case where all O(1000) ISP's
in the RIPE-NCC (geographical) area participate in the project, the
test-boxes can be receiving test traffic from and sending it to 1000
other boxes, thus testing all
10002
possible connections.
However, the topology of the Internet is such that it is likely that
one can still do reliable measurements of the performance of the
Internet without having to send test traffic for all 1000000
possible connections. This will be studied.
Although this method can be used to measure the performance of the
internal network of each provider, we want to limit our work to the
external networks only. For this reason, the test-boxes should be located
as close as possible to the border routers of the ISP's, in order to
eliminate any effects caused by the internal networks.
There are several possibilities for the test traffic that we can
generate:
One-way: A packet is sent from ISP-A to ISP-B, as described
in the previous section.
Two-way: A packet is sent from ISP-A to ISP-B and then
returned to the sender. This principle is used in ``ping'' and
other tools that determine the round trip time.
However, assuming that the paths are known, the results obtained
with one two-way traffic measurement, will simply be the sum of
two the one-way traffic measurements. For this reason, we plan to
use two-way traffic only for independent checks of the one-way
results.
Real life applications: A measurement that a user sees
in a real application, like fetching a WWW-page or downloading
a file by FTP, using a well defined TCP benchmark connection.
In the first instance (phase 1, section 2.1.1), we plan to do one-way
and two-way measurements.
At a later stage (phase 2, section 2.1.2), this
will be expanded to performance
measurements for real life applications. Our measurements will,
already in phase 1,
produce several observables, this will be discussed in section 2.1.3.
The one way test traffic will consist of UDP data packets of 3 different
sizes: small (56 bytes,
as
used by ping and similar tools), medium (576 bytes, the minimum packet
size that any router must handle as a single packet) and large (say,
2048 bytes,
to see the effect of (possible) splitting
data-packets into smaller units and related effects).
The packets will contain:
The address of the sender.
A time-stamp that shows when the packet was sent, together with the dispersion
of the clock in the sending machine. The dispersion is needed to determine the
overall error in the measurement.
A reference number, in order to keep track of the number of
packets send and lost.
A hop-count (number of routers that the packet passed between sender
and receiver).
Administrative information, such as the version number of the software and a
checksum.
Padding to give the packet the desired size.
The receiving process will remove the padding and then
add the following information to the packet:
The address of the receiver.
A time-stamp that shows when the packet was received, together with the dispersion
of the clock in the receiving machine.
The result is raw-data that contains all information about this
particular delay measurement.
The path or routing-vector
between ISP-A and ISP-B is defined as the collection of machines
and network between the border routers of these two ISP's. This path
may vary as a function of time. A tool like ``traceroute'' will be used
to determine the path for the test traffic at any given time. The
path information will be made available.
There are several potential problems while doing these measurements:
Set-up effects: even for Internet traffic, the routers have to be
setup before a connection between two points is established.
If this is not taken into account, then the measured delay
will be larger than the delay that can be attributed to the
network. Two possible solutions to circumvent this problem are:
Precede any measurement by a ``traceroute''. This determines the
path and provides the routing information that we are interested in
anyway.
Run the test more than once and compare the results. If the first
result is significantly different from the other measurements,
discard it.
Set-up effects are interesting in themselves and
this data will be recorded and studied.
No connectivity.
The fact that there is, contrary to what one expects, no connectivity
between two points is interesting information in itself. However, our
software should be written such that it will survive this case without
operator interference.
For the two-way measurements, we plan to use data-packets that are similar to
the one-way measurements. These results will be used for consistency checks
only.
In the second phase of the project, we plan to do measurements with:
TCP-streams.
Simulations of applications.
Packet Trains. A packet train is a number of test messages
that are sent with very short intervals. They provide a way to
study if the packets are delivered out of order and/or merged together
into larger packets along the way.
The implementation of this will be discussed in a future design note.
These measurements will provide several observables:
Delay Information: The results of the delay
measurements can be stored in a
N×N matrix
D(t,s)
where each element
Dsd(t,s)
represents the time that a packet
needed to travel from a source s to a destination d.
If there is no connection between two points, then
Dsd(t,s)
will be .
The elements of D
are a function of the time t, as the network
characteristics will change over time, and the packet size s.
The elements
Dsd(t,s)
all have an error
ðDsd(t,s)
which gives the total error in this delay measurement.
Routing Vector or Path Information: Each delay
measurement is accompanied by a determination of the path between
the two locations using a tool like ``traceroute''.
These results can be written as a N×N
matrix
P(t) of vectors.
This matrix provides, assuming that our test-boxes are installed at
a significant fraction of the ISP's, an up-to-data map of the
Internet as well as the history of of the network.
If there is no connectivity between two points i and j, then
the element
Pij
will be 0. The number of zeroes therefore
gives a measure of the total connectivity.
Also, the two elements
Pij and
Pji should either both be
zero (no connection between these two points) or non-zero. If
only one of them is zero, this indicates a network configuration error.
Finally, the elements
Pij can be used to detect routing
loops.
It should be noted that on any particular test-box, only the results of
delay measurements to that box and the path information from
that box are available (in other words:
the columns of the matrix D and the
rows of the matrix P
): the results of a delay
measurement between ISP-A and ISP-B will be collected at ISP-B, whereas
the traceroute information is only available at ISP-A.
It is undesirable
to do the traceroute at ISP-B, as the path from A to B
is not guaranteed to be the same as the path from B to A.
In order to get the full matrices, the results have to
transferred to a central point. This is one of the reasons for
collecting all test results on a single machine.
From these observables we can determine if there are trends in the
transfer times as a function of time and isolate special or suspicious
events. During the initial phase of the project we will concentrate on
developing (statistical) tools to analyze the matrices. In a
later phase we, or other researchers, may develop tools to correlate the
data in the matrices, analyze the
effect of the size of the packets or perform analysis of the
routing vector matrix itself.
Once enough data has been collected we must be able to define
meaningful metrics and we might tune the measurements to optimally
sample this metric.
The test traffic should be small compared to the load on
the connection under consideration. If not, then the test traffic will
affect the performance and the measurement becomes useless.
The intervals should be small enough to study (interesting)
fluctuations in the performance of the network.
As the network changes over time, the amount and type of test traffic
should be easily configurable.
The interval between two measurements should be small enough to
detect and eliminate set-up effects.
As suggested by the IPPM [2] the measurements should be
randomly distributed to prevent synchronization of events due to weak
coupling as demonstrated in [3]. The IPPM suggest
using a so-called Poisson sampling rate. We will investigate the IPPM
suggestion and other possibilities to prevent weak coupling.
The first two requirements are contradictory: smaller time intervals means
more test traffic, but more test traffic means a higher load on the network.
For the initial settings, suppose that we generate
the following amount of test traffic:
1 small packet per minute.
1 medium sized packet per minute.
1 large packet every 10 minutes.
These packets are used to see the effects
of splitting large packets into smaller ones, presumably this effect
is constant over time.
This then generates a data-volume of approximately 14 bytes/s
for each measurements. This number does not include overheads and the
data for the ``traceroute'' program.
If this amount of data is too high, then one might consider increasing
the interval between two packets and add a burst mode (a short time
with a higher test traffic volume) for occasional measurements with
a smaller interval between two test packets.
This leads to a number of questions that have to be answered
Can we generate this amount of data without the ISP's objecting?
Probably not for 1 connection, but what if we have
installed test-boxes at 10, 100 or even 1000
sites, with
102,
1002 or
10002 possible connections?
Are we sure that the test traffic does not
affect the performance of the network?
How do we check that a packet has left the test-box before
the next one is sent out? We do not want packets to sit in a buffer if
the local network is at its maximum load.
In this section, we want to estimate the errors on the final results.
As we mentioned before, our plan is to do both one and two way measurements.
One way measurements involve sending a time-stamped message from one
machine to another. The second machine compares its arrival time with
its local clock and determines the transfer delays from that. This
requires that the local clocks are synchronized and provide the time
with a high enough accuracy. The synchronization is the difference
between the clock on this machine and an absolute time standard. The
accuracy is the error in the time measurement. The errors in
synchronization and accuracy determine the overall error in the
delay measurement.
Two way measurements are a bit simpler in this respect: a message is sent
to another machine and echoed there. All time measurements can be
done on the same machine, so the clock has to be accurate and the
the resolution of the clock has to be
high enough. The resolution is the smallest time interval that can
be measured by this clock.
In both cases, we will be subtracting two times,
t1 and
t2.
If
we call the total error on these times , then the error in the
final result () is equal to:
It should be noted that, even with an high resolution clock, can
become relatively large for small time intervals. For example, if we measure a
50 ms
time interval with a clock with a resolution of
10 ms,
then the error in the result will be 14 ms.
Systematic effects of the equipment on the results have to
be studied and should be understood.