Thesis1 PLC

Published on March 2017 | Categories: Documents | Downloads: 77 | Comments: 0 | Views: 509
of 87
Download PDF   Embed   Report

Comments

Content


Quality aspects of audio communication
Ian Marsh
A thesis submitted to KTH,
the Royal Institute of Technology,
in partial fulfilment of the requirements for
the Licentiate of Technology degree.
May 2003
Laboratory for Communication Networks
Department of Microelectronics and Information Technology
KTH, Royal Institute of Technology
Stockholm, Sweden
TRITA-IMIT-LCN AVH 03:01
ISSN 1651-4106
ISRN KTH/IMIT/LCN/AVH-03/01–SE
c Ian Marsh May 2003
Printed by Universitetsservice US-AB 2003
Quality aspects of audio communication
Abstract
The Internet is increasingly being used to carry real-time voice traffic.
Users of real-time voice services are sensitive to variable audio quality. The
quality of packet audio is largely determined by the mouth-to-ear delay and
the packet loss. The contribution of this thesis is to provide techniques to
improve the packet audio quality: dimensioning links specifically for packet
voice communication, modelling the packet audio arrival process at a re-
ceiver, measuring connectivity quality in wide area networks, and reducing
delays in end systems.
The first study investigates how to allocate capacity to voice traffic in a
purely packet switched network. We study an idealised case for VoIP ses-
sions, where the voice traffic is separated from the data traffic. A Markov
modulated Poisson process model simulates the superposition of VoIP flows
into a finite buffer. The model corresponds well with both packet level simu-
lations and laboratory experiments. A second study looks at the interaction
between voice and data traffic. We address the issue of how a constant
rate VoIP stream is affected when multiplexed together with data traffic in
router queues. We derive a Markov model which captures the effect of the
random delays experienced by packet audio data, plus the affect of silence
suppression at the sender and packet loss in the network.
Measurements made in 1999 and 2002 show that VoIP communication is
feasible between academic sites in Europe and the United States. However,
we show that network connectivity on a global scale still does not provide
sufficient quality for satisfactory real-time voice communication. The data
collated as part of this study is one of the largest publicly available reposi-
tories of VoIP data, containing over 18,000 sample sessions.
The end systems also contribute to the delay of interactive voice com-
munication. Absorption of the variable delay, or jitter, is necessary in a
packet switched network in order to replay voice samples smoothly with-
out glitches. We have shown that by moving the buffer used to absorb the
jitter into the operating system, significant time savings can be achieved.
We have implemented a VoIP tool, Sicsophone, which shows very low delay
characteristics.
Using the above techniques, we show that hundreds of milliseconds can
be saved in the delay budget of real-time voice communication, improving
the audio quality considerably. The traffic models and measurement data
presented in this thesis, will also enable future research into quality aspects
of audio communication.
Keywords: Packetised voice, packetised audio, Voice over IP (VoIP),
Quality of Service (QoS), speech quality, network measurements
5
Preface
This thesis proposes methods to improve the perceived quality of real-time
voice communication over the Internet. The cost of providing and running
voice services using an IP infrastructure is considerably less than a tra-
ditional public exchange system. Therefore, new types of operators selling
voice services are emerging that use IP technology, allowing a broader choice
of operators for end users. Users however, would like to receive voice quality
akin to that provided by the traditional telephony network. Voice services
using IP in service today, focus on lower cost and less guarantees on service
quality. There is a common misconception about VoIP, which implies it of-
fers lower quality, this does not necessarily need to be the case. The services
offered today simply use the Internet as a bearer for long haul links, and
live with the implications under the proviso it is cheaper than PBX-based
telephony. The voice quality can be variable when using the IP infrastruc-
ture, especially in peak hours, however it is possible to improve the quality
but it is often not implemented. We
1
are proposing ideas that will improve
the quality of voice communication at least initially, to that provided by the
traditional telephony network.
To achieve our goal of good quality audio communication, some changes
might be needed to the ubiquitous Internet. Providing strict quality guaran-
tees has plagued researchers for many years and now industry is facing the
same challenges. The degradation of voice quality which can occur when
using a multi-user packet switched network is the fundamental problem.
Unpredictable short term loads, lack of guarantees on network performance,
lack of control over the end systems and stringent requirements on the voice
quality make VoIP a challenging application to realise successfully on the
Internet.
Our Quality of Service (QoS) research is orthogonal to the investigations
being carried out by the network community. These investigations focus on
changing the packet switching techniques to be more reliable, more timely
and more fair. This is especially the case for time sensitive traffic such as
voice. Protocols have been developed to signal routers and end systems
that certain data types need to be treated differently, again in the case of
voice traffic often at higher priority. The techniques presented in this thesis
do not rely on any ongoing research within the network community. We
look at allocating resources given the current conditions of the network or
adapting to it, also by measuring the current state so that we can make
decisions based on these measurements rather than assuming the certain
functionality will be available.
1
The authors of the publications and I.
7
Content notes
The thesis contains five papers that address four distinct areas within quality
aspects of packet voice communication: dimensioning links for VoIP traffic,
delay reduction at the end systems, the disruption of real-time voice streams
by traditional data traffic, and wide area measurements of VoIP quality.
These four areas therefore include investigations at the network layer, in the
operating system and at the application.
We start with a short introduction and then provide some background
on the subject of this thesis. A problem definition is given, explaining why
and how we tackled each problem. The contribution of this thesis is given
next, showing exactly what has been achieved during the course of this
investigation. There is also a pr´ecis of the five included articles. Since all
of the works are co-authored, my contribution to each of the publications is
stated. Finally we round off with some conclusions. The papers appear as
they were published.
Paper A Bengt Ahlgren, Anders Andersson, Olof Hagsand, and Ian Marsh.
Dimensioning Links for IP Telephony. In Proceedings of the 2nd IP-Telephony
Workshop, pages 14-24, New York, USA, April 2001.
Paper B Ingemar Kaj and Ian Marsh. Modelling the Arrival Process for
Packet Audio. In Quality of Service in Multiservice IP Networks, pages
35-49, Milan, Italy, February 2003.
Paper C Olof Hagsand, Ian Marsh and Kjell Hanson. Sicsophone: A Low-
delay Internet Telephony Tool. To appear at the 29th Euromicro Conference,
Belek, Turkey, September 2003.
Paper D Olof Hagsand, Kjell Hanson Ian Marsh. Measuring Internet
Telephony Quality: Where are we today? In Proceedings of IEEE Globecom:
Global Internet, pages 1838-1842, Rio De Janeiro, Brazil, December 1999.
Paper E Ian Marsh and Fengyi Li. Wide Area Measurements of VoIP
Quality. To appear at Quality of Future Internet Services 2003, October,
2003, Stockholm, Sweden.
9
Acknowledgements
Writing this part of the thesis is actually enjoyable. First of all I must thank
my advisor Professor Gunnar Karlsson at KTH, and my manager Dr. Bengt
Ahlgren at SICS. Without their co-operation and assistance this Licentiate
would not be completed. I am very thankful to Gunnar for his creative spirit
during this licentiate ’life’, in particular I would like to thank him for our
fruitful discussions. To my manager Bengt, I am also very grateful, firstly
for getting me started on this academic path, and secondly for his continual
encouragement during the degree, particularly in the last stages ”Hur g˚ar
det med licen?”
2
I was frequently asked. The people at SICS are a fantastic
well of information. This includes all the people with whom I have had coffee
room discussions, in particular the members of the CNA group. I would like
to extend my special gratitude to Laura ’the Estonian’ Feeney and Herr
Doctor Engineer Thiemo Voigt for their careful reading and comments on
the text I loosely referred to as English. This also extends to Dr. Adrian
Bullock who, at least, has the same notion of spelling. Dr. Olof Hagsand,
who I have tracked from SICS to Dynarc and now at KTH, has set a great
example of how to work effectively. He is one of the rare people who can get
things done both quickly and with high quality. Finally in the SICS gang,
I would like to thank Bj¨orn ’200,000 volts’ Gr¨onvall who has helped almost
everyone at SICS, not least myself. This extends from FreeBSD installation
questions to getting ID cards (yes plural). Outside of Stockholm, I would like
to thank Professor Ingemar Kaj at Uppsala University; his excellent course,
book and support has been a source of inspiration during my research and
is reflected in this thesis.
The other ’half’ of my working life revolves around KTH University.
Most of the people there have become friends rather than working colleagues.
Two weddings and a Christmas holiday in their home countries only goes to
illustrate this point. Our ’United Nations’ style lunch time gatherings are
always something I look forward to. The innumerable humoristic moments
kept me sane during the past three years of pseudo-student life. Particularly
I would like to say ”Dudey Wudey” to Iyad ’Diad’ Al-Khatib, I will never
forget our numerous classic moments, unfortunately not many of them are
printable in a licentiate thesis. Being in an academic environment allows
one to advise; with two Chinese masters students finished and a couple
more on the horizon, I would like to say (to you) it is a pleasure to be
involved in your education. In particular the latest and greatest Fengyi
Li, whose effort is also evident in this thesis, you should be awarded ”first
price”! Not too far from China, newly married Evgueni ’Dude’ Ossipov from
Siberia, is always a welcoming site including a firm handshake. His presence
automatically provokes a rye smile from me (not the English though :-). On
2
How are things going with the licentiate work?
10
the subject of English, I want to extend my deepest gratitude to Professora
Nil ’Neely Wheely’ Tarim for her many last minute proof readings of my
drafts. Even after working all night, she still has the time and energy to
correct my carelessness, her accent might be American but her ’English’ is
just perfect:-)
There is more to life than work and study but not much more! Outside
KTH and SICS circles I would like to say ”orange juice please” to the English
gang in the Loft. I must extend my thanks to the staff and ’students’ at the
local gym, called World Class (the WC). Performing (almost) mindless exer-
cise works wonders in alleviating stress, creating ideas and preparing oneself
for the next days of rigorous research. Alphabetically Christer, Elisabeth,
Eric, Eshan, Maria, Mia, Petra, Sabine and Sarah(Z), thanks! Of course I
have to thank my beloved mother and Ray for their never ending support,
this time you have been spared the manuscripts!
Contents
Preface 5
Content Notes 7
Acknowledgements 9
1 Introduction 13
2 Background 13
3 Problem definition 14
4 Contribution of this thesis 14
5 Quality aspects of audio communication: A 30 year perspec-
tive 14
5.1 A decade of research: 1973 - 1983 . . . . . . . . . . . . . . . . 15
5.2 Emergence of Internet applications: 1990 - 1995 . . . . . . . . 16
5.3 Times of measurement: 1996 - present . . . . . . . . . . . . . 17
6 Summary of the individual papers and their contributions 18
6.1 Paper A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.2 Paper B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.3 Paper C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.4 Paper D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.5 Paper E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7 Conclusions 27
References 29
Paper A: Dimensioning Links for IP Telephony 35
Paper B: Modelling the Arrival Process for Packet Audio 47
Paper C: Sicsophone: A Low-delay Internet Telephony Tool 63
Paper D: Measuring Internet Telephony Quality: Where are
we today? 73
Paper E: Wide Area Measurements of Voice over IP Quality 81
1 Introduction
The success of the Internet has been phenomenal. Since the introduction of
the Web, the benefit of a world wide data network has been truly realised.
There are many reasons for the Web’s success: a well designed protocol
suite, large colour displays, the Mosaic browser, interesting sites (even then)
plus no viable alternative. It is argued that the telephone companies also
designed and deployed a global network, but were very conservative by only
allowing voice to be carried. The telephone network could have been the
start of the Internet. Now the telephone network can be considered part of
the Internet as it partly carries data traffic via modems. The question posed
in this thesis is: Can the opposite be applied? Can the Internet be used
to carry the world’s voice traffic? The savings would be enormous, if one
network could carry all the voice and data of the world’s users. People would
retain their computers and phones in the homes and offices, but outside these
areas the data and voice would be carried along the same communication
lines, using the IP protocol.
2 Background
In a packet switched network, the voice is sampled, packetised, transmit-
ted, received, de-packetised and replayed. This sequence is not problematic
when using a packet switched network for voice per se. A functioning, well
provisioned packet switched network delivers voice data reliably and with
low jitter. We have extensive results that show this to be the case (Paper
E). The major problem faced by real-time traffic on the Internet is the un-
predictable competing load. If a network is neither totally reliable nor very
predictable, as on an IP network, delay, jitter and packet loss will affect the
audio quality.
Quoting the definition of the IP protocol (RFC 791) ”The internet pro-
tocol does not provide a reliable communication facility.” During high load
situations data may be discarded to keep the network operational, and the
IP specification is not violated by doing so. In the Internet, reliability is nor-
mally addressed at the transport layer using TCP or an application specific
solution. Neither of these are perfectly suitable for real-time communica-
tion, mainly due to the delay requirements of interactive voice. This clearly
has implication for the world’s telephony traffic if IP is to be its bearer.
Within the IP protocol, there is functionality defined to allow time-sensitive
packets to be transmitted with higher precedence in the case of high load.
However this has, until now, not been widely deployed and is not expected
to be for some time to come.
14
3 Problem definition
We are now ready to define the problem: How to achieve acceptable quality
voice communication on an unreliable, unpredictable packet switched net-
work? The requirements of the voice communication can also be defined.
The network and application should not delay the voice samples above a well
known number of milliseconds. The network ideally should also deliver a suf-
ficient number of packets in order for the voice stream to be reconstructed
with acceptable quality. Realising these two requirements effectively defines
the problem we address in this thesis.
4 Contribution of this thesis
Taken as a whole, the contribution of this thesis can be summarised as tech-
niques to improve the quality of real-time voice communication. Moreover
should all the ideas presented in this thesis be implemented, they would re-
sult in improved perceived quality. We have identified and addressed issues
within this research topic that have a direct significance on the improve-
ment of real-time voice services on the Internet. We propose solutions to
real-world problems concerning packet audio communication. The thesis
looks at quality aspects of audio communication at different layers (in the
ISO sense) and from a theoretical and practical perspective. Since the out-
comes of real-time voice research are implementable, we also consider the
practical results of this work a valuable contribution.
5 Quality aspects of audio communication: A 30
year perspective
The problem addressed in this thesis is by no means new. Researchers have
investigated real-time voice communication on packet switched networks for
over 30 years now. This section covers chronologically the most significant
findings related to our work. It is important to emphasise this point, as there
are probably hundreds (but not many thousands) of articles which have some
relevance to real-time voice on packet switched networks. Therefore we in-
clude only the ones which are relevant and widely cited. We also include
survey articles and theses for the interested reader; they usually give com-
prehensive lists of relevant publications. Finally we should state there are a
number of textbooks about Voice over IP and Internet Telephony, however
none of them adequately cover the quality aspects in sufficient detail.
5 Quality aspects of audio communication: A 30 year perspective 15
5.1 A decade of research: 1973 - 1983
Advances in low data rate coders [1] and the deployment of a distributed
(viable) packet switched network led to early findings on real-time voice
being published. In 1974 William Naylor published ’A status report on the
real-time speech transmission work [2] and James Forgie published ’Speech
communications in packet-switched networks’ in the Journal of Acoustic
Society of America in 1976 [3]. In 1977, Naylor published his PhD thesis
titled ’Stream traffic communication in packet switched networks’ [4].
Another researcher who published his early experiences with voice on
packet switched networks was Danny Cohen. In 1977, he suggested that
the packetisation algorithm and data rate should be varied according to
the network load [5]. Interestingly, this adaptive approach of reacting to
the network load has been popular in recent years. Cohen also states that
the time spent at the receiver (called ‘waiting period’ in his paper) should
be a function of the network performance. Tuning the size of the playout
buffer in voice systems to the network load has occupied researchers for
many years. Indeed, Cohen states that the parameters in a real-time voice
communication system are heavily dependent on the network performance
and a systematic method of predicting it must be developed.
In 1979 John Gruber looked at the issue of ’Variable delays in a shared
network environment handling voice traffic’ [6]. His vision was a packet and
circuit switched hybrid network called ’Transparent message switching’ for
handling both voice and data traffic. The ideas were novel (and preliminary):
The basic entities processed are messages rather than calls. The messages
do belong to an established call, however they may be completed or blocked
at the network periphery. Voice messages are given priority where delays
are being exceeded, however where loss is being experienced, voice packets
may be discarded initially. However, some loss in voice communication
is tolerable. Gruber suggests that channel contention can be resolved by
buffering messages at the edges only. Once sufficient capacity is obtained,
the network behaves as a circuit, switching the messages. Messages are
switched on the fly, thus eliminating the need for the whole message to
arrive. The rest of the paper explores the benefits of using this technique
for voice traffic. The paper includes 76 references, encompassing nearly all
of the early work on real-time voice on packet switched networks. The idea
of having a ’circuit switched’ core for the Internet has gained popularity
recently with schemes such as Multi Protocol Label Switching (MPLS).
In 1980 Giulio Barberis and Daniele Pazzalia published their seminal
’Analysis and Optimal Design of a Packet-Voice Receiver’ [7]. They con-
clude that in order to obtain the optimal voice reconstruction, an accurate
estimation of the delay suffered in the communication network is required.
Their conclusions echo those by Cohen, however they suggest meeting this
requirement by using a synchronisation algorithm that reduces the gap be-
16
tween the transmitter and receiver clocks to zero.
In December 1983 Warren Montgomery published ’Techniques for Packet
Voice Synchronization’ in an IEEE JSAC special edition on Packet Switched
Voice and Data Communication [8]. He considers the local and wide area
network situations separately. Round trip estimates are sufficient for the
local area case, while more sophisticated methods are needed for the wide
area case. He suggests that the addition of timing information and incor-
porating extra delay at the receiver should be sufficient to yield satisfactory
voice quality for the wide area case. This is the approach taken by most
modern real-time packet voice applications. It is effective, simple and cheap
to implement.
The period from 1984 to 1990 was relatively inactive as far as real-time
voice over packet switched networks was concerned. Two notable excep-
tions include Prabandham Gopal’s ’Analysis of Playout Strategies for Voice
Transmission Using Packet Switching Techniques’ [9] and Mehmet Ali’s ’Re-
assembly Buffer Requirements in a Packet Voice Network’ [10].
5.2 Emergence of Internet applications: 1990 - 1995
In the early nineties, Domenico Ferrari’s group at UC Berkeley produced a
number of significant publications about the effect of jitter and delay on real-
time communication applications as part of the TENET suite [11][12]. Their
work proposed a distributed mechanism for controlling the delay jitter in a
packet-switching network. They argued that if the advantages are sufficient
to justify the higher costs of the distributed jitter control mechanism, then
implementing it is worthwhile. Although no such scheme was deployed, their
work is still referenced.
Research on IP multicast was actively being carried out in the early
nineties. Multicast was to be the vehicle on which multimedia sessions were
to be transmitted over the Internet. Indeed many people listened to the
early Mbone transmissions [13]. An array of real-time applications were pro-
duced, notably VIC, VAT and wb (whiteboard) from the Network Research
Group at LBL [14]. Other tools surfaced such as Nevot [15], Freephone [16]
and RAT [17]. These works led to a standardised synchronisation protocol,
RTP, the real-time protocol for use with real-time media flows. The au-
thors of the standard were those of the above mentioned applications. One
of them was Van Jacobson, who gave a Sigcomm tutorial in London 1994
entitled ’Multimedia conferencing on the Internet’ [18]. In this presentation
he suggested using a simple synchronisation protocol to restore the original
timing information at the receiver and a small adaptable buffer to absorb
delay variations. This influential presentation moulded the approach taken
by researchers in real-time voice for many years.
Henning Schulzrinne’s 1993 PhD thesis ’Reducing and characterizing
packet loss for high-speed computer networks with real-time services’ looked
5 Quality aspects of audio communication: A 30 year perspective 17
at congestion control, scheduling, and loss correlation of real-time traffic [19].
Schulzrinne highlighted the practical importance of scheduling packet audio
in the context of the DARTnet project.
During 1993-1996 Jean Bolot produced a series of papers that reported
and characterised the loss and delay behaviour of packet audio on the In-
ternet [20, 21, 22]. They were theoretical works supported by experimental
evidence, advocating the use of techniques such as redundancy protection
against packet loss. Their publication also had an effect on the research
community, by highlighting the need to conduct theoretical, but applicable
research on the Internet’s behaviour. Comparisons of voice playout algo-
rithms have been made by Ramachandran Ramjee et. al. in 1994 [23]. This
work was extended to include performance bounds of the algorithms, with
Sue Moon as the primary investigator [24]; it was also later published in the
Multimedia Systems journal in January 1998 [25].
In 1997, Nicolas Maxemchuk and Shau-Ping Lo measured the loss and
delay variation for intra-state, inter-state and international links [26]. Two
important, but unsurprising conclusions, were that the quality depends on
the number of hops and the time of day. In 1999 Dong Lin also concluded
that even calls within the USA could suffer large jitter spikes [27]. Her
results on packet loss also agree with those above in [22], which is interesting
as the latter measurements were taken some four years later. We did not
observe such effects in our measurements, which were conducted over purely
academic networks.
One piece of work which is worthy of note is Christian Sieckmeyer’s
master thesis entitled ’Evaluation of adaptive playout algorithms for packet
audio’ done at TU Berlin in 1995 [28]. This is a comprehensive evaluation
of jitter buffer playout algorithms using C++ implementations. Because it
is written in German, this thesis has not received the attention it deserves.
5.3 Times of measurement: 1996 - present
Recently, measurements have been very much in vogue. Three significant
theses presenting VoIP measurements were published at the turn of the cen-
tury. First, Dong Lin’s master thesis ’Real-time voice transmissions over the
Internet 1999’ undertaken at the University of Illinois in 1999 investigated
the use of interleaving and reconstruction for improved speech fidelity [27].
Second, Henning Sanneck’s PhD work ’Packet Loss Recovery and Control for
Voice Transmission over the Internet’ in 2000 looks at intra-flow hop-by-hop
schemes where VoIP flows are repaired within the network itself [29]. The
routers look into individual flows and attempt to interpolate any missing au-
dio packets. Third, Sue Moon’s ’Measurement and Analysis of End-to-End
Delay and Loss in the Internet’ in 2000 [30] looked at correcting the sys-
tematic errors introduced by clock skew between the sender and the receiver
[31]. Her findings on the playout delay incurred by a buffer at the receiver
18
are relevant to our work. In particular, she and her co-authors computed
the upper and lower bounds for the delay given a number of losses. They
also showed that these bounds are tight. A new ’spike detection’ algorithm
which takes into account the sudden delay peaks of audio transmissions is
presented in [25].
Robert Cole and Joshua Rosenbluth published ’Voice Over IP Perfor-
mance Monitoring’ in 2001 which advocates using the ITU’s E-model with
simplifications for use on packet switched networks [32]. In 2001 Mansour
Karam and Fouad Tobagi published ’Analysis of the Delay and Jitter of
Voice Traffic Over the Internet’ [33]. They looked at voice delay and the
effect of network parameters on the delay caused by voice traffic, assuming
it uses separate queues. They state the importance of bandwidth to reduce
the delay percentile incurred by voice. Additionally and unsurprisingly, for
networks over 10Mbits per second, the transmission delay becomes negligible
within the end-to-end delay.
Catherine Boutremans recent (December 2002) PhD thesis ’Delay As-
pects in Internet Telephony’ presents an adaptive error control scheme that
is delay aware [34]. The end-to-end delay is considered when choosing which
parameters should be used for forward error correction coding. She extends
this idea to include a playout buffer implementation that takes into account
the choice of the FEC scheme. Boutremans states that link and router fail-
ures are the dominant sources of degradation for VoIP sessions, despite IP
route protection.
6 Summary of the individual papers and their con-
tributions
6.1 Paper A
Bengt Ahlgren, Anders Andersson, Olof Hagsand, and Ian Marsh. Dimen-
sioning links for IP telephony. In Proceedings of the 2nd IP-Telephony Work-
shop, pages 14-24, New York, USA, April 2001.
Summary: Currently many telephony providers use IP technology as a
bearer for voice data. One method to achieve good quality voice communi-
cation is to dimension the network, as is done in the traditional telephone
system. We propose that the capacity allocation for voice traffic on IP
networks can be performed as in traditional telephony or ATM voice net-
works. We argue that the research and models derived from the traditional
telephony and ATM fields are suitable for today’s IP networks.
Data network operators do not necessarily know how to dimension their
networks for voice traffic. A naive approach could be to allocate the number
of calls based on the capacity of the link and the maximum speech coder
6 Summary of the individual papers and their contributions 19
rate. This will yield good quality, but will under-utilise the link due to the
lost capacity by not using the statistical multiplexing characteristics of the
voice calls. The other alternative is to allow as many calls onto the network
as possible, however in busy periods this may result in unpredictable or
poor quality. We can phrase one possible formulation of the problem as
’How many calls can an operator allow over a link (or portion thereof) with
a packet loss rate under 1%?’.
We propose a model based on the Markov modulated Poisson process
(MMPP) which calculates packet loss probabilities for a set of super-po-
sitioned voice input sources. We assume the talk and silence periods are
exponentially distributed. Because of the exponentially distributed inter-
arrival times during a talkspurt (a series of voice packets), the emission of
packets in a talk period can be regarded as a Poisson process with a given
intensity. The superposition of Poisson processes is also a Poisson process.
We can therefore simply add the intensities of the sources that are currently
in a talkspurt and obtain a new Poisson process for the superposition. We
can use a two state birth-death process to describe the packet generation;
one state represents the idle periods and the other state the talkspurts. This
arrival process is fed into a */D/1/K queue. It is a single FIFO server with
deterministic service times and a buffer size K-1. The size of the buffer is
variable. This solution is sufficient to tackle the dimensioning problem given
a number of possible known and unknown quantities.
Our simulations and laboratory measurements are in good agreement
with the Markov model chosen. This shows that many of the earlier efforts
on network dimensioning actually match the real environment they were
proposed for. A second contribution of this work is the inclusion of both
packet level simulations and laboratory measurements to verify the Markov
model. As far as we know of, no other researchers have presented and
compared results from three different environments.
Contribution of this work: The contribution of this work is a planning
tool with which to dimension networks for voice traffic. We have established
relationships amongst certain parameters of a packet voice network; namely
the speech coding, the capacity of the voice network, the number of users,
the buffer sizes, the acceptable packet loss in order dimension the network
for voice communication.
We propose an engineering solution to the problem of achieving good
quality using theoretical work as a basis. Many engineers unfortunately do
not use results from the theoreticians and we attempted in this work to prove
the algorithms based on theoretical results are correct and are implementable
in a real network. This work is a little unusual in that we compared a model,
simulation and a laboratory testbed, which raises interesting questions in
itself. What are the differences between the three approaches, and how do
20
these differences manifest themselves in the results? As a simple example,
the time needed to transmit a packet is not included in the model, but it
is included in the simulation environment. In the testbed milieu, packets
might be sent in a bursty manner due to system behaviour. These effects are
difficult to account for in the model and the simulation. The work brought
differences such as these to our attention, and though it is not possible to
do anything about them specifically, it is useful to know of their existence.
My Contribution: The original idea to perform such a study was mine.
Within the project I supervised a masters student, Anders Andersson, who
implemented the MMPP model in Matlab and corresponding simulation
scripts in ns-2. I implemented most parts of the testbed environment and
the traffic generator that was used to simulate the superposed telephony
flows. I wrote the majority of the paper.
6.2 Paper B
Ingemar Kaj and Ian Marsh. Modelling the Arrival Process for Packet
Audio. In Quality of Service in Multiservice IP Networks, pages 35-49,
Milan, Italy, February 2003.
Summary: In this work, we model the arrival process of audio packets
that have passed through a number of routers. The first objective was
to gain an insight into the processes that determine this behaviour: Does
any theory exist that can explain the distribution of the arriving audio
packets? The second objective was to use a model of the arrival process for
the generation of artificial audio streams, thus resembling those that have
traversed the Internet. A model is clearly superior to static trace files as it
can produce variable temporal relationships between packets of a flow which
is not possible with a recorded session. To evaluate our work, we compare
the probability density functions of the gathered and generated data.
Normally packets are sent with constant time spacing from a sender,
and due to buffering at intermediate routers, arrive at the receiver with
non-constant time spacing. We separate the queueing delay caused by our
own packets (in front of us) from the delays induced by cross-traffic present
in the buffers of routers. This is because the contribution of the delay from
our packets is known and solvable. The delay contribution of the cross-traffic
however is more difficult and is the problem we address in this study. It is
complicated by the fact that most VoIP tools implement silence suppres-
sion and packets may be discarded. To make the problem more tractable,
we assume that the waiting times in the routers’ buffers are exponentially
distributed.
The solution, based on Markov theory, attempts to model the delay vari-
ation of audio packets sent at a constant rate. The packets are assumed to be
6 Summary of the individual papers and their contributions 21
subjected independently (of each other) to delays when traversing the net-
work. The waiting time in intermediary buffers is, as stated, assumed to be
exponentially distributed. The observed delay of the packets can be shown
to be Markovian. The use of Markov theory allows silence suppression and
loss to be incorporated into the model as they are considered independent
from the delay variation introduced by the network.
Given a certain distribution of the delays in the network (e.g. exponen-
tial, Gaussian) it is possible to create a distribution of the arrivals which
mimics the arrival process of real data streams.
Contribution of this work: The main contribution of this work is in-
sight into the problem of how packet audio streams become distorted when
traversing a network like the Internet. A Markov model that does not use
any transforms is developed from first principles. The model is extensible,
and therefore allows us to include both silence suppression at the sender and
packet loss during the transmission. A simple method to estimate packet loss
based on observed interarrival times is also given, independent of whether
silence suppression is activated or not.
My Contribution: The idea came about from Ingemar Kaj’s course ’Sto-
chastic Traffic Modelling,’ with whom it was jointly conceived. My contribu-
tion was the original wide-area measurements of several VoIP flows. Using
these measurements I made some suggestions to the possible processes acting
upon the streams. I also wrote several tools to process the data. I co-wrote
the paper.
6.3 Paper C
Olof Hagsand, Ian Marsh and Kjell Hanson. Sicsophone: A Low-delay In-
ternet Telephony Tool. To appear at the 29th Euromicro Conference, Belek,
Turkey, September 2003.
Summary: Users of interactive VoIP applications demand low latency
conversations. Replaying packetised audio requires that sufficient number
of packets are available to the application in order to avoid audible glitches.
The standard method to solve this problem is to introduce a small inter-
mediary buffer between the decoded voice and the audio hardware. This
creates a dam and hence a temporary reservoir of packets for immediate
playout. We describe a Voice over IP system, called Sicsophone, that cou-
ples the low level features of audio hardware with a jitter buffer playout
algorithm. Using the sound card directly eliminates unnecessary buffering
as well as giving us fine control over timers needed by soft real-time applica-
tions such as VoIP. A soft real-time application is one which does not have
strict deadlines, but nevertheless requires low latency to obtain acceptable
22
performance. The delay in real-time voice communication is an example of
such a requirement.
Constructing a fully functional low delay VoIP system is non-trivial.
There is a clear tradeoff of speed against flexibility; for example different
audio formats make the system more difficult to optimise. PCM voice coding
is considered the ’fast path’ in Sicsophone. PCM encoded speech samples are
delivered to the loudspeaker the quickest. GSM is supported, but requires
extra buffers which have not been placed on the optimised path.
A standard statistical-based approach for inserting packets directly into
audio buffers is used in conjunction with low level control of the audio hard-
ware. The buffers in the sound card memory act as the playout buffers
rather than using memory in the application. This saves valuable time
when copying data to the application, as later it must be copied back to the
audio hardware after de-jittering. The application calculates the length of
the current playout buffer, because simple statistics need to be calculated.
However adjustments to the buffer length are done in the silent periods by
the operating system. Late arrivals are detected in the hardware itself by
using two pointers, one which writes the incoming data and one which reads
(and hence replays) it immediately afterwards. If the read pointer passes
the write pointer, the next incoming data is simply not written to the (cir-
cular) buffer. In addition we developed a scheme for inhibiting unnecessary
corrections in the playout buffer size. Adjusting the buffer size in the very
short term only induces unwanted instability. This is especially undesirable
in our approach as we are dealing with the hardware directly.
Reducing unwanted changes to low level buffers on the sound card main-
tains good performance of the system. We found this combination of low
level access to the audio hardware plus a relatively simple technique of ad-
justing the playout buffer gave excellent delay characteristics for many test
cases. We performed live tests with the implemented software for compari-
son with off-the-shelf VoIP tools and found the system to exhibit much lower
delay characteristics.
Contribution of this work: The contribution of this work is a consid-
erable reduction in the delay incurred by VoIP end systems. This delay is
an important factor in determining the perceived quality. Since this work is
an engineering solution, it is rarely seen in the research community. People
have looked at optimising and reducing jitter buffer sizes, but do not realise
their ideas in real systems. Small theoretical improvements can be almost
negligible in a real system. A key artifact of this work is Sicsophone, a fully
functional VoIP application. It also has been used for the measurement
work, with some modifications.
6 Summary of the individual papers and their contributions 23
My Contribution: I wrote the RTCP part of Sicsophone. I also made
some changes to the tool for measurement work. I wrote the conference pa-
per. I also performed comparisons between the playout delay of Sicsophone
and the optimal playout delay.
6.4 Paper D
Olof Hagsand, Kjell Hanson and Ian Marsh. Measuring Internet Telephony
Quality: Where are we today? In Proceedings of IEEE Globecom: Global
Internet, pages 1838-1842, Rio De Janeiro, Brazil, December 1999.
Summary: Users of Internet telephony applications demand good quality
audio playback. This quality depends on the instantaneous network condi-
tions and the time of day. In this paper, we describe a scheme for measuring
network connectivity and motivate the development of a new metric, asym-
metry, for judging quality. Work such as this gives useful feedback to users
and operators of IP telephony networks and important information for de-
velopers of Voice over IP applications.
This paper outlines a scheme to send a pre-recorded telephone call, in
PCM format, between a central site (Stockholm) and four satellite sites.
The call was simplex, it was transmitted in one direction only, i.e. either to
or from Stockholm. The call probes the links and routers of the intervening
connections, giving an estimation of the quality at the receiver. Using these
techniques we measured the quality of the intervening links. Our tests in-
cluded a wide range of geographically distributed sites. We concluded that
at four of the five sites we had access to, it was feasible to run successful
VoIP applications according to the ITU-T G.114 (delay specific) standard,
which states the end-to-end delay should not exceed 150 ms.
Contribution of this work: In 1999 we reported QoS measurements of
links between remote five sites and a central site. As far as we are aware the
jitter and asymmetry results were relatively new within the VoIP measure-
ment community. A further contribution of this work was ’updating’ the
available VoIP traces for research. The same ones were being used in many
different works and were becoming out of date. Three of the sites involved
in this study have been used in the work in Paper B and for comparison
of the VoIP quality in Paper E. This work can be seen as precursor to the
work described next.
My Contribution: Although my name appears last on the (alphabetic)
list, the idea, work and text of the paper were mine.
24
6.5 Paper E
Ian Marsh and Fengyi Li. Wide Area Measurements of VoIP Quality. To ap-
pear at Quality of Future Internet Services 2003, October, 2003, Stockholm,
Sweden.
Summary: In Paper D from 1999, we reported experiments to quantify
the quality of Internet telephony. In this work we improved our measurement
methodology, included more hosts, probed more sessions and compared the
quality of the links that remained unchanged over the past years.
Once again, a pre-recorded telephone call was sent between nine sites
available to us, however this time the sites were connected as a full mesh
allowing us, in theory, to measure the quality of 72 different Internet paths.
In practice, some of the combinations were not usable due to certain ports
being blocked, thus preventing the audio to be sent to some sites. There were
four such cases. Bi-directional sessions were scheduled on an hourly basis
between any two given end systems. Calls were transferred only once per
hour due to load considerations on remote machines. This time nine sites
were carefully chosen with large variations in hops, geographic distances,
time zones and connectivity to obtain a better diversification of distributed
sites. One limitation of the sites was they were all located at academic
institutions, which are typically associated with well provisioned networks.
In order to gather more comprehensive measurement data, we included
four new test sites, and automated the process of sending and measuring the
test files. Other extensions included hourly bi-directional conversations on
a 24 hour basis. In contrast to the first set of experiments, where we set up
calls between a central site and the satellites, this time we used a full-mesh
scheme so each co-operating site could send and receive to all of the others.
Only PCM coding was used and call signalling was not included, we simply
started sending a UDP voice stream to an awaiting receiver, thus assuming
the signalling has been established between the two communicating parties.
Contribution of this work: The contribution of this work is a report
on the quality of Voice over IP in 2002. We defined the quality as one-way
delay, loss and jitter. With a large undertaking, we have gathered more than
24,000 sample sessions from nine globally distributed sites. for three sites,
we have been able to compare the quality from 1999.
One further contribution is by combining the results of this work and
Paper C: we can estimate the mouth-to-ear delay of a VoIP system. Without
GPS and other specialised equipment this is a difficult quantity to measure.
Paper C accounts for the delay incurred by the end systems and paper E the
delay by the network. Seen as one result these two results are an estimate
of the total mouth-to-ear delay of a wide area VoIP system.
6 Summary of the individual papers and their contributions 25
My Contribution: The idea to improve on the measurements from 1999
was mine. I advised a Masters student, Fengyi Li, to perform the measure-
ments using Sicsophone with my modifications. I wrote a tool to process
the session files and we jointly wrote the paper (based on the Li’s master
thesis).
7 Conclusions 27
7 Conclusions
In the first phase of my doctoral studies I have investigated selected topics
within real-time audio communication. I have suggested techniques to im-
prove the quality of packet audio: dimensioning links specifically for packet
voice communication, modelling the packet audio arrival process at a re-
ceiver, measuring connectivity quality in wide area networks, and reducing
delays in end systems. The common theme in this work is an engineering ap-
proach to real-world issues. We have also tried to solve them independently
of current quality of service research agendas.
In the first study, we looked at the scenario where an operator uses a
link (or portion thereof) exclusively for voice over IP. Taking advantage of
the statistical properties of the call properties, such as silence periods, al-
lows much higher utilisation of the link capacity. We used existing Markov
theory to ascertain how many calls can be allocated to the link according
to a specific quality. Given parameters we can define an appropriate opera-
tional range. Through modelling the scenario, simulation and a laboratory
implementation we are able to conclude the model and approach are valid.
In the second study, we investigated the effect that bulk TCP data has
on a single audio stream. This extends the previous work to include the
interaction between the two traffic types. Small audio packets multiplexed
with large data packets in the queues of routers can distort the original
timing of the speakers voice pattern. By identifying independence between
the delay experienced by each packet in the buffer and the observed network
delay we can show that the observed delay of the packets is Markovian. This
allows us to construct a Markov model in which we are able to model the
arrival process of packet audio streams. The interarrival histograms for the
model and the gathered data are similar, confirming that the model produces
an arrival process similar to those observed. The knowledge gained allows us
to generate representative and reproducible packet audio streams, which can
be used to test jitter buffer playout algorithms. The alternative is extensive
field testing, which although representative, is not very flexible. We have
used data from our measurement effort to construct the model, so realistic
data has been incorporated.
Measurement work has been an important part of this thesis. We con-
ducted two separate studies to report for the quality of VoIP on the Internet.
We implemented a real-time VoIP tool called Sicsophone (Paper C). This
tool was modified to enable us to measure the delay, loss and jitter between
globally distributed sites. Our findings show that the VoIP quality is accept-
able for communication between academic sites in Europe and the United
States, providing the end systems do not add excessive delay to the audio
streams. The measurements have also assisted us in gaining some insight
into the loss and delay processes at work on the Internet. We have gathered
over 18,000 traces and made them publicly available. This can be seen as a
28
contribution of this work in its own right.
Finally, we have looked more closely at the contribution of the end sys-
tems to the mouth-to-ear delay. The end systems are an important, and
often overlooked, part of a real-time voice communication system. They
are also one component of a VoIP system that can be finely tuned. We
show that the delay incurred by the end points can be reduced from 100’s
of milliseconds to 10’s of milliseconds per host. This is achieved by moving
the jitter buffer to the memory of the sound card, resulting in less copying
of the data and direct access to the sound samples. Because humans are
particularly sensitive to delays over 175 milliseconds, the reduction in delay
is critical to achieving good quality audio communication.
We believe that a combination of research and engineering solutions can
yield significant improvements for the quality of real-time voice services.
We have investigated four areas from different perspectives, with a par-
ticular focus on delay. Using the techniques we have proposed, we show
that hundreds of milliseconds can be saved in the delay budget of real-time
voice communication, improving the audio quality considerably. We have
also gained valuable insight into more fundamental issues through modelling
and measurements of real time voice systems. In future work we will look at
implementing more efficient jitter reduction techniques. We will also investi-
gate the bound on the lowest possible delay that can be attained in a packet
audio system. This work will be complemented by further measurements to
ascertain the contribution of the (variable) queueing delay within the total
mouth-to-ear delay budget.
References
[1] D. T. Magill, “Adaptive speech compression for packet communication
systems,” in Conference record of the IEEE National Telecommunica-
tions Conference, pp. 29D–1 – 29D–5, 1973.
[2] W. E. Naylor, “A status report on the real-time speech transmission
work at UCLA.” NSC Note 52, Dec 1974.
[3] J. Forgie, “Speech communications in packet-switched networks,” Jour-
nal of the Acoustic Society of America, vol. 59, no. 1, 1976.
[4] W. E. Naylor, Stream traffic communication in packet switched net-
works. PhD thesis, UCLA, 1977.
[5] D. Cohen, “Issues in transnet packetized voice communications,” in
Proceedings of the Fifth Data Communications Symposium, (Snowbird,
Utah), pp. 6–10 – 6–13, ACM, IEEE, Sept. 1977.
[6] J. G. Gruber, “Delay related issues in integrated voice and data net-
works — a review and some experimental work,” in 6th Data Commu-
nications Symposium (ACM Sigcomm Computer Communication Re-
view), (Pacific Grove, California), pp. 166–180, ACM/IEEE, Nov. 1979.
[7] G. Barberis and D. Pazzaglia, “Analysis and optimal design of a packet-
voice receiver,” IEEE Transactions on Communications, vol. COM-28,
pp. 217–227, Feb. 1980.
[8] W. A. Montgomery, “Techniques for packet voice synchronization,”
IEEE Journal on Selected Areas in Communications, vol. SAC-1,
pp. 1022–1028, Dec. 1983.
[9] P. M. Gopal, J. W. Wong, and J. C. Majithia, “Analysis of playout
strategies for voice transmission using packet switching techniques,”
Performance Evaluation, vol. 4, pp. 11–18, Feb. 1984.
[10] M. K. M. Ali, C. M. Woodside, and J. F. Hayes, “Re-assembly buffer
requirements in a packet voice network,” Computer Networks and ISDN
Systems, vol. 15, no. 2, pp. 109–120, 1988.
[11] D. C. Verma, H. Zhang, and D. Ferrari, “Delay jitter control for real-
time communication in a packet switching network,” Tech. Rep. TR-
91-007, University of California, Berkeley, CA, 1991.
[12] D. Ferrari and D. C. Verma, “A scheme for real-time channel estab-
lishment in wide-area networks,” IEEE Journal on Selected Areas in
Communications, vol. 8, no. 3, pp. 368–379, 1990.
30
[13] S. L. Casner and S. E. Deering, “First IETF Internet audiocast,” ACM
Computer Communication Review, vol. 22, pp. 92–97, July 1992.
[14] V. Jacobson and S. McCanne, “vat - LBNL audio conferencing tool,”
July 1992. Available at http://www-nrg.ee.lbl.gov/vat/.
[15] H. Schulzrinne, “Voice communication across the Internet: A network
voice terminal,” Technical Report TR 92-50, Dept. of Computer Sci-
ence, University of Massachusetts, Amherst, Massachusetts, July 1992.
[16] “http://www-sop.inria.fr/rodeo/fphone/,” 1999.
[17] V. Hardman, A. Sasse, M. Handley, and A. Watson, “Reliable audio for
use over the Internet,” in Proc. of INET’95, (Honolulu, Hawaii), June
1995.
[18] V. Jacobson, “Multimedia conferencing on the Internet,” in SIGCOMM
Symposium on Communications Architectures and Protocols, (London,
England), Aug. 1994. Tutorial slides.
[19] H. Schulzrinne, Reducing and characterizing packet loss for high-speed
computer networks with real-time services. PhD thesis, University of
Massachusetts, Amherst, Massachusetts, May 1993.
[20] J. C. Bolot, “Characterizing end-to-end packet delay and loss in the
Internet,” Journal of High Speed Networks, vol. 2, no. 3, pp. 305–323,
1993.
[21] J. C. Bolot, “End-to-end packet delay and loss behavior in the In-
ternet,” in SIGCOMM Symposium on Communications Architectures
and Protocols (D. Sidhu, ed.), (San Francisco, California), pp. 289–298,
ACM, Sept. 1993. also in Computer Communication Review 23 (4),
Oct. 1992.
[22] J. C. Bolot, H. Crepin, and A. Garcia, “Analysis of audio packet loss in
the Internet,” in Proc. International Workshop on Network and Operat-
ing System Support for Digital Audio and Video (NOSSDAV), Lecture
Notes in Computer Science, (Durham, New Hampshire), pp. 163–174,
Springer, Apr. 1995.
[23] R. Ramjee, J. Kurose, D. Towsley, and H. Schulzrinne, “Adaptive play-
out mechanisms for packetized audio applications in wide-area net-
works,” in Proceedings of the Conference on Computer Communications
(IEEE Infocom), (Toronto, Canada), pp. 680–688, IEEE Computer So-
ciety Press, Los Alamitos, California, June 1994.
[24] S. B. Moon, J. Kurose, and D. Towsley, “Packet audio playout delay
adjustment algorithms: performance bounds and algorithms,” research
report, Department of Computer Science, University of Massachusetts
at Amherst, Amherst, Massachusetts, Aug. 1995.
[25] S. Moon, J. F. Kurose, and D. F. Towsley, “Packet audio playout de-
lay adjustment: performance bounds and algorithms,” Multimedia Sys-
tems, vol. 5, pp. 17–28, Jan. 1998.
[26] N. F. Maxemchuk and S.-P. Lo, “Measurement and interpretation of
voice traffic on the Internet,” in Conference Record of the International
Conference on Communications (ICC), (Montreal, Canada), June 1997.
[27] D. Lin, “Real-time voice transmissions over the Internet,” master thesis,
University of Illinois, Urbana-Champaign, 1999.
http://manip.crhc.uiuc.edu/Wah/papers/TM16/TM16.pdf.
[28] C. Sieckmeyer, “Bewertung von adaptiven Ausspielalgorithmen f¨ ur
paketvermittelte Audiodaten (Evaluation of adaptive playout algo-
rithms for packet audio - in German),” Studienarbeit, Dept. of Electri-
cal Engineering, TU Berlin, Berlin, Germany, Oct. 1995.
[29] H. Sanneck, Packet Loss Recovery and Control for Voice Transmission
over the Internet. PhD thesis, Technical University of Berlin, Oct. 2000.
[30] S. Moon, Measurement and Analysis of End-to-End Delay and Loss in
the Internet. PhD thesis, University of Massachusetts, 2000.
[31] S. Moon, P. Skelly, and D. Towsley, “Estimation and removal of clock
skew from network delay measurements,” in Proceedings of the Con-
ference on Computer Communications (IEEE Infocom), (New York),
Mar. 1999.
[32] G. Cole, Robert and H. Rosenbluth, Joshua, “Voice over IP Perfor-
mance Monitoring,” ccr, vol. 31, pp. 9–24, Apr 2001.
[33] M. Karam and F. Tobagi, “Analysis of the delay and jitter of voice
traffic over the Internet,” in Proceedings of the Conference on Computer
Communications (IEEE Infocom), (Anchorage, Alaska), Apr. 2001.
[34] C. Boutremans, Delay Aspects in Internet Telephony. PhD thesis,
EPFL, Dec 2002. 2715.
Paper A
Bengt Ahlgren, Anders Andersson, Olof Hagsand, and Ian Marsh. Dimen-
sioning Links for IP Telephony. In Proceedings of the 2nd IP-Telephony
Workshop, pages 14-24, New York, USA, April 2001.
IPTEL2001 35
Dimensioning Links for IP Telephony
Bengt Ahlgren, Anders Andersson, Olof Hagsand and Ian Marsh
SICS
CNA Laboratory
Sweden
{bengta, olof, andersa, ianm}@sics.se
Abstract—
Packet loss is an important parameter for dimensioning
network links or traffic classes carrying IP telephony traf-
fic. We present a model based on the Markov modulated
Poisson process (MMPP) which calculates packet loss prob-
abilities for a set of superpositioned voice input sources and
the specified link properties. We do not introduce another
new model to the community, rather try and verify one of
the existing models via extensive simulation and a real world
implementation. A plethora of excellent research on queu-
ing theory is still in the domain of ATM researchers and we
attempt to highlight it’s validity to the IP Telephony commu-
nity.
Packet level simulations show very good correspondence
with the predictions of the model. Our main contribution is
the verification of the MMPP model with measurements in
a laboratory environment. The loss rates predicted by the
model are in general close to the measured loss rates and the
loss rates obtained with simulation. The general conclusion
is that the MMPP-based model is a tool well suited for di-
mensioning links carrying packetized voice in a system with
limited buffer space.
Keywords—Link Dimensioning, Markov Process, IP Tele-
phony, MMPP/D/1/K
I. INTRODUCTION
Voice applications, such as telephony, have been used
on the best effort service provided by the Internet for quite
some time. Currently many telephone operators have ad-
vanced plans to use IP technology as a bearer also for the
regular telephone service. This, however, requires that the
IP network can provide service guarantees.
Quality of Service (QoS) issues are being addressed by
many forums, committees and researchers. Research on IP
QoS has concentrated on the issues of classifying, schedul-
ing and admission of packets into a network. Less has been
done on howto dimension an IP network carrying real time
traffic.
This paper focuses on dimensioning IP network links
intended to carry packetized telephony or voice calls. It is
feasible that existing carriers would like to allocate a por-
tion of their bandwidth for this service and through mech-
anisms like differentiated services [11] provide superior
Node 1
a0
a1
a3
Buffer
1.536Mbits/sec
Sources (60-80)
sink
Voice
Node 0
Fig. 1. Problem: dimensioning a link for voice sources over IP.
service for this kind of data and subsequently levy higher
charges.
Our approach is to look at work done in both the ATM
and traditional telephony communities as well as to use
tools and simulators from the IP community to verify these
ideas in an environment relevant for the Internet today. We
have seen very little work which has taken this approach.
The research community is divided into one of the two
camps (but is changing as ATM and telephony people are
more engaged in Internet research now).
Figure 1 illustrates the problem scenario we are address-
ing. A number of packet voice sources are multiplexed
onto a link. The link has a limited amount of buffering
which sometimes will result in the loss of packets with the
obvious consequences on sound quality. With a link of a
given bandwidth and a number of voice sources, what kind
of quality could be expected if we ran 60 sources? What if
we increased to 80—can we still expect adequate quality?
How will we affect the system by changing the amount of
buffering in the router?
We present a mathematical model based on a Markov
modulated Poisson process (MMPP) which can predict the
packet loss probability. We first verify the model using the
NS packet level simulator. The main contribution of this
paper is the verification of the MMPP model with mea-
surements in a lab network. These experiments show a
very good correspondence between the loss rate predicted
by the model and the loss rate measured in the lab.
The rest of the paper is organized as follows. After
summarizing relevant related work in the next section, we
present the MMPP-based mathematical model and the rea-
soning leading to this model in Section III. Section IV
describes the parameters we used in the experiments. Sec-
IPTEL2001 36
tions V and VI describe the NS simulations and the labo-
ratory experiments, respectively. The experimental results
are presented and discussed in Section VII and the paper
is concluded with Section VIII.
II. RELATED WORK
Link dimensioning for voice has been a research topic
for several decades in both academia and the telecommuni-
cations industry. Starting a little more than ten years back,
the research focus has been on link dimensioning for ATM
networks. Most of the results in the domain of ATM net-
works are also applicable in the domain of IP networks,
since both are packet switching systems. The majority of
the results from previous research is theoretical or results
from simulations. Our research also has results from mea-
surements of a real system.
Several approaches have been suggested in the litera-
ture to solve the problem of dimensioning links in packet
switched networks. Anick, Mitra and Sondhi [2] study a
multiplexer with infinite buffer with a stochastic fluid flow
model but it is shown by Zheng [14] that this model only
works for a multiplexer under heavy load. Tucker [15]
studies a multiplexer with finite buffer using the fluid
flow model, but it does not fit the model well for small
buffers. Heffes and Lucantoni [7] uses a two-state Markov
modulated Poisson process (MMPP) quite successfully to
estimate the delay in a multiplexer with infinite buffer
size. They suggest that the same approach for calculat-
ing the parameters of the MMPP can be used for a mul-
tiplexer with finite buffer size, but Nagarajan, Kurose and
Towsley [10] show that this does not work in the case of
finite buffer size. Instead, they develop a different method
for finding the parameters of the MMPP. Baiocchi et
al. [4] approximate the arrival process with a two-state
MMPP and suggest a method called asymptotic matching
for the calculation of the parameters of the MMPP. This
approach is used by Andersson [1] together with a proce-
dure to calculate the loss probabilities developed by Baioc-
chi, Melazzi and Roveri [3] to study a multiplexer loaded
with a superposition of voice sources.
III. MATHEMATICAL MODEL
In this section we develop a mathematical model for di-
mensioning a link carrying voice traffic. We start with the
arrival process of a single IP telephony source and pro-
ceed with the superposition of independent identically dis-
tributed sources. The sources are then multiplexed on a
bottleneck link through a queue of limited size. A more
detailed description of this model can be found in previous
work by one of the authors [1]. The model is based on a
model developed by Baiocchi, Melazzi and Roveri [3].
T
ON OFF
t
Packet
size
T T
Fig. 2. Characteristics of a single source.
A. Single source properties
Most standard voice encodings have a fixed bit rate and
a fixed packetization delay. They are thus producing a
streamof fixed size packets. This packet streamis however
only produced during talk-spurts—the voice coder sends
no packets during silence periods.
The behavior of a single source is easily modeled by a
simple on-off model (Figure 2). During talk-spurts (ON-
periods), the model produces a stream of fixed size packets
with fixed inter-arrival times T. Note that the first packet
is produced one packet time after the start of an on-period.
This is the result of the packetization—the voice coder
has to collect voice samples before it can produce the first
packet.
The number of packets in a talk-spurt, denoted with the
stochastic variable N
b
, is assumed to be geometrically dis-
tributed on the positive integers with mean n. This means
that we can never have zero packets in a talk-spurt. This
variant of the geometric distribution is sometimes called
first success distribution (see for instance Gut [6, page
258]), and has the probability function:
P(N
b
= k) = qp
k−1
, k = 1, 2, 3, . . . (1)
where q represents the probability that a packet is the
last one in a talk-spurt. This means that p =
n−1
n
. This
fact implies that the ON-periods have a expected value of
α = nT, where n is the expected value of the number of
packets in a talk-spurt.
We assume that the OFF-periods are exponentially dis-
tributed with mean β, which is well documented and dis-
cussed by Sriram and Whitt [13]. A voice source may be
viewed as a two state birth-death process with birth rate β
and death rate α. The OFF state represents the idle peri-
ods and the ON state represents the talk-spurts. While in a
talk-spurt, packets are generated with a rate of
1
T
packets
per second.
B. Approximating the single source
We have chosen to approximate the above model using
exponentially distributed inter-arrival times with mean T
IPTEL2001 37
Exp(1/T)
ON OFF
t
Packet
size
τ∼ τ∼ Exp(1/T)
Fig. 3. A single source approximated with exponentially dis-
tributed inter-arrivals.
instead of fixed inter-arrival times. The purpose of the ap-
proximation is to simplify the modelling of many sources.
We let τ ∈ Exp(
1
T
) denote the stochastic variable
which describes the inter-arrivals during talk-spurts, and
N
b
be the geometrically distributed stochastic variable
with the probability function stated in Equation 1 with
mean n describing the number of packets in a talk-spurt.
Moreover τ and N
b
are assumed to be independent. It can
be easily seen that the ON-periods (denoted U) are expo-
nentially distributed and that the mean length of a talk-
spurt is the same as in the deterministic inter-arrival case
(nT). Figure 3 illustrates the behaviour of a single source
with exponentially distributed inter-arrivals.
As in the previous section the OFF-periods are assumed
to be exponentially distributed with mean β. Because of
the exponentially distributed inter-arrival times during a
talk-spurt, the emission of packets during an ON-period
can be regarded as a Poisson process with intensity T. We
can use the two state birth-death process to describe the
packet generation with one state representing the idle pe-
riods and the other state representing the talk-spurts where
packets are generated as a Poisson process with inten-
sity T.
C. The superposition of independent voice sources
The superposition of voice sources can be viewed as a
birth-death process where the states represent the number
of sources that are currently in the ON-state. Here state
i represents that i sources are active in a talk-spurt. We
refer to the birth-death process as the phase process J(t).
The birth rate is given by the mean of the exponentially
distributed idle periods, and we denote the mean as
1
β
. The
death rate is determined by the mean of duration of the
talk-spurts and is denoted
1
α
. The probability p
on
that a
source is on is given by:
p
on
=
α
α +β
.
D. Markov modulated Poisson process
The Markov modulated Poisson process (MMPP) is a
widely used tool for analysis of tele-traffic models (see,
Poisson rates
N-1 N 1 0
. . . . .
N (N-1) 2
N 2 (N-1)
β β β β
α α α α
T NT (N-1)T
Fig. 4. Superposition of N voice sources with exponentially
distributed inter-arrivals.
10
0
10
1
10
2
10
3
10
4
0
2
4
6
8
10
12
14
16
18
20
I
n
d
e
x

o
f

D
i
s
p
e
r
s
i
o
n

f
o
r

I
n
t
e
r
v
a
l
s

(
I
D
I
)
# consecutive intervals (k) in log 10
N=1
N=10
N=60
N=130
Poisson
Fig. 5. k-interval squared coefficient of variation curves for
superposition of N voice sources.
e.g., Heffes and Lucantoni [7]). It describes the superpo-
sition of sources of the type described in Section III-B.
When the phase process is in state i, i sources are on. The
model graph of the MMPP is shown in Figure 4.
The superposition of Poisson processes is also a Poisson
process. We can therefore simply add the intensities of the
sources that are currently in a talk-spurt and receive a new
Poisson process for the superposition.
To validate the accuracy of approximating with a MMPP
process, we calculated the index of dispersion of intervals
(IDI) using a formula from Sriram and Whitt [13]. The
IDI, also called the squared coefficient of variation, gives
us some measure of how similar the traffic is in terms of
burstiness. A value of 1 shows the traffic is as bursty as
Poisson traffic, whereas a value as 18 is the burstiness of a
single voice source. The high value accounts for the fact
that the source is indeed bursty. The time period under
which one observes this behaviour is very important.
Figure 5 shows c
2
kN
, the IDI, versus k for k between 1
and 2000 and the number of sources, N, equal to 1, 10, 60
and 130. As a reference we have added the value of c
2
kN
for a Poisson process. Data was obtained from simulations
IPTEL2001 38
using a Matlab program. The solid line shows the c
2
kN
for sources with deterministic inter-arrival times between
packets during a talk-spurt, and the dashed lines show the
c
2
kN
for sources with exponentially distributed inter-arrival
times, i.e., the MMPP approximation.
We see in the figure that the two descriptions of a sin-
gle source behave in a similar way when they are super-
positioned. The figure also shows that the superpositioned
arrival process behaves as a Poisson process if we look at
it for a short instant of time but it is much burstier if we
study it over a longer period of time.
E. The multiplexer: MMPP/D/1/K queue
The arrival process described by the MMPP model is fed
into a simple D/1/K queue. It is deterministic, has a sin-
gle FIFO server and a buffer size (waiting room) which we
vary. This kind of model is described in detail by Baioc-
chi et al. [3], [4]. We use their method and formulas for
calculating the loss probability.
IV. PARAMETER VALUES
We used the following parameters to run the MMPP
model, simulations and lab experiments:
• 32 kb/s ADPCMvoice encoding with 16 ms packet inter-
arrival time, which results in 64 bytes of voice payload per
packet
• A protocol header overhead consisting of 12 bytes for
RTP, 8 bytes UDP and 20 bytes IP. We do not include any
link layer headers. The resulting total packet size is 104
bytes, and the resulting bit rate is 52 kb/s.
• The number of successive packets in one talk-spurt is
geometrically distributed on the positive integers with a
mean of 22, which results in a mean talk-spurt length of
352 ms. The idle time between two successive bursts is
exponentially distributed with a mean of 650 ms. The re-
sulting average fraction of time a source is in a talk-spurt
is 0.351.
• The bottleneck is a T1 link with a bandwidth of
1.536 Mb/s.
These values coincide with Sriram and Whitt [13] as well
as previous work done by Zheng [14] whilst at SICS and
Andersson [1], except that we in this paper include proto-
col header overhead for the RTP/UDP/IP protocol stack.
Figure 6 shows loss curves computed with the MMPP
model for a sample set of buffer sizes. The next steps are
to compare these loss probabilities from the model with
results from NS simulations and measurements from a lab
network.
1e-05
0.0001
0.001
0.01
0.1
1
0 10 20 30 40 50 60 70 80 90 100
L
o
s
s

p
r
o
b
a
b
i
l
i
t
y
Number of buffers
Mathematical MMPP model
60 sources
65 sources
70 sources
75 sources
80 sources
Fig. 6. Loss probabilities computed with the MMPP model.
Sources (N) Load (λ)
29 34.5 %
60 71.4 %
80 95.3 %
84 98 %
TABLE I
NETWORK LOAD FOR A NUMBER OF SOURCES.
A. Load
We use between 60 and 80 sources to load the link. To
define a load that is independent of the link bandwidth the
load factor, or λ, is used in the literature:
Load (λ) =
N ×P
on
×Rate
peak
C
where N is number of sources, C is the link capacity, P
on
is the probability that the source is on and Rate
peak
speaks
for itself. Table I shows loads for different numbers of
sources.
We decided to run between 60 and 80 sources as 84
sources is where the mean bandwidth of the sources equals
the bandwidth of the link. The peak allocation is as low as
29 sources (100 % utilisation when P
on
= 1) so taking ad-
vantage of the probability that a source is off yields much
higher link utilisation.
B. Buffer size
We have chosen to simulate a multiplexer with an output
link capacity of 1.536 Mb/s and buffer sizes ranging from
2 to 100 packets. With this choice of parameters we in-
troduce a maximum queueing delay of 54 ms in the buffer.
According to ITU recommendation G.114 [8] a delay of 0-
150 ms acceptable for telephony, between 150 and 400 ms
IPTEL2001 39
set cbr($i) [new Agent/CBR/UDP]
set exp($i) [new Traffic/Expoo]
$exp($i) set packet-size 104
$exp($i) set burst-time 0.352s
$exp($i) set idle-time 0.65s
$exp($i) set rate 52K
$cbr($i) attach-traffic $exp($i)
Fig. 7. Tcl code fragment defining a source NS-2.
can also be acceptable, but over 400 ms is not. The total
acceptable delay must be divided into a delay budget for
each node in the path between the sender and receiver. If
the path has 15 hops, and half of the delay budget can be
allocated to queueing delay, then we get 13.3 ms per hop.
This translates to approximately 24 buffers per hop. For
higher bandwidth links, the queueing delay per buffered
packet decreases inversely proportional to the bandwidth.
V. NS SIMULATION
We used ns-2 [5], a packet level simulator to verify
the MMPP model. Figure 1 shows the topology used in
the simulations and Figure 7 the Tcl code that is used to
start “agents”. They are constant rate sources, denoted by
“CBR/UDP”. Traffic/Expoo generates traffic based on an
exponential on/off distribution with the parameters speci-
fied in the next four lines. Each CBR source $i$ uses a
different random number seed, hence the sources will start
independently of each other.
The simulation should run long enough for the system
to reach steady state, ideally the system should be run for
an infinite amount of time, however this is not practical
due to time and resource constraints. A reasonable trade-
off is to use a simulated time of 1000 seconds in both the
simulation and the lab experiments. 1000 seconds with an
interval of 16 ms generates 22000 packets per source and
1.32 million packets for 60 sources or 1.76 million for 80
sources.
VI. LAB NETWORK MEASUREMENTS
A. Topology
Figure 8 shows the experimental setup. A single
machine acts as a traffic generator and emulates several
IP Telephony ’calls’ multiplexed together. The traffic is
then sent on a shared 100 Mb/s Ethernet and received by
two hosts: (1) a machine configured as a router; (2) a sink
machine for measurement purposes. An outgoing link of
the router is connected to the sink. In this configuration the
traffic is emitted by the generator, passes through the router
and is received by the sink. Since the sink can observe the
1.536Mbits/s
Hub
dummynet
fxp1
fxp2 fxp2
fxp1
SICS net
fxp0
Traffic generator Router
queue
60-80 sources
Sink
100Mbits/s
Fig. 8. Topology for Laboratory. The outgoing interface of the
router is also connected to the sink.
packets before it enters the router, it can directly compare
latency and loss of each individual packet. The outgoing
link of the router is constrained to 1.536 Mb/s using Dum-
mynet [12] which is explained in the next section. All the
machines in the experiment were running FreeBSD 3.4.
B. Dummynet
Dummynet is a link emulator which allows arbitrary
bandwidths and latencies to be specified. It is often used
for emulating a slower link than what is physically avail-
able. Buffer sizes can be set for a given link and loss rates
set to emulate the effect of lossy links. It is possible to
create the illusion for TCP/UDP and IP that the link is like
a WAN rather than a LAN. We are primarily interested
in the lower bandwidth and configurable queue sizes. We
modified the output functionality slightly to enable simpler
calculation of the total number of packets received as well
as the drop rate. Recording the total number of packets re-
ceived gives us an additional check if the traffic generator
or any system component lost/dropped packets during the
experiment.
The total number of sent packets remained the same for
a given source count and can be checked with the output
of the traffic generator. It is trivial with a script to divide
the loss by the total number of packets to obtain the loss
rate.
C. Packet capture
To verify the loss rate we gathered the packets on the
sink machine via a program that we developed
1
using the
Berkeley Packet Filter [9] . Figure 8 shows that the output
of the generator is attached directly to the sink machine
as well as the outgoing link of the router. This enables us
to capture all the packets and the ones not dropped by the
router. A simple difference between the two should ver-
ify the loss rate reported by Dummynet. Our bpf program
captures packets with a specific destination and port, and
prints the time of arrival, RTP src and seq fields.
1
Not tcpdump. We wrote out our own kernel filter to extract the pack-
ets we wanted as well as a user space program to output headers from
2 interfaces simultaneously.
IPTEL2001 40
#define INVERSE_M ((double) 4.6566128200e-10)
/* little number */
int calc_length(double burstlen) {
double rand, logvalue;
rand = INVERSE_M * random();
logvalue = burstlen * -log(rand);
return ((int)(logvalue + 0.5));
}
Fig. 9. C code to “randomize” a burst length
D. Traffic generator
The idea of the traffic generator is to create a sequence
of packets that resemble many individual IP telephony
calls multiplexed together. Furthermore, it should perform
this job as accurately as possible with each packet emerg-
ing with a given deadline.
D.1 Trace file generation and playback
In order to be able to repeat experiments, we first pre-
calculate the sending times of the packets and generate
trace files. These files are then fed into the traffic generator
which sends packets according to the trace. The trace files
also allow us to test our setup to see if packets were be-
ing generated at the right times (such as inter-arrival times
and sequence). The files are generated on a per source ba-
sis. The average length of a burst is calculated as shown in
Equation 2.
burst length = rand

P
on
interval

(2)
The C-code for the rand function is shown in Figure 9.
Using the logarithm of the random variable generates
burst lengths which are exponentially distributed.
The same calculation is applied for the idle (with P
off
)
period. The result is (reading vertically for each source) an
exponentially distributed series of ON and OFF sequences
with a mean ON of 0.351 seconds, OFF of 0.65 seconds
which results in a burst length of 22 packets. An example
of a trace file
2
with ten sources is shown in Figure 10.
The file shows for each time step (in this case 16 ms)
which of the 10 sources are on or off. In the example,
sources 1, 3, 5, 7 and 9 sends packets in the first time step.
The traces of one source can be followed by reading a col-
umn downwards. Source 2, for example, sends no packet
in the first timestep, but then sends a packet in each of the
succeeding steps.
If there are n sources, each timestep is further subdi-
vided into n sub steps. Each sub step defines the sending
2
Actually it is converted into a binary format for more compact rep-
resentation
source
0 1 2 3 4 5 6 7 8 9
time
0 0 1 0 1 0 1 0 1 0 1
1 0 1 1 0 1 1 0 1 1 1
2 1 1 1 1 1 1 0 1 0 1
3 1 0 1 0 1 0 1 0 0 1
4 1 0 1 0 1 0 1 0 0 1
5 0 1 1 0 1 1 0 1 1 0
Fig. 10. Traffic generator trace file.
5
16ms
0 1 2 3 4 5 6 7 8 9
0
0 1 2 3
1
source interval
timestep
3 9 1 1 2 7
Fig. 11. Traffic generator sending times
interval for each source. For example, with ten sources and
a time step of 16 ms starting at t, source 0 sends its packet
within [t, t +1.6]; source 1 sends within [t +1.6, t +3.2],
etc. If a source does not send its packets within its interval,
it is said to miss its deadline. Packets that miss their dead-
line are recorded by the generator and printed when the
run has completed as well as the largest value by which a
packet was delayed.
So for the trace file above, the first steps of a packet se-
quence is shown in Figure 11. The sending of each packet
is depicted as a horizontal interval, corresponding to the
entering and leaving of the send system call, respectively.
In the picture, the packets of source 5 and 7 missed their
deadlines. The actual sending time on the link can be mea-
sured by an external mechanism, such as the packet cap-
ture program described previously.
D.2 Traffic generator verification
As a simple test for a trace file of 220000 packets we
obtained values 36.9 % for the on time, 63.1 % for the off
time by simply counting the ones and zeros in one column
of the file. The mean number of packets in a burst equalled
22.5. Using the trace files turned out to be more useful than
we first expected, despite the performance gains of replay-
ing pre-calculated files they also allowed us to test the per-
formance of our traffic generator (setting all the sources
on), cross check parameters as just stated as well as gener-
ating special sequences for analysing queue behaviour.
D.3 Traffic generator verification
We calculated the index of dispersion of intervals, or
IDI (see Section III-D), also for the lab traffic generator.
IPTEL2001 41
0
2
4
6
8
10
12
1 10 100 1000
I
n
d
e
x

o
f

d
i
s
p
e
r
s
i
o
n

f
o
r

i
n
t
e
r
v
a
l
s

(
I
D
I
)
# consecutive interval (k) in log 10
IDI’s for lab and simulation
"idi_lab_75.txt"
"idi_sim_75.txt"
"idi_poisson.txt"
Fig. 12. IDI curves for superposition of 75 sources
In Figure 12
we can see that the simulation and lab traffic generator
produce similar types of traffic. The larger the observa-
tion time the more skewed the traffic is. One voice source
is equal to about 18.1 also a value is given for a Poisson
sources. The graphs show the result of a trace which was
10000 simulated seconds, resulting in 17.3 million packets
for the lab and 16.3 for the simulation.
The traffic generator was also tested to ensure it (and the
machine on which we run on) was capable of outputting
packets as close to their deadlines as possible
VII. RESULTS
In this section we present and discuss the results from
the MMPP model, the NS simulations and the measure-
ment in the lab network. Recall from Section IV that in all
three cases we used the 32 kb/s ADPCM voice encoding
with 16 ms packetization. This results in 64 bytes of voice
payload in each packet and a total packet size of 104 bytes
including the RTP, UDP and IP protocol headers.
Figures 13 and 14 show the packet loss probability as
a function of the number of buffers on (y) log scales. We
can see in these graphs that both the MMPP model results
and the NS simulations in general compare well with the
measurements in the lab. The exception is for very small
buffer sizes and when the loss rate is small.
The MMPP model is most of the time closer to the lab
measurements than the NS simulations are, which is an in-
teresting result. The ns simulations consistently show the
lowest loss rates for more than 7–8 buffers. We analysed
the output from the traffic generators in NS and in the lab
to try to come up with an explanation. We found that there
is a small difference in mean total rate between the two
that can explain the difference in loss rate.
The second set of graphs presented in Figures 15 to
18 plots the packet loss probability as a function of the
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
0 10 20 30 40 50 60 70 80 90 100
L
o
s
s

p
r
o
b
a
b
i
l
i
t
y
Number of buffers
65 sources
MMPP model
ns simulation
Lab measurement
Fig. 13. 65 sources for model, NS and lab (log scale)
0.01
0.1
1
0 10 20 30 40 50 60 70 80 90 100
L
o
s
s

p
r
o
b
a
b
i
l
i
t
y
Number of buffers
80 sources
MMPP model
ns simulation
Lab measurement
Fig. 14. 80 sources for model, NS and lab (log scale)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
60 65 70 75 80
L
o
s
s

p
r
o
b
a
b
i
l
i
t
y
Number of sources
3 buffers
MMPP model
ns simulation
Lab measurement
Fig. 15. Loss probability Vs buffers (3)
IPTEL2001 42
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
60 65 70 75 80
L
o
s
s

p
r
o
b
a
b
i
l
i
t
y
Number of sources
5 buffers
MMPP model
ns simulation
Lab measurement
Fig. 16. Loss probability Vs buffers (5)
0
0.01
0.02
0.03
0.04
0.05
0.06
60 65 70 75 80
L
o
s
s

p
r
o
b
a
b
i
l
i
t
y
Number of sources
10 buffers
MMPP model
ns simulation
Lab measurement
Fig. 17. Loss probability Vs buffers (10)
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
60 65 70 75 80
L
o
s
s

p
r
o
b
a
b
i
l
i
t
y
Number of sources
40 buffers
MMPP model
ns simulation
Lab measurement
Fig. 18. Loss probability Vs buffers (40)
number of voice sources for four different buffer lengths
measured in packets. These buffer lengths correspond to
a maximum queueing delay of 1.6, 2.7, 5.4 and 21.7 ms,
respectively. We immediately see that the relationship be-
tween the number of sources and loss rate is close to linear
for few buffers, but far from linear for many buffers. Vi-
sual observation suggests an exponential relationship. In
the region above 10 buffers, the lab measurements often
has the highest loss rate. Below about 10 buffers, the lab
measurements have the lowest loss rate.
One interesting detail is that for very small buffers, the
loss curve obtained in the NS simulation is shifted one
buffer to the right in the plots. Even though we have gone
to great lengths in ensuring that the three environments
have identical properties, there are nevertheless subtle dif-
ferences that can explain discrepancies like this.
One obvious difference in the models we used is that
the bandwidth offered by Dummynet is not exactly the
same as in ns. Using netperf we found there to be about a
3% difference between what netperf and dummynet report
as their measured and configured bandwidths respectively.
Perhaps more subtle and not so obvious is the amount of
buffering in the system, in NS we simply state the buffer
size in packets (between 2 and 100). In a real system this
is much harder to calculate as buffers exist in many places
in the system, for example in the queue between the Eth-
ernet driver and ip_input() routine on the input side.
Ethernet cards can also buffer packets on the output side.
This is the default configuration as most Ethernet cards are
used on host systems where this is not an issue. Neverthe-
less, the buffering in a real system is probably larger than
the simulation and maybe account for differences in the
systems under comparison.
VIII. CONCLUSIONS AND FUTURE WORK
We have studied the packet loss behaviour when a num-
ber of homogeneous voice sources are multiplexed onto a
bottleneck link. The goal is to find an accurate mathemat-
ical model which can be used to dimension the link.
We have implemented a mathematical model based on
a Markov modulated Poisson process (MMPP) in Matlab.
The model was compared with both simulations using NS
and measurements in a lab environment. The comparison
shows that the model in general predicts the loss rate well.
The exceptions are for small loss rates in some cases. An
interesting result is that most of the time the model predicts
the loss rate better than the simulations in ns.
This result once more proves that the only way to reli-
ably verify a model is to make measurements of a real sys-
tem. We found that the relationship between the load and
loss rate is close to linear for few buffers (around three),
IPTEL2001 43
but looks exponential for many (10 and above) buffers.
The general conclusion is that the MMPP-based model
is well suited for predicting loss rates for superpositioned
voice sources in a system with limited buffer space. The
mathematical model is an important tool for conveniently
dimensioning network links. The lab environment is con-
strained to physical limits as well as finite resources where
the model is clearly not. Running a lab experiment con-
sumes resources and time a lab experiment takes on av-
erage 12 hours to complete. For each number of sources
and each buffer size the experiment is re-started which is
one reason why we use Dummynet, we can change the
buffer sizes without re-booting the router. The simulation
typically takes 2 hours whereas the model consumes only
about 10 minutes as well as considerably less physical re-
sources
3
.
There are a number of further work items that we are
currently addressing. The maximum delay is bounded by
the buffer length in the system studied in this paper, but
what is the resulting mean delay? We are experimenting
with higher bandwidth links. One challenge is to accu-
rately generate enough sources. The next step is to mea-
sure a system which has multiple traffic classes in the style
of diffserv [11]. How does different queue scheduling al-
gorithms affect the dimensioning of traffic classes? Can
the MMPP model presented in this paper be used to de-
scribe the loss and delay properties of a traffic class?
The ongoing work can be found at a web page
4
which
has information about current experiments as well as data
which was not directly relevant for this paper.
ACKNOWLEDGEMENTS
The authors would like to acknowledge the indispens-
able work of Henrik Abrahamsson in helping us calcu-
lating the IDI values presented in this paper. We would
like thank Thiemo Voigt for his help in setting up Dum-
mynet and adding extra debug statements to make our loss
calculations considerably simpler. Finally we would like
to thank Telia AB for their financial support in the early
phases of this work.
REFERENCES
[1] Anders Andersson. Capacity study of statistical multiplexing for
ip telephony. Technical Report T2000:03, SICS – Swedish Insti-
tute of Computer Science, January 2000.
[2] D. Anick, Debasis Mitra, and M. M. Sondhi. Stochastic theory of
a data-handling system with multiple sources. Bell System Tech-
nical Journal, 61(8):1871–1894, October 1982.
3
These values were derived from an Athlon 600 Mhz PC with
FreeBSD, a Fast SCSI-3 disk and 128 MB of RAM.
4
http://www.sics.se/˜ianm/Experiment/experiment.html
[3] Andrea Baiocchi, Nicola Blefari-Melazzi, and Aldo Roveri.
Buffer dimensioning criteria for an ATM multiplexer loaded with
homogeneous on-off sources. In J. W. Cohen and Charles D.
Pack, editors, Queueing, Performance and Control in ATM —
Proceedings of the Workshop at the 13th International Teletraf-
fic Congress (ITC), pages 13–18, Copenhagen, Denmark, June
1991. North-Holland. Volume 15 of the North Holland Studies in
Telecommunication.
[4] Andrea Baiocchi, Nicola Blefari Melazzi, Marco Listanti, Aldo
Roveri, and Roberto Winkler. Loss performance analysis of an
ATM multiplexer loaded with high-speed on-off sources. IEEE
Journal on Selected Areas in Communications, 9(3):388–393,
April 1991.
[5] Kevin Fall and Kannan Varadhan. ns: Notes and documentation.
Technical report, Berkeley University, 1998. Technical Report.
[6] Allan Gut. An Intermediate Course in Probability. Springer-
Verlag, New York, 1995.
[7] Harry Heffes and David M. Lucantoni. A Markov modulated
characterization of packetized voice and data traffic and related
statistical multiplexer performance. IEEE Journal on Selected
Areas in Communications, SAC-4(6):856–867, September 1986.
[8] International Telecommunication Union (ITU). Transmission
systems and media, general recommendation on the transmission
quality for an entire international telephone connection; one-way
transmission time. Recommendation G.114, Telecommunica-
tion Standardization Sector of ITU, Geneva, Switzerland, March
1993.
[9] Steven McCanne and Van Jacobson. A BSD packet filter: A new
architecture for user-level packet capture. In Proc. of Usenix Win-
ter Conference, pages 259–269, San Diego, California, January
1993. Usenix.
[10] Ramesh Nagarajan, James F. Kurose, and Don Towsley. Approx-
imation techniques for computing packet loss in finite-buffered
voice multiplexers. IEEE Journal on Selected Areas in Commu-
nications, 9(3):368–377, April 1991.
[11] Kathie Nichols, Van Jacobson, and Lixia Zhang. A two-bit differ-
entiated services architecture for the internet. Internet draft, Bay
Networks, LBNL and UCLA, November 1997.
[12] L. Rizzo. Dummynet: A simple approach to the evaluation of net-
work protocols. Computer Communications Review, 27(1):31–
41, January 1997.
[13] Kotikalapudi Sriram and Ward Whitt. Characterizing super-
position arrival processes in packet multiplexers for voice and
data. IEEE Journal on Selected Areas in Communications, SAC-
4(6):833–846, September 1986.
[14] Zheng Sun. Capacity study of statistical multiplexing for ip tele-
phony. Technical Report LiTH-MAT-EX-98-12, Department of
Mathematics, Linkoping University, December 1998.
[15] Roger C. F. Tucker. Accurate method for analysis of a packet-
speech multiplexer with limited delay. IEEE Transactions on
Communications, COM-36(4):479–483, April 1988.
Paper B
Ingemar Kaj and Ian Marsh. Modelling the Arrival Process for Packet
Audio. In Quality of Service in Multiservice IP Networks, pages 35-49,
Milan, Italy, February 2003.
c Springer-Verlag 2003
Reprinted with permission.
Modelling the Arrival Process for Packet Audio
Ingemar Kaj
1
and Ian Marsh
2
1
Dept. of Mathematics, Uppsala University, Sweden
[email protected]
2
Ian Marsh, CNA Lab, SICS, Sweden
[email protected]
Abstract. Packets in an audio stream can be distorted relative to one another
during the traversal of a packet switched network. This distortion can be mainly
attributed to queues in routers between the source and the destination. The queues
can consist of packets either from our own flow, or from other flows. The con-
tribution of this work is a Markov model for the time delay variation of packet
audio in this scenario. Our model is extensible, and show this by including sender
silence suppression and packet loss into the model. By comparing the model to
wide area traffic traces we show the possibility to generate an audio arrival pro-
cess similar to those created by real conditions. This is done by comparing the
probability density functions of our model to the real captured data.
Keywords:: Packet delay, VoIP, Markov chain, Steady state
1 Introduction
Modelling the arrival process for audio packets that have passed through a series of
routers is the problem we will address. Figure 1 illustrates this situation: Packets con-
taining audio samples are sent at a constant rate from a sender, shown in step one. The
Router Router
Receiver
Sender
cross
traffic
1. Original
spacing
4. Restored
spacing
buffer
3. Jitter
2. Distortation due to queueing
Fig. 1. The networks effect on packet audio spacing
48 QoS-IP
spacing between packets is compressed and elongated relative to each other. This is
due to the buffering in intermediate routers and mixing with cross-traffic, shown in step
two. In order to replay the packets with their original spacing, a buffer is introduced at
the receiver, commonly referred to as a jitter buffer shown in step three. The objective
of the buffer is to absorb the variance in the inter-packet spacing introduced by the de-
lays due to cross traffic, and (potentially) its own data. In step four, using information
coded into the header of each packet, the packets are replayed with their original timing
restored.
The motivation for this work derives from the inability of using known arrival pro-
cesses to approximate the packet arrival process at the receiver. Using a known arrival
process, even a complex one, is not always realistic as the model does not include char-
acteristics that real audio streams experience. For example the use of silence suppres-
sion or the delay/jitter contribution of cross traffic. One alternative is to use real traffic
traces. Although they produce accurate and representative arrival processes, they are
inherently static and do not offer much in the way of flexibility. For example, observing
the affect of different packet sizes without re-running the experiments. When testing
the performance of jitter buffer playout algorithms, for example, this inflexibility is un-
desirable. Thus, an important contribution of this paper is to address the deficiencies
of these approaches by combining the advantages of both a model of the process, with
data from real traces.
This paper presents in a descriptive manner, a packet delay model, based on the main
assumption that packets are subjected to independent transmission delays. It is intended
that readers not completely familiar with Markovian theory can follow the description.
We assume no prior knowledge of the model as it is built from first principles starting in
section 2. We give results for the mean arrival and interarrival times of audio packets in
this section. We add silence suppression to the model in section 3 and packet loss in the
next section, 4. Real data is incorporated in section 5, related work follows in section 6
and we customarily round off with some conclusions in section 7.
2 The packet delay model
There are two causes of delay for packet audio streams. Firstly, the delay caused by
our own traffic, i.e. packets queued up behind ones from the same flow, this we refer
to as the sequential delay. Secondly, the delay contributed by cross traffic, usually TCP
Web traffic, which we call transmission delay in this paper. It is important to state
we consider these two delays as separate, but study their combined interaction on the
observed delays and interarrivals. Propagation and scheduling delay are not modelled
as part of this work.
In this model packets are transmitted periodically using a packetisation time of 20
milliseconds. For convenience, the packetisation interval is used as the time unit for the
model. Saying that a packet is sent at time k signifies that this particular packet is sent
at clock time 20k ms into the data stream. The first packet is sent at time 0.
We begin with the transmission delay of a packet. Suppose that packet k could be
sent isolated from the rest of the audio stream and let
Y
k
= transmission delay of packet k (no. of 20 ms periods).
Ingemar Kaj and Ian Marsh: Modelling the Arrival Process for Packet Audio 49
To see the impact of the sequential delay, let
T
k
= the arrival time of packet k at the jitter buffer, k ≥ 1.
The model used in this paper is shown in Figure 2. The figure shows packets being trans-
0 k+1
U
k−1
T
k k+1
T
V
k k
Sender
Receiver
k
Time
Transmitted Packets
Received Packets
Fig. 2. T
k
arrival times before playout, V
k
observed delays, U
k
observed interarrival times
mitted from a sender at regular intervals. They traverse the network, where as stated,
their original spacing is distorted. Packet k arrives at time T
k
at the receiver. The dif-
ference in time between when it departed and arrived we call the observed delay, which
we denote
V
k
= arrival time −departure time = T
k
−k + 1 k ≥ 1.
The time when the next packet (numbered k + 1) arrives is T
k+1
and so the observed
interarrival times are obtained as the differences between T
k+1
and T
k
, denoted
U
k
= T
k+1
−T
k
.
Apacket k, sent at time k−1, requires time Y
k
to propagate through the network and
arrives therefore at T
k
= k −1 +Y
k
, as long as it is not delayed further by other audio
packets (which we call sequential delays). It may however catch up to audio packets
transmitted earlier (1 →k −1). This packet is forced to wait before being stored in the
playout buffer. This shows that the actual arrival times satisfy:
T
1
= Y
1
T
k
= max(T
k−1
, k −1 +Y
k
), k ≥ 2. (1)
Since T
k−1
and Y
k
are independent, we conclude from the relation above (1) that
T
k
forms a transient Markov chain. Moreover, the interarrival times satisfy
U
k
= T
k
−T
k−1
= max(0, k −1 +Y
k
−T
k−1
) k ≥ 2. (2)
50 QoS-IP
0 20 40 60 80 100 120
0
50
100
150
200
250
interarrival times, ms
f
r
e
q
u
e
n
c
y
Fig. 3. Histogram of the interarrival times (U
k
)
The arrival times (T
k
), interarrival times (U
k
) and observed delays (V
k
) can be eas-
ily observed from traffic traces. As an example, Figure 3 shows the histogram for an
empirical sequence of interarrival times. The data is from a recording of a Voice over
IP session between Argentina and Sweden, more details of the traffic traces are given
in section 5.1. The transmission delay sequence (Y
k
) should be on the other hand con-
sidered as non-observable. The approach in this study is to consider (Y
k
) having a
general (unknown) distribution and investigate the resulting properties of the observed
delay (V
k
) and interarrival times (U
k
). Since the latter sequences can be empirically ob-
served, this leads to the question to whether the transmission delay distribution can be
reconstructed using statistical inference. In this direction we will indicate some meth-
ods that could be used to compare the theoretical results with the gathered empirical
data.
To carry out the study, we assume from this point the sequence (Y
k
) is independent
and identically distributed, with distribution function
F(x) = P(Y
k
≤ x), k ≥ 1,
and finite mean transmission delay ν =


0
(1 − F(x)) dx < ∞. For the data in our
study, typical values of ν are 20-40, i.e. 400-800 ms. We consider these assumptions
justified for the purpose of studying a reference model, obviously it would be desirable
to allow dependence over time.
2.1 Mean arrival and interarrival times
It is intuitively clear that in the long run E(U
k
) ≈ 1 as on average packets arrive with
20 ms spacing, which we will now verify for the model. The representation (1) for T
k
can be written
T
k
= max(Y
1
, 1 +Y
2
, . . . , k −1 +Y
k
) k ≥ 1,
Ingemar Kaj and Ian Marsh: Modelling the Arrival Process for Packet Audio 51
which gives the alternative representation
T
k
= max(Y
1
, 1 +T

k−1
), k ≥ 2 (3)
where on the right side
T

k−1
= max(Y
2
, 1 +Y
3
, . . . , k −2 +Y
k
)
has the same marginal distribution as T
k−1
but is independent of Y
1
. From (3) follows
that we can write {T
k
> t} as a union of two disjoint events, as
{T
k
> t} = {1 +T

k−1
> t} ∪ {Y
1
> t, 1 +T

k−1
≤ t}.
Hence, using the independence of T

k−1
and Y
1
,
P(T
k
> t) = P(1 +T

k−1
> t) +P(Y
1
> t, 1 +T

k−1
≤ t)
= P(1 +T
k−1
> t) +P(Y
1
> t)P(1 +T
k−1
≤ t)
and so
E(T
k
) =


0
P(T
k
> t) dt
= E(1 +T
k−1
) +


1
P(Y
1
> t)P(T
k−1
≤ t −1) dt. (4)
Therefore
E(U
k
) = 1 +


1
P(Y
1
> t)P(T
k−1
≤ t −1) dt →1, k →∞ (5)
(since ν =


0
P(Y
1
> t) dt < ∞and T
k
→ ∞, the dominated convergence theorem
applies forcing the integral to vanish in the limit).
A further consequence of (4) is obtained by iteration,
E(T
k
) = k −1 +E(Y
1
) +


1
P(Y
1
> t)
k−1
¸
i=1
P(T
i
≤ t −1) dt.
If we introduce
N(t) = the number of arriving packets in the time interval (0, t],
so that {N(t) ≥ n} = {T
n
≤ t}, this can be written
E(V
k
) = E(Y
1
) +


1
P(Y
1
> t)
k−1
¸
i=1
P(N(t −1) ≥ i) dt, (6)
which, as k →∞, gives an asymptotic representation for the average observed delay as
E(V
k
) →ν +


1
P(Y
1
> t)E(N(t −1))) dt. (7)
52 QoS-IP
2.2 Steady state distributions
By (1),
P(T
k
≤ x) =
k
¸
i=1
P(i +Y
i
≤ x + 1) =
k−1
¸
i=0
F(x −i),
and therefore the sequence (V
k
), which we defined by V
k
= T
k
−k +1, k ≥ 1, satisfies
P(V
k
≤ x) =
k−1
¸
i=0
F(x +k −1 −i) =
k−1
¸
i=0
F(x +i) x ≥ 0.
This shows that (V
k
) is a Markov chain with state space the positive real line and asymp-
totic distribution given by
P(V

≤ x) =

¸
i=0
F(x +i) x ≥ 0. (8)
Furthermore, for x ≥ 0
P(U
k
≥ x) = P(k −1 +Y
k
−T
k−1
≥ x) = P(V
k−1
≤ Y
k
+ 1 −x)
=


0
P(V
k−1
≤ y + 1 −x) dF(y),
where in the step of conditioning over Y
k
we use the independence of Y
k
and V
k−1
.
Therefore the sequence (U
k
) has the asymptotic distribution
P(U

≤ x) = 1 −


0

¸
i=1
F(y −x +i) dF(y) x ≥ 0, (9)
in particular a point mass in zero of size
P(U

= 0) = 1 −


0

¸
i=1
F(y +i) dF(y). (10)
This distribution has the property that E(U

) = 1 for any given distribution F of Y
with ν = E(Y ) < ∞. In fact, this follows from 5 under a slightly stronger assumption
on Y (uniform integrability), but can also be verified directly by integrating (9). Figure
4 shows numeric approximations of the (non-normalised) density function
d
dx
P(U


x) of (9) for three choices of F. All three distributions show a characteristic peak close
to time 1 corresponding to the bulk of packets arriving with more or less correct spacing
of 20 ms. A fraction of the probability mass is fixed at x = 0 in accordance with (10),
but not shown explicitly in the figure. These features of the density functions can be
compared with the shape of the histogram in Figure 3 with its peak at the 20 ms spacing.
Also, close to the origin is a small peak which corresponds to packets arriving back-to-
back usually arriving as a burst, probably due to a delayed packet ahead of them. In
Figure 4 the density function with the highest peak close to 1 time unit is a Gaussian
distribution with arbitrarily selected parameters mean 5 and variance 0.2. Of the two
exponential distributions, the one with the higher variance (Exp(3)) has a lower peak
and more mass at zero compared with an exponential with smaller variance (Exp(2)).
Ingemar Kaj and Ian Marsh: Modelling the Arrival Process for Packet Audio 53
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.5
1
1.5
Fig. 4. Density functions of U for N(5,0.2), Exp(2) and Exp(3)
3 Silence suppression mechanism
In this section we incorporate an additional source of random delays due to silence
suppression into the model. Silence suppression is employed at the sender so as not to
transmit packets when there is no speech activity. During a normal conversation this
accounts for about half of the total number of packets, considerably reducing the load
on the network. Assign to packet number k the quantity
X
k
= duration of silent period between packets k −1 and k.
A silent period is the time interval during which the silence suppression mechanism is
in effect. We assume that the silence suppression intervals are independent of (Y
k
)
k≥1
and are given by a sequence of independent random variables X
1
, X
2
, . . ., such that
G(x) = P(X
k
≤ x), 1 −α = G(0) = P(X
k
= 0) > 0, µ = E(X
k
) < ∞.
The (small) probability α = P(X
k
> 0) represents the case where silence suppression
is activated just after packet k −1 is transmitted from the sender. Note that
S
k
=
k
¸
i=1
X
i
= total time of silence suppression affecting packet k,
which implies that the delivery of packet k from the sending unit now starts at time
k −1 +S
k
. The representation (1) takes the form
T
1
= S
1
+Y
1
, T
k
= max(T
k−1
, k −1 +S
k
+Y
k
), k ≥ 2, (11)
54 QoS-IP
hence
U
k
= T
k
−T
k−1
= max(0, k −1 +S
k
+Y
k
−T
k−1
) k ≥ 2. (12)
Similarly,
V
k
= arrival time −departure time = T
k
−k + 1 −S
k
k ≥ 1.
The alternative representation (3) is
T
k
= X
1
+ max(Y
1
, 1 +T

k−1
), (13)
where
T

k−1
= max(Y
2
+S
2
−X
1
, 1 +Y
2
+S
2
−X
1
, . . . , k −2 +Y
k
+S
k
−X
1
)
has the same marginal distribution as T
k−1
but is independent of X
1
and Y
1
. In analogy
with the calculation of the previous section leading up to (4), this relation gives
E(T
k
) = E(X
1
+ 1 +T
k−1
) +


1
P(X
1
+Y
1
> t, X
1
+T

k−1
≤ t −1) dt. (14)
Exchanging the operations of integration and expectation shows that the last integral
can be written
E
¸

1+X
1
1{Y
1
> t −X
1
, T

k−1
> t −X
1
−1} dt

where we have also used that the integrand vanishes on the set {t ≤ 1 + X
1
}. Apply
the change-of-variables t → t − X
1
to get E


1
1{Y
1
> t, T

k−1
> t −1} dt

. Then
shift integration and expectation again to obtain from (14) the relations
E(T
k
) = 1 +E(X
1
) +E(T
k−1
) +


0
P(Y
1
> t)P(T
k−1
≤ t −1) dt
and
E(U
k
) = 1 +E(X
1
) +


1
P(Y
1
> t)P(T
k−1
≤ t −1) dt.
Hence with silence suppression, as k →∞,
E(U
k
) →1 +µ, E(V
k
) →ν +


1
P(Y
1
> t)E(N(t −1)) dt, (15)
using the same arguments as in the simpler case of the previous section.
4 Including packet loss in the model
We return to the original model without silence suppression but consider instead the
effect of lost packets. Suppose that each IP packet is subject to loss with probability p,
independently of other packet losses and of the transmission delays. Lost packets are
Ingemar Kaj and Ian Marsh: Modelling the Arrival Process for Packet Audio 55
unaccounted for at the receiver and hence, in this section, the sequence (T
k
) records
arrival times of non-lost packets only. To keep track of their delivery times from the
sender introduce
K
k
= number of attempts required between
successful packets k −1 and k, k ≥ 1,
which gives a sequence (K
k
)
k≥1
of independent, identically distributed random vari-
ables with the geometric distribution
P(K
k
= j) = (1 −p)p
j
, j ≥ 0.
Moreover,
L
k
= K
1
+. . . +K
k
= number of attempts required for k successful packets
is a sequence of random variables with a negative binomial distribution. The arrival
times of packets are now given by
T
1
= K
1
−1 +Y
K1
, T
k
= max(T
k−1
, L
k
−1 +Y
L
k
), k ≥ 2.
Due to the independence we may re-index the sequence of Y
L
k
’s to obtain
T
1
= K
1
−1 +Y
1
, T
k
= max(T
k−1
, L
k
−1 +Y
k
), k ≥ 2.
and thus
T
k
= K
1
−1 + max(Y
1
, 1 +T

k−1
), k ≥ 2 (16)
with K
1
, Y
1
and T

k−1
all independent, and again T
k−1
and T

k−1
identically distributed.
This is the same relation as (13) with X
1
replaced by K
1
− 1 and hence, as in (15),
E(U
k
) → 1 + E(K
1
− 1) =
1
1−p
, k → ∞, which provides a simple method to
estimate packet loss based on observed interarrival times. Similarly, combining silence
suppression and packet loss,
E(U
k
) →1 +E(X) +E(K
1
−1) = µ +
1
1 −p
, k →∞, (17)
5 Incorporating Real Data
5.1 Trace data
We give a brief description of the experiments we performed in order to obtain estimates
for the parameters in the model. Pulse Code Modulated (PCM) packet audio streams
were sent from a site in Buenos Aires, Argentina to Stockholm, Sweden over a number
of weeks
1
. The streams are sent with a 64kbits/sec rate in 160 byte payloads. This
implies the packets leave the sender with a inter-packet spacing of 20 ms. The remote
site is approximately 12,000 kilometres, 25 Internet hops and four time zones from
our receiver. The tool is capable of silence suppression, in which packets are not sent
when the speaker is silent. Without silence suppression, 3563 packets are sent during 70
seconds and with suppression 2064 are sent. We record the absolute times the packets
leave the sender and the absolute arrival times at the receiver. This gives an observed
1
Available from http://www.sics.se/˜ianm/COST263/cost263.html
56 QoS-IP
1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900
0
20
40
60
80
packet sequence number
i
n
t
e
r
i
n
t
e
r
v
a
l

t
i
m
e
s
,

m
s
1700 1720 1740 1760 1780 1800 1820 1840 1860 1880 1900
600
620
640
660
680
packet sequence number
o
b
s
e
r
v
e
d

d
e
l
a
y
s
,

m
s
Fig. 5. Four second audio packet traces: a)delays b)interarrival times
sequence
v
k
= arrival time no k −departure time no k
of the Markov chain (V
k
). In particular, the sample mean ¯ v is an estimate of the one-way
delay. Similarly,
u
k
= arrival time no k −arrival time no (k −1)
is a sample of the interarrival time sequence (U
k
).
A typical sequence of trace data used in this study without silence suppression is
shown in Figure 5, which shows (v
k
) and (u
k
) for a small sequence of 200 packets
(1700 ≤ k < 1900), corresponding to four seconds of audio. To further illustrate such
trace data, Figure 6 shows a histogram of the delays (v
k
) and Figure 3 a histogram
for the interarrival times (u
k
). It can be noted that large values of interarrival times
are sometimes followed by very small ones, manifesting that a severely delayed packet
forces subsequent packets to arrive back-to-back. The fraction of packets arriving in
this manner corresponds to the height of the leftmost peak in the histogram of Figure 3.
Returning to traces with silence suppression, Figure 7 gives the statistics of the
recorded voice signal used. The upper part shows a histogram of the talkspurts and the
lower part the corresponding histogram for the non-zero part of the distribution G of
the silence intervals X discussed in section 3. The probability α = P(X = 0) and the
expected value µ = E(X) were estimated to
α

= 0.0456 µ

= 25.7171.
Ingemar Kaj and Ian Marsh: Modelling the Arrival Process for Packet Audio 57
500 550 600 650 700 750 800
0
50
100
150
200
250
300
observed delay times, ms
f
r
e
q
u
e
n
c
y
Fig. 6. Histogram of the observed delays (V
k
)
5.2 Numerical estimates
In this section we indicate a few simple numerical techniques that give parameter esti-
mates based on trace data. In principle such methods based on the model presented here
can be used for systematic studies of the delays and losses and for comparison of traces
sampled in different environments.
Considering first the case of no silence suppression, it was pointed out in section 4
that given an observed realization (u
k
)
n
k=1
of (U
k
), a point estimate of the packet loss
probability p is obtained from (17) (with µ = 0), using
p

= 1 −
20
¯ u
, ¯ u =
1
n
n
¸
k=1
u
k
ms.
Our measurements gave consistently ¯ u ≈ 20.002 − 20.005 ms, indicating loss proba-
bilities of the order 10
−4
.
Next we look at an experiment where the pre-recorded voice is transmitted at seven
different times using silence suppression, and the interarrival times measured at the re-
ceiver during each transmission. Table 1 shows the expected silence interval E(X) and
the estimated µ from the trace files. The obtained estimates indicate a systematic bias
of the order 0.5 milliseconds in the mean values of the silence suppression intervals.
Packet losses do not seem to explain fully the observed deviation. A more comprehen-
sive statistical analysis might reveal the source of this slight mismatch. For the present
preliminary investigation we find the numerical estimates satisfactory.
We now consider the problem of estimating the distribution F of packet delays Y
given a fixed length sample observation (v
k
) of the Markov chain (V
k
) for observed
delays. One method for this can be based on the steady state analysis in section 2.2.
58 QoS-IP
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
10
20
30
40
Length of talkspurts, ms
N
o

o
f

o
b
s
e
r
v
a
t
i
o
n
s
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0
10
20
30
40
Length of silent intervals, ms
N
o

o
f

o
b
s
e
r
v
a
t
i
o
n
s
Fig. 7. Lengths of talkspurts and silence periods
Table 1. Silence Interval Parameters
Trace E(X) E(X)-µ

trace 1 25.7492 0.0321
trace 2 26.2204 0.4639
trace 3 26.2284 0.5113
trace 4 26.2164 0.4993
trace 5 26.2186 0.5015
trace 6 26.2124 0.4953
trace 7 26.2209 0.5038
Indeed, rewriting (8) as the simple relation
P(V

≤ x) = F(x)

¸
i=1
F(x +i) = F(x) P(V

≤ x + 1)
shows that if we let
¯
F
V
denote an empirical distribution function of V , then we obtain
an estimate
¯
F of F by taking
¯
F(x) =
¯
F
V
(x)
¯
F
V
(x + 1)
x ≥ 0, (18)
where we recall that the variable x is measured in units of 20 ms intervals. An applica-
tion of this numerical algorithm to the trace data of the previous figures (5 and 6) yields
an estimated density function for Y as in Figure 8. The numerical scheme is sensitive
for small changes in the data, so it is difficult to draw conclusions on the finer details of
the distribution of F. As expected the graph is very similar to that of the observed de-
lays, Figure 6, but with certain differences due to the Markovian dependence structure
Ingemar Kaj and Ian Marsh: Modelling the Arrival Process for Packet Audio 59
500 550 600 650 700 750 800
0
0.2
0.4
0.6
0.8
1
1.2
delay times, ms
e
s
t
im
a
t
e
d

f
r
e
q
u
e
n
c
y
Fig. 8. Estimated density of Y
in the sequence (V
k
) as opposed to the independence in (Y
k
). The main difference is
the shift towards smaller values for Y in comparison to those of V . This corresponds to
the inequality
¯
F(x) ≥
¯
F
V
(x) valid for all x, which is obvious from (18).
6 Related Work
Many researchers have looked at the needs in terms of buffer size for packet streams
characterised by Markov (semi or modulated) behaviour especially in the case of multi-
plexed sources. Their goal was to derive the waiting time of packets spent in the buffer
shown as probability density function of the waiting times. Relatively few, however,
have looked at the arrival process using a stage of buffers and identifying embedded
Markov chains from a single source. Additionally we concentrate on this scenario, in-
cluding both streams with and without silence suppression. Additionally as far as we
know, no-one has used real trace data to enhance and verify their models to the level we
show.
Some early analytical work on the buffer size requirements for packetised voice is
summarised by Gopal et al. [1]. One often cited piece of work is Barberis [2]. As part
of this work he assumes the delays experienced by packets of the same talkspurt are
i.i.d according to an exponential distribution p(t) = λe
−λT
where 1/λ is the average
transmission delay and standard deviation. M.K. Mehmet Ali et al. in their work of
buffer requirements [3] model the arrival process as a Bernoulli trial with probability
[1 −F(j, n −j + 1)] of the event “no arrival yet” at each interval up to its arrival. The
outcome of the trial is represented by the random variable k(j, n):
k(j, n) =

1 if packet j has arrived at or before time n
0 otherwise.
Ferrandiz and Lazar in [4] look at the analysis of a real time packet session over
a single channel node and compute its performance parameters as a function of their
60 QoS-IP
model primitives. They do not use any Markovian assumptions, rather an approach
which uses a series of overload and under-load periods. During overload packets are
discarded. They derive an admission control scheme based on an average of the packet
arrival rate. Van Der Wal et al. derive a model for the end to end delay for voice packets
in large scale IP networks [5]. Their model includes different factors contributing to
the delay but not the arrival process of audio packets per se. The mathematical model
described here is also discussed in the book [6].
7 Conclusions
We have addressed the problem of modelling the arrival process of a single packet audio
stream. The model can be used to produce packet audio streams with characteristics, at
least, quite similar to the particular traces we have obtained. The model is suitable for
generating streams both with and without silence suppression applied at the source, in
addition the case where packets are lost has been included.
The work can be generally applied to research where modelling arriving packet
audio streams needs to be performed. A natural next step is to use the arrival model
presented here for evaluation of jitter buffer performance, such as investigating waiting
times and possible packet loss in the jitter buffer. We observed from our model that the
interarrival times are negatively correlated (as mentioned Section 5.1). This will have
an impact on the dynamics and performance of a jitter buffer. With an accurate model,
based on real data measurements, a realistic traffic generator can be written. In separate
work we have gathered nearly 25,000 VoIP traces from ten globally dispersed sites
which we can utilise for ’parameterising’ the model, depending on the desired scenario.
References
1. Prabandham M. Gopal, J. W. Wong, and J. C. Majithia, “Analysis of playout strategies for
voice transmission using packet switching techniques,” Performance Evaluation, vol. 4, no.
1, pp. 11–18, Feb. 1984.
2. Giulio Barberis, “Buffer sizing of a packet-voice receiver,” IEEE Transactions on Communi-
cations, vol. COM-29, no. 2, pp. 152–156, Feb. 1981.
3. Mehmet M. K. Ali and C. M. Woodside, “Analysis of re-assembly buffer requirements in
a packet voice network,” in Proceedings of the Conference on Computer Communications
(IEEE Infocom), Bal Harbour (Miami), Florida, Apr. 1986, IEEE, pp. 233–238.
4. Josep M. Ferrandiz and Aurel A. Lazar, “Modeling and admission control of real time packet
traffic,” Technical Report Technical Report Number 119-88-47, Center for Telecommunica-
tions Research, Columbia University, New York 10027, 1988.
5. Walm van der Wal, Mandjes Michel, and Rob Kooij, “End-to-end delay models for interactive
services on a large-scale IP network,” in IFIP, June 1999.
6. Ingemar Kaj, Stochastic Modeling for Broadband Communications Systems, Society for
Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002, Prepared with L
A
T
E
X.
Paper C
Olof Hagsand, Ian Marsh and Kjell Hanson. Sicsophone: A Low-delay In-
ternet Telephony Tool. To appear at the 29th Euromicro Conference, Belek,
Turkey, September 2003.
c IEEE 2003
Reprinted with permission.
Sicsophone: A low-delay Internet telephony tool
Olof Hagsand
LCN Laboratory, IMIT
Royal Institute of Technology
Sweden
[email protected]
Ian Marsh
SICS AB
Stockholm
Sweden
[email protected]
Kjell Hanson
Prosilient Software AB
Stockholm
Sweden
[email protected]
Abstract
The end to end delay is a critical factor in the per-
ceived quality of service for Voice over IP applications.
Sicsophone is a complete VoIP system that couples the low
level features of audio hardware with a standard jitter buffer
playout algorithm. Using the sound card directly eliminates
intermediate buffering as well as providing fine control over
timers needed by a soft real-time application such as VoIP.
A statistical based approach for inserting packets into au-
dio buffers is used in conjunction with a scheme for inhibit-
ing unnecessary fluctuations in the system. We also present
mouth-to-ear delay measurements for selected VoIP appli-
cations and show that several hundreds of milliseconds can
be saved by using the techniques described in this paper. A
prototype for both UNIX and Windows platforms has been
implemented, demonstrating that our system adapts to net-
work conditions whilst maintaining low delays.
Keywords: Packet voice, playout buffer adaption,
operating systems
1 Introduction
Users of interactive VoIP applications demand low la-
tency conversations. Replaying packetised audio requires
that sufficient packets are available to the application in or-
der to avoid gaps or glitches. The digital to analog conver-
sion of sampled voice requires strict, synchronous timing
despite the fact that the network and operating system may
disrupt the process. The most common method to solve this
problem is to introduce a small intermediary buffer between
the decoded audio stream and the audio hardware which
allows packets to be “available” for playout. Of course
withholding packets instead of immediately playing them
increases the total delay of a VoIP application. However,
the longer packets can be delayed, the more resilient the re-
ceiver is to adverse network conditions. We should point
out that Sicsophone is a working implementation and the
algorithmic complexity is an important factor. We moti-
vate this approach with real delay experiments and results.
Hence the goal is not to compare the merits of various play-
out algorithms, this has been covered by many researchers,
rather to give some insight what issues are important when
realising these schemes and their effects.
In this paper we refer to the mouth-to-ear delay as the
total one way delay experienced by two speakers includ-
ing the analog-digital-analog conversion. By jitter we mean
the variability in the packet delay. This variability is the
reason we need to buffer packets, thus our work focuses
on how to detect and compensate for packet jitter in an ef-
ficient manner. Our solution is to insert packets into the
memory of sound cards relieving the need for data copying
or context switching. This approach saves precious time,
avoids scheduling problems but requires careful buffer man-
agement.
Figure 1 illustrates the complete path of audio samples
from a microphone at a sender to the loudspeaker at a re-
ceiver. Traditionally, a sender writes voice samples to the
operating system which are subsequently sent across the
network to a receiving host. At the receiver, data is read
from the operating system interface where it is the responsi-
bility of the application to adjust the buffer size as required,
this is shown by the solid lines in the illustration.
In our approach we use the buffering scheme in the op-
erating system and copy the packets directly into the mem-
ory of the sound card. Therefore, we save copying the data
to and from the application, plus not performing the de-
jittering in the application. We describe our approach in
the context of DirectSound [1] on the Windows platform. It
is important to point out that our approach is not confined to
this architecture, a ring buffer with pointer support is suffi-
cient to realise the ideas presented in this paper (alternatives
for UNIX include [9] or [8]). However we describe the sys-
tem using DirectSound as it is known to many developers
and was used in our experimental evaluation. It is impor-
tant to state that we assume the systems are not under heavy
load or consider Sicsophone as a hard real-time system.
Optimized Path
Ethernet Ethernet Ethernet
IP IP
Microphone
A/D
Loudspeaker
Ethernet
D/A
App
App
IP
Wave API
OS
OS
DirectSound
Sender Receiver
Normal Path
Figure 1. Sicsophone audio delivery path
If we now look at the steps a receiver must take to replay
packetised audio from the network in more detail, Table 1
shows four such typical steps. Firstly, de-packetisation, re-
moves the IP and UDP headers and passes the datagram
together with a Real Time Protocol (RTP) payload to a
VoIP application. This step takes a few milliseconds on
most systems. Step two is to decode the sound samples,
this is dependent on the compression scheme as well as
the packet size used. Typically this takes from a few mil-
liseconds to tens of milliseconds. Steps three and four are
usually performed as distinct steps, absorption of network
delays through buffering and delivery to the sound applica-
tion. Our goal was to consolidate these steps into a single
step, saving the time of inter-mediatory buffering and con-
text switching. We refer to this approach, solely for defini-
tion by its software name, Sicsophone.
Step Process Overhead Depends On
1 De-packetisation 10 - 50 Pact. Size
2 Decoding 10 - 50 Coding
3 Buffer Delay 5 - 200 Network
4 Delivery 5 - 120 End System
Table 1. Typical receiver incurred delays (ms)
The remainder of this paper is organised in the following
fashion; Section 2 forms the main body of this work, low
level adaption of playout buffers using ring buffers. Section
3 presents results of Sicsophone’s performance of mouth-
to-ear results for different VoIP tools. We also give com-
parisons of the playout delay with Sicsophone against the
idealised case. Section 4 is a description of related efforts
with which this paper has commonalities, we round off the
paper with some conclusions in Section 5.
2 End-system adaption to jitter
2.1 Buffering issues
In this section we outline some issues associated with the
data buffering scheme we have chosen. Our goal is to save
time by avoiding data copying, setting up direct memory
access (DMA) transfers ahead of time, using simple data
structures and inserting de-jittered audio packets directly
into the memory of the sound card. Using the sound card
as a buffer has the advantage of not adding any extra buffer-
ing to the audio sample path. It also avoids copying data
froma kernel to an application and back again. Direct mem-
ory access is used to move data from memory to the sound
card and vice versa without intervention of the CPU. Using
DMA efficiently is not trivial, as it can take some time to
set up the transfer. However once it is done, the transfer
can be done much quicker and more efficiently. This offers
significant time savings over posting an interrupt for every
packet, particularly in the older (and non-DirectX) versions
of the Windows operating systems.
One potential problem of using the sound card memory
as a buffer is it could be overrun by packets arriving too
quickly, for example on a fast connection. Modern sound
cards however are equipped with megabytes of RAM to
store down-loadable sound samples, DirectSound can allo-
cate buffers up to this physical size when a hardware buffer
is initialised. Another potential cause of overrun or under-
run is misaligned or drifting clocks, there is no mechanism
in Sicsophone to explicitly detect this. We do however keep
the buffer from being overrun by mechanisms explained in
Section 2.3.
Another important but often overlooked issue is mix-
ing. For an application like VoIP where the voice channel
is stopped and started continuously, we would like to min-
imise this setup time. Valuable time can be lost by setting
up mixers for software and hardware buffers where we nor-
mally do not want to mix audio from different sources, e.g.
the VoIP and a MP3 file. Therefore we allocate a Direct-
Sound primary buffer to give better delay characteristics, as
it does not need to be mixed before outputting to the D/A
conversion.
The coding scheme used is another issue. Packets have
to be decoded before insertion into the buffer if they are not
in PCM format. However using PCM allows us to DMA
the payload into the sound card memory without any audio
format conversion. This is the fastest path from packet re-
ception to playout. It is however possible to support other
audio formats, however they require extra CPU cycles for
decoding the audio, and a small buffer to hold the data be-
fore and after decoding. Using buffers in this manner makes
the assumption a certain number of bytes in the buffer cor-
responds to a well defined playout time. This is the case for
2
coding such as PCM but not compressed audio formats.
Write
Cursor
Direction
Cursor
DirectSound
Ring Buffer
Start of
Saftey
Effective
Length
Playout
Margin
1
2
Talkspurt
Talkspurt Start of
Cursor
Playout
Figure 2. DirectSound buffer structure
Figure 2 shows the interface offered by DirectSound.
Data is written at the write pointer and replayed by the trail-
ing playout pointer. The read and write pointers are updated
by the system, and continuously encircle the buffer. Read-
ing and writing the pointers requires that the system almost
instantaneously updates their current positions. Some of the
older operating systems used in our measurements did not
give the fine granularity over the positions of the timers. In
Sicsophone they are used as both timers and pointers. In
fact we use the DMA producing and consuming data as the
only clocks in the delivery system. They function as a timer
by indicating if a packet is too late. If the read pointer has
already passed the point where a packet should be, and it
has not been written, then we know that this packet is late.
Insertion is simply a modulo operation and a buffer copy.
In order not to replay old data from the buffer when no
packets are being sent we write ‘silence’ samples that sound
like audible background noise into the buffer so that the lis-
tener is aware the connection is still open.
Used as pointers, the read and write pointers give mem-
ory locations where data is read or written to depending on
the operation to be performed. Given these pointers it is
easy to adjust the buffer length, it is simply where packets
are chosen to be read from. The closer the read pointer is to
the write pointer the smaller the effective buffer length will
be. Note there is a small margin of 15ms in front of the read
pointer to allow data that has been written to be “ready” for
playback. Use of this safety margin is recommended by the
developers of DirectSound.
To give a concrete example, using an estimate of the
network delay and its variation, described previously, we
insert packets at a specified “distance” ahead of the read
pointer. Therefore a translation from milliseconds to
bytes is needed; bytes = (samples sec · bits sample ·
P
i
) / 8000. For example, if one substitutes 8000 for sam-
ples sec, 8 bits per sample and 200ms for the playout point
this equals 1600 bytes. This means that the write pointer can
simply be set 1600 bytes in front of the read pointer. The
safety margin is also included in the length of playout buffer
but it is possible to simply subtract the value (120 bytes in
this case) from the calculation. To re-iterate once the play-
out point has been calculated it is trivial to insert packets
into the buffer, no complex data operations are needed.
2.2 Fast startup adaption
In an adaptive VoIP application we normally consider
changing the buffer size during a silence period so as not to
introduce audible glitches in the analog audio stream. Since
the goal of this work is to produce a low-level VoIP tool
we would like to keep the buffer length close to optimal.
However in the startup phase we have no idea of the network
condition and therefore have to use default values for the
network delay (Sicsophone uses min = 20ms, max = 60ms).
We therefore adjust the buffer length after monitoring only
a few packets to settle to an estimate quickly.
Figures 3 and 4 show packet delay in the jitter buffer
during the start up phase of Sicsophone as an example. The
y-axis shows the waiting time in the buffer (in ms) and the
x-axis shows the number of packets received, sorted by the
time spent in the buffer, note this is not the sequence num-
ber. It shows the number of packets and their respective
waiting times.
0
10
20
30
40
50
60
0 2 4 6 8 10
J
i
t
t
e
r

B
u
f
f
e
r

D
e
l
a
y

(
m
s
)
Number of Packets
10 packets
Figure 3. Jitter for the first ten packets
Figure 3 shows the buffer state after ten packets have
been received and Figure 4 after an additional 40 packets
have arrived, the original ten are shown with bolder lines.
After ten packets were stored, the time spent in the buffer
varied between 14 and 30 ms whereas after 50 packets the
median delay incurred is around 20 ms. This is not surpris-
ing as packets are sent with a 20 millisecond separation.
3
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35 40 45 50
J
i
t
t
e
r

B
u
f
f
e
r

D
e
l
a
y

(
m
s
)
Number of Packets
10 packets
50 packets
Figure 4. Arrival of 10 and 50 packets
Fast adaption is worthwhile during the start up phase of
a VoIP session. The alternative approach is to be conser-
vative in the start up phase and have long playout buffers
until a value for the playout point can be calculated. In the
presence of spikes [7] we can re-estimate the jitter value
quickly. Since the goal of this paper is a low delay VoIP
tool we chose to adapt quickly. Furthermore, usually there
are sufficient silence periods during the startup phase of a
conversation to perform fast adaption. In the case where
there are none (such as call waiting, i.e. music playing) we
adjust the buffer when a packet is excessively delayed or if
there is a loss, failing these possibilities we adjust the buffer
length and tolerate an audio glitch.
2.3 Bounding the estimated network delay
Sudden increases in the network delay can cause VoIP
applications problems. A spike is referred to a sudden and
rapid increase in the network delay which is typically short
lived, often less than one round trip time. One solution is to
follow the increase in the delay and adjust the buffer length
accordingly. The alternative is to include the values of the
jitter estimate but not adapt the buffer size. It is because of
this temporal property has led us to be more conservative to
network conditions and not to adapt the DirectSound buffer
length to sudden increases in network delay.
We have implemented a system which bounds the de-
lay jitter estimate. As stated, the spikes are not completely
ignored but we do not react immediately to their pres-
ence. The estimated jitter value should vary between an
upper Qmax
i
and a lower Qmin
i
(see the key in Figure
5) bound in a “corridor”, where Qmin
i
< d
i
< Qmax
i
.
If the running estimate breaks either of the boundaries we
re-calculate the new buffer length, taking into account the
value of the spike, but reset the mean estimate to the mid-
dle value of this new (Qmin
i
, Qmax
i
). Figure 5 shows
an example of a receiver jitter buffer during a conversa-
tion between two machines on a local network. The y-axis
0
10
20
30
40
50
60
70
80
0 500 1000 1500 2000 2500 3000 3500 4000
J
i
t
t
e
r

B
u
f
f
e
r

D
e
l
a
y

(
m
s
)
Sequence Number
d
i
Qmin
i
Qmax
i
Qreset
i
Figure 5. Bounded jitter buffer playout delays
shows the jitter buffer length and the x-axis the sequence
number. The system starts with Q
min
and Q
max
set to the
default values 20ms and 60ms for the minimum and max-
imum bounds respectively. At the bottom of the figure we
show the corridor breaks as stars to highlight the breaks.
We have found this scheme to work well, in the given trace
there were only 14 breaches of the corridor from over 3600
packets (less than 0.5%). More importantly we did not
make costly, unnecessary changes to the DirectSound pri-
mary buffer.
3 Results
We divide this section into two sections, the first
gives the total delay of popular VoIP tools compared to
Sicsophone in a laboratory environment, with basically no,
or little, network delay. Secondly, we show the performance
of the playout algorithm using trace files taken over the In-
ternet (two of the ones used in [7]). This shows a typical
WAN component due to the jitter on the Internet plus the
best possible playout that could have been achieved by post-
processing trace files. We chose to give the results in this
manner to estimate the real mouth-to-ear delay by includ-
ing both the WAN and LAN components. It can be seen
as a sum of these two quantities. Essentially the first set
includes the delay due to coping with the operating system
4
and the second with the network conditions. Using these
trace files we can compare our implementation with those
published.
3.1 Mouth-to-ear measurements
The delay contributed by the end systems is the main
result of this paper. We performed one way mouth-to-ear
measurements with a range of VoIP tools and the results are
summarised in Table 2. It’s important to state that no pa-
rameter tweaking of these tools was done, we used their de-
fault installation values. The experimental setup used was
as shown in Figure 1. We used a signal generator which
outputted a 1Hz square signal. The square wave serves as a
trigger, the signal is packetised and sent over the IP network
and played back through the loudspeaker at the destination.
The square wave is detected by an oscilloscope and the dif-
ference in time between the original waves are measured.
Audio Tool Latency (ms)
Sicsophone prototype 25-100
Ericsson Lanphone 300
Vocal Internet Phone 4.5 (SB) 450-550
Vocal Internet Phone 4.5 (PJ) 580-620
NetMeeting 2.1 (SB) 620
NetMeeting 2.1 (PJ) 750
VAT 3.4 (Solaris) 1200
RAT 3 (Solaris) 1500
Table 2. Mouth-to-ear latency measurements
(SB=SoundBlaster and PJ=PhoneJack)
The measurements were done using a signal generator
feeding a sender and an oscilloscope to measure the time
difference between the sender and receiver.
We can see that there are large variations between the
various applications. One important result of this paper is to
highlight the design of end systems for VoIP applications.
Our goal is not simply to state Sicsophone is superior to
other tools, rather to show the considerable time savings,
10’s to 100’s of milliseconds, can be saved by using the
approach described.
3.2 Comparison with ideal playout conditions
In the introduction we mentioned that a jitter buffer play-
out algorithm essentially has to tradeoff low delay or packet
loss due to their arrivals being too late. Low delay implies a
short playout buffer, incurring higher packet loss due to late
arrivals. When comparing the performance of algorithms
it makes sense, therefore, to consider loss and delay. They
should be taken as the delays incurred for a given trace at
a particular time and are used to show the network delay
which would be added to the system delay, i.e. those in
Table 2.
Figures 6 and 7 show the results for two Internet trace
files
1
. To calculate the optimal playout point we order all
the packets and remove the 1% with the highest delay. We
then calculate the delay needed to play the remaining 99%
of the packets resulting in the delay for 1% packet loss, this
process is repeated up to 25% packet loss (although in prac-
tice more than 10% would be deemed unacceptable). Stud-
ies have shown that 1% is acceptable without packet loss
concealment and up to 10% with packet loss concealment
[3]. Packet loss concealment is the process of “filling in”
missing or lost packets which has been shown to be more
preferable than glitches in the voice. Figures 6 and 7 show
Sicsophone’s playout delay performance in comparison.
0
20
40
60
80
100
0 5 10 15 20 25
J
i
t
t
e
r

B
u
f
f
e
r

D
e
l
a
y

(
m
s
)
% packet loss
optimal
Sicsophone
Figure 6. Playout delays for a trace from UCI,
California to INRIA, France
In Figure 6 Sicsophone is about 50ms from the ideal
playout point and remains more or less constant as the
packet loss increases. For a given loss rate, e.g. 5% Pinto
and Christensen [6] quote a slightly lower delay than we
do, 72ms compared to our 98ms and similarly so for other
loss rates. We should re-iterate the focus of this paper is the
implementation and the absolute/measurable delays rather
than the playout algorithm itself. Nevertheless, the result in
this case is due to the large variation of jitter (± 20ms),
which makes it hard to settle to a constant value for the
minimum buffer length, this can be verified by looking at
the absolute jitter. The ITU G.114 document recommends
that the one-way delay should not exceed 150ms, so this
is about one third of the recommended delay used in the
1
http://gaia.cs.umass.edu/˜sbmoon/traces.html
5
0
2
4
6
8
10
12
0 5 10 15 20 25
J
i
t
t
e
r

B
u
f
f
e
r

D
e
l
a
y

(
m
s
)
% packet loss
optimal
Sicsophone
Figure 7. Playout delays for a trace from
Amherst, Mass to GMD, Berlin
buffer playout algorithm. A second test is shown in Fig-
ure 7 which shows a trace from the University of Mas-
sachusetts in Amherst to GMD in Berlin. In this case the
jitter is much better and the difference between the optimal
and Sicsophone is only 5ms. We have included both the
best and worst test cases of the five traces available.
4 Related work
The early 90’s produced a surge in packet audio play-
out research. One of the first efforts to implement a voice
application on an IP network with an adaptive buffer play-
out strategy was NeVoT [11]. The playout algorithm imple-
mented in Sicsophone is almost identical to NeVoT [11].
They use a variation estimate similar to the one given ear-
lier, however they make a slight distinction for the first
packet in a talkspurt and subsequent ones. The playout
for the first packet is delayed longer due to lack of infor-
mation on the network state after the silence period. Our
work shares theirs in the choice of a ring buffer for buffer-
ing packets, only we perform the copying by using DMA
transfers directly rather than copying the data from the ap-
plication to the operating system. Using a ring buffer in
Sicsophone is identical to that described in [11] where the
authors motivate their choice of using a circular buffer for
performance reasons. VAT (Visual Audio Tool) [2] is a well
known VoIP tool that implements a playout buffer similar
to the one described, including a circular buffer to hold the
packets before playout. We use an additional scheme to pre-
vent the jitter estimates from varying too rapidly plus focus
on the efficient insertion of packets into the playout buffer.
Moon et al. [7] present four different playout algorithms for
packet audio. All calculate an estimate of the network delay
and jitter as an average from all the packets measured. The
authors study jitter spikes in traces and also do not adapt the
buffer size to these spikes. Pinto and Christensen [6] de-
scribe an algorithm for jitter compensation based on the tar-
get packet loss rate. Their “gap based” approach compares
the current playout time with the arrival time and calculate
a gap for both early and late packets. They compare the cur-
rent playout delay, for any particular talkspurt in progress,
with an optimal playout delay. This optimal theoretical de-
lay is defined as minimum amount of delay to be added to
the creation time of each packet which would result in a
playout of a talkspurt at the given loss rate. Our calcula-
tion of the optimal playout is similar to the one described
in this paper. Luigi Rizzo describes a generic sound card
driver for FreeBSD [9]. Aspects of it resemble our work, in
particular, handling of timers, DMA transfer and buffer size
allocation. They include hooks to use the driver for VoIP
applications, one such example is a select() call which can
be scheduled to return only when a certain amount of data
is ready for consumption. Rosenberg et. al in [10] looked at
combining target-based playout algorithms in conjunction
with FEC schemes, and propose a number of new playout
algorithms based on this coupling. Kouvelas and Hardman
in [4] keep the flow of audio constant during operating sys-
tem load by using buffering in the audio hardware. They
also look at reducing the amount of buffering in the appli-
cation by keeping the buffers in the application as small as
possible. In our case we try and totally eliminate it by only
using the hardware buffers.
5 Conclusions
In this paper we have shown how careful buffer manage-
ment combined with a simple statistical playout scheme can
reduce mouth-to-ear delay for VoIP applications. As stated
at the start of this paper, delay is one of the most important
factors in the perceived QoS and this has been the focus
of this work. The results are encouraging as the mouth-
to-ear delay of Sicsophone on a LAN is around 50ms on
a Windows NT system with DirectX 8.0. We also include
an estimate of the delay induced by network conditions us-
ing the playout algorithm, but include it to help estimate
the network delay of VoIP tool such as Sicsophone. We
have proposed a system which tries to reduce the perceived
mouth-to-ear delay of real-time packet audio communica-
tion.
It is well known that users are sensitive to delay, however
we are not aware of any studies that have been conducted
on the effect of dynamically changing the perceived delay
(even reducing) by techniques such as the one suggested in
this work. If this is found to be important then one could
6
consider it in the design of the algorithm i.e. by inhibit-
ing too frequent changes in the playout delay. We also plan
to measure the delay with operating systems such as Win-
dows XP (with DirectX 9.0) and UNIX platforms. We have
gathered more than 25,000 VoIP trace files from around the
world using Sicsophone and plan to use them for further
jitter, loss and delay analysis [5].
References
[1] B. Bargen and P. Donnelly. Inside DirectX. Microsoft Press,
1998.
[2] V. Jacobson and S. McCanne. vat - LBNL audio con-
ferencing tool, July 1992. Available at http://www-
nrg.ee.lbl.gov/vat/.
[3] J. Janssen, D. D. Vleeschauwer, and G. H. Petit. Delay and
distortion bounds for packetized voice calls of traditional
PSTN quality. In Proceedings of the 1st IP Telephony Work-
shop (IPTel 2000), pages 105–110, Berlin, Germany, Apr.
2000. GMD Report 95.
[4] I. Kouvelas and V. Hardman. Overcoming workstation
scheduling problems in a real-time audio tool. In Proc. of
Usenix Winter Conference, Anaheim, California, Jan. 1997.
[5] I. Marsh and F. Li. Wide Area Measurements of VoIP Qual-
ity. In Quality of Future Internet Services (to appear), Stock-
holm, Sweden, Oct. 2003.
[6] J. Pinto and K. Christensen. An algorithm for playout of
packet voice based on adaptive adjustment of talkspurt si-
lence periods. In Proceedings of the IEEE 24th Conference
on Local Computer Networks, pages 224–231. ACM, Oct.
1999.
[7] R. Ramjee, J. Kurose, D. Towsley, and H. Schulzrinne.
Adaptive playout mechanisms for packetized audio applica-
tions in wide-area networks. In Proceedings of the Confer-
ence on Computer Communications (IEEE Infocom), pages
680–688, Toronto, Canada, June 1994. IEEE Computer So-
ciety Press, Los Alamitos, California.
[8] D. Reed. A new audio device driver abstraction. In Proc.
International Workshop on Network and Operating System
Support for Digital Audio and Video (NOSSDAV), 1998.
[9] L. Rizzo. The FreeBSD audio driver. Lecture Notes in Com-
puter Science, 1356, 1997.
[10] J. Rosenberg, L. Qiu, and H. Schulzrinne. Integrating packet
FEC into adaptive voice playout buffer algorithms on the in-
ternet. In Proceedings of the Conference on Computer Com-
munications (IEEE Infocom), Tel Aviv, Israel, Mar. 2000.
[11] H. Schulzrinne. Voice communication across the Internet:
A network voice terminal. Technical Report TR 92-50,
Dept. of Computer Science, University of Massachusetts,
Amherst, Massachusetts, July 1992.
7
Paper D
Olof Hagsand, Kjell Hanson and Ian Marsh. Measuring Internet Telephony
Quality: Where are we today? In Proceedings of IEEE Globecom: Global
Internet, pages 1838-1842, Rio De Janeiro, Brazil, December 1999.
1838 Global Telecommunications Conference - Globecom’99
Global Internet: Application and Technology
0-7803-5796-5/99/$10.00 © 1999 IEEE
Global Telecommunications Conference - Globecom’99 1839
Global Internet: Application and Technology
1840 Global Telecommunications Conference - Globecom’99
Global Internet: Application and Technology
Global Telecommunications Conference - Globecom’99 1841
Global Internet: Application and Technology
1842 Global Telecommunications Conference - Globecom’99
Global Internet: Application and Technology
Paper E
Ian Marsh and Fengyi Li. Wide Area Measurements of VoIP Quality. To ap-
pear at Quality of Future Internet Services 2003, October, 2003, Stockholm,
Sweden.
SICS Technical Report T2003:08 ISRN: SICS-T–2003/08-SE ISSN: 1100-3154
Wide Area Measurements
of Voice Over IP Quality
Ian Marsh and Fengyi Li
Swedish Institute of Computer Science
Box 1263, SE-164 29 Kista, Sweden
15th May, 2003
Abstract
Time, day, location and instantaneous network conditions largely
dictate the quality of Voice over IP calls. In this paper we present
the results of over 18000 VoIP measurements, taken from nine sites
connected in a full-mesh configuration. We measure the quality of the
routes on a hourly basis by transmitting a pre-recorded call between
a pair of sites. We repeat the procedure for all nine sites during the
one hour interval. Based on the obtained jitter, delay and loss values
as defined in RFC 1889 (RTP) we conclude that the VoIP quality is
acceptable for all but one of the nine sites we tested. We also conclude
that VoIP quality has improved marginally since we last conducted a
similar study in 1998.
Wide Area Measurements of Voice Over IP Quality 82
1 Introduction
It is well known that the users of real-time voice services are sensitive and
susceptible to variable audio quality. If the quality deteriorates below an
acceptable level or is too variable, users often abandon their calls and retry
later. Since the Internet is increasingly being used to carry real-time voice
traffic, the quality provided has become, and will remain an important issue.
The aim of this work is therefore to disclose the current quality of voice
communication at end-points on the Internet.
It is intended that the results of this work will be useful to many different
communities involved with real-time voice communication. Within the next
paragraph we list some potential groups to whom this work might have
relevance. Firstly end users can determine which destinations are likely to
yield sufficient quality. When deemed insufficient they can take preventative
measures such as adding robustness, for example in the form of forward error
correction to their conversations. Operators can use findings such as these
to motivate upgrading links or adding QoS mechanisms where poor quality
is being reported. Network regulators can use this kind of work to verify
the quality level that was agreed upon, has indeed been deployed. Speech
coder designers can utilise the data as input for a new class of codecs, of
particular interest are designs which yield good quality in the case of bursty
packet loss. Finally, researchers could use the data to investigate questions
such as, “Is the quality of real-time audio communication on the Internet
improving or deteriorating?”.
The structure of the paper is as follows: Section 2 begins with some
background on the quality measures we have used in this work namely, loss,
delay and jitter. Following on from the quality measures, section 3 gives a
description of the methodology used to ascertain the quality. In section 4
the results are presented, and due to space considerations we condense the
results into one table showing the delay, loss and jitter values for the paths we
measured. In section 5 the related work is given, comparing results obtained
in this study with other researchers’ work. This is considered important as it
indicates whether quality has improved or deteriorated since those studies.
Section 6 rounds off with some conclusions and a pointer to the data we
have collated.
2 What Do We Mean by Voice over IP Quality?
Ultimately, users judge the quality of voice transmissions. Organisations
such as ETSI, ITU, TIA, RCR plus many others have detailed mechanisms
to assess voice quality. These organisations are primarily interested in speech
coding. Assigning quality ’scores’ involves replaying coded voice to both
experienced and novice listeners and asking them to adjudge the perceived
Wide Area Measurements of Voice Over IP Quality 83
quality. Measuring the quality of voice data that has been transmitted
across a wide area network is more difficult. The network inflicts its own
impairment on the quality of the voice stream. By measuring the delay, jitter
and loss of the incoming data stream at the receiver, we can provide some
indication on how suitable the network is for real-time voice communication.
The two schemes can be combined as was proposed by the ITU using with
the E-model [1].
The quality of VoIP sessions can be quantified by the network delay,
packet loss and packet jitter. We emphasise that these three quantities are
the major contributors to the perceived quality as far as the network is con-
cerned. The G.114 ITU standard states that the end-to-end one way delay
should not exceed 150ms [2]. Delays over this value adversely effect the
quality of the conversation. An alternative study by Cole and Rosenbluth
state that users perceive a linear degradation in the quality up to 177ms [3].
Above this figure the degradation is also linear although markedly worse.
As far as the packet loss is concerned, using simple speech coding such A-
law or µ-law coding, tests have shown that the mean packet loss should not
exceed 10% before glitches due to lost packets seriously affect the perceived
quality. Note that a loss rate such as this does not say anything about the
distribution of the losses. As far as the authors are aware of, no results exist
that state how jitter solely can affect the quality of voice communication.
Work on jitter and quality are often combined with loss or delay factors.
When de-jittering mechanisms are employed, the network jitter is typically
transferred into application delay. The application must hold back a suffi-
cient number of packets in order to ensure smooth, uninterrupted playback
of speech. To summarise, we refer to the quality as a combination of delay,
jitter and loss. It is important to mention we explicitly do not state how
these values should be combined. The ITU E-model is one approach but
others exist, therefore we refer the interested reader to the references given
as well as [4] and [5].
3 Simulating and Measuring Voice over IP Ses-
sions
Our method to measure VoIP quality is to send pre-recorded calls between
globally distributed sites. Through the modification of our own VoIP tool,
Sicsophone, the intervening network paths are probed by a 70 second pre-
recorded ‘test signal’. The goal of this work is therefore to report in what
state the signal emerges after traversing the network paths. Incidentally,
we do not include the signalling phase, i.e. establishing a connection with
the remote host, rather we concentrate solely on the quality of the data (or
speech) transfer.
Nine sites have been carefully chosen with large variations in hops, geo-
Wide Area Measurements of Voice Over IP Quality 84
graphic distances and time-zones to obtain a diverse selection of distributed
sites. One important limitation of the available sites was they were all
located at academic institutions, which are typically associated with well
provisioned networks. Their locations are shown in the map of Figure 1.
The sites were connected as a full mesh allowing us, in theory, to measure
Cooperating Sites in 1998
Cooperating Sites in 2002
Figure 1: The nine sites used in the 2002 measurements are shown with
circles. The six depicted with squares show those that were available to us
in 1998, three remained unchanged during the four years.
the quality of 72 different Internet paths. In practice, some of the combina-
tions were not usable due to certain ports being blocked, thus preventing the
audio being sent to some sites. There were four such cases. Bi-directional
sessions were scheduled on a hourly basis between any two given end sys-
tems. Calls were only transferred once per hour due to load considerations
on remote machines.
In Table 1 below we list the characteristics of the call we used to probe
the Internet paths between those sites indicated on the map. Their loca-
tions, separation in hops and time zones are given in the results section. As
stated, the call is essentially a fixed length PCM coded file which can be
sent between the sites, the length of the call and the payload size were arbi-
trarily chosen. Over a 15 week period we gathered just over 18,000 recorded
sessions. The number of sessions between the nine sites is not evenly dis-
tributed due to outages at some sites, however we attempted to ensure an
even number of measurements per site, in total nearly 33 million individual
packets were transmitted during this work.
Wide Area Measurements of Voice Over IP Quality 85
Test “signal”
Call duration 70 seconds
Payload size 160 bytes
Packetisation time (ms) 20ms
Data rate 64kbits/sec
With silence suppression 2043 packets
Without silence suppression 3653 packets
Coding 8 bit PCM
Recorded call size 584480 bytes
Obtained data
Number of hosts used (2003) 9
Number of traces obtained 18054
Number of data packets 32,771,021
Total data size (compressed) 411 Megabytes
Measurement duration 15 weeks
Table 1: The top half of the table gives details of the call used to measure
the quality of links between the sites. The lower half provides information
about the data which was gathered.
3.1 A Networking Definition of Delay
We refer to the delay as the one way network delay. One way delay is
important in voice communication, particularly if it is not the same in each
direction. Measuring the one way delay of network connections without
synchronised clocks is a non-trivial task. Hence many methods rely on
round-trip measurements and halve the values, hence estimating the one
way delay. We measured the network delay using the RTCP protocol which
is part of the RTP standard [6]. A brief description follows. At given
intervals the sender transmits a so called “report” containing the time the
report was sent. On reception of this report the receiver records the current
time. Therefore two times are recorded within the report. When returning
the report to the sender, the receiver subtracts the time it initially put in
the report, therefore accounting for the time it held the report. Using this
information the sender can calculate the round-trip delay and importantly,
discount the time spent processing the reports at the receiver. This can be
done in both directions to see if any significant anomalies exist. We quote
the network delay in the results section as they explicitly do not include
any contribution from the end hosts. Therefore it is important to state the
delay is not the end-to-end delay but the network delay. We chose not to
include the delay contributed by the end system as it varies widely from
operating system to operating system and how the VoIP application itself
is implemented. The delay incurred by an end system can vary from 20ms
Wide Area Measurements of Voice Over IP Quality 86
up to 1000ms, irrespective of the stream characteristics.
3.2 Jitter - An IETF Definition
Jitter is the statistical variance of the packet interarrival time. The IETF in
RFC 1889 define the jitter to be the mean deviation (the smoothed absolute
value) of the packet spacing change between the sender and the receiver [6].
Sicsophone sends packets of identical size at constant intervals which implies
that S
j
− S
i
(the sending times of two consecutive packets) is constant.
The difference of the packet spacing, denoted D, is used to calculate the
interarrival jitter. According to the RFC, the interarrival jitter should be
calculated continuously as each packet i is received. For one particular
packet the interarrival jitter J
i−1
for the previous packet i −1 is calculated
thus:
J
i
= J
i−1
+ (|D(i −1, i)| −J
i−1
)/16.
According to the RFC “the gain parameter 1/16 gives a good noise
reduction ratio while maintaining a reasonable rate of convergence”. As
stated earlier, buffering due to jitter adds to the delay of the application.
This delay is accounted for in the results we present. The “real” time needed
for de-jittering depends on how the original time spacing of the packets
should be restored. For example if a single packet buffer is employed it
would result in an extra 20ms (the packetisation time) being added to the
total delay. Note that packets arriving with a spacing greater than 20ms
should be discarded by the application as being too late for replay. Multiples
of 20ms can thus be allocated for every packet held before playout in this
simple example. To summarise, the delay due to de-jittering the arriving
stream is implementation dependent, thus do not include it in our results.
3.3 Counting Ones Losses in the Network
We calculate the lost packets as is exactly defined in RFC 1889. It defines
the number of lost packets as the expected number of packets subtracted by
the number actually received. The loss is calculated using expected values
so as to allow more significance for the number of packets received. For
example 20 lost packets from 100 packets has a higher significance than 1
lost from 5. For simple measures the percentage of lost packets from the
total number of packets expected is stated. As stated the losses in this work
do not include those incurred by late arrivals, as knowledge of the buffer
playout algorithm is needed, therefore our values are only the network loss.
Detailed analysis of the loss patterns is not given in the results section, we
simply state the percentages of single, double and triplicate losses.
Wide Area Measurements of Voice Over IP Quality 87
4 Results
The results of 15 weeks of measurements are condensed into figure 2. The
table should be interpreted as an 11 by 11 matrix. The locations listed
horizontally across the top of the table are the locations used as receivers.
Listed vertically they are configured as senders. The values in the rightmost
column and bottom row are the statistical means for all the connections from
the host in the same row and to the host in the same column respectively.
For example the last column of the first row (directly under “Mean”) the
average delay to all destinations from Massachusetts is 112.8ms.
Each cell includes the delay, jitter, loss, number of hops and the time
difference prefixed by the letters D, J, L, H and T for each of the connec-
tions. The units for each quantity are the delay in milliseconds, the jitter in
milliseconds, the loss in percentage, the hops as reported by traceroute and
time differences in hours. A ‘+’ indicates that the local time from a site is
ahead of the one in the corresponding cell and behind for a ’-’. The values in
parenthesis are the standard deviations. A NA signifies “Not Available” for
this particular combination of hosts. The bottom rightmost cell contains the
mean for all 18054 calls made, both to and from all the nine hosts involved.
The most general observation is the quality of the paths is generally
good. The average delay is just below the ITU’s G.114 recommendation for
the end-to-end delay. Nevertheless at 136ms it does not leave much time
for the end systems to encode/decode and replay the voice stream. A small
buffer would absorb the 4.1ms jitter and a loss rate of 1.8% is more than
acceptable with PCM coding [4].
There are two clear groupings from these results, those within the EU and
the US and those outside. The connections in Europe and the United States
(and between them) are very good. The average delay between the US/EU
hosts is 105ms, the jitter is 3.76ms and the loss 1.16%. Those outside fair
less well. The Turkish site suffers from large delays, which is not surprising
as the Turkish research network is connected via a satellite link to Belgium
(using the Geant network). The jitter and loss figures however are low,
5.7ms and 4% respectively. The Argentinian site suffers from asymmetry
problems. The quality when sending data to it is significantly worse than
when receiving data from it. The delay is 1/3 higher, the jitter is more than
twice it in the opposite direction and the loss is nearly four times higher
than when sending to it. Unfortunately we could not perform a traceroute
from the host in Buenos Aires, so we cannot say how the route contributed
to these values.
We now turn our attention to results which are not related to any partic-
ular site. As far as loss is concerned the majority of losses are single losses.
78% of all the losses counted in all trace files were single losses whereas 13%
were duplicate losses and 4% triplicate losses. Generally the jitter is low
relative to the delay of the link, approximately 3-4%. This is not totally
Wide Area Measurements of Voice Over IP Quality 88
unexpected as the loss rates are also low. With the exception of the Ar-
gentinian site, the sites did not exhibit large differences in asymmetry and
were normally within 5% of each other in each direction. It is interesting
to note that the number of hops could vary under the 15 week measure-
ment period denoted by () in the hops field. Only very few (< 0.001%)
out of sequence packets were observed. Within [7] there are details of other
tests, such as the effect of using silence suppression, differing payload sizes
and daytime effects. In summary no significant differences were observed in
these tests. We can attribute this (and the good quality results) to generally
well-provisioned academic networks.
5 Related Work
Similar but less extensive measurements were performed in 1998 [8]. Only
three of the hosts remain from four years ago so comparisons can only be
made for these routes. An improvement, in the order of 5-10% has been
observed for these routes. We should point out though, the number of
sessions recorded four years ago numbered only tens per host, whereas on
this occasion we performed hundreds of calls from each host. Bolot et. al.
looked at consecutive loss for designing an FEC scheme [9]. They concluded
that the number of consecutive losses is quite low and stated that most losses
are one to five losses at 8am and between one to ten at 4pm. This is in broad
agreement with the findings in this work, however we did not investigate
the times during the day of the losses. Maxemchuk and Lo measured both
loss and delay variation for intra-state connections within the USA and
international links [10]. Their conclusion was the quality depends on the
length of the connection and the time of day. We did not try different
connection durations but saw much smaller variations (almost negligible)
during a 24 hour cycle (see [7]). We attribute this to the small 64kbits per
second VoIP session on well dimensioned academic networks. It is worthy
to point out our loss rates were considerably less than Maxemchuks (3-4%).
Dong Lin had similar conclusions [11], stating that in fact even calls within
the USA could suffer from large jitter delays. Her results on packet loss also
agree with those in [9], which is interesting, as the measurements were taken
some four years later.
6 Conclusions
We have presented the results of 15 weeks of voice over IP measurements con-
sisting of over 18000 recorded VoIP sessions. We conclude that the quality
of VoIP is good, and in most cases is over the requirements of many speech
quality recommendations. Recall that all of the sites were at academic insti-
tutions which is an important factor when interpreting these results as most
Wide Area Measurements of Voice Over IP Quality 89
universities have well provisioned links, especially to other academic sites.
Nevertheless, the loss, delay and jitter values are very low and from previous
measurements the quality trend is improving. We can only attribute this
to more capacity and better managed networks than those four years ago.
However some caution should be expressed as the sample period was only 15
weeks, the bandwidth of the flows very small and only used once per hour.
We have a large number of sample sessions so can be confident the findings
are representative of the state of the network at this time. One conclusion
is that VoIP is obviously dependent on the IP network infra-structure and
not only on the geographic distance. This can be clearly seen in the dif-
ferences between the Argentinian and Turkish hosts. Concerning the actual
measurement methodology, we have found performing measurements on this
scale is not an easy task. Different access mechanisms, firewalls, NATs and
not having permissions on all machines, complicates the work in obtaining
(and validating later) the measurements. Since it is not possible to envis-
age all the possible uses for this data we have made it available for further
investigation at http://www.sics.se/˜ianm/COST263/cost263.html.
References
[1] ITU-T Recommendation G.107, “The E-Model, a computational model
for use in transmission planning,” December 1998.
[2] I.-T. Recommendation G.114, “General Characteristics of International
Telephone Connections and International Telephone Circuits: One-
Way Transmission Time,” Feb. 1998.
[3] R. Cole and J. Rosenbluth, “Voice over IP Performance Monitoring,”
ACM Computer Communication Review, 2002.
[4] B. L.F.Sun, G.Wade and E.C.Ifeachor, “Impact of Packet Loss Loca-
tion on Perceived Speech Quality,” in Proceedings of 2nd IP-Telephony
Workshop (IPTEL ’01), (Columbia University, New York), pp. 114–
122, April 2001.
[5] N. Kitawaki, T. Kurita, and K. Itoh, “Effects of Delay on Speech Qual-
ity,” NTT Review, vol. 3, pp. 88–94, Sept. 1991.
[6] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RTP: A
Transport Protocol for Real-Time Applications,” RFC 1889, Internet
Engineering Task Force, Jan. 1996.
http://www.rfc-editor.org/rfc/rfc1889.txt.
[7] F. Li, “Measurements of Voice over IP Quality,” Master’s thesis, KTH,
Royal Institute of Technology, Sweden, 2002.
Wide Area Measurements of Voice Over IP Quality 90
[8] O. Hagsand, K. Hansson, and I. Marsh, “Measuring Internet Telephone
Quality: Where are we today ?,” in Proceedings of the IEEE Conference
on Global Communications (GLOBECOM), (Rio, Brazil), IEEE, Nov.
1999.
[9] J. Bolot, H. Crepin, and A. Garcia, “Analysis of audio packet loss in
the internet,” in Proc. International Workshop on Network and Operat-
ing System Support for Digital Audio and Video (NOSSDAV), Lecture
Notes in Computer Science, (Durham, New Hampshire), pp. 163–174,
Springer, Apr. 1995.
[10] N. F. Maxemchuk and S. Lo, “Measurement and interpretation of voice
traffic on the Internet,” in Conference Record of the International Con-
ference on Communications (ICC), (Montreal, Canada), June 1997.
[11] D. Lin, “Real-time voice transmissions over the Internet,” Master’s
thesis, Univ. of Illinois at Urbana-Champaign, 1999.
Wide Area Measurements of Voice Over IP Quality 91
r
e
c
e
i
v
e
r
M
a
s
s
a
c
h
u
s
e
t
t
s
M
i
c
h
i
g
a
n
C
a
l
i
f
o
r
n
i
a
B
e
l
g
i
u
m
F
i
n
l
a
n
d
S
w
e
d
e
n
G
e
r
m
a
n
y
T
u
r
k
e
y
A
r
g
e
n
t
i
n
a
M
e
a
n
s
e
n
d
e
r
D
:
3
8
.
0
(
1
7
.
1
)
D
:
5
4
.
2
(
1
5
.
8
)
D
:
6
7
.
1
(
1
5
.
5
)
D
:
9
7
.
1
(
2
.
6
)
D
:
9
9
.
5
(
8
.
5
)
D
:
5
8
.
4
(
5
.
0
)
D
:
3
8
8
.
2
(
4
3
.
2
)
D
:
9
9
.
7
(
4
.
9
)
D
:
1
1
2
.
8
J
:
2
.
4
(
1
.
7
)
J
:
2
.
4
(
1
.
8
)
J
:
3
.
6
(
1
.
5
)
J
:
2
.
5
(
1
.
5
)
J
:
3
.
2
(
1
.
7
)
J
:
4
.
5
(
1
.
4
)
J
:
1
0
.
4
(
4
.
9
)
J
:
1
9
.
9
(
8
.
4
)
J
:
6
.
1
M
a
s
s
a
c
h
u
s
e
t
t
s
*
L
:
0
.
1
(
0
.
6
)
L
:
0
.
1
(
0
.
9
)
L
:
0
.
1
(
0
.
8
)
L
:
0
.
1
(
0
.
8
)
L
:
0
.
0
4
(
0
.
2
)
L
:
0
.
0
(
0
.
0
)
L
:
4
.
9
(
4
.
7
)
L
:
8
.
9
(
7
.
2
)
L
:
1
.
2
H
:
1
4
(
+
1
)
H
:
1
9
H
:
1
1
H
:
1
5
H
:
2
1
H
:
1
7
(
+
3
)
H
:
2
0
H
:
2
5
H
:
1
7
T
:
0
T
:
-
3
T
:
+
6
T
:
+
7
T
:
+
6
T
:
+
6
T
:
+
7
T
:
+
1
D
:
3
6
.
4
(
1
5
.
4
)
D
:
4
0
.
4
(
4
.
5
)
D
:
6
3
.
5
(
4
.
2
)
D
:
8
8
.
2
(
8
.
0
)
D
:
8
6
.
7
(
4
.
7
)
D
:
6
3
.
6
(
8
.
2
)
D
:
3
5
8
.
9
(
4
4
.
9
)
D
:
1
1
2
.
1
(
1
0
.
6
)
D
:
1
0
6
.
2
J
:
4
.
7
(
0
.
8
)
J
:
4
.
4
(
0
.
8
)
J
:
4
.
3
(
0
.
7
)
J
:
4
.
1
(
0
.
7
)
J
:
5
.
2
(
0
.
6
)
J
:
7
.
3
(
1
.
9
)
J
:
5
.
6
(
1
.
7
)
J
:
1
8
.
7
(
7
.
9
)
J
:
6
.
8
M
i
c
h
i
g
a
n
L
:
0
.
0
(
0
.
2
)
*
L
:
0
.
2
(
1
.
1
)
L
:
0
.
0
(
0
.
1
)
L
:
0
.
1
(
1
.
1
)
L
:
0
.
1
(
2
.
2
)
L
:
0
.
2
(
0
.
9
)
L
:
3
.
0
(
1
.
9
)
L
:
6
.
5
(
7
.
0
)
L
:
1
.
3
H
:
1
4
(
+
1
)
H
:
2
0
(
+
1
)
H
:
1
1
H
:
1
7
H
:
2
3
H
:
1
6
(
+
1
)
H
:
2
0
H
:
2
5
H
:
1
8
T
:
0
T
:
-
3
T
:
+
6
T
:
+
7
T
:
+
6
T
:
6
T
:
7
T
:
+
1
D
:
5
4
.
5
(
1
6
.
7
)
D
:
4
0
.
6
(
5
.
1
)
D
:
8
1
.
0
(
2
.
2
)
D
:
1
0
6
.
0
(
3
.
0
)
D
:
1
0
8
.
0
(
2
.
4
)
D
:
8
1
.
5
(
1
.
8
)
D
:
3
8
6
.
9
(
6
0
.
5
)
D
:
1
2
3
.
9
(
1
2
.
4
)
D
:
1
2
2
.
2
J
:
2
.
0
(
1
.
0
)
J
:
1
.
2
(
0
.
6
)
J
:
1
.
6
(
0
.
8
)
J
:
1
.
4
(
0
.
8
)
J
:
2
.
1
(
0
.
9
)
J
:
4
.
9
(
1
.
5
)
J
:
5
.
3
(
1
.
7
)
J
:
1
8
.
1
(
9
.
9
)
J
:
4
.
6
C
a
l
i
f
o
r
n
i
a
L
:
0
.
1
(
0
.
3
6
)
L
:
0
.
1
(
1
.
9
)
*
L
:
0
.
2
(
0
.
8
)
L
:
0
.
6
(
1
.
4
)
L
:
0
.
2
(
0
.
3
)
L
:
2
.
8
(
3
.
0
)
L
:
4
.
4
(
2
.
4
)
L
:
8
.
9
(
8
.
2
)
L
:
2
.
2
H
:
1
8
(
+
1
)
H
:
2
1
H
:
2
0
H
:
2
5
(
+
1
)
H
:
3
0
(
+
2
)
H
:
2
3
H
:
2
3
H
:
2
5
H
:
2
3
T
:
+
3
T
:
+
3
T
:
+
9
T
:
+
1
0
T
:
+
9
T
:
+
9
T
:
+
1
0
T
:
+
4
D
:
6
5
.
2
(
1
0
.
1
)
D
:
6
3
.
4
(
3
.
3
)
D
:
8
4
.
0
(
1
.
3
)
D
:
3
1
.
3
(
0
.
6
)
D
:
3
3
.
4
(
0
.
2
)
D
:
1
6
.
6
(
1
0
.
4
)
D
:
3
4
1
.
1
(
2
4
.
7
)
D
:
1
3
6
.
5
(
7
.
1
)
D
:
9
6
.
4
J
:
1
.
6
(
0
.
6
)
J
:
0
.
6
(
0
.
1
)
J
:
0
.
9
(
0
.
8
)
J
:
0
.
9
(
0
.
5
)
J
:
1
.
6
(
0
.
9
)
J
:
3
.
4
(
1
.
5
)
J
:
6
.
9
(
2
.
0
)
J
:
N
A
J
:
2
.
0
B
e
l
g
i
u
m
L
:
0
.
0
(
0
.
0
)
L
:
0
.
0
(
0
.
0
)
L
:
1
.
2
(
1
.
0
)
*
L
:
0
.
0
(
0
.
0
)
L
:
0
.
0
(
0
.
0
)
L
:
0
.
2
1
(
0
.
7
)
L
:
3
.
8
(
2
.
7
)
L
:
N
A
L
:
0
.
6
H
:
1
6
H
:
1
7
H
:
2
3
H
:
1
7
H
:
2
2
H
:
1
3
H
:
1
6
(
+
2
)
H
:
1
9
H
:
1
7
T
:
-
6
T
:
-
6
T
:
-
9
T
:
+
1
T
:
0
T
:
0
T
:
+
1
T
:
-
5
D
:
9
7
.
8
(
4
.
2
)
D
:
8
6
.
8
(
1
.
9
)
D
:
1
0
9
.
9
(
4
.
7
)
D
:
3
0
.
7
(
0
.
3
)
D
:
1
3
.
6
(
1
.
0
)
D
:
2
6
.
8
(
7
.
3
)
D
:
3
2
1
.
2
(
3
9
.
3
)
D
:
1
6
1
.
5
(
1
2
.
2
)
D
:
1
0
6
.
3
J
:
1
.
7
(
0
.
8
)
J
:
1
.
1
(
0
.
6
)
J
:
1
.
4
(
0
.
8
)
J
:
1
.
4
(
0
.
6
)
J
:
1
.
9
(
0
.
9
)
J
:
3
.
9
(
1
.
1
)
J
:
3
.
4
(
1
.
7
)
J
:
1
7
.
4
(
8
.
2
)
J
:
4
.
1
F
i
n
l
a
n
d
L
:
0
.
0
(
0
.
1
)
L
:
0
.
0
(
0
.
3
)
L
:
0
.
7
(
1
.
4
)
L
:
0
.
1
(
0
.
3
)
*
L
:
0
.
0
(
0
.
0
)
L
:
0
.
0
(
0
.
0
)
L
:
3
.
2
(
1
.
7
)
L
:
7
.
5
(
6
.
5
)
L
:
1
.
4
H
:
1
5
(
+
1
)
H
:
1
7
(
+
1
)
H
:
2
4
(
+
2
)
H
:
1
6
H
:
2
0
H
:
2
0
(
+
1
)
H
:
1
7
(
+
2
)
H
:
1
9
H
:
1
8
T
:
-
7
T
:
-
7
T
:
-
1
0
T
:
-
1
T
:
-
1
T
:
-
1
T
:
0
T
:
-
6
D
:
9
9
.
3
(
8
.
8
)
D
:
8
4
.
9
(
1
.
9
)
D
:
1
0
5
.
6
(
2
.
1
)
D
:
3
3
.
3
(
0
.
4
)
D
:
1
3
.
5
(
0
.
5
)
D
:
2
9
.
8
(
1
2
.
8
)
D
:
3
2
2
.
2
(
3
0
.
3
)
D
:
1
6
5
.
6
(
1
7
.
9
)
D
:
1
0
7
.
8
J
:
3
.
0
(
1
.
9
)
J
:
2
.
5
(
2
.
0
)
J
:
3
.
2
(
1
.
9
6
)
J
:
2
.
8
(
1
.
6
)
J
:
2
.
4
(
1
.
8
)
J
:
4
.
8
(
2
.
5
)
J
:
3
.
2
(
1
.
4
9
)
J
:
N
A
J
:
2
.
8
S
w
e
d
e
n
L
:
0
.
0
(
0
.
0
)
L
:
0
.
0
3
(
0
.
4
)
L
:
0
.
1
(
0
.
1
)
L
:
0
.
1
(
0
.
3
)
L
:
0
.
0
(
0
.
0
1
)
*
L
:
0
.
0
(
0
.
0
)
L
:
2
.
9
(
1
.
0
)
L
:
N
A
L
:
0
.
4
H
:
2
2
(
+
1
)
H
:
2
5
H
:
3
0
H
:
2
4
H
:
2
1
H
:
2
5
H
:
2
6
H
:
4
1
H
:
2
6
T
:
-
6
T
:
-
6
T
:
-
9
T
:
0
T
:
+
1
T
:
0
T
:
+
1
T
:
-
5
D
:
6
3
.
5
(
9
.
6
)
D
:
6
0
.
4
(
0
.
5
)
D
:
8
4
.
4
(
1
.
0
)
D
:
1
1
.
1
(
0
.
2
)
D
:
2
7
.
8
(
7
.
3
)
D
:
2
9
.
2
(
7
.
6
)
D
:
3
0
0
.
7
(
3
9
.
7
)
D
:
1
4
9
.
8
(
1
5
.
6
)
D
:
9
0
.
9
J
:
1
.
7
2
(
0
.
7
)
J
:
0
.
7
(
0
.
3
)
J
:
1
.
8
(
0
.
7
)
J
:
0
.
8
(
0
.
3
)
J
:
1
.
0
(
0
.
5
)
J
:
1
.
5
(
0
.
6
)
J
:
4
.
8
(
2
.
1
)
J
:
N
A
J
:
1
.
6
G
e
r
m
a
n
y
L
:
0
.
0
(
0
.
0
)
L
:
0
.
0
(
0
.
0
)
L
:
2
.
5
(
1
.
9
)
L
:
0
.
0
(
0
.
0
)
L
:
0
.
0
(
0
.
0
)
L
:
0
.
0
(
0
.
0
)
*
L
:
3
.
7
(
2
.
5
)
L
:
N
A
L
:
0
.
8
H
:
1
5
H
:
1
6
H
:
2
2
H
:
1
2
H
:
1
7
H
:
2
2
H
:
1
6
H
:
1
8
H
:
1
7
T
:
-
6
T
:
-
6
T
:
-
9
T
:
0
T
:
+
1
T
:
0
T
:
+
1
T
:
-
5
D
:
3
7
9
.
1
(
4
7
.
1
)
D
:
3
8
7
.
9
(
3
5
.
5
)
D
:
4
1
0
.
9
(
4
3
.
9
)
D
:
3
3
0
.
2
(
2
8
.
6
)
D
:
3
1
8
.
9
(
4
2
.
4
)
D
:
3
1
1
.
1
(
8
.
3
)
D
:
3
7
8
.
2
(
4
9
.
3
)
D
:
4
9
0
.
8
(
2
6
.
0
)
D
:
3
7
5
.
9
J
:
8
.
6
(
0
.
7
)
J
:
8
.
9
(
1
.
2
)
J
:
8
.
8
(
2
.
5
)
J
:
9
.
2
(
2
.
0
)
J
:
8
.
8
(
0
.
6
)
J
:
9
.
1
(
0
.
7
)
J
:
1
0
.
7
(
1
.
2
)
J
:
N
A
J
:
8
.
0
T
u
r
k
e
y
L
:
8
.
1
(
2
.
8
)
L
:
8
.
0
(
2
.
9
)
L
:
7
.
6
(
6
.
8
)
L
:
7
.
1
0
(
4
.
0
)
L
:
7
.
8
(
2
.
7
)
L
:
8
.
4
(
3
.
1
)
L
:
8
.
0
(
3
.
1
)
*
L
:
N
A
L
:
6
.
9
H
:
1
8
(
+
1
)
H
:
2
0
H
:
1
9
H
:
1
7
H
:
1
9
H
:
2
5
H
:
1
6
H
:
1
8
H
:
1
9
T
:
-
7
T
:
-
7
T
:
-
1
0
T
:
-
1
T
:
0
T
:
-
1
T
:
-
1
T
:
-
6
D
:
1
1
7
.
0
(
3
0
.
8
)
D
:
1
4
6
.
7
(
4
4
.
2
)
D
:
1
5
2
.
0
(
4
7
.
8
)
D
:
N
A
D
:
1
6
4
.
1
(
2
7
.
2
)
D
:
1
6
0
.
9
(
4
7
.
7
)
D
:
1
8
0
.
5
(
5
0
.
5
)
D
:
N
A
D
:
1
1
5
.
2
J
:
4
.
2
(
2
.
0
)
J
:
4
.
3
(
2
.
3
)
J
:
3
.
1
(
2
.
4
)
J
:
4
.
2
(
2
.
0
)
J
:
3
.
9
(
2
.
2
)
J
:
2
.
9
(
0
.
8
)
J
:
4
.
7
(
1
.
5
)
J
:
6
.
0
(
1
.
2
)
J
:
4
.
2
A
r
g
e
n
t
i
n
a
L
:
0
.
5
(
1
.
4
)
L
:
0
.
5
(
1
.
5
)
L
:
0
.
6
(
1
.
8
)
L
:
0
.
5
(
1
.
4
)
L
:
0
.
5
(
1
.
4
)
L
:
0
.
0
(
0
.
1
)
L
:
0
.
1
(
0
.
1
)
L
:
5
.
8
(
3
.
0
)
*
L
:
1
.
1
H
:
N
A
H
:
N
A
H
:
N
A
H
:
N
A
H
:
N
A
H
:
N
A
H
:
N
A
H
:
N
A
H
:
N
A
T
:
-
1
T
:
-
1
T
:
-
4
T
:
+
5
T
:
+
6
T
:
+
5
T
:
+
5
T
:
+
6
D
:
1
1
4
.
1
D
:
1
1
3
.
6
D
:
1
1
5
.
7
D
:
7
7
.
1
D
:
1
0
5
.
8
D
:
1
0
5
.
2
D
:
1
0
4
.
4
D
:
3
4
5
.
6
D
:
1
8
0
.
0
D
:
1
3
6
.
2
M
e
a
n
J
:
3
.
4
J
:
3
.
4
J
:
3
.
2
J
:
3
.
5
J
:
3
.
1
J
:
3
.
4
J
:
5
.
5
J
:
5
.
7
J
:
9
.
3
J
:
4
.
1
L
:
1
.
1
L
:
1
.
1
L
:
1
.
6
L
:
1
.
0
L
:
1
.
1
L
:
1
.
1
L
:
1
.
4
L
:
4
.
0
L
:
4
.
0
0
L
:
1
.
8
H
:
1
4
H
:
1
6
H
:
1
9
H
:
1
3
H
:
1
6
H
:
2
0
H
:
1
6
H
:
1
7
H
:
2
3
H
:
1
8
Figure 2: A summary of 18000 VoIP sessions. The delay, jitter and loss
for the nine sites. The delay and jitter are in milliseconds, the losses are in
percentages. The number of hops and time zones (in hours) are also given.
The means for each site and all sites are stated and standard deviations are
in parenthesis.

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close