Next: 8 Making Unreliable Communication Up: Unix Communication Facilities Previous: 6 Distributed IPC

7 Performance

This chapter
describes how the performance of the different interprocess communication facilities can be measured
compares the performance of the different approaches
describes the usability of the different approaches

So far many different interprocess communication facilities have been described. How fast are they? From the results of performance measurements derived for this chapter conclusions can be drawn which communication facilities to use in different situations.

A large number of programs were written for all described IPC facilities which find out different aspects of their performance.

All performance measurements were done on three different systems to allow more general conclusions:

SparcCenter 2000 (multi-user timesharing computer, six TMS390Z50 processors), with the operating system Solaris 2.5 (SunOS 5.5), which is based on BSD UNIX. The SparcCenter 2000 computer used in the University of Newcastle upon Tyne has the name aidan.
HP-9000/710 (Processor: PA 7000/50 MHz) computer systems running HP-UX 9.01, which is based on UNIX System V release 3.0 (but has included all important features from other UNIX flavours). The used HP-9000/710 are named pikexy.
PC (i586/90 MHz) running Linux 1.2.9. Linux has the more common features of System V and BSD. This computer was not connected to the other computer systems.

The performance of all IPC facilities is measured with different message/buffer sizes (where appropriate) from 1 byte doubling up to the default limit of the particular interprocess facility. This limit is sometimes different on the three systems. Where limits are huge, an appropriate value is chosen.

The performance graphs are labeled double-logarithmic for two reasons:

the possibility to compare performance graphs for the three different systems in one graph
the ability to compare results for small buffer sizes.

Please note that even if lines in double-logarithmic graphs are quite close this means a big difference in absolute values.

This chapter summarizes the results derived from the various performance measurement programs. The Appendix of the WWW-Version of this document includes pointers to the generated raw results.

Interesting performance measurement outcomes are:

how long it takes to establish a connection
how many bytes can be transferred in a certain time (note that this is nearly equal to how long the data is in transit (``travels''))
the time needed for a specific operation (this is equivalent to the inverse of how many operations of a certain type can be done in a certain time)

Number 1 is only interesting for interprocess-communication facilities where a connection has to be established. Only the TCP connection example has this property. Possible similar measurements for filesystem dependent IPC facilities were not done for the following reason: the speed of these is highly dependent on the file system (and not of the operating system), and these are distributed for aidan and all pikes. This would have had an unpredictable impact on the results.

Number 2 is interesting for all examples that are able to transfer larger amounts of data. Most of the interprocess examples include this property. The result is measured in Kb (1024 bytes) per second.

Number 3 is useful for nearly all examples. The resulting graphs show how much non-avoidable overhead is involved with a certain operation. The results are given in s per operation (where appropriate for a specific buffer size). Some other results are also given in operations per second.

For all measurements passed clock time is used to display the results, as this is the time the user/program becomes aware of. If CPU time is measured, accumulated system and user CPU time is given.

7.1 Measuring Time

To measure the time in the different interprocess communication facilities some functions and classes were coded. They measure the

system time (``real'' time)
(class clocktimer defined in common/clocktimer.h)
used CPU time together in user and kernel mode
(class cputimer defined in common/cputimer.h)

Time measurements on computers normally have an impact upon the time they should measure. To ensure that the measurements do not influence the measured time too much, it is important to know how long the measurements take. The overhead for the two described classes clocktimer and cputimer is measured with the program performance/timeroverhead.c. This is the result of running the program on SparcCenter 2000/Solaris 2.5:

performance/timeroverhead.log.SunOS5

It can be seen that the overhead is not very high. Therefore the overhead of measuring the time is ignored in all performance measurements. To reduce the influence caused by the measurement and other temporary differences all measurements are done several times (normally 1000 or more times).

The measurements were done in such a way that the influence of other interferences (other users using the computer systems at the same time etc.) was kept to a minimum.

The precision of the internal timers of a computer system has an influence on the accuracy of the measurements. The precision of the used CPU time is normally much lower than the precision of the ``real time'' clock. The precision of the used CPU time is 1/100 second on all used systems, whereby the clock precision is reported to be accurate to some s.

To reduce the replication of code for measuring time the class measurement, defined in common/measurement.h, was developed. It uses the classes cputimer and clocktimer internally to measure time. The measured time is written to an internal memory buffer. This avoids time consuming I/O operations which could influence the measurements. The measured times can later be written out in various ways.

7.2 Principle Overhead of a System Call

UNIX system calls introduce a non-avoidable overhead. This overhead was measured with the program performance/systemcalloverhead.c. This program measures the time of a getpid() system call. This system call is regarded as one of the cheapest system calls in a UNIX operating system, as it has only to copy an integer value (the process number) from the kernel to the user process. The result is shown in Table 20.

System System Call Overhead

SparcCenter 2000/Solaris 2.5 3.2 s (312500 op/s)

HP 9000/710/HP-UX 9.01 1.6 s (625000 op/s)

i586/Linux 1.2.9 2.5 s (400000 op/s)

Table 20: Principle Overhead of a System Call

**Table 20:** Principle Overhead of a System Call
System	System Call Overhead
SparcCenter 2000/Solaris 2.5	3.2 s (312500 op/s)
HP 9000/710/HP-UX 9.01	1.6 s (625000 op/s)
i586/Linux 1.2.9	2.5 s (400000 op/s)

In all following measurement descriptions the number of system calls made to get the measured data is included. The non-avoidable overhead is not subtracted from the measured results, as programs/users experience these delays.

7.3 Signals

UNIX signals were used in some measurement programs to synchronize the measurement of different processes. The program performance/signaloverhead.c measures how long it takes to send a signal from one process to another and to get a response signal. The described operation involves two system calls: kill() and sigsuspend(). This is the result of the program performance/signaloverhead.c on SparcCenter 2000/Solaris 2.5.

performance/signaloverhead.log.SunOS5

Table 21 shows how often the described operation can be done on the different systems.

System Signals and Acknowledgments

SparcCenter 2000/Solaris 2.5 770 s (1300 op/s)

HP 9000/710/HP-UX 9.01 430 s (2325 op/s)

i586/Linux 1.2.9 160 s (6250 op/s)

Table 21: Performance of Signals on different UNIX Systems

**Table 21:** Performance of Signals on different UNIX Systems
System	Signals and Acknowledgments
SparcCenter 2000/Solaris 2.5	770 s (1300 op/s)
HP 9000/710/HP-UX 9.01	430 s (2325 op/s)
i586/Linux 1.2.9	160 s (6250 op/s)

Signals are used in some of the local interprocess communication examples to synchronize corresponding processes. Some functions defined in common/synchronize.c hide the implementation details of how to synchronize related processes via signals.

7.4 Pipes

Program performance/pipes.c measures the performance of pipes with different buffer sizes for read() and write() operations. Figure 32 shows the measured unsynchronized read() throughput, while Figure 33 shows the time for the (also unsynchronized) read() function calls.

Figure 32: Throughput of Pipes

Figure 33: Time per read() System Call for Pipes

The write counterpart is not shown as the read() and write() results are nearly the same. It can not be said how much read()/write() operations are involved in getting the data due to the different read() semantic for pipes, FIFOs, sockets etc. The problem is that I/O operations can return less than the requested amount of data, and therefore the system call might have to be issued again and again till all data is read/written (common/readn.c and common/writen.c handle these cases). Details about this problem are given in [Stev92, page 406].

7.5 FIFOs (named pipes)

The performance of named pipes is measured with the client/server pair of programs performance/fifoclient.c and performance/fifoserver.c. Figures 34 and 35 show the results only for the read() system calls, as the corresponding write() system calls are nearly equal.

Figure 34: Throughput of FIFOs

Figure 35: Time per read() System Call for FIFOs

At least one read()/write() system call is issued for each buffer size.

7.6 Message Queues

The program performance/message.c was used to measure the performance of message queues. Figure 36 shows their throughput, and Figure 37 the time per msgsnd() system call.

Figure 36: Throughput of Messages

Figure 37: Time per msgsnd() System Call

Please note that the maximum size of messages is quite small, it was less than 4k on all three systems.

7.7 Semaphores

Semaphores are not used to transfer information, they are used to synchronize access of different processes to other resources. How many semaphore operations can be done per second? The rounded result for one semop() system call derived with the program performance/semaphore.c for the different systems is shown in Table 22.

System Operations with undo Operations w/o undo

SparcCenter 2000/Solaris 2.5 24.1 s (41500 op/s) 21.5 s (46500 op/s)

HP 9000/710/HP-UX 9.01 41.0 s (24400 os/s) 28.6 s (35000 op/s)

i586/Linux 1.2.9 7.8 s (128000 op/s) 7.7 s (130000 op/s)

Table 22: Semaphore Operations

**Table 22:** Semaphore Operations
System	Operations with undo	Operations w/o undo
SparcCenter 2000/Solaris 2.5	24.1 s (41500 op/s)	21.5 s (46500 op/s)
HP 9000/710/HP-UX 9.01	41.0 s (24400 os/s)	28.6 s (35000 op/s)
i586/Linux 1.2.9	7.8 s (128000 op/s)	7.7 s (130000 op/s)

Please refer to Chapter 3.5.2 for the meaning of the undo facility of semaphores.

7.8 Shared Memory

The program performance/memory.c is used to measure if there is any time penalty in writing to shared memory instead of normal memory. A buffer of 1 Mb in ``normal'' memory and an equally sized shared memory region is allocated. The program writes zeros in both buffers. A read test could not be performed, as the compilers optimized the read code to ``nothing''. The best results are given in Table 23.

System Normal Memory Shared Memory

SparcCenter 2000/Solaris 2.5 9649 s (103.6 Mb/s) 9663 s (103.5 Mb/s)

HP 9000/710/HP-UX 9.01 34181 s (29.3 Mb/s) 34183 s (29.3 Mb/s)

i586/Linux 1.2.9 35327 s (28.3 Mb/s) 35543 s (28.1 Mb/s)

Table 23: Time needed to write 1 Mb

**Table 23:** Time needed to write 1 Mb
System	Normal Memory	Shared Memory
SparcCenter 2000/Solaris 2.5	9649 s (103.6 Mb/s)	9663 s (103.5 Mb/s)
HP 9000/710/HP-UX 9.01	34181 s (29.3 Mb/s)	34183 s (29.3 Mb/s)
i586/Linux 1.2.9	35327 s (28.3 Mb/s)	35543 s (28.1 Mb/s)

The results suggest that there is no difference in accessing ``normal'' and shared memory. No system call was used to write into the memory.

One might ask how big the overhead is to establish memory sharing. Figure 38 shows the time needed to create, attach, detach and remove a shared memory region of different sizes. Four system calls were used per operation: shmget(), shmat(), shmdt(), and shmctl().

Figure 38: Time needed to create, attach, detach and remove a Shared Memory Region

7.9 Sockets

7.9.1 UNIX Domain

As already described in Chapters 3.4 and 6.1.5 the socketpair() system call can be used to create a bi-directional pipe. The program performance/socketlocal.c is nearly identical to the corresponding program performance/pipes.c using the pipe() system call. Figure 39 shows the throughput of local sockets, Figure 40 the time for the read() system calls.

Figure 39: Throughput of local Sockets created by the socketpair() System Call

Figure 40: Time for read() System Calls for local Sockets

The performance (comparing Figure 39 with Figure 32 ) is roughly that of the pipe() version, and in fact most pipe() implementations today seem to be implemented by using the socketpair() system call.

7.9.2 Internet Domain: UDP

The client program performance/udpsclient.c measures the time that UDP messages need from one process to another. The sender waits for a short acknowledgment message from the server performance/udpsserver.c. Figure 41 shows the throughput if server and client are running on the same system, and Figure 42 the throughput between aidan and pike17 and also the throughput between two HP 9000/710 computers on the same physical Ethernet. The time for the corresponding system calls can be seen in Figures 43 and 44. Per operation the following system calls are executed both in the client and in the server process: recvfrom() and sendto(). Please note that the default maximum size for UDP was different on the used systems, it was 16 Kb on SparcCenter 2000/Solaris 2.5 and i586/Linux 1.2.9, and less than 16 Kb on HP 9000/710/HP-UX 9.01.

Figure 41: Throughput of UDP on one computer

Figure 42: UDP Throughput between different Computers

Figure 43: Time per sendto() and recvfrom() System Calls (both executed in client and server process) on one Computer

Figure 44: Time per sendto() and recvfrom() System Calls (both executed in client and server process) between different Computers

The computers aidan and pike17 are not connected to the same physical Ethernet, which explains why the throughput is slower between them compared to the two pikes.

UDP is not reliable. This had an impact on the measurement between different computers. The number of repeats for each buffer size had to be reduced, as none of the tests otherwise succeeded (the measurement program relies on data to be delivered, as in the case of a not delivered message both client and server are waiting for each other forever).

7.9.3 Internet Domain: TCP

The client program performance/tcpsclientoverhead.c together with the server program performance/tcpsserverconcurrentoverhead.c is used to measure the overhead of establishing a TCP connection. The connection establishment measurements include the execution of the system calls socket(), connect(), and close() on the client side and accept(), fork() (the TCP server is implemented as a concurrent server which can handle multiple requests simultaneously), and two close()s on the server side. The rounded results are given in Table 24.

System Overhead

SparcCenter 2000/Solaris 2.5 5500 s

HP 9000/710/HP-UX 9.01 1900 s

i586/Linux 1.2.9 1500 s

Table 24: Average Overhead of Establishing a TCP Connection on one Computer

**Table 24:** Average Overhead of Establishing a TCP Connection on one Computer
System	Overhead
SparcCenter 2000/Solaris 2.5	5500 s
HP 9000/710/HP-UX 9.01	1900 s
i586/Linux 1.2.9	1500 s

For comparison: on average it takes about 6600 s to establish a connection from aidan to pike17, and 2400 s between two pikes on the same physical Ethernet.

Figure 45 states the throughput of a local TCP connection. The executed system calls are (eventually several) write()s on the client side and (eventually several) read()s on the server side.

Figure 45: Throughput of a local TCP Connection

The client program performance/tcpsclientthroughput.c together with the server program performance/tcpsserverconcurrent.c, which measured the data for Figure 45, were also used to measure the TCP throughput between two HP-9000/710 computers on the same physical network and the TCP throughput between aidan and pike17. The result is shown in Figure 46. The used time for the corresponding read() system calls is shown in Figures 47 and 48.

Figure 46: Throughput of a TCP Connection between different Computers

Figure 47: Time for read() System Calls of a TCP Connection on one Computer

Figure 48: Time for read() System Calls of a TCP Connection between different Computers

7.10 Transport Layer Interface

The UDP and TCP measurements were also performed using the TLI interface on aidan only, as the other systems unfortunately do not support the Transport Layer Interface.

The results were similar to the ones from the socket interface (about 20% slower for TCP throughput and about 15% slower for UDP throughput), therefore the results are not shown here.

7.11 Conclusion

The IPC facilities which were investigated have major differences in performance. The three different systems produce dramatically different performances, but in principle their behaviour for a specific IPC facility is very similar.

To synchronize processes on one computer systems semaphores should be used instead of signals, as the overhead of signals is dramatic, caused by the context switch that has to be done.

The size of the buffer has a great impact on the throughput of a local IPC facility. Up to about 4 - 8 Kb (dependent of the system and the IPC facility) larger buffers normally produce a larger throughput. This is likely to be caused by the overheads of the system calls. Larger buffer sizes do not cause major performance loss or gain.

For interprocess communication on one computer, shared memory is the fastest way to transfer huge amounts of information from one process to another. There is no penalty in using shared memory instead of normal memory. The size of the shared memory region has nearly no impact on the time that is needed to create, attach, detach, and remove a shared memory region.

Message queues, originally designed to provide a faster form of IPC are now slower than other forms of IPC (possibly caused by the small buffer sizes). They therefore should not be used for high speed information exchange any more. The possibility of selecting messages by priority may justify their use for some purposes.

For distributed IPC facilities the time to transfer the data has a stronger influence than on the local ones. Therefore the time of an operation is more proportional to the amount of data transfered than in the local IPC facilities.

UDP is faster than TCP at transferring data between different computers, but the application has to take care of lost packages, congestion, flow control etc. On aidan and on the pikes the local use of TCP has a much higher throughput than UDP.

The overheads of establishing a TCP connection are quite high. If only small amounts of data have to be transfered, UDP should be used. In the time needed for a connection establishment, UDP might already have transfered all the data.

The speed of TCP used locally on a system is not dramatically slower than equivalent local IPC facilities (e.g. pipes), at least on aidan and the pikes. Therefore where speed is not very critical, applications that might be distributed, should use TCP wherever possible instead of pipes and local sockets.

Most IPC facilities have a fixed time overhead per operation. Up to a specific buffer size the time per operation is nearly independent of the buffer size used.

Tables 25 and 26 present the most appropriate facility which should be used for particular purposes. ``UDP'' and ``TCP'' should be read as using sockets or the TLI with the specified protocol.

Type Small amounts of Data Big amount of Data

local semaphores (very small amount of information) pipes, FIFOs, sockets etc. shared memory

distributed UDP (problem: not reliable) TCP (reliable, handles congestion etc.)

Table 25: IPC Facility to use for different Amount of Data

**Table 25:** IPC Facility to use for different Amount of Data
Type	Small amounts of Data	Big amount of Data
local	semaphores (very small amount of information) pipes, FIFOs, sockets etc.	shared memory
distributed	UDP (problem: not reliable)	TCP (reliable, handles congestion etc.)

Connection Type Local Distributed

Connection-oriented, reliable pipes, FIFOs, stream sockets, (shared memory) TCP

Connection-oriented, unreliable - -

Connectionless, reliable message queues, semaphores, signals -

Connectionless, unreliable datagram sockets UDP

Table 26: IPC Facilities for different Types of Connections

**Table 26:** IPC Facilities for different Types of Connections
Connection Type	Local	Distributed
Connection-oriented, reliable	pipes, FIFOs, stream sockets, (shared memory)	TCP
Connection-oriented, unreliable	-	-
Connectionless, reliable	message queues, semaphores, signals	-
Connectionless, unreliable	datagram sockets	UDP

7.12 Summary

Various measurements were done on three different systems to get information about the performance of different interprocess communication facilities.

The performance difference between the three systems is sometimes high, but the characteristics are very similar. Shared memory is the fastest mechanism to transfer large amounts of data between processes on one computer, while UDP is the IPC mechanism with the highest throughput between different computers. The size of buffers should not be less than 4 Kb if larger amounts of data are exchanged.

The performance measurements demonstrated the unreliability of the UDP protocol. Therefore the next chapter examines what is necessary to ensure reliable datagram exchange.

Next: 8 Making Unreliable Communication Up: Unix Communication Facilities Previous: 6 Distributed IPC

Gerhard M�ller