Next: 7 Performance Up: Unix Communication Facilities Previous: 5 Networking with TCP/IP

6 Distributed IPC

This chapter covers the UNIX communication facilities for distributed interprocess communication. Two APIs are introduced:
the Socket Interface
Transport Layer Interface

Different computers normally make use of networks and protocols to exchange information, as they do not share memory. This chapter describes two application program interfaces (APIs) to communication protocols: the socket interface and the Transport Layer Interface. The TCP/IP protocol suite is used as an example where appropriate.

6.1 Sockets

The socket interface to networking was developed by the University of California in Berkeley. It was introduced with 4.2BSD. The basic idea of the socket interface was to make interprocess communication similar to file I/O.

However, network protocols are more complex than conventional I/O devices. Therefore interaction between user processes and network protocols has to be more complex than interaction between user processes and conventional I/O facilities. Normal I/O does not allow server code that awaits connections passively and client code that forms connections actively. Also there was no mechanism for sending datagrams to different destinations: the open() system call always bounds a fixed file name to a descriptor.

The basic building block for communication is the socket, an endpoint of communication. A name may be bound to a socket. Sockets can have different types. A socket can be associated with one or more processes. A socket exists within a communication domain, an abstraction to bundle properties of processes communicating through sockets. Data normally can only be exchanged with sockets in the same domain.

6.1.1 Supported Protocols

The socket interface supports different communication protocols. The UNIX domain protocol (used for interprocess communication between processes on the same computer) and the Internet domain protocol (used for interprocess communication between processes on either the same or different computers) are described.

6.1.2 Connections and Associations

The term connection is used in this chapter for a communication link between two processes. The term association is used for an n-tuple that completely specifies the two endpoints of communication that make up a connection. For connections which use the TCP/IP protocol suite this n-tuple is the 5-tuple

{protocol, local IP address, local port number, foreign IP address, foreign port number}

For the UNIX domain the following 3-tuple specifies an association:

{protocol, local path, foreign path}

All members of an n-tuple have to be specified by both processes wishing to communicate before data can be exchanged. Different system calls fill in different parts of the n-tuples.

6.1.3 Asymmetry

The typical client/server relationship, described in Chapter 2.2.2, is not symmetrical. Initiating a network connection requires that both programs know which role, client or server, they are to play. This is reflected in the design of the socket interface. Some system calls are only useful for the clients, while others are only useful for servers.

6.1.4 Socket Types

A socket has a specific type which influences its communication properties. Processes are presumed to communicate only between sockets of the same type. Four different types of sockets are available: stream sockets, datagram sockets, raw sockets, and sequenced packet sockets.

A stream socket provides bi-directional, reliable, sequenced, unduplicated flow of data. Message boundaries are not visible. A datagram socket supports bi-directional flow of data. Record boundaries are visible. Messages may be duplicated, lost, or arrive in a different order than they were sent. A raw socket allows user processes to access the underlying communication protocols. Raw sockets are not intended to be used by normal applications and therefore not described further. A sequenced packet socket is similar to a stream socket, with the exception that record boundaries are preserved. Sequenced packet sockets are not supported by the Internet domain, therefore they are also not discussed further.

6.1.5 Basic Socket Function Calls

Socket Creation

A socket is created with the

int socket(int domain, int type, int protocol)

system call. The domain parameter specifies the communication domain within which communication will take place; this selects the address or protocol family which should be used. For the UNIX domain AF_UNIX (AF for address family) has to be specified, for the Internet protocols AF_INET. The type indicates the socket type as stated above. For stream sockets SOCK_STREAM and for datagram sockets SOCK_DGRAM has to be specified. The protocol parameter can select a specific protocol. Normally 0 is used to selects the default protocol. With the internet protocols a stream socket uses TCP, and UDP is used for a datagram socket. The returned integer is a new socket descriptor which is needed for further access to the socket. UNIX allocates socket descriptors and normal file descriptors from the same set of integers. This is part of the goal to make networking as similar to standard file I/O as possible. Sockets perform exactly like UNIX files or devices where appropriate. The socket() system call specifies the protocol in the association n-tuple.

Sockets for the UNIX domain can also be created with the

int socketpair(int domain, int type, int protocol, int sd[2])

system call. It creates two sockets simultaneously and places the two socket descriptors in the two elements of sd. This call is the socket equivalent of the pipe() system call. The first three parameters are equal to the ones in the socket() system call.

Binding

The local part of an association (local address, local port number for the Internet domain, and local pathname for the UNIX domain) has to be specified before a connection can be established. This is normally done by the

int bind(int sockfd, struct sockaddr *localaddr, int addrlen)

system call. Under certain circumstances the send() and connect() system calls (described later) automatically bind a socket. The pointer sockaddr points to the needed data to fill in the local part of an association. For the Internet domain this is a pointer to a struct sockaddr_in, and for the UNIX domain to a struct sockaddr_un. As the different domains use structures of different sizes, the length of the used structure has to be specified in addrlen. The structure sockaddr_in and sockaddr_un is defined like this:

 struct sockaddr_in {
   short int          sin_family;  /* Address family (AF_INET)             */
   unsigned short int sin_port;    /* Port number, network byte order      */
   struct in_addr     sin_addr;    /* Internet address, network byte order */
   char               sin_zero[8]; /* unused                               */
 };

 struct sockaddr_un {
   unsigned short sun_family;      /* Address family (AF_UNIX)             */
   char sun_path[108];             /* pathname                             */
 };

There are still unspecified elements of the association: the foreign address and foreign port number for the Internet domain, and foreign pathname for the UNIX domain. These are filled in differently for connection-oriented and connectionless communication.

Socket Termination

When a process finishes using a socket, it calls the normal close() system call. If the process was the last one using this socket, the socket is discarded.

6.1.6 Connection-oriented Communication

Figure 26 gives an overview of system calls used for client/server interaction using a connection-oriented protocol.

Figure 26: Socket System Calls, Connection-Oriented Protocol [Stev90]

Connection Establishment

is usually asymmetrical, with one process being the client and the other the server. The server first binds to its well-known port number as described above and then passively ``listens'' on its socket with the

int listen(int sd, int backlog)

system call. This puts the socket in a passive mode ready to accept incoming connections. An incoming connection is then accepted with the

int accept(int sd, struct sockaddr *addr, int *addrlen)

system call, which finally fills in the foreign parts of the n-tuple. The argument addr is a result parameter that is filled in with the address of the client. The addrlen is a value-result parameter; it should initially contain the amount of space pointed to by addr. On return it will contain the actual length (in bytes) of the address returned. This information can be used to verify that the server is willing to serve the client. accept() creates a new socket with the same properties of sd and allocates a new socket descriptor for the socket. accept() blocks until a connection is present. The original socket sd remains open and may be used to accept the next connection. The newly created socket descriptor, which now can be used for data transfer, is returned.

The client calls

int connect(int sd, struct sockaddr *serv_addr, int addrlen )

after creating a socket to bind (to a temporary port number in the Internet domain) and to establish a connection with the server stated in serv_addr. All parts of the association n-tuple are now filled in and data exchange can start.

Data Exchange

can be performed with the normal read(), readv(), write(), and writev() system calls. The new system calls send() and recv() may also be used. They are similar to readv()/writev(), but allow specification of a special flag. This can be used to send or receive out of band data, to look at data without reading it, or other purposes.

Examples

The client/server pair of programs distSOCKET/tcpclient.c and distSOCKET/tcpserver.c demonstrate how to exchange data using a stream socket with TCP.

6.1.7 Connectionless Communication

Figure 27 gives an overview of system calls used for client/server interaction using a connectionless protocol.

Figure 27: Socket System Calls, Connectionless Protocol [Stev90]

Connection Establishment

is performed on a per datagram basis. A connection only exists for one datagram. Both server and client call bind() to set their local address in their association n-tuple after creating a socket. Note that this procedure is symmetrical.

Data Exchange

Data can be exchanged in both directions with the new sendto() and recvfrom() system calls. For each datagram the destination has to be stated. If several messages are to be sent to the same destination, the connect() system call may be used to register the destination. Data is then sent to the registered destination with the normal write() and writev() system calls. Also, only datagrams from this address will be received by the socket.

Examples

The client/server pair of programs distSOCKET/udpclient.c and distSOCKET/udpserver.c demonstrate how to exchange data using a datagram socket with UDP.

6.1.8 Options

For many applications the need arises to control the socket in more detail. The application might change a timeout parameter, or change the size of a buffer. The system call getsockopt() allows the application to request information about a specific socket, while the system call setsockopt() allows an application to set a socket option.

6.1.9 Obtaining Local and Remote Socket Addresses

A process might need to determine the local address of a socket or the destination address to which a socket connects. This can be performed with the system calls getpeername() and getsocketname().

6.1.10 Obtaining and Setting Host Name and Domain

The gethostname() system call allows user processes to access the host name. The host name can be set by a privileged process with sethostname(). The host domain may be obtained with the getdomainname() and can be changed by privileged user processes with the setdomainname() system call.

6.1.11 Library Functions

A number of library functions are available that handle

mapping host names to network addresses
network names to network numbers
protocol names to protocol numbers
service names to port numbers.

These functions are summarized in Table 15.

Library Call Purpose

gethostbyname() obtains network host entry for a host by its name

gethostbyaddr() obtains network host entry for a host by its address

getnetbyname() obtains network net entry by its name

getnetbyaddr() obtains network net entry by its address

getprotobyname() obtains protocol entry by its name

getprotobynumber() obtains protocol entry by its number

getservbyname() obtains service entry by its name

getservbyproto() obtains service entry by its port number

Table 15: Library Calls for Database Services

**Table 15:** Library Calls for Database Services
Library Call	Purpose
`gethostbyname()`	obtains network host entry for a host by its name
`gethostbyaddr()`	obtains network host entry for a host by its address
`getnetbyname()`	obtains network net entry by its name
`getnetbyaddr()`	obtains network net entry by its address
`getprotobyname()`	obtains protocol entry by its name
`getprotobynumber()`	obtains protocol entry by its number
`getservbyname()`	obtains service entry by its name
`getservbyproto()`	obtains service entry by its port number

In addition to these general protocol independent library functions Internet domain specific functions are available. Table 16 summarizes functions which convert data from and to network byte order. Table 17 summarizes the internet address manipulation functions.

Library Call Purpose

htonl(val) convert 32-bit quantity from host to network byte order

htons(val) convert 16-bit quantity from host to network byte order

ntohl(val) convert 32-bit quantity from network to host byte order

ntohs(val) convert 16-bit quantity from network to host byte order

Table 16: Library Calls for converting from/to Network Byte Order

**Table 16:** Library Calls for converting from/to Network Byte Order
Library Call	Purpose
`htonl(val)`	convert 32-bit quantity from host to network byte order
`htons(val)`	convert 16-bit quantity from host to network byte order
`ntohl(val)`	convert 32-bit quantity from network to host byte order
`ntohs(val)`	convert 16-bit quantity from network to host byte order

Library Call Purpose

inet_addr(cp) converts the Internet host address cp from dotted decimal notation into binary data in network byte order

inet_network(cp) extract the network number in network byte order from the address cp in dotted decimal notation

inet_ntoa(in) converts the Internet host address in given in network byte order to a string in dotted decimal notation

inet_makeaddr(net, host) makes an Internet host address in network byte order by combining the network number net with the local address host in network net, both in local host byte order.

inet_lnaof(in) returns the local host address part of the Internet address in. The local host address is returned in local host byte order

inet_netof(in) returns the network number part of the Internet Address in. The network number is returned in local host byte order

Table 17: Internet Address Manipulation Library Routines

**Table 17:** Internet Address Manipulation Library Routines
Library Call	Purpose
`inet_addr(cp)`	converts the Internet host address `cp` from dotted decimal notation into binary data in network byte order
`inet_network(cp)`	extract the network number in network byte order from the address `cp` in dotted decimal notation
`inet_ntoa(in)`	converts the Internet host address `in` given in network byte order to a string in dotted decimal notation
`inet_makeaddr(net, host)`	makes an Internet host address in network byte order by combining the network number `net` with the local address `host` in network `net`, both in local host byte order.
`inet_lnaof(in)`	returns the local host address part of the Internet address `in`. The local host address is returned in local host byte order
`inet_netof(in)`	returns the network number part of the Internet Address `in`. The network number is returned in local host byte order

Details about these functions can be found in the corresponding manual pages.

6.1.12 Socket Summary

The BSD interprocess communication facilities have become the de facto standard for UNIX, as they are available in nearly all UNIX derivatives. The socket interface is nowadays also used on other operating systems, e.g Microsoft Windows.

Sockets were created to make interprocess communication look like standard UNIX file I/O. The socket function calls are based on the client/server model. Different ``domains'' exist which support local IPC or distributed IPC. The socket interface allows different types of sockets for different purposes.

Function Call Protocol l. Addr. l. Port-Nr. f. Addr. f. Port-Nr.

socket()

bind()

accept()

recvfrom()

sendto()

Table 18: System Calls which fill in Parts of an Association (Internet domain 5-tuple assumed)

Figure 28: How Ports, Addresses, and Sockets fit together. The shaded path indicates the flow of data between the computers.

Table 18 summarizes which system calls change fields in an association. Figure 28 summarizes how data flows from one process over a socket descriptor via TCP, IP, Ethernet to another process.

6.2 Transport Layer Interface

The Transport Layer Interface (TLI) was introduced with Release 3 of System V in 1986. The TLI is designed after the ISO Transport Service Definition. It is based on System V STREAMS (the principles of System V STREAMS are summarized in Chapter 9.2.2). Normal UNIX descriptors are used to refer to network data. As the TLI is similar to the socket interface, only the more important differences are stated.

The Transport Layer Interface is a collection of library routines, not system calls. Therefore if the TLI is used, a special library has to be linked to the program. The function names and functionality of the TLI are similar to the socket interface (with an additional t_). The number of options for TLI functions is much higher than in the socket interface. This allows more flexibility, but prevents a detailed description in this document.

Figures 29 and 30 give an overview of the function calls used for an iterative server.

Figure 29: TLI Function Calls, Connection-Oriented Protocol [Stev90]

Figure 30: TLI Function Calls, Connectionless Protocol [Stev90]

Many of the data structures passed between the user code and the TLI functions are structures. Most of these structures contain one or more netbuf structures, each of which contains a pointer to a buffer used for a special purpose. The functions t_alloc() and t_free() simplify the dynamic allocation of these structures.

The t_listen() function does not accept an incoming connection. This has to be done with t_accept(). The t_accept() function call does not create a new transport endpoint (the equivalent to a socket). For a new file descriptor, needed by a concurrent server, the functions t_open(), and t_bind() have to be called.

The functions t_sndudata() and t_rcvudata() correspond to the socket system calls sendtto() and recvfrom().

The example programs distTLI/udpclient.c and distTLI/udpserver.c demonstrate how to use TLI with UDP. The UDP server is implemented as an iterative server. The use of TCP with TLI is demonstrated with the example programs distTLI/tcpclient.c and distTLI/tcpserver.c. As TCP connections might exist for a longer time the TCP server is implemented in a concurrent fashion.

6.3 Remote Procedure Calls

Remote Procedure Calls (RPCs) are a higher level communication abstraction which make use of the distributed UNIX communication facilities described above. As remote procedure calls are becoming more and more important, and RPCs are a basic functionality in most modern operating systems, the principles of RPCs are sketched.

The idea behind remote procedure calls is to hide the underlying I/O (message passing...) communication for invoking a service on a different computer. The call of a function or procedure on a different computer should be as easy and transparent as a local procedure call; the programmer should not be concerned where a function is executed.

To archive that goal, special programs are used to create code that handles the necessary message exchange: the client and the server stub code. The client stub code packs the arguments into a message (called parameter marshalling), sends the message, waits for the result message, unpacks the result and returns the result to the caller. The server stub code unpacks the parameters, calls the right function, packs the result into a message and sends a message back. Figure 31 (from [Tan92]) illustrates this.

Figure 31: Calls and Messages in an RPC. Each ellipse represents a single process, with the shaded portion being the stub.

RPC should be as transparent as possible. To achieve this goal the following issues have to be considered:

Parameter passing: only call-by-value parameters can be supported directly. Call-by-reference parameters (using pointers) introduces problems which can not be solved easily.
Binding: how to find a remote host for a desired service, and how to find the correct server process on that host.
Transport Protocol: which protocol should be used for connecting to the server.
Exception handling: what should be done if something goes wrong in the remote procedure call.
Call semantics: often it is difficult to ensure that a RPC is guaranteed to be executed only once (the message might be delivered more than once, what happens if the server crashes while the RPC is executed...). Therefore the exact semantics of an RPC are important.
Data representation: the binary format of data might be different (floating point representation, big vs. small endian...). Therefore there might be a need for data conversion.
Performance: what are the overheads of RPCs, and how can these overheads be kept to a minimum?
Security

With remote procedure calls the development of distributed applications is much easier.

Sun RPC

The Sun RPC protocol, defined in [RFC 1057] is common in the UNIX environment. The program rpcgen(1) is used to generate the appropriate client and server C-stub code. XDR (eXternal Data Representation, defined in [RFC 1057]) is used for the representation of data. UDP and TCP can be used as network protocols. The portmapper daemon handles all binding issues.

The principle of remote procedure calls is described in more detail in [Tan92, Chapter 10.3], while existing implementations are covered in [Stev90, Chapter 18]. A modified RPC example from [Stev90] is included in Appendix A.7. An even more powerful method for developing distributed applications is the use of the Arjuna system.

6.4 Summary

Distributed interprocess communication is far more complex than local interprocess communication. Details about the underlying network protocols have to be considered. The number of system and library calls that are needed for distributed IPC is much higher than for local IPC.

Two different APIs for distributed IPC have been described: the socket interface and Transport Layer Interface. Sockets are widely used, although TLI provides more flexibility.

A high-level use of networking facilities, remote procedure calls, can make the development of distributed applications easier.

Distributed IPC is more general than local IPC, as it can be used on one computer, but also between distinct computers. Table 19 compares the function calls for some local and distributed IPC facilities used for connection-oriented communication. Chapter 7 compares the performance of local IPC and distributed IPC facilities.

Sockets TLI Messages FIFOs

Server allocate space t_alloc()

create endpoint socket() t_open() msgget() mknod()

open()

bind address bind() t_bind()

specify queue listen()

wait for connection accept() t_listen()

get new fd t_open()

t_bind()

t_connect()

Client allocate space t_alloc()

create endpoint socket() t_open() msgget() open()

bind address bind() t_bind()

connect to server connect() t_connect()

transfer data read() read() msgrcv()

write() write() msgsnd() read()

recv() t_rcv() write()

send() t_snd()

datagrams recvfrom() t_rcvudata()

sendto() t_sndudata()

terminate close() t_close() msgctl() close()

shutdown() t_sndrel() unlink()

t_snddis()

Table 19: Comparison of Socket, TLI, Message Queues, and FIFO Function Calls [Stev90]

Next: 7 Performance Up: Unix Communication Facilities Previous: 5 Networking with TCP/IP

Gerhard M�ller

Function Call	Protocol	l. Addr.	l. Port-Nr.	f. Addr.	f. Port-Nr.
`socket()`
`bind()`
`accept()`
`recvfrom()`
`sendto()`

		Sockets	TLI	Messages	FIFOs
Server	allocate space		`t_alloc()`
	create endpoint	`socket()`	`t_open()`	`msgget()`	`mknod()`
					`open()`
	bind address	`bind()`	`t_bind()`
	specify queue	`listen()`
	wait for connection	`accept()`	`t_listen()`
	get new fd		`t_open()`
			`t_bind()`
			`t_connect()`
Client	allocate space		`t_alloc()`
	create endpoint	`socket()`	`t_open()`	`msgget()`	`open()`
	bind address	`bind()`	`t_bind()`
	connect to server	`connect()`	`t_connect()`
	transfer data	`read()`	`read()`	`msgrcv()`
		`write()`	`write()`	`msgsnd()`	`read()`
		`recv()`	`t_rcv()`		`write()`
		`send()`	`t_snd()`
	datagrams	`recvfrom()`	`t_rcvudata()`
		`sendto()`	`t_sndudata()`
	terminate	`close()`	`t_close()`	`msgctl()`	`close()`
		`shutdown()`	`t_sndrel()`		`unlink()`
			`t_snddis()`