Session One: Intro/Demo, lonc/d, Replication and Load Balancing (Gerd)

Fig. 1.1.1 Ð Overview of Network

Overview

Physically, the Network consists of relatively inexpensive upper-PC-class server machines which are linked through the commodity internet in a load-balancing, dynamically content-replicating and failover-secure way. Fig. 1.1.1 shows an overview of this network.

All machines in the Network are connected with each other through two-way persistent TCP/IP connections. Clients (B, F, G and H in Fig. 1.1.1) connect to the servers via standard HTTP. There are two classes of servers, Library Servers (A and E in Fig. 1.1.1) and Access Servers (C, D, I and J in Fig. 1.1.1). Library Servers are used to store all personal records of a set of users, and are responsible for their initial authentication when a session is opened on any server in the Network. For Authors, Library Servers also hosts their construction area and the authoritative copy of the current and previous versions of every resource that was published by that author. Library servers can be used as backups to host sessions when all access servers in the Network are overloaded. Otherwise, for learners, access servers are used to host the sessions. Library servers need to be strong on I/O, while access servers can generally be cheaper hardware. The network is designed so that the number of concurrent sessions can be increased over a wide range by simply adding additional Access Servers before having to add additional Library Servers. Preliminary tests showed that a Library Server could handle up to 10 Access Servers fully parallel.

The Network is divided into so-called domains, which are logical boundaries between participating institutions. These domains can be used to limit the flow of personal user information across the network, set access privileges and enforce royalty schemes.

Example of Transactions

Fig. 1.1.1 also depicts examples for several kinds of transactions conducted across the Network.

An instructor at client B modifies and publishes a resource on her Home Server A. Server A has a record of all server machines currently subscribed to this resource, and replicates it to servers D and I. However, server D is currently offline, so the update notification gets buffered on A until D comes online again. Servers C and J are currently not subscribed to this resource.

Learners F and G have open sessions on server I, and the new resource is immediately available to them.

Learner H tries to connect to server I for a new session, however, the machine is not reachable, so he connects to another Access Server J instead. This server currently does not have all necessary resources locally present to host learner H, but subscribes to them and replicates them as they are accessed by H.

Learner H solves a problem on server J. Library Server E is HÕs Home Server, so this information gets forwarded to E, where the records of H are updated.

lonc/lond/lonnet

Fig. 1.1.2 elaborates on the details of this network infrastructure.

Fig. 1.1.2A depicts three servers (A, B and C, Fig. 1.1.2A) and a client who has a session on server C.

As C accesses different resources in the system, different handlers, which are incorporated as modules into the child processes of the web server software, process these requests.

Our current implementation uses mod_perl inside of the Apache web server software. As an example, server C currently has four active web server software child processes. The chain of handlers dealing with a certain resource is determined by both the server content resource area (see below) and the MIME type, which in turn is determined by the URL extension. For most URL structures, both an authentication handler and a content handler are registered.

Handlers use a common library lonnet to interact with both locally present temporary session data and data across the server network. For example, lonnet provides routines for finding the home server of a user, finding the server with the lowest loadavg, sending simple command-reply sequences, and sending critical messages such as a homework completion, etc. For a non-critical message, the routines reply with a simple Òconnection lostÓ if the message could not be delivered. For critical messages, lonnet tries to re-establish connections, re-send the command, etc. If no valid reply could be received, it answers Òconnection deferredÓ and stores the message in buffer space to be sent at a later point in time. Also, failed critical messages are logged.

The interface between lonnet and the Network is established by a multiplexed UNIX domain socket, denoted DS in Fig. 1.1.2A. The rationale behind this rather involved architecture is that httpd processes (Apache children) dynamically come and go on the timescale of minutes, based on workload and number of processed requests. Over the lifetime of an httpd child, however, it has to establish several hundred connections to several different servers in the Network.

On the other hand, establishing a TCP/IP connection is resource consuming for both ends of the line, and to optimize this connectivity between different servers, connections in the Network are designed to be persistent on the timescale of months, until either end is rebooted. This mechanism will be elaborated on below.

Establishing a connection to a UNIX domain socket is far less resource consuming than the establishing of a TCP/IP connection. lonc is a proxy daemon that forks off a child for every server in the Network. . Which servers are members of the Network is determined by a lookup table, which Fig. 1.1.2B is an example of. In order, the entries denote an internal name for the server, the domain of the server, the type of the server, the host name and the IP address.

The lonc parent process maintains the population and listens for signals to restart or shutdown, as well as USR1. Every child establishes a multiplexed UNIX domain socket for its server and opens a TCP/IP connection to the lond daemon (discussed below) on the remote machine, which it keeps alive. If the connection is interrupted, the child dies, whereupon the parent makes several attempts to fork another child for that server.

When starting a new child (a new connection), first an init-sequence is carried out, which includes receiving the information from the remote lond which is needed to establish the 128-bit encryption key Ð the key is different for every connection. Next, any buffered (delayed) messages for the server are sent.

In normal operation, the child listens to the UNIX socket, forwards requests to the TCP connection, gets the reply from lond, and sends it back to the UNIX socket. Also, lonc takes care to the encryption and decryption of messages.

lonc was build by putting a non-forking multiplexed UNIX domain socket server into a framework that forks a TCP/IP client for every remote lond.

lond is the remote end of the TCP/IP connection and acts as a remote command processor. It receives commands, executes them, and sends replies. In normal operation, a lonc child is constantly connected to a dedicated lond child on the remote server, and the same is true vice versa (two persistent connections per server combination).

lond  listens to a TCP/IP port (denoted P in Fig. 1.1.2A) and forks off enough child processes to have one for each other server in the network plus two spare children. The parent process maintains the population and listens for signals to restart or shutdown. Client servers are authenticated by IP.


Fig. 1.1.2A Ð Overview of Network Communication

When a new client server comes online, lond sends a signal USR1 to lonc, whereupon lonc tries again to reestablish all lost connections, even if it had given up on them before Ð a new client connecting could mean that that machine came online again after an interruption.

The gray boxes in Fig. 1.1.2A denote the entities involved in an example transaction of the Network. The Client is logged into server C, while server B is her Home Server. Server C can be an Access Server or a Library Server, while server B is a Library Server. She submits a solution to a homework problem, which is processed by the appropriate handler for the MIME type ÒproblemÓ. Through lonnet, the handler writes information about this transaction to the local session data. To make a permanent log entry, lonnet establishes a connection to the UNIX domain socket for server B. lonc receives this command, encrypts it, and sends it through the persistent TCP/IP connection to the TCP/IP port of the remote lond. lond decrypts the command, executes it by writing to the permanent user data files of the client, and sends back a reply regarding the success of the operation. If the operation was unsuccessful, or the connection would have broken down, lonc would write the command into a FIFO buffer stack to be sent again later. lonc now sends a reply regarding the overall success of the operation to lonnet via the UNIX domain port, which is eventually received back by the handler.

Scalability and Performance Analysis

The scalability was tested in a test bed of servers between different physical network segments, Fig. 1.1.2B shows the network configuration of this test.

msul1:msu:library:zaphod.lite.msu.edu:35.8.63.51

msua1:msu:access:agrajag.lite.msu.edu:35.8.63.68

msul2:msu:library:frootmig.lite.msu.edu:35.8.63.69

msua2:msu:access:bistromath.lite.msu.edu:35.8.63.67

hubl14:hub:library:hubs128-pc-14.cl.msu.edu:35.8.116.34

hubl15:hub:library:hubs128-pc-15.cl.msu.edu:35.8.116.35

hubl16:hub:library:hubs128-pc-16.cl.msu.edu:35.8.116.36

huba20:hub:access:hubs128-pc-20.cl.msu.edu:35.8.116.40

huba21:hub:access:hubs128-pc-21.cl.msu.edu:35.8.116.41

huba22:hub:access:hubs128-pc-22.cl.msu.edu:35.8.116.42

huba23:hub:access:hubs128-pc-23.cl.msu.edu:35.8.116.43

hubl25:other:library:hubs128-pc-25.cl.msu.edu:35.8.116.45

huba27:other:access:hubs128-pc-27.cl.msu.edu:35.8.116.47

Fig. 1.1.2B Ð Example of Hosts Lookup Table /home/httpd/lonTabs/hosts.tab

In the first test, the simple ping command was used. The ping command is used to test connections and yields the server short name as reply.  In this scenario, lonc was expected to be the speed-determining step, since lond at the remote end does not need any disk access to reply.  The graph Fig. 1.1.2C shows number of seconds till completion versus number of processes issuing 10,000 ping commands each against one Library Server (450 MHz Pentium II in this test, single IDE HD). For the solid dots, the processes were concurrently started on the same Access Server and the time was measured till the processes finished Ð all processes finished at the same time. One Access Server (233 MHz Pentium II in the test bed) can process about 150 pings per second, and as expected, the total time grows linearly with the number of pings.

The gray dots were taken with up to seven processes concurrently running on different machines and pinging the same server Ð the processes ran fully concurrent, and each process finished as if the other ones were not present (about 1000 pings per second). Execution was fully parallel.

In a second test, lond was the speed-determining step Ð 10,000 put commands each were issued first from up to seven concurrent processes on the same machine, and then from up to seven processes on different machines. The put command requires data to be written to the permanent record of the user on the remote server.

In particular, one "put" request meant that the process on the Access Server would connect to the UNIX domain socket dedicated to the library server, lonc would take the data from there, shuffle it through the persistent TCP connection, lond on the remote library server would take the data, write to disk (both to a dbm-file and to a flat-text transaction history file), answer "ok", lonc would take that reply and send it to the domain socket, the process would read it from there and close the domain-socket connection.

Fig. 1.1.2C Ð Benchmark on Parallelism of Server-Server Communication (no disk access)

The graph Fig. 1.1.2D shows the results. Series 1 (solid black diamond) is the result of concurrent processes on the same server Ð all of these are handled by the same server-dedicated lond-child, which lets the total amount of time grow linearly.

Fig. 2D Ð Benchmark on Parallelism of Server-Server Communication (with disk access as in Fig. 2A)

Series 2 through 8 were obtained from running the processes on different Access Servers against one Library Server, each series goes with one server. In this experiment, the processes did not finish at the same time, which most likely is due to disk-caching on the Library Server Ð lond-children whose datafile was (partly) in disk cache finished earlier. With seven processes from seven different servers, the operation took 255 seconds till the last process was finished for 70,000 put commands (270 per second) Ð versus 530 seconds if the processes ran on the same server (130 per second).

Dynamic Resource Replication

Since resources are assembled into higher order resources simply by reference, in principle it would be sufficient to retrieve them from the respective Home Servers of the authors. However, there are several problems with this simple approach: since the resource assembly mechanism is designed to facilitate content assembly from a large number of widely distributed sources, individual sessions would depend on a large number of machines and network connections to be available, thus be rather fragile. Also, frequently accessed resources could potentially drive individual machines in the network into overload situations.

Finally, since most resources depend on content handlers on the Access Servers to be served to a client within the session context, the raw source would first have to be transferred across the Network from the respective Library Server to the Access Server, processed there, and then transferred on to the client.

To enable resource assembly in a reliable and scalable way, a dynamic resource replication scheme was developed. Fig. 1.1.3 shows the details of this mechanism.

Anytime a resource out of the resource space is requested, a handler routine is called which in turn calls the replication routine (Fig. 1.1.3A). As a first step, this routines determines whether or not the resource is currently in replication transfer (Fig. 1.1.3A, Step D1a). During replication transfer, the incoming data is stored in a temporary file, and Step D1a checks for the presence of that file. If transfer of a resource is actively going on, the controlling handler receives an error message, waits for a few seconds, and then calls the replication routine again. If the resource is still in transfer, the client will receive the message ÒService currently not availableÓ.

In the next step (Fig. 1.1.3A, Step D1b), the replication routine checks if the URL is locally present. If it is, the replication routine returns OK to the controlling handler, which in turn passes the request on to the next handler in the chain.

If the resource is not locally present, the Home Server of the resource author (as extracted from the URL) is determined (Fig. 1.1.3A, Step D2). This is done by contacting all library servers in the authorÕs domain (as determined from the lookup table, see Fig. 1.1.2B). In Step D2b a query is sent to the remote server whether or not it is the Home Server of the author (in our current implementation, an additional cache is used to store already identified Home Servers (not shown in the figure)). In Step D2c, the remote server answers the query with True or False. If the Home Server was found, the routine continues, otherwise it contacts the next server (Step D2a). If no server could be found, a ÒFile not FoundÓ error message is issued. In our current implementation, in this step the Home Server is also written into a cache for faster access if resources by the same author are needed again (not shown in the figure).


Fig. 1.1.3A Ð Dynamic Resource Replication, subscription


Fig. 1.1.3B Ð Dynamic Resource Replication, modification

In Step D3a, the routine sends a subscribe command for the URL to the Home Server of the author. The Home Server first determines if the resource is present, and if the access privileges allow it to be copied to the requesting server (Fig. 1.1.3A, Step D3b). If this is true, the requesting server is added to the list of subscribed servers for that resource (Step D3c). The Home Server will reply with either OK or an error message, which is determined in Step D4. If the remote resource was not present, the error message ÒFile not FoundÓ will be passed on to the client, if the access was not allowed, the error message ÒAccess DeniedÓ is passed on. If the operation succeeded, the requesting server sends an HTTP request for the resource out of the /raw server content resource area of the Home Server.

The Home Server will then check if the requesting server is part of the network, and if it is subscribed to the resource (Step D5b). If it is, it will send the resource via HTTP to the requesting server without any content handlers processing it (Step D5c). The requesting server will store the incoming data in a temporary data file (Step D5a) Ð this is the file that Step D1a checks for. If the transfer could not complete, and appropriate error message is sent to the client (Step D6). Otherwise, the transferred temporary file is renamed as the actual resource, and the replication routine returns OK to the controlling handler (Step D7).

Fig. 1.1.3B  depicts the process of modifying a resource. When an author publishes a new version of a resource, the Home Server will contact every server currently subscribed to the resource (Fig. 1.1.3B, Step U1), as determined from the list of subscribed servers for the resource generated in Fig. 1.1. 3A, Step D3c. The subscribing servers will receive and acknowledge the update message (Step U1c). The update mechanism finishes when the last subscribed server has been contacted (messages to unreachable servers are buffered).

Each subscribing server will check if the resource in question had been accessed recently, that is, within a configurable amount of time (Step U2).

If the resource had not been accessed recently, the local copy of the resource is deleted (Step U3a) and an unsubscribe command is sent to the Home Server (Step U3b). The Home Server will check if the server had indeed originally subscribed to the resource (Step U3c) and then delete the server from the list of subscribed servers for the resource (Step U3d).

If the resource had been accessed recently, the modified resource will be copied over using the same mechanism as in Step D5a through D7 of Fig. 1.1.3A (Fig. 1.1.3B, Steps U4a through U6).

Load Balancing

lond provides a function to query the serverÕs current loadavg. As a configuration parameter, one can determine the value of loadavg, which is to be considered 100%, for example, 2.00.

Access servers can have a list of spare access servers, /home/httpd/lonTabs/spares.tab, to offload sessions depending on own workload. This check happens is done by the login handler. It re-directs the login information and session to the least busy spare server if itself is overloaded. An additional round-robin IP scheme possible. See Fig. 1.1.4 for an example of a load-balancing scheme.

Fig. 1.1.4 Ð Example of Load Balancing