Chapter 4

UNITE Server

4.1 Overview

The central program of the UNITE project is the server. Everything in the UNITE project interacts with the server in one way or another. Obviously, its main purpose is to serve the client's requests. This brought about the problem of how do the server and client communicate? A protocol had to be defined. When we first started the project in June of 1993, the World-Wide-Web was starting to grow, and its goals were very similar to ours. Therefore, we adopted the HTTP syntax as our protocol. In the intervening years, more features have been added to our server and, therefore, our server has become ideal solution to add, search, and browse resources on the Internet. This chapter will go into detail on the structure of the server and how it can be used to its fullest potential.

4.2 Global Configuration File

The global configuration file provides the UNITE server and tools with the directory structure and other miscellaneous information. Appendix B shows the global configuration file that is actually used by the UNITE project. The first entry file is the 'TopLevelDir' which is the directory in which the software was installed and under which everything should be stored.

The 'AuthDir', 'FileSetsDir', and 'UserGroupsDir' are variables used for authorization and authentication. 'AuthDir' is relative to 'TopLevelDir', 'FileSetsDir' and 'UserGroupsDir' are relative to 'AuthDir'.

The 'ContributionDir' is the location in which newly contributed resources are put after being reviewed. 'OldContributionDir' is a directory in which a copy of the original contributed resources is kept. This is done as a safety measure. The 'ReviewDir' is the location in which newly contributed resources are put before they have been reviewed. These three directories are relative to the 'TopLevelDir'.

'MirrorDir', 'MirrorNewFiles', 'MirrorRemovedFiles', 'MirrorUpdatedFiles', 'MirrorServers', and 'MirrorLogs' are directories in which mirroring information is stored.

'ConnectLogs' is a directory in which usage logs are stored. 'DeleteDir' is a directory to which resources deleted using the DELETE method are copied. This is a safety measure provided so that a file deleted by accident can be recovered. 'PutDir' is the directory to which resources are copied when the PUT method is used. This directory is usually the same as the 'ReviewDir' or the 'ContributionDir'.

'ResourceDir' is the directory in which all the HTML files should be stored (i.e. the Document Root). This is where the server will look for any files. 'ScriptDir' is the directory in which the CGI scripts are stored for the server to run. 'DefaultDir' is the name of the script to run when the URL is a slash ('/'). 'GenericDir' is a generic directory where anything can be stored (no special purpose). 'IconDir' is the directory in which the icons are stored. 'BrowserDir' is the directory in which the browser files are stored. 'AuxDir' is a directory used for storing miscellaneous information. 'HomeHTML' is the HTML home page. 'SearchHelp' is the HTML help page for the search interface. 'DeleteMessageFile' is the HTML page displayed to the user after a successful DELETE. 'PutMessageFile' is the HTML page displayed to the user after a successful PUT. 'DatabaseList' is a file containing a list of all the databases currently used.

'DatabaseCSO' is the directory in which all the databases' indexed files, using the CSO search engine, are stored. 'DatabaseWAIS' is the directory in which all the databases' indexed files, using the WAIS search engine, are stored. 'DatabasePG' is the directory in which all the databases' indexed files, using the Postgres search engine, are stored.

'DbConfigFile' is the database configuration file and is discussed in Section 4.4.1.

'UserDir' is the directory which is appended onto a user's home directory if a ~user request is received.

'DefaultPage' is the default home page used when a request comes in without a specific file. This is relative to 'ResourceDir'. 'DeletePermission' is the entry in the 'From: ' field that ought to be used for successful use of the DELETE method.

'serverPort' is the port number the server is running on. 'serverHost' is the host name of the server.

'defaultUserGroup' is the user group used when none is specified.

In the 'databaseLocation' table the 'dbName' is the name of the database. The 'engine' is the name of the search engine. The 'dbHost' is the host name on which the search engine is located. The 'dbPort' is the port on which the search engine is listening.

4.3 Structure of the UNITE Server

The structure of the server is shown in Figure 4-1.

Figure 4-1: UNITE Server Structure

From the client/server paradigm, we have seen that the server has to be waiting for a request to come. Therefore, the first milestone in the server design is to have it listen on a port for a request. This process is known as a daemon and can be achieved in two ways. The first is called a standalone daemon and means that the server is running with no help from any other applications. It performs its own startup tasks: create a socket, bind the server's well-known address to the socket, wait for a connection, then fork. The child process then performs the service while the parent waits for another request. The second method is called inetd and means that the server is running using the BSD UNIX super-server, inetd. The inetd daemon provides two features: (1) it allows a single process (inetd) to be waiting to service multiple connection requests, instead of one process for each potential service, which reduces the total number of processes in the system, and (2) it simplifies the writing of the daemon processes to handle the requests, since many of the start-up details are handled by inetd. However, there is a small price to pay for this in that the inetd daemon has to execute both a fork and exec to invoke the actual server process, while a self-contained daemon that does everything itself only has to execute a fork to handle each request.

Figure 4-2: Inetd

On startup, inetd reads the /etc/inetd.conf file. Therefore to add a service, simply add the proper line in this file and restart the daemon (usually with a kill -1 signal). Figure 4-2 gives the steps followed by inetd for each services in the configuration file. Once a connection is accepted, the server reads the content of the socket and parses the request. The protocol used is HTTP with a few modifications added. Figure 4-3 gives the BNF for the protocol used by the UNITE server.

Request = Request-Line
        ( Request-Header 
          | Entity-Header )
        CRLF
        [Entity-Body]

Request-Line = Method SP Request-URI SP HTTP-Version CRLF
Method = "GET" | "PUT" | "POST" | "DELETE" | "SEARCH"
Request-Header = Accept
                 | From
                 | Pragma
                 | Referer 
                 | User-Agent 

Entity-Header = Content-Length
                | Content-Type

Figure 4-3: BNF for protocol

The first symbol is the method used by the client. This can either be a PUT, SEARCH, GET, POST, or DELETE.

The PUT method tells the server that the client wants to add a new file to the server. The name of the new file is the Request-URI. The server creates the file containing the Entity-Body sent by the client. All the parameters concerning the Entity-Body is given in the Entity-Header: the Content-Length tells the server how many bytes to write to the file and the Content-Type tells the server the type of the Entity-Body. A message is then returned to the client informing the user that the operation was successful. The message displayed to the client is the content of a file, defined in the global configuration file, on the server and therefore can be easily modified. Figure 4-4 gives an example for this method.

 PUT /new_file.html HTTP/1.0 
 Accept: text/html
 Content-Type: text/html 
 Content-Length: 84 

 This is the content of new_file.html.  Whatever I type here
 will appear in the file.

Figure 4-4: Example for PUT method

The DELETE method tells the server that the client wants to remove a file from the server. Figure 4-5 gives an example for this method. This method could be dangerous; we would not want everybody to remove all the files on the server. To make this method more secure, the From field, in the Request-Header, is checked and compared with the value given in the global configuration file. If the value, from the global configuration file, matches the value given by the client then the file is copied to a delete directory. At this point, it is up to the system administrator to delete the file. The response to the client is again the content of a file defined in the global configuration file (similar to the PUT method).

 DELETE /new_file.html HTTP/1.0
 Accept: text/html
 From: CedricDeniau

Figure 4-5: Example for DELETE method

The GET method is used to retrieve documents from the server and to run CGI scripts. If the Request-URI is a valid file on the server then the server returns the content of the file. If the Request-URI is a directory then the server checks for the default page in that directory. The default page is defined in the global configuration file. If the Request-URI is a user path (using the ~ symbol in the URL) then the server will check in the /etc/passwd file for the absolute pathname of the user's home directory. Then the server will look in the global configuration file for the "UserDir" which is appended to the user's home directory. Once this is done the server will append the name of the file the client requested and return the content of the file. If the Request-URI specifies a valid CGI script then the server will execute the script passing the program the proper environment variables. For CGI scripts, the GET method appends any client information to the Request-URI. Therefore, the server has to parse the Request-URI for the script name and the client information. This is possible because the two are separated by a question mark (Ś?). The server then passes the client information to the CGI script program through an environment variable called QUERY_STRING. Figure 4-6 shows examples for the different points just discussed. 1. Client requesting: (a) A file GET /file.html HTTP/1.0
Accept: text/html
(b) A user file GET /~deniau/file.html HTTP/1.0
Accept: text/html
(c) A directory GET /~deniau HTTP/1.0
Accept: text/html
(d) CGI script GET /cgi-bin/script_name?name=cedric
Accept: text/html
2. Server response: (a) The content of file.html
(b) The content of /users/deniau/.public_html/file.html where /users/deniau came form the /etc/password file and .public_html came from the global configuration file specifying the location on the user's HTML files.
(c) the file /users/deniau/.public_html/index.html the index.html file comes from the global configuration file and, therefore, depends from server to server.
(d) executes the script script_name with the environment variable QUERY_STRING set to name=cedric.

Figure 4-6: Example for GET method

The POST method is used to perform CGI scripts. When a POST request comes in, the server first checks if the Request-URI is a valid CGI script program. Any information sent by the user is added to the Entity-Body and the Entity-Header fields are properly set by the client. The client information is put in the Entity-Body and passed to the CGI script through standard input (stdin in C). The CGI scripts knows how much data to read from standard input from the environment variable CONTENT_LENGTH which is passed by the server to the CGI script. If the Request-URI is a valid CGI script then the server executes the specified program and passes it the proper environment variables. Figure 4-7 shows an example of a POST method.

 POST /cgi-bin/program_name HTTP/1.0 
 Accept: text/html 
 Content-Type: text/html 
 Content-Length: 83 

 firstName=Cedric+lastName=Deniau

Figure 4-7: Example for POST method

The SEARCH method is a unique feature of the UNITE server. It was created to allow the server to directly respond to queries from the client rather than via CGI scripts. It also defines a search syntax, which has yet to be done by the Web community. Appendix A shows the BNF for this method.

The protocol-type is the protocol the server understands and is used for version control. The DBSpec gives the name of one or more databases to search. If an invalid database name is specified, an error message is returned informing the client of such an error. If the database specified is valid, but unavailable, then a different error message is returned. In either case, an error on a database does not prevent the search process from continuing on other databases specified in the same query. The SessionSpec gives resource control parameters. The time-pair specifies the maximum number of seconds a search may take, the cost-pair specifies the maximum connection cost that a search may take, and the distance-pair specifies the maximum distance at which a database may be and still be searched. If any of these maximums are violated, a message is returned to the client. A user has the option of overriding the maximum, and continuing the search, up to a true maximum, or seeing the incomplete search results. The SearchSpec specifies which records get returned. The operators defined are and, or, andnot, and contains. The and, or, andnot perform the standard Boolean operations, while the contains allows the user to search for specific values. The ReturnSpec indicates how to present the identified records, including how to present extremely large retrieval sets. The max_num_full specifies the number of full record to present, max_num_sum specifies the number of summary record to present, max_size_full specifies the maximum size of the full records set in bytes, max_size_sum specifies the maximum size of the summary records set in bytes, sort_method is a sort specification with the primary sort key being the one most nested, show_full_method specifies how to present the full records, and show_sum_method specifies how to present the summary records. Figure 4-8 shows an example query sent by the UNITE client.

 SEARCH Unite-2.0
  ((UNITEResource)
   (:maxTime 500) 
  (and  (contains "Title"  "animal") 
        (contains "Grades"  "6") 
  )
  (:maxNumSummaryRecords 200
   :sort-by (:alpha "ResourceType" (:alpha "Title") )
   :show-summary ( "ResourceType" "Title" )) 
  )

The DATABASE OBJECT section defines a UNITE resource'’s fields and field attributes, using one line per field. This section is first defined by the keywords DATABASE OBJECT, followed by a STRING, which is the name of the database. A NUMBER then follows, which is the version number. The resource's fields follows, enclosed in braces and delimited by a semi-colon. The first attribute of an entry is the field type, which can either be a predefined or a user defined type. The predefined types are: string, integer, uid, and freetext. A string is defined as a sequence of characters enclosed in double quotes, and an integer as a sequence of numbers from 0 to 9. Freetext is the same as a string except it can contain line feeds. The user defined types are either enumerations or records. The next attribute specifies how many items the field can contain: One, OneOrMore, ZeroOrMore, or Zero. The third attribute specifies how the field is used during a search, while the last attribute is the name of the field used by the database.

In the example of Appendix C, the name of the database is UNITEResource, and the version number is 1994092001. The last entry of the DATABASE OBJECT section specifies that the field "Reviewers" is of type "string", can hold one or more values and is not searchable. As another example, the field "Curriculum" is of type "CurriculumT" which is an ENUMERATION representing a set of values that are hierarchically defined. Therefore, the "Curriculum" field can only contain values that are explicitly defined in the ENUMERATION "CurriculumT". Some possible values could be: "Mathematics", "Mathematics/Problem Solving and Reasoning/Generalize" and "Natural Science/Life Science". "Curriculum" can hold "OneOrMore" values which means that there has to be at least one value defined and it is a "KeywordValue" meaning that it is searchable through a keyword based search engine like CSO.

The other user defined type is a RECORD. This RECORD object uses the same set of parameters as the DATABASE OBJECT. However, the record defined is used as a type for a field in the DATABASE OBJECT rather than defining an object directly. This allows for a more flexible definition of the database. Following our example in Appendix C, the field "FileDescriptions" is of type "FileDescriptionsT" which is a RECORD. This RECORD contains a field "FileDescription" which is of type "FileDescriptionT" which is also a RECORD. This record contains five fields: "FileSizeInKBytes", "FileFormat", "FileName", "FileSet", and "FileEncoding".

The TABLE section gives extra flexibility to the system by defining a mapping from one set of values to another. From the syntax, this section is first defined by the keyword TABLE followed by a STRING, which is the name of the table. The table entries then follow enclosed in braces. Each entry consists of two STRINGs and each entry is delimited by a semicolon. The first STRING in an entry is used as the index and the second STRING is mapped to the value.

The ENUMERATION section defines a set of valid values a database field is allowed to have. The syntax for this section is first defined by the keyword ENUMERATION followed by a STRING which is the enumeration name. The content of the enumeration then follows enclosed in braces. All enumerations are hierarchic. Some hierarchies may just be one level deep making them look like simple lists. For example, the ENUMERATION "ResourceTypeT" is a simple list of valid values for the field "ResourceType". On the other hand, the ENUMERATION "FileFormatT" is a hierarchic list of valid values for the field "FileFormat". Internally, both of these enumerations are represented in the same manner.

Following the definition of a database, the records need to be entered and ultimately presented to the user. The records are indexed using the CSO database and are rendered in HTML. The HTML generation is currently done at contribution time but could be done at runtime if it were desirable to trade time for space.

4.4.2 CSO

The current search engine used for UNITE is CSO. CSO was originally written for a simple name service, a computer resident phone book, but required only slight modifications to fit UNITE’s needs. It can keep relatively small amounts of information about a relatively large number of objects, and provide fast access to that information over the Internet [4]. CSO also allows for wild card expansion which permits users to be conveniently vague when formulating queries. The main problem with CSO is that it is inappropriate for large target text items and it does not have boolean search capabilities. This motivated us to implement set operations (i.e. and, or, contains, ... ).

4.4.3 Adding Databases Engines

To add a new search engine to the UNITE system, only a handfull of functions would need to be written. First, functions to format and send the query to the new search engine are needed. Then, once the search engine returns the results, functions will have to be written to parse that result in the proper data structures. Finally, the global configuration file would have to be modified by adding an extra line in the "databaseLocation" section (refer to the example in Appendix B) and the database would have to be built.