Systems are emerging upon which applications with varied behavioral semantics run. Single systems capable of supporting co-existing application semantics offer advantages over specialized systems created for specific applications by simplifying system management, and reducing redundant computing resources. For example, multi-purpose systems may be used simultaneously for web-browing and multimedia, or in a laboratory setting a single system may be used for a wide variety of scentific applications, each requiring their own set of unique semantics. The trend of moving away from specialized systems to general purpose systems capable of supporting a variety of applicatino semantics is driven by the cost and simplicity of managing fewer systems. But as the set of co-existing behaviors on a system continues to grow, the use of shared system resources must be controlled through a unified system policy.
Software developers express the semantics of their applications through programming models using semantic mappings. A semantic mapping is a function which expresses one semantics in terms of another. A programming model expresses a semantic mapping through scheduling parameters, as well as other components of the model, such as concurrency control policy. For instance, the Linux kernel implements a hard-wired, static-priority scheduler, and exports a priority based programming model. Thus, developers using Linux are required to express the behavior of their software by mapping application smeantics onto a priority based scheduling model, regardless of the difficulty and precision involved in creating the mapping. It would be beneficial if developers were able to easily create their own programming models using clear semantic mapping functions.
Group Scheduling is a hierarchical scheduling framework used for creating thread schedulers with arbitrary semantics, by allowing applications to represent scheduling information directly within the scheduler. The direct representation of scheduling information allows a developer to choose the level at which semantic mappings are expressed. For example, Group Scheduling has been used to express progress-based semantics for a multi-tier, pipeline-based processing application, by allowing threads to publish their progress directly into the scheduler. Achieving these semantics using using only a priority-based programming model is significantly more challenging, and error prone. While Group Scheduling allows developers to create arbitrary scheduling semantics, supporting co-existing, arbitrary semantics is difficult due to the policy conflicts that arise as computations interact through their use of shared system resources.
In general, access to a shared resource is controlled by affecting the runnabilty of computations. Associated with each shared resource is thus a semantics specifying the rules by which access is granted, in terms of computation runnability. We use the term interaction channel to refer to the general idea that application interact through shared resources, and that this interaction has an associated semantics affecting the runnability of computations. Because the semantics of interaction channels affect the runnability of computations, they in turn can affect the abiilty of the system to construct schedules obeying the behavioral constraints of computations. In order to support arbitrary programming models we must be able to specify arbitrary scheduling semantics. Thus, the semantics of all system components which affect scheduability, including interaction channels, must be integrated with scheduling. While Group Scheduling facilitates the creation of arbitrary scheduling semantics, this paper presents a general solution to the problem of resolving scheduling policy conflicts for a common interaction channel, the binary semaphore. In doing so, we successfully integrate the semantics of the binary semaphore interaction channel with the Group Scheduling framework in support of arbitrary programming models.
Interaction channels can take the form of
- HW shared among computation threads, most notably CPUs
- Threads providing services to computation threads, X-server most obvious example, SIRQ and HIRQ multiplexed among computation threads are extremely important.
- Data structures shared among 2 or more threads, CC used to coordinate sharing is the real IC
- CC used for scheduling effects, especially to implement desired scheduling semantics when priority is inappropriate.
Exclusive use interaction channels are among the most common, although not the only type. The Linux kernel mutex is among the most commonly used and represents the first component of a general solution to the problem of fully integrating interaction management under the group scheduling framework.
The PREEMPT_RT patch has done a considerable amount of relevant work. They have done considerable code cleanup related to creating a standard interface for the basic kernel mutex, or lock, that can also be configured to operate under two major semantics: non-preemptible and preemptible. The standard Linux kernel uses the non-preemptible semantics for locks because it generally produces higher performance for the system as a whole since its overhead is lower.
In general terms, the PREEMPT_RT patch solves the problem of how to minimize context switch latency by permitting threads holding lock to be preempted. However, since the locks are interaction channels among arbitrary sets of threads, management of lock semantics must be integrated with thread scheduling semantics. In the PREEMPT_RT patch this integration takes the form of support for Priority Inheritance by lock owners.
The PREEMPT_RT patch adds preemptible lock semantics (rt_mutex) in service of improving RT performance by lowering worst case latency of thread context switch, but at the cost of higher overhead in the form of the accounting required to track lock ownership and waiting relationships among tasks requesting a lock and the lock owners. Worst case context switch latency is lowered because unless the currently executing thread is in one of a few short non-preemptible critical sections, the CPU can be immediately switched to the newly runnable RT thread.
The ownership and waiter accounting is required because permitting lock owneres to be preempted introduces the problem of Priority Inversion under priority scheduling semantics, because a higher priority task can become blocked for arbitrary periods of time by lower priority tasks when lower priority tasks are preempted while holding locks. In more general terms, a more desirable thread cannot run because a less desirable thread is holding a lock required by the more desirable thread. Solving this problem requires some form of integration between the scheduling and concurrency control semantics of the lock.
The accounting associated with rt_mutex use under the PREEMPT_RT patch to track ownership and waiter status can be viewed as a method of integrating concurrency control and scheduling semantics under priority scheduling. The decisions made in the lock code are consistent with and coordinated with the decisions made by the scheduling subsystem. When a lock is released the waiter with the best priority is chosen. Owners of locks inherit the highest priority of those threads waiting on the lock to ensure those higher priority waiters spend a minimum amount ot time waiting to acquire the lock. The result is that both context switch latency and waiting time of more desirable threads blocked by less desirable threads is minimized.
However, Priority Inheritance is a method of integrating lock and scheduling semantics which is only approriate to priority based scheduling semantics. It is difficult or impossible to use under a number of other important approaches to scheduling, including: CPU Share, Explicit Plan, Earliest Deadline First, or a wide variety of application specific semantics.
The goal of the Group Scheduling framework is to permit developers to specify arbitrary semantics for their applications, and through hierarchical composition of groups representing different aspects of system activities, to specify the semantics for the system as a whole. Since Priority Inheritance is limited to priority scheudling schemes another, more general, approach to integrating scheduling and concurrency control is required.
Proxy Management is a new subsystem which keeps track of lock ownership and waiting relationships among tasks in as general a way as possible and to provide an interface to the scheduling layer that constrains the scheduling semantics as little as possible. The basic model used by Proxy Management is that when a task T1 blocks on a lock owned by task T2, T1 becomes a waiter on T2, and T2 becomes the proxy of T1. In general, the scheduling layer using the information can permit tasks blocked because they are waiting on a lock to remain selectable, and when selected, to run their proxy instead.
This approach assumes no specific scheduling semantics, yet achieves the goals of minimizing both latency and waiting time of a more desirable thread being blocked by a less desirable thread. It has a variety of differences in implementation detail from Priority Inheritance, however, which make the tradeoff of overhead against semantic flexibility reasonably complex. In comparing Proxy Management to Priority Inheritance, however, it is important to note that the proxy Management provides the same results as Priority Inheritance when priority scheduling semantics are used to control thread selection.