Proxy Management

The Proxy Management (PMGT) subsystem is responsible for all accounting required to track: (1) all owners of locks, (2) all waiters on locks, (3) all blocking relations among tasks and locks, and (4) all proxy relations among tasks. However, the PMGT layer is explicitly design to concentrate solely on managing this information, explicitly avoiding any aspect of how the information is used. The PMGT code is designed under the assumption that the Scheduling Layer (SL) of the system, logically above PMGT, will be responsible for making use of the information in ways that are appropriate to the SL semantics. The PMGT layer is explicitly designed to permit different scheduling layers to be chosen at kernel configuration time.

This section first concentrates on the basic concepts underlying PMGT, and then presents the specific data structures and algorithms used to implmentat PMGT.

Blocking and Proxy Relations

This section presents three basic concepts fundamental to understanding PMGT: direct blocking, indirect blocking, and proxy relations. These concepts are important when considering the data structures supporting PMGT and the procedures using them in implementing the PMGT algorithms.

Direct Blocking Relation

A direct blocking relation between two tasks T1 and T2, DBlock(T2, T1, L1), exists when a lock L1 is owned by T1 and T2 has blocked in a request to own L1. We can also note the relation between the tasks without noting the lock involved: DBlock(T2, T1). A direct blocking relation exists between a task T2 and a lock L1, DBlock(T2, L1), when T2 is blocked in a request to own L1 but the lock is already owned by another task.

These relations are illustrated in Figure 1.

The following are true statements about the direct blocking relation illustrated:

  • T1 is the owner of lock L1
  • DBlock(T2, L1): T2 is directly blocked on L1
  • Dblock(T2,T1, L1): T2 is directly blocked on T1 through L1
_images/direct-blocking-relation-oo.png

Figure 1: A Direct Blocking Relation

Indirect Blocking Relation

An indirect blocking relation between two tasks Ta and Tb, IBlock(Ta, Tb), exists when some set of direct blocking relations involving an arbitrary set of tasks and locks exist which transitively link Ta and Tb.

An inductive definition is useful, since the number of tasks and locks involved is unbounded.

Base Case

The simplest indirect blocking relation involvesthree tasks T1, T2, and T3, and two locks L1 and L2. This relation is illustrated in Figure 2. In the figure, the following statements are true:

  • DBlock(T2, T1, L1): T2 is directly blocked on task T1 through lock L1
  • DBlock(T3, T2, L2): T3 is directly blocked on task T2 through lock L2
  • IBlock(T3, T1): T3 is indirectly block on T1
  • IBlock(T3, T1, (L1, L2)): T3 is indirectly blocked on task T1 through locks L1 and L2
  • IBlock(T3, T1, (T2)): T3 is said to be indirectly blocked on task T1 through task T2
_images/indirect-blocking-relation-base-case-oo.png

Figure 2: Indirect Blocking Relation (Base Case)

Inductive Case

An indirect blocking relation involving more than three tasks and two locks is illustrated in Figure 3, where the following statements are true:

  • IBlock(T3, T1): T3 is indirectly blocked on task T1 by the base case

  • DBlock(T4, T3, L3): T4 is directly blocked on T3 through lock L3

  • T4 is indirectly blocked on all tasks and locks

    that task T3 is blocked on either directly or indirectly. Thus, the following relations are true:

    • IBlock(T4, T2)
    • IBlock(T4, T1)
    • IBlock(T4, L2)
    • IBlock(T4, L1)
_images/indirect-blocking-relation-inductive-case-oo.png

Figure 3: Indirect Blocking Relation (Inductive Case)

For convenience, when we do not care if the blocking relation between Ta and Tb* is either direct or indirect, we will use the notation Block(Ta, Tb).

Proxy Relation

A proxy relation between two tasks Ta and Tb, Proxy(Ta, Tb), exists when (1) a blocking relation, BLock(Ta, Tb), either direct or indirect, exists between Ta and Tb; and (2) Tb is not blocked on a lock. In this case, we say that the blocking relation induces the proxy relation: Block(Ta, Tb) -> Proxy(Ta, Tb), iff Tb is not blocked on a lock. In general, knowledge of the proxy relation is useful only when the proxy is in the runnable state. If the proxy is blocked for any of a number of reasons other than blocking on a lock, there is nothing that can be done until it becomes runnable again.

When the proxy task is runnable, then the proxy relation indicates that if the system wishes to run Ta, then running Tb until it releases the lock it owns which is involved in the direct or indirect blocking relation between Ta and Tb would be helpful because it would remove the blocking relation between Ta and Tb.

For example, in Figure 1 a single, direct blocking relation exists, DBLock(T2, T1), and this relation induces the proxy relation, Proxy(T2, T1). But in Figure 3,, proxy relations are induced by both direct and indirect blocking relations:

  • Proxy(T2, T1)
  • Proxy(T3, T1)
  • Proxy(T4, T1)

Notice that in Figures 1, 2, and 3 each blocked task is associated with exactly one proxy relation. It is worth noting that because a task can be blocked on exactly one lock at a time, and a lock can have exactly one owner at a time, a sequence of DBlock relations always exists describing a path between a blocked task and its proxy.

_images/complex-chain-example-oo.png

Figure 4: Complex Lock Chain

Consider, for example, the complex set of blocking relations shown in Figure 4. In this figure the translucent green rectangle emphasizes that the direct blocking relations DBLock(T3, T2) and DBLock(T6, T5) could be arbitrarily complex sets of relations resulting in IBLock(T3, T2) and IBLock(T6, T5) instead. In Figure 4, T1 is the proxy of all other tasks as a result of a variety of direct and indirect blocking relations. The set of blocking relations and the proxy relations they induce are thus:

  • DBlock(T2, T1) -> Proxy(T2, T1)
  • DBlock(T5, T1) -> Proxy(T5, T1)
  • DBlock(T7, T1) -> Proxy(T7, T1)
  • IBlock(T3, T1) -> Proxy(T3, T1)
  • IBlock(T6, T1) -> Proxy(T6, T1)
  • IBlock(T4, T1) -> Proxy(T4, T1)
  • IBlock(T8, T1) -> Proxy(T8, T1)

Finally, it is worth noting that while it is not unreasonable to consider defining proxy relations among tasks whenever a blocking relation holds, we do not do this because these additional proxy relations have never been useful until the proxy task in the relation is runnable. We have thus preferred to add the second condition for a task to be a proxy, which specifies that the proxy cannot be blocked as a witer on another lock. It can be block for any of a number of other reasons associated with normal task execution, since we have no control over the actions of a generic task.

Data Structures

This section describes the data structures used by PMGT to keep track of the blocking and proxy relationships, knowledge of which is used by the SL to make decisions of its own regarding what threads to run, and used by it to provide decisions to PMGT based in the relevant scheduling semantics for operations such as selecting the best waiter or whether to permit stealing.

PMGT Waiter Structure

The PMGT Waiter structure struct pmgt_waiter is primarily used to represent the proxy relation between two tasks, but during PMGT algorithm execution it temporarily represents DBlock and IBlock relations while the proxy is being found. The task in the relation that is blocked on a lock is referred to as a waiter, while the other task is the proxy of the waiter. The proxy task may have many waiters, and a separate pmgt_waiter structure will be used to represent the proxy relation between the proxy and each of its waiters.

It is important to note that the only contraint on the state of the proxy is that it not be blocked on a lock. In general, we tend to think of the proxy task as being in the runnable state, as that is when executing it as the proxy is useful. However, the proxy can block for other reasons in the normal course of its execution on the system. When it does, then that is outside the purview of PMGT which integrates concurrency control and scheduling. If a proxy task P blocks on a lock L, then P ceases to be the proxy of its waiters, and the owner of L becomes the proxy for P and all of its waiters.

A fairly subtle point of implementation is that when a task T first blocks on a lock, it is not always possible to know what task is going to be the proxy of T. For example, consider a scenario starting with the situation shown in Figure 5. In this initial state task T3 is in the process of blocking on task T2. The struct pmgt_waiter is allocated, and intialized, but since the entire locking chain must be examined, the proxy is not immediately known. In the figure, the fact that the rest of the cahin is unknown is illustrated by the lack of color for nodes L1 and T1. Given the infomration know at the moment of locking, PMGT associates T3‘s struct pmgt_waiter, illustrated by the green square in the figure, with the best choice given the available knowledge of the locking chain, which is T2.

_images/sprx-waiter-not-yet-proxy1-oo.png

Figure 5: PMGT Waiter Associated With New Blocking Relation

At the completion of this initial stage of lock processing, where the direct blocking relation DBLOCK(T3, T2, L2) is known, a secondary stage begins which iteratively traces the locking chain to discover the identity of the proxy. Such a proxy must exist, as long as no cycle exists in the graph describing the blocking relationships. In Figure 6 we illustrate the point at which the iterative process has found the next blocking relationship in the chain which also happens to be the last. This is the point at which the PMGT code knows the entire blocking chain for T3 but has not yet updated the struct pmgt_waiter to reflect the identity of the proxy.

_images/sprx-waiter-not-yet-proxy2-oo.png

Figure 6: Proxy Waiter Needs Update

In Figure 7, we see the situation after the proxy fo T3 has been identified and the relation Proxy(TT3, T2) has been reflected in the contents of the struct pmgt_waiter.

_images/sprx-waiter-as-proxy-oo.png

Figure 7: Proxy Waiter Associated With Proxy Relation

struct pmgt_waiter

The struct pmgt_waiter structure represents a task waiting on a lock, and is defined in include/linux/rtmutex.h.

struct pmgt_waiter {
        atomic_spinlock_t       lock;
        struct task_struct      *proxy_task;
        struct list_head        proxy_ent;
        struct list_head        nodes;
        struct pmgt_node        *tnode;

        pmgt_sched_waiter_t     sched_waiter;
};

lock

An atomic (traditional) spinlock controlling access to the waiter.

proxy_task

The proxy_task field points to the owner of the lock involved in the blocking relation that the struct pmgt_waiter is describing at the moment. The proxy_task pointer is depicted by the dashed, red arrow in Figure 6 and Figure 7. When the proxy identification is complete, the proxy_task pointer refers to the proxy task, but while proxy identification is in progress, it can refer to any task with which the waiter has a blocking relation which lies along the chain from the waiter to the proxy.

proxy_ent

A list entry structure used to place the struct pmgt_waiter on a list of all its waiters maintained by the proxy task, whose struct task_struct is pointed to by proxy_task. The list root is the pmgt_waiters field in the struct task_struct. The per-task pmgt_waiters list is represented in Figure 8 as a dashed, green line. Note that the task T1 has 4 waiters, and thus 4 struct pmgt-waiter nodes on the list created using this field.

nodes

This is the root of an ordered list of struct pmgt_node structures which represents the sequence of blocking relations forming the chain between a waiter and its proxy. The (struct pmgt_node *)->waiter_ent field is used to place node on the list representing the locking chain. A lock chain represented by a sequence of struct pmgt_node elements rooted at the nodes field begins with a proxy task’s direct blocking relation on the chain, and ends with the blocking relation associated with the waiter. This list is represented in Figure 8 as a dashed, black line. For example, in the figure, the nodes N5,1 and N5,4 on nodes list of struct pmgt_waiter W5 represent the relations IBlock(T5, T1) and DBlock(T5, D4), respectively.

tnode

This is referred to as the top node, or last node, in the chain of blocking relations that generates the proxy relation between a waiting task and its proxy. This relation may be indirect or direct. In Figure 8, the tnodes, paired with their waiters, are: (W2, N2,1), (W3, N3,1), (W4, N4,1), (W5, N5,1). The most useful property of the tnode is that it identifies the lock held by the proxy of a given waiter, through which the proxy relation is established. In Figure 8, L1 is the lock identified by all of the tnodes, because L1 is the lock held by T1 through which the proxy relations to all the other tasks are created. Finally, note that the tnode is always the first entry on the (struct pmgt_waiter *)->nodes list. The information provided by this field is at least partially redundant, since it is always the same as the first entry of the (struct pmgt_waiter *)->nodes list.

sched_waiter

This is a structure whose type is determined by the scheduling layer configuration and whose purpose is to serve as a “black box” for the scheduling layer, to hold whatever information the SL wants to keep about a waiting task. The PMGT layer niether knows nor cares about how to examine the contents of the structure.

PMGT Node Structure

The struct pmgt_node is the other major data structure involved in PMGT. The struct pmgt_waiter represents each waiter on a proxy task. Thus, in Figure 8, task T1 has four waiter structures associated with it, which represent the four proxy relations in the scenario: Proxy(T5,T1), Proxy(T4,T1), Proxy(T3,T1), Proxy(T2,T1). While these relations are the main point of PMGT, they are not all of the accounting information needed under all circumstances. Tasks can perform lock operations or abort their waiting on a lock at any time. The responsibility of PMGT is to track the implications of every relevant action, which requires additional information beyond that provided by the struct pmgt_waiter nodes. The ``struct pmgt_node``provides the additional information by representing each direct relation along the locking chain from the waiter to the proxy.

Lock Chain Representation

A locking chain is represented by a sequence of struct pmgt_node instances arranged in a linked list and rooted at the (struct pmgt_waiter *)->nodes field. The node order on the list reflects the order in which the blocking relations occur on the chain, starting at the proxy end. Thus, the first node always represents a direct blocking relation between the proxy and a task waiting on a lock held by the proxy. The last node in the chain represents a direct blocking relation between the waiter and whatever lock is is blocked on that lead through one or more relations to the proxy. In Figure 8 the locking chain associated with each waiter task is shown as a dashed black line attached to a waiter node representing the waiting task.

_images/proxy-mgmt-linked-lists-oo.png

Figure 8: Linking Among Proxy Management Data Structures (empty lists are not shown)

Lock Chain Distinction

PMGT algorithms must be able to distinguish between waiters in distinct locking chains in different contexts. For example, in Figure 8 tasks T4 and T2 are both blocked on lock L1, and represent two distinct locking chains that merge at lock L1. The merging of the two chains effectively masks the locking chain membership of indirectly blocked tasks, if we consider only the information provided by the struct pmgt_waiter structures. For example, without additional information, given the nodes W3 and W5 we know that tasks T5 and T3 are waiters on T1, and that T1 is their proxy, but we do not know the set of blocking relations through which the proxy relation was induced.

However, the struct pmgt_node provides information from which the detailed set of relations can be deduced. First, the set of struct pmgt_node elements on each (struct pmgt_waiter *)->nodes list traces the lock chain from the proxy back to the waiter. Second, via_task and via_lock fields in the struct pmgt_node provide information about how a given chain arrived at a given point.

For example, the nodes N3,1 and N5,1 represent the first step in the chains of blocking relations between the proxy T1 and the waiters T3 and T5, respectively. Note, however, that the two chains of blocking relations split at L1. When looking at the struct pmgt_node structures N3,1 and N5,1 we can see that the two waiters have arrived at the same lock L1 using the via_lock field, but see that the chain splits at this point, since the via_task fields are different, pointing to T2 and T4, repsectively.

_images/arrived-via-oo.png

Figure 9: struct pmgt_node Property arrived_via Using Pointer via_task

The arrived_via property of a blocking relation is implemented using a via_task. The pointer is illustrated in Figure 9 as a dashed black arrow. In the figure tasks T3 and T5 are indirectly blocked on lock L1, and arrived_via tasks T2 and T4, respectively. All indirect blocking relations by definition have an associated arrived_via property that identifies a task. The arrived_via property of a direct blocking relation is NULL value by convention to indentify it as a direct blocking relation.

struct pmgt_node

struct pmgt_node {
        raw_spinlock_t          lock;
        struct task_struct      *owner_task;
        struct list_head        owner_ent;
        struct pmgt_waiter      *waiter;
        struct list_head        waiter_ent;
        struct task_struct      *via_task;
        struct rt_mutex         *via_lock;
        struct list_head        via_lock_ent;
};

lock

An atomic (traditional) spinlock controlling access to the structure.

owner_task

The owner of the lock associated with the blocking relation represented by a node. The blocked task associated with the relation is blocked on the owner_task``either directly or indirectly. This field is used exclusively for assertions in Proxy Management algorithms. The ``owner_task pointer is illustrated in Figure 10 as the dashed black line.

owner_ent

This field is used to place a node on the (struct task_struct *)->pmgt_nodes list of the task pointed to by (structpmgt_node *)->owner_task. This list represents all direct and indirect waiters on a task. The pmgt_nodes list is shown in Figure 10 as the dashed red line.
_images/node-owner-task-oo.png

Figure 10: struct pmgt_node Node’s owner_task, and pmgt_nodes list

waiter

Pointer to the struct pmgt_waiter associated with the locking chain containing a node. Note that deferred actions are used in PMGT to clean-up unused struct pmgt_node instances in some circumstances. An unused node is identified in PMGT by a NULL value assigned to the waiter field. The blacked dashed arrow in Figure 11 shows the use of the waiter pointer in an example lock chain.
_images/node-waiter-oo.png

Figure 11: struct pmgt_node Node waiter Field

waiter_ent

This field is used to place a struct pmgt_node on the ordered list (struct pmgt_waiter *)->nodes that represents the sequence of blocking relations beginning with the proxy and ending with the waiter. The nodes list of the waiter structure is shown as a dashed blacked line in Figure 8.

via_task

This pointer refers to the task via which the waiter is related to the proxy at this link in the blocking chain. The via_task pointer, as discussed in section Lock Chain Distinction, allows PMGT to distinguish between blocked tasks based on their involvement in separate locking chains that are distinguished by involving different tasks and locks after the splitting point, if present.

via_lock

A pointer to the struct rt_mutex lock associated with the blocking relation represented by a node. Since the rt_mutex has a list of all direct waiters’ struct pmgt_node elements rooted at (struct rt_mutex *)->pmgt_nodes it is fairly easy to find all direct waiters on a lock.

via_lock_ent

This field is used to place the struct pmgt_node on a list of direct waiters on a lock, rooted at (struct rt_mutex *)->pmgt_nodes, and is depicted in Figure 8 as a dashed blue line.. The relevant struct rt_mutex is found using the via_lock pointer in the same structure.

Task Structure

The Linux struct task_struct is modified to track direct and indirect waiters on a task.

struct task_struct {

        /* ... omitted ... */

# ifdef CONFIG_KUSP_PMGT
        /* holds the trackers for tasks proxied to this task */
        struct list_head pmgt_waiters;
        /* holds nodes for a locking chain */
        struct list_head pmgt_nodes;
        /* information stored on a proxy task by the scheduling layer */
        pmgt_sched_proxy_t sched_proxy;
# else /* CONFIG_KUSP_PMGT */
        /* PI waiters blocked on a rt_mutex held by this task */
        struct plist_head pi_waiters;
# endif /* CONFIG_KUSP_PMGT */

        /* ... omitted ... */

};

pmgt_waiters

A list of struct pmgt_waiter instances representing the set of tasks blocked on this task, and thus the set of waiters for which this task is the proxy. This list is shown as a dashed green line in Figure 8. Elements of the list are connected using the (struct pmgt_waiter *)->proxy_ent field.

pmgt_nodes

A list of nodes representing all blocking relations involving this task. This list is shown as a dashed red line in Figure 8. It uses the (struct pmgt_node *)->owner_ent field to hold elements of the list. This represents the set of tasks directly or indirectly blocked on the task T represented by the stucture, Block(T*, T).

sched_proxy

This field represents the set of information which the system’s scheduling layer keeps for each task that serves as a proxy for other tasks. It’s type is pmgt_sched_proxy_t which is a typedef defined type because the structure used is defeined by the scheudling layer configured for the ssytem. Different scheudling layers may wish to keep differnt sets of information. The key point is that while PMGT is peripherally related to the management of this set of information since it is used by PMGT routines that interface to the Scheduling layer, from the point of view of PMGT, this filed is a black box containing arbitrary information. How the type is defined and other topics related to the interface of PMGT and the scheduling layer is disussed Here.

pi_waiters

This field is the root of a list ordered by the priority of the tasks each element represents. It is used by the PREEMPT_RT patch when PMGT is not configured to strack the set of waiters on a task using a priority sorted list.

RT-Mutex

The RT-Mutex is defined in include/linux/rtmutex.h and is modified to support either GS or PI on top of PMGT.

struct rt_mutex {
        atomic_spinlock_t       wait_lock;
#ifdef CONFIG_KUSP_PMGT
        union {
                struct list_head list;
                struct plist_head plist;
        }                       waiters;
        struct pmgt_arbiter     *arbiter;
        struct list_head        pmgt_nodes;
#else
        struct plist_head       wait_list;
#endif
        struct task_struct      *owner;
#ifdef CONFIG_DEBUG_RT_MUTEXES
        int                     save_state;
        const char              *name, *file;
        int                     line;
        void                    *magic;
#endif
};

waiters

This is a union open to expansion by adding additional members in the future. The union is used by a kernel component called an arbiter, introduced under KUSP, which makes the policy used to manage the set of waiters on a struct rt_mutex configurableon a lock-by-lock basis. An arbiter defines the semantics used to manage a set of waiters by providing a set of function pointers representing all the operations on the set. A new type of arbiter is specified by creating an instance of struct pmgt_arbiter and filling in its function pointer fields. A specific arbiter is associated with a specific rt_mutex by filling in the (struct rt_mutex *)->arbiter field. The definition and use of arbiters is discussed in greater detail Here.

pmgt_arbiter

This is used as part of the interface between the PMGT layer and the scheduling layer using it. It gives access to the structure containing the scheduling layer callbacks used to make decisions when required in the course of lock operations. This is discussed in greater detail Here.

pmgt_nodes

A list of nodes representing blocking relations that involve this lock. The via_lock pointer of all nodes on this list point to this RT-Mutex. This list is shown as a dashed blue line in Figure 8.

RT-Mutex Waiter

The RT-Mutex Waiter (struct rt_mutex_waiter) is used by PREEMPT_RT to represent a task waiting on a mutex. The KUSP PMGT code modifies it support the additional capabilities required. The first modification was the addition of the pmgt_waiter field which points to a corresponding struct pmgt_waiter structure. The PMGT code maintains its own set of information, but the original RT Mutex information is also important. We currently use a pair of the struct rt_mutes_waiter and struct pmgt_waiter structures rather than merging them or replacing the struct rt_mutex with the struct pmgt_waiter. It is important to note that the two structures have the same life-time: they exist only while a task is blocked. It is beginning to seem attractive to either merge the structures or place an instance of struct pmgt_waiter as a member of the struct rt_mutex_waiter, but we have not yet finally committed to that. Different parts of the code use the different structures, so this does make some parts a bit clearer, in our opinion.

struct rt_mutex_waiter {
#ifdef CONFIG_KUSP_PMGT
        struct pmgt_waiter       pmgt_waiter;
        union {
                struct list_head node;
                struct plist_node pnode;
        }                       waiters_ent;
#else
        struct plist_node       pi_list_entry;
        struct plist_node       list_entry;
#endif
        struct task_struct      *task;
        struct rt_mutex         *lock;
#ifdef CONFIG_DEBUG_RT_MUTEXES
        unsigned long           ip;
        struct pid              *deadlock_task_pid;
        struct rt_mutex         *deadlock_lock;
#endif
};

pmgt_waiter

An instance of struct pmgt_waiter structure representing the information about a blocked task maintained by PMGT. The majority of the PMGT code uses the struct pmgt_waiter instance directly, but when it wishes to know the identity of the waiting task, it can use the container_of macro to obtain the pointer to the struct rt_mutex_waiter instance from the pointer to the struct pmgt_waiter isntance.

waiters_ent

This union is the counterpart to the (struct rt_mutex *)->waiter union. The other enables the struct rt_mutex to maintian a list of struct rt_mutex_waiter instances under an arbitrary policy implemented by the arbiter asociated with the rt_mutex. This field enables the instances of struct rt_mutex_waiter to be members of the list representing the set of waiters on a lock. The ordering of the list and all list operations are handled under the semantics specified by the set of callbacks in (struct rt_mutex *)->arbiter, as discussed in greater detail Here.

Concurrency Control Issues

Concurrency control is vital for both the correctness of what happens, and the efficiency with which it occurs, in a wide range of kernel subsystems. The PMGT code tracks sets of tasks waiting on kernel mutex locks using a set of data structures that are accessed in different contexts for several different purposes. Whenever a kernel mutex operation is executed, the PMGT code has to perform some actions. In addition, whenever a proxy task changes scheduling class or CPU affinity, the information describing the set of waiting tasks maintained by the PMGT layer must be used by the scheduling layer to take appropriate actions. The frequency of their use and the variety of contexts in which they are used give the data structures for PMGT greater influence on system performance than might at first seem apparent.

Two aspects of the data structures are key in determining their influence on system performance: (1) the efficiency with which specific operations can be performed on them and (2) the concurrency constraints under which the operations must be performed to ensure their data integrity. With repsect to the first issue, the situation is slightly complicated by the fact that the PMGT layer is designed to be as confiugurable as possible, permitting the scheduling layer to use its preferred data structures for maintaining the sets. The basic implementation of PMGT anticipates the use of the standard Linux linked list and of the priority sorted list implemented by PREEMPT_RT as part of support for Priority Inheritance. It is important to remember, however, that the PMGT design permits that addition of as many other data structures as desired for a specific scheduling layer, or to support a set of scheduling layers.

Different sets of concurrency control constraints are represented by different locks in Linux and the PMGT code, each of which have an associated set of conditions under which they are used, and a set of constraints they impose on the concurrency of the system. The rest of this section discusses each of the locks relevant to PMGT and the conditions and constraints that are related to their use.

(struct rq *)->lock

This lock is a standard Linux component used to protect the set of tasks structures representing the set of tasks assigned to a specific CPU. Holding it obviously constrains everything that happens on the CPU and periods during which it is held should thus be kept as short as possible. The struct rq is defined in kernel/sched.c as it is directly used only in that file, but the routines such as task_rq_lock which locks the runqueue on whicha tasks resides or this_rq_lock which locks the runqueue associated with the CPU on which the thread executing the routine is executing are used throughout the kernel.

This lock is relevant to PMGT because the runqueue lock must be held whenever data within a struct task_struct is being changed. There are several PMGT related fields in the struct task_struct structure, as discussed in the Task Structure Section.

(struct task_struct *)->pi_lock

This lock was created by the PREEMPT_RT patch to control allt he data structures related to tracking the set of information related to Priority Inheritance. We generalize its use by letting it control all the information related to PMGT, which can itself be viewed as a generalization of PI. Under PMGT the name of the lock could be changed, but we have not bothered.

Under PMGT this lock controls access to three sets of information. The (struct task_struct \*)->pmgt_nodes list represents the set of tasks blocked on the task represented by the struct task_struct instance, in general. The (struct task_struct \*)->pmgt_waiters list represents represents the tasks blocked on the task represented by the struct task_struct instance, when it is a proxy. Recall that this list is null for non-proxy tasks except during the process of finding the proxy of a task. The (struct task_struct \*)->pi_blocked_on field is a pointer to the struct rtmutex_waiter instance representing this task when it is blocked on a lock and thus a waiter.

This lock is held for the duration of many PMGT operations and is assumed to be held by the calling context of many PMGT routines.

(struct rt_mutex *)->wait_lock

This lock protects the integrity of the struct rt_mutex instance which is a kernel mutex instance. The lock protects two elements of struct rt_mutex related to PMGT. The first is the (struct rt_mutex \*)->pmgt_nodes list root, which represents the set of tasks blocked on a lock, whether the blocking relation is direct or indirect, in the PMGT code. The second is the (struct rt_mutex \*)->waiters list which is the set of struct rt_mutex_waiter instances representing the set of direct waiters on the lock.

The two lists provide different information, but they are also obviously at least somewhat redundant. They exist in part because PMGT is being developed as a generalization of the PREEMPT_RT PI implmentation. The struct rt_mutex_watier is part of the PI implementation and is kept separate in part to make the generalization of PI by PMGT more obvious. As PMGT development, it may well be reasonable to merge some currently separate data structures.

(struct pmgt_waiter *)->lock

This lock protects the integrity of the information cntained in the struct pmgt_waiter structure, which is the PMGT component representing each task that is waiting on a lock. These structures are associated with the proxy task using the (struct task_struct \*)->pmgt_waiters field.

It is important to note that three elements of the structure are not protected by this lock: the nodes, proxy_ent and sched_waiter elements. The nodes element represents the set of tasks waiting on the proxy task, and access to this list is already controlled by the (struct task_struct *)->pi_lock typically obtained at higher levels of the call stack. The proxy_ent field is the element holding the struct pmgt_waiter on the list rooted at (structr task_struct *)->pmgt_waiters and thus controled by the (struct task_struct *)->pi_lock.The sched_waiter element is the black box of scheduling layer infomration which is held by PMGT, but not used by it, and which thus requires no PMGT level concurrency control.

FIXME: “Proxy lock is for proxy level operations, pi_lock is for access to the task->pmgt_waiters roots of the list using proxy_ent” - this statement was here but I have no idea what it means or if it is trying to make a point that has already been covered or not.

(struct pmgt_node *)->lock

The node structure is controlled by two locks: the lock field within the structure and the wait_lock for the instance of the struct rtmutex held by the proxy represented by the node. Either lock can be used to read from the node structure but both locks must be held to write to the node structure.

Lock ordering

When more than one lock is used in a series of operations an acquisition order is specified. In any operation involving both the (struct task_struct *)->pi_lock and the (struct rq *)->lock the pi_lock is obtained first. One of the more obvious reasons for this order is that the pi_lock constrains concurrency on the CPU far less completely than the (struct rq *)->lock.

Mutex and PMGT Operations

This section discusses basic operations on rt-mutexes and the PMGT operations used to track the implications of rt-mutex use. At the most fundamental levels,

  • Blocking on Mutex
    • Creating Blocking Relation
    • Extend Blocking Chains
    • Phase 1: Prepare New Relations
    • Phase 2: Finalize/Install New Relations
    • Walking a Lock Chain
  • Releasing a Mutex
    • Prune Node Chain
  • Stealing a Mutex
  • Mutex Waiter Removal
    • Level 1: struct rt_mutex_waiter
    • Level 2: struct pmgt_waiter
  • Waiter Movement
  • Deferred Cleanup: struct pmgt_node

Blocking On RT-Mutex

Blocking on an RT-Mutex is an operation that results in the creation of at least one new proxy relation. When a task TB blocks on a mutex L a direct blocking relation is always created between the blocking task and the owner of L, TO. In addition to the direct blocking relation zero or more indirect relations are also created. Indirect relations are created when TB has waiters on one or more locks it owns, or when TO is itself blocked on a mutex owned by another task. In the former case, the waiters of TB which had TB as their proxy become waiters of TO which becomes their proxy, as well as the proxy of TB.

In the latter case, TB “walks” the lock chain by iteratively examining blocking relations until a task is found that is not blocked on a mutex. As it walks the chain, it carres with it all the struct pmgt_waiter instances representing the waiters on TB as well as the instance of the structure representing TB as a waiter. When a task not blocked on a mustex is found, this is the proxy for TB and its waiters.

The SL must be informed of changes to blocking relations in order for it to function correctly. During the blocking process the SL is notified of all new relations created.

static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
                           struct rt_mutex_waiter *pi_waiter,
                           struct task_struct *task,
                           int detect_deadlock, unsigned long flags,
                           struct pmgt_waiter *waiter)
{
        struct list_head pmgt_waiters;
        struct list_head pmgt_nodes;

        atomic_spin_lock(&task->pi_lock);
        pi_waiter->task = task;
        pi_waiter->lock = lock;
        list_add(&pi_waiter->list_entry, &lock->wait_list);

        task->pi_blocked_on = pi_waiter;

17      pmgt_init_waiter(waiter);

19      INIT_LIST_HEAD(&tmp_waiters);
20      INIT_LIST_HEAD(&tmp_nodes);

22      pmgt_advance_waiter(waiter, NULL, lock, owner, &tmp_waiters,
                        &tmp_nodes);

25      pi_waiter->pmgt_waiter = waiter;

Line 17

Initialize the waiter stored on the waiter’s stack.

Lines 19-20

Initialize temporary lists on the stack to store proxy management structures modified while holding task->pi_lock which will later be used when holding owner->pi_lock. These two locks cannot be held at the same time, but the spinlock lock->wait_lock protects these data structures during the entire blocking operation.

Line 22

Create a direct blocking relation between the blocking task and the owner of the lock. See Create Blocking Relation.
27      pmgt_advance_waiters(task, lock, owner, &tmp_waiters, &tmp_nodes);
        atomic_spin_unlock(&task->pi_lock);

Line 27

Create indirect blocking relations between the waiters on the task blocking (if any), and the owner of the lock being blocked on. The data structures that are placed on the lock owner’s lists are stored on temporary lists until we are able to acquire owner->pi_lock. See Phase 1: Prepare New Relations for more information regarding the pmgt_advance_waiters routine.
30      atomic_spin_lock(&owner->pi_lock);
31      pmgt_advance_waiters_finish(owner, &tmp_waiters, &tmp_nodes);
        atomic_spin_unlock(&owner->pi_lock);

        get_task_struct(owner);
        atomic_spin_unlock_irqrestore(&lock->wait_lock, flags);

37      pmgt_adjust_rt_mutex_chain(owner);
        atomic_spin_lock_irq(&lock->wait_lock);

        return 0;
}

Line 31

Finish the creation of indirect blocking relations by associating the waiters on the blocking task with the owner of the lock, as well as finalizing the new direct blocking relation. See Phase 2: Finalize/Install New Relations for more information on the pmgt_advance_waiters_finish() routine.

Line 37

In cases where the owner of the mutex being blocked on is itself blocked on a mutex, indirect blocking relations must be extended to represent the entire length of the lock chain. This is done by iteratively “walking” the chain. The routine pmgt_adjust_rt_mutex_chain is responsible for this, and is described in Walking a Lock Chain.

Create Blocking Relation

A blocking relation represents various relationships between a blocked task and other Proxy Management structures, as described in pmgt_waiter Section. Creating a new relation involves representing these relationships using pointers and linked lists.

static void pmgt_advance_waiter(struct pmgt_waiter *waiter,
                struct task_struct *via_task, struct rt_mutex *lock,
                struct task_struct *owner, struct list_head *waiters,
                struct list_head *nodes)
{
6       struct pmgt_node *node = pmgt_alloc_node();

        /* Setup blocking relation */
9       node->via_task = via_task;
10      node->via_lock = lock;
11      node->owner_task = owner;
12      list_add(&node->via_lock_ent, &lock->pmgt_nodes);
13      list_add(&node->owner_ent, nodes);

        /* The waiter is accessible via nodes */
16      node->waiter = waiter;

        /* List of nodes in same order as blocked tasks in chain */
19      list_add(&node->waiter_ent, &waiter->nodes);

        /* Advance waiter to next position in the chain */
22      list_add(&waiter->proxy_ent, waiters);

        /* Used for assertions */
        waiter->proxy_task = owner;

        waiter->tnode = node;
}

Line 6:

Allocates a node for the new blocking relation. Memory for nodes is created at compile time, and concurrent access to the cache of nodes is protected by a global spinlock. More advanced allocation schemes and concurrency control schemes are possible.

Lines 9-11:

Record information describing a blocking relation’s position in a lock chain.

  • Line 9: As described in section Lock Chain Distinction, the via_task field points to the task, directly blocked on the lock, through which an indirect blocking relation is constructed. By convention the via_task field is set to NULL for direct blocking relations. For example, in Figure 9 task T5 records, using the via_task field, that it is blocked on task T1 via task T4.
  • Line 10: via_lock is the lock associated with this relation. The task via_task is directly blocked on via_lock. For direct blocking (via_task is NULL) relations the waiter associated with this relation is directly blocked on via_lock.
  • Line 11: The owner of via_lock.

Lines 12-13:

Inserts the blocking relation into sets of other relations occupying the same position in the locking chain. For example in Figure 9 the top waiter nodes N5, N4, N3, and N2 are one the same lists associated with lock L1, and task T1.

  • Line 12: Inserts the node into a list of waiters on via_lock.
  • Line 13: Inserts the node into the list of waiters blocked on the owner of via_lock. A pointer to the target list is passed in because the list is actually a temporary list allocated by the caller. This is done because concurrency control restrictions prevent us from immediately accessing the owner’s real list. The calling context resolves this descrepency.

Line 16

Save a pointer to the waiter. Other Proxy Management routines must be able to locate the waiter from a given reference to a node.

Line 19

Store the node on the waiter’s list of nodes. This is an ordered list representing the structure of the locking chain.

Line 22

The waiter goes onto the temporary list of waiters that will later be added to the per-task list of the lock’s owner.

Extend Blocking Chains

This section outlines the three routines that are used to extend the blocking chains represented in Proxy Management. The first two routines are used together to move waiters from a blocked task to the owner of a lock being blocked on. The last operation is used create additional indirect blocking relations by walking a lock chain during the blocking process. The creation of new relations during a lock chain traversal uses the first two routines.

Phase 1: Prepare New Relations

The routine pmgt_advance_waiters is the first phase in transferring waiters from one task to another. This routine results in the removal of one proxy relation, and the creation of a new proxy relation (when a waiter is advanced). Group Scheduling must be notified of the removal and creation operations in order to properly represent this new state within the scheduling framework.

static void pmgt_advance_waiters(struct task_struct *task,
                struct rt_mutex *lock, struct task_struct *owner,
                struct list_head *waiters, struct list_head *nodes)
{
        struct pmgt_waiter *waiter, *t;

7       list_for_each_entry_safe(waiter, t, &task->pmgt_waiters, proxy_ent) {
                atomic_spin_lock(&waiter->lock);
9               list_del(&waiter->proxy_ent);
10              pmgt_advance_waiter(waiter, task, lock, owner, waiters, nodes);
                atomic_spin_unlock(&waiter->lock);
        }
}

Line 7

Loop over task‘s set of waiter structures. Each waiter represents a blocking chain that must be extended further up-chain.

Line 9

The waiter is removed from task‘s set. It will be moved to the next task in the lock chain (the owner task).

Line 10

Create a new indirect blocking relation, extending each of task‘s waiters’ locking chains (see Create Blocking Relation). The associated proxy relation is setup at this step, and finalized in the second phase.

Phase 2: Finalize/Install New Relations

This routine is called following the first phase of extending proxy relations, described in Phase 1: Prepare New Relations. At this point during execution a new set of proxy relations is ready to be installed. To install the relation it must be added to the proper linked lists associated with the task the relation is being trasferred to, and Group Scheduling must be notified.

static void pmgt_advance_waiters_finish(struct task_struct *task,
                struct list_head *waiters, struct list_head *nodes)
{
        struct pmgt_waiter *waiter, *t;

6       list_for_each_entry_safe(waiter, t, waiters, proxy_ent) {
7               list_del(&waiter->proxy_ent);
8               list_add(&waiter->proxy_ent, &task->pmgt_waiters);
        }

11      list_splice(nodes, &task->pmgt_nodes);
}

Line 6

Loop over each waiter, removing it from the temporary list we are looping over, and adding it to the new proxy’s list of waiters.

Line 11

Add the set of node on the temporary list to task‘s set of nodes. This list is not currently used for the correct operation of any algorithms, but may be useful in the future: it is the set of all relations involving a particular task.

Walking a Lock Chain

This routine is called as the last step of a task blocking on an RT-Mutex (Blocking On RT-Mutex) and is used to extend the locking chains maintained in Proxy Management when the owner of the locking being blocked on is itself blocked on a lock. This is an iterative process that examines a sequence of blocking relations until a task is found that is not blocked on an RT-Mutex. This final, non-blocked task is at the head of the lock chain being traversed.

static void pmgt_adjust_rt_mutex_chain(struct task_struct *task)
{
        struct rt_mutex *lock;
        struct rt_mutex_waiter *pi_waiter;
        struct list_head tmp_waiters;
        struct list_head tmp_nodes;
        unsigned long flags;
        static int maxdepth = 0;
        int depth = 0;

11 again:
        atomic_spin_lock_irqsave(&task->pi_lock, flags);

14      pi_waiter = task->pi_blocked_on;
15      if (!pi_waiter)
16              goto out_unlock_pi;

17      lock = pi_waiter->lock;
18      if (!atomic_spin_trylock(&lock->wait_lock)) {
29              atomic_spin_unlock_irqrestore(&task->pi_lock, flags);
20              cpu_relax();
21              goto again;
        }

Line 11

The beginning of a routine-long loop that, for each iteration, considers the next task in the lock chain.

Lines 14-16

Check if the task is blocked on an RT-Mutex. If not, we have found the front of the lock chain, and have completed the chain traversal.

Lines 17-21

Try to lock the RT-Mutex that the task is blocked on. Try locks are used because this ordering of locking is reversed from normal locking order. To prevent deadlocks, the entire process is repeated if we fail to acquire both spinlocks.
24      INIT_LIST_HEAD(&tmp_waiters);
25      INIT_LIST_HEAD(&tmp_nodes);

27      pmgt_advance_waiters(task, lock, rt_mutex_owner(lock),
                        &tmp_waiters, &tmp_nodes);

        atomic_spin_unlock(&task->pi_lock);
        put_task_struct(task);

Lines 24-25

Initialize the temporary lists used to store data structures allocated and initialized while holding task->pi_lock, and then transferred to the owner while holding owner->pi_lock. These two locks cannot be held at the same time.

Line 27

Execute the first phase in transferring waiters on task to the owner of the lock that task is blocked on.
33      task = rt_mutex_owner(lock);
        get_task_struct(task);

        atomic_spin_lock(&task->pi_lock);

38      pmgt_advance_waiters_finish(task, &tmp_waiters, &tmp_nodes);

        atomic_spin_unlock(&task->pi_lock);

        atomic_spin_unlock_irqrestore(&lock->wait_lock, flags);

44      goto again;

46 out_unlock_pi:
        atomic_spin_unlock_irqrestore(&task->pi_lock, flags);
        put_task_struct(task);
}

Line 33

Advance to the next task in the lock chain, the owner of the lock being considered. This also initializes the task variable for the next iteration of the loop.

Line 38

Execute the second phase of transferring the waiters.

Line 44

This returns to the beginning of the routine starting a new iteration of the traversal loop.

Line 56

The only exit point of the lock chain traversal is when we encounter a task not blocked on an RT-Mutex. When the head of the lock chain is found execution is transferred to this label.

Releasing RT-Mutex

An RT-Mutex is released when its owner no longer requires the resource that the RT-Mutex protects. After releasing an RT-Mutex a new task may acquire the resource. The RT-Mutex primitive is a blocking semaphore, thus upon releasing an RT-Mutex the owner must wake-up a waiter to ensure that a new task will make progress by running to acquire the resource it is blocked on.

The waiter on an RT-Mutex chosen to be awoken by the owner during release is called the pending owner. The name pending owner refers to the state that the task is in with respect to being the real owner of the RT-Mutex. Specifically, before a task can become the real owner it must run in its own context to fully acquire the lock. Since a delay exists between a task being woken up and being scheduled on a CPU, the state is called pending. We will see later that this state is used in an optimization called stealing in which this delay is exploited to allow high priority tasks to steal an RT-Mutex from a pending owner.

static void wakeup_next_waiter(struct rt_mutex *lock, int savestate)
{
        struct rt_mutex_waiter *pi_waiter;
        struct task_struct *pendowner;
        struct pmgt_waiter *waiter;
        struct pmgt_node *node, *t;

        atomic_spin_lock(&current->pi_lock);

10      pi_waiter = rt_mutex_top_waiter(lock);
11      list_del(&pi_waiter->list_entry);

12      pendowner = pi_waiter->task;
13      pi_waiter->task = NULL;

15      if (!savestate)
16              wake_up_process(pendowner);
17      else {
18              smp_mb();
19              if (pendowner->state & ~TASK_RUNNING_MUTEX)
20                      wake_up_process_mutex(pendowner);
21      }
22
23      rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING);

Lines 10-11

Select a waiter that is directly blocked on this RT-Mutex to become the pending owner. This selection must be integrated with the system scheduling policy. The selected waiter structure is removed from the list of waiters on the RT-Mutex.

Lines 12-13

Get a reference to the pending owner’s task struct, and mark the PI waiter as not being blocked (blocked within the context of the RT-Mutex framework).

Lines 15-23

Wake-up the pending owner, and set the new owner of the lock to be the pending owner.
25      waiter = pi_waiter->pmgt_waiter;
26      pi_waiter->pmgt_waiter = NULL;

        atomic_spin_lock(&waiter->lock);

30      list_del(&waiter->proxy_ent);

        atomic_spin_lock(&waiter->tnode->lock);
33      waiter->tnode->waiter = NULL;
34      list_del(&waiter->tnode->waiter_ent);
        atomic_spin_unlock(&waiter->tnode->lock);

35      list_del(&waiter->tnode->owner_ent);
36      list_del(&waiter->tnode->via_lock_ent);
37      pmgt_free_node(waiter->tnode);

        atomic_spin_unlock(&waiter->lock);

Lines 25-26

Get a reference to the Proxy waiter for this pending owner. In this context the owner of the lock is releasing the lock, thus the owner cannot be blocked on any other RT-Mutex. As a result, the Proxy waiter for the pending owner should be associated with the owner.

Line 30

Remove the Proxy waiter from the owner’s list of waiters.

Lines 33-34

Remove the top-node from the owner’s list of waiters.

Lines 35-37

Finish removal of the top node. Return node to pool of free nodes.
41      list_for_each_entry(node, &lock->pmgt_nodes, via_lock_ent) {
                waiter = node->waiter;
43              list_del(&waiter->proxy_ent);
44              list_del(&node->owner_ent);
        }

        atomic_spin_unlock(&current->pi_lock);

Concurrency control commentary

At this point we have deleted the waiter structure from the old proxy’s pmgt_waiters list and deleted the pmgt_node from the mutex owner’s pmgt_nodes list but we have not put NULL in either of those list entries nor have we reset the waiter->proxy_task field. As a result, while the struct pmgt_waiter instance no longer should represent a proxy relationship between a waiter and the old proxy, it still appears to do so. However, while we have released the pi_lock for the old proxy, we have not released the mutex->wait_lock, which we hold until the new proxy is identified. Therefore, all other contexts which need to determine proxy need to hold the wait_lock for the top node of the waiter involved since not doing so would permit a bad outcome from a race condition between a thread identifying a proxy and a thread releasing a lock which would make it look like a task which has already released a lock was still the proxy of a task waiting on the mutex.

  • wakeup_next_waiter
  • try_to_steal_lock
  • pmgt_remove_waiter
  • pmgt_move_waiters

Line 41

Loop over each relation involving this lock. All relations involving this lock currently will reflect that the owner is the current task.

Lines 43-44

Remove all waiters and nodes from the task releasing the lock.
        atomic_spin_lock(&pendowner->pi_lock);

        pendowner->pi_blocked_on = NULL;

53      list_for_each_entry_safe(node, t, &lock->pmgt_nodes, via_lock_ent) {
                waiter = node->waiter;
                atomic_spin_lock(&waiter->lock);

57              if (node->via_task == pendowner) {
58                      pmgt_prune_node_chain(waiter, node, pendowner);
59                      pmgt_free_node(node);
60              } else {
61                      list_add(&node->owner_ent, &pendowner->pmgt_nodes);
62                      node->owner_task = pendowner;
                }

65              waiter->proxy_task = pendowner;
66              list_add(&waiter->proxy_ent, &pendowner->pmgt_waiters);

                atomic_spin_unlock(&waiter->lock);
        }

        atomic_spin_unlock(&pendowner->pi_lock);
}

Line 53

Loop over all nodes on the lock to setup new relations between waiters on the lock and the pending owner that has been chosen by the releasing task.

Lines 57-59

When the node being considered arrived via the stealing task then a relationship between the waiter associated with the node and stealing task already exists for some other RT-Mutex that the stealing task owns. In this case the chain of nodes for the waiter is pruned of its top node, leaving the node representing the relation between the waiter and the stealing task.

Lines 60-62

If the waiter represented by the node is directly blocked on the RT-Mutex, or arrived via another task, then the node is reused.

Lines 65-66

Finalize the movement by adding the waiter to the pending owner.

Prune Node Chain

When a lock is stolen or released the new owner (or pending owner) will already have relationships established with waiters that are blocked on the lock indirectly through the new owner. For these waiters their waiter is associated with the old owner of the lock being acquired, and thus their top node must be removed. This routine removes the top node from the waiter and update the waiter’s top node with the next node in the lock chain.

        static void pmgt_prune_node_chain(struct pmgt_waiter *waiter,
                        struct pmgt_node *node, struct task_struct *next_owner)
        {
                atomic_spin_lock(&node->lock);
5               node->waiter = NULL;
6               list_del(&node->waiter_ent);
7               list_del(&node->via_lock_ent);

9               waiter->tnode = list_first_entry(&waiter->nodes, struct pmgt_node,
                                waiter_ent);
                atomic_spin_unlock(&node->lock);
        }

Lines 5-9

Mark the node as dead and remove from the linked lists that it is on.

Line 9

Update the new top node of the waiter with the next node in lock chain.

Stealing RT-Mutex

Stealing an RT-Mutex refers to an optimization by which a task may steal ownership of an RT-Mutex from a pending owner. The optimization takes advantage of the scheduling latency between a pending owner being awoken, and it running to acquire an RT-Mutex.

_images/pending-owner-1.png

Figure 12: The dashed arrow in the left sub-figure depicts a pending owner. The right side shows the result of T4 stealing L1. On the right, T1 is running to re-block on the lock, depicted by the dashed arrow.

The dashed arrow in the left side of Figure 12 leaving L1 and entering T1 indicates a pending owner relation. There are three possible scenarios that can occur when the system is in a state such as that shown in the left side of Figure 12.

In the first scenario the pending owner T1 runs in its own context to become the full owner of the lock L1. The is the simplist, and most common case.

The second scenario is that a task unrelated to the locking chain such as T4 steals the lock from pending owner T1. The result of this scenario is shown on the right side of Figure 12. The dashed line indicates a blocking relationship between T1 and L1 that is in the process of completing. Recall that when a task is selected to be a pending owner it is woken up. Thus, T1 is running when T4 steals the lock, and T1 will re-block on lock L1.

_images/pending-owner-2.png

Figure 13: The dashed arrow in the left sub-figure depicts a pending owner. The right side shows the result of T2 stealing L1. On the right, T1 is running to re-block on the lock, depicted by the dashed arrow.

The third scenario occurs when a waiter on the lock receives a non-mutex wake-up and runs to steal the lock from the pending owner T1. Figure 13 shows the result of this scenario when task T2 runs to steal the lock from task T1.

static inline int try_to_steal_lock(struct rt_mutex *lock,
                                    struct task_struct *task, int mode)
{
        struct task_struct *pendowner = rt_mutex_owner(lock);
        struct pmgt_waiter *waiter;
        struct pmgt_node *node, *t;

8       if (!rt_mutex_owner_pending(lock))
                return 0;

11      if (pendowner == task)
                return 1;

        atomic_spin_lock(&pendowner->pi_lock);

16      if (!lock_is_stealable(task, pendowner, mode)) {
                atomic_spin_unlock(&pendowner->pi_lock);
                return 0;
        }

21      if (likely(!rt_mutex_has_waiters(lock))) {
                atomic_spin_unlock(&pendowner->pi_lock);
                return 1;
        }

Line 8

If this is the real owner then we cannot steal the lock.

Line 11

If the pending owner is the task trying to steal then let it succeed.

Line 16

Make a policy decision about allowing the lock be stolen from the pending owner. This decision must be integrated with the system scheduling policy.

Line 21

If there are no waiters on the RT-Mutex being stolen then there are no other required operations. The lock may be acquired.
26      list_for_each_entry(node, &lock->pmgt_nodes, via_lock_ent) {
27              if (!node->waiter)
                        continue;
                waiter = node->waiter;
30              list_del(&waiter->proxy_ent);
31              list_del(&node->owner_ent);
        }

        atomic_spin_unlock(&pendowner->pi_lock);

Line 26

Loop over all waiters on the lock and remove them from the pending owner. These would have been originally setup by the owner of this lock when it was being released.

Line 27

Skip dead nodes that will be reclaimed during lock release.

Concurrency control commentary

We are in this routine because a waiter on this lock is aborting its waiting relation. As part of the abort, we need to know the proxy of the aborting task so its waiting relationship with the proxy can be removed. However, since this is an unusual situation, the data structures and concurrency control are designed to make the more common case of a proxy releasing a lock more efficient by requiring it to lock and unlock fewer locks. The implication of this approach is that the pmgt_waiter->proxy_task field can be incorrect unless the rt_mutex->wait_lock for the relevant lock held by the proxy is held when the proxy_task field is used to identify the proxy task in order to operate on its pmgt data structures. The wakuep_next_waiter routine is the one that leaves the proxy_task field incorrect but which holds the rt_mutex->wait_lock for the relevant mutex until the new proxy is identified. The try_to_steal_lock routine also leaves the pmgt_waiter->proxy_task field incorrect for a similar reason but holds the rt_mutex->wait_lock until the update is complete.

  • wakeup_next_waiter
  • try_to_steal_lock
  • pmgt_remove_waiter
  • pmgt_move_waiters

Lines 30-31

Remove node and waiter from the pending owner.
        atomic_spin_lock(&current->pi_lock);

38      list_for_each_entry_safe(node, t, &lock->pmgt_nodes, via_lock_ent) {
39              if (node->waiter == NULL)
                        continue;

                waiter = node->waiter;
                atomic_spin_lock(&waiter->lock);

45              if (node->via_task == current) {
46                      pmgt_prune_node_chain(waiter, node, current);
47                      pmgt_free_node(node);
48              } else {
49                      list_add(&node->owner_ent, &current->pmgt_nodes);
50                      node->owner_task = current;
                }

53              waiter->proxy_task = current;
54              list_add(&waiter->proxy_ent, &current->pmgt_waiters);

                atomic_spin_unlock(&waiter->lock);
        }

        atomic_spin_unlock(&current->pi_lock);

        return 1;
}

Line 38

Loop over all nodes on the lock to setup new relations between waiters on the lock and the stealing task.

Line 39

Skip dead nodes that will be reclaimed during lock release.

Lines 45-47

When the node being considered arrived via the stealing task then a relationship between the waiter associated with the node and stealing task already exists for some other RT-Mutex that the stealing task owns. In this case the chain of nodes for the waiter is pruned of its top node, leaving the node representing the relation between the waiter and the stealing task.

It is important to node here that the stealing task can be either a waiter on the lock, or another task trying to acquire the lock for the first time. In the former case the stealing task is running even though it has not been chosen as the pending owner. This is a special case because the configuration of the stealing task’s data structures are for that of a blocked task.

To handle this case the stealing task treats its own ‘node’ as any other waiter, and the actual clean-up of this inconsistent state is handled in remove_waiter before returning from the RT-Mutex acquisition routine.

Lines 48-50

If the waiter represented by the node is directly blocked on the RT-Mutex, or arrived via another task, then the node is reused.

Lines 53-54

Finalize the movement by adding the waiter to the stealing task.

RT-Mutex Waiter Removal

There are two instances where a task trying to acquire a lock must explicitly remove itself as a waiter on a lock. The first occurs when a waiter runs unexpectedly to steal a lock from a pending owner. Having not been chosen by the owner of a lock during normal release, the waiter is acquiring will acquire an RT-Mutex while still being a waiter. The second case occurs when a waiter receives a signal or a timeout occurs while acquiring an interruptible mutex. In this case the waiter does not acquire the lock, but must still remove itself as a waiter.

In the later case in which the lock acquisition is aborted the result is a severed lock chain which can occur at an arbitrary location. This requires special a special clean-up routine that updates the Proxy Management data structures to represent the split.

Removing a Waiter - RT_Mutex Level

The routine remove_waiter() is called by a waiter in both the case that it steals the RT-Mutex it blocked on, and the case in which it receives a signal or a timeout occurs.

static void remove_waiter(struct rt_mutex *lock,
                                  struct rt_mutex_waiter *waiter,
                                  unsigned long flags)
        {
                struct task_struct *owner = rt_mutex_owner(lock);
                struct pmgt_waiter *waiter;

                atomic_spin_lock(&current->pi_lock);
9               list_del(&waiter->list_entry);
10              waiter->task = NULL;
11              current->pi_blocked_on = NULL;

                waiter = waiter->pmgt_waiter;
                waiter->pmgt_waiter = NULL;

                atomic_spin_unlock(&current->pi_lock);

18              pmgt_remove_waiter(waiter, lock);

20              if (owner != current)
21                      pmgt_move_waiters(lock);
        }

Line 9

The waiter structure is removed from the list of waiters on the lock.

Lines 10-11

Mark this waiter and task as being no longer blocked.

Line 18

Remove the waiter associated with the waiter.

Lines 20-21

If the current task (the waiter) is not the owner then the lock acquisition is occuring because of a signal or timeout. In this case the split can occur at an arbitrary location in a locking chain and requires special clean-up.

Note that when the waiter is the owner (as a result of stealing the lock) then it cannot be blocked on any other RT-Mutex, and thus such a situation will always occur at the head of a locking chain.

Removing a Waiter -PMGT Level

The removal of a waiter results in a change the blocking relations involving the waiter and the tasks ahead of the waiter in its lock chain. Since waiters represent proxy relations that affect scheduling state they must be removed. Nodes, however, can be marked as dead, in which case they are ignored by the other Proxy Management code. This is ability is important because it allows us to avoid acquiring a large amount of locks to clean-up all data structures. The routine pmgt_remove_waiter is used to remove a waiter from an arbitrary location, and mark all nodes associated with the waiter as dead.

static void pmgt_remove_waiter(struct pmgt_waiter *waiter,
                struct rt_mutex *lock)
{
        struct pmgt_node *node, *t;
        struct rt_mutex *via_lock;
        struct task_struct *proxy_task;

retry:
        atomic_spin_lock(&waiter->lock);

11      proxy_task = waiter->proxy_task;
12      if (!atomic_spin_trylock(&proxy_task->pi_lock)) {
                atomic_spin_unlock(&waiter->lock);
                cpu_relax();
                goto retry;
        }

18      via_lock = waiter->tnode->via_lock;
19      if (via_lock != lock && !atomic_spin_trylock(&via_lock->wait_lock)) {
                atomic_spin_unlock(&proxy_task->pi_lock);
                atomic_spin_unlock(&waiter->lock);
                goto retry;
        }

25      list_del(&waiter->proxy_ent);

27      list_for_each_entry_safe(node, t, &waiter->nodes, waiter_ent) {
                atomic_spin_lock(&node->lock);
                BUG_ON(node->waiter == NULL);
                node->waiter = NULL;
                list_del(&node->waiter_ent);
                atomic_spin_unlock(&node->lock);
        }

        if (via_lock != lock)
                atomic_spin_unlock(&via_lock->wait_lock);

        atomic_spin_unlock(&proxy_task->pi_lock);
        atomic_spin_unlock(&waiter->lock);
}

Lines 11-12

Given the waiter being removed we attempt to lock proxy task that the waiter is associated with. Note that it is possible that the proxy task is the owner of the current lock (i.e. at the current location in the chain).

Lines 18-19

The RT-Mutex at the location of the waiter must also be locked. However, if via_lock is equal to the current lock, then we are already holding the relevant lock.

Concurrency control commentary

We are in this routine because a waiter on this lock is aborting its waiting relation. As part of the abort, we need to know the proxy of the aborting task so its waiting relationship with the proxy can be removed. However, since this is an unusual situation, the data structures and concurrency control are designed to make the more common case of a proxy releasing a lock more efficient by requiring it to lock and unlock fewer locks. The implication of this approach is that the pmgt_waiter->proxy_task field can be incorrect unless the rt_mutex->wait_lock for the relevant lock held by the proxy is held when the proxy_task field is used to identify the proxy task in order to operate on its pmgt data structures. The wakuep_next_waiter routine is the one that leaves the proxy_task field incorrect but which holds the rt_mutex->wait_lock for the relevant mutex until the new proxy is identified. The try_to_steal_lock routine also leaves the pmgt_waiter->proxy_task field incorrect for a similar reason but holds the rt_mutex->wait_lock until the update is complete.

  • wakeup_next_waiter
  • try_to_steal_lock
  • pmgt_remove_waiter
  • pmgt_move_waiters

Line 25

Remove the waiter from the proxy it is associated with.

Line 27

Mark all of the nodes associated with the waiter as dead. These will be removed during the next release of the RT-Mutex, and allows the routine to avoid acquiring multiple locks along the lock chain to completely clean-up the nodes.

Waiter Movement

A waiter that receives a signal or timeout may abort acquisition of an RT-Mutex. In such a situation a locking chain can be ‘cut’ at an arbitrary location. Because the aborting task will become the new proxy of all its waiters (because it is now not blocked), special clean-up is required.

static void pmgt_move_waiters(struct rt_mutex *lock)
{
        struct pmgt_node *node, *node2, *t;
        struct pmgt_waiter *waiter, *tw;
        struct list_head tmp_waiters;
        struct rt_mutex *via_lock;
        struct task_struct *proxy_task;

9       INIT_LIST_HEAD(&tmp_waiters);

retry:
        list_for_each_entry(node, &lock->pmgt_nodes, via_lock_ent) {

                atomic_spin_lock(&node->lock);

16              if (!node->waiter) {
                        atomic_spin_unlock(&node->lock);
                        continue;
                }

21              if (node->via_task != current) {
                        atomic_spin_unlock(&node->lock);
                        continue;
                }

Line 9

Initialize a list that is used to hold a set of up-chain waiters that are to be moved down-chain to the location of the aborting waiter.

Note that it is important to consider other concurrent operations that may be taking place. The most important being another task walking the lock chain and advancing waiters up-chain while we are trying to move them down-chain. This situation is made safe by two properties. The first is that a task walking the lock chain cannot proceed once it reaches a task that is not blocked. At this point in the execution of the lock abort code path RT-Mutex data structures have been set to make the waiter be considered as not blocked (task->pi_blocked_on = NULL).

The second situation addresses tasks that are walking the lock chain and are already at a location ahead of the current task. A problem will arise if a task tries to advance waiters forward which we are moving backwards. This case is made safe because this routine removes all waiters from up-chain tasks, thereby insuring that no up-chain task can advance a waiter (because they do not exist at up-chain locations).

Line 16

Ignore dead nodes or any concurrent down-chain lock breaks.

Line 21

Ignore waiters that are not on the same lock chain as the aborting task.
25              waiter = node->waiter;
                if (!atomic_spin_trylock(&waiter->lock)) {
                        atomic_spin_unlock(&node->lock);
                        goto retry;
                }

31              proxy_task = waiter->proxy_task;
                if (!atomic_spin_trylock(&proxy_task->pi_lock)) {
                        atomic_spin_unlock(&waiter->lock);
                        atomic_spin_unlock(&node->lock);
                        goto retry;
                }

38              via_lock = waiter->tnode->via_lock;
                if (via_lock != lock && !atomic_spin_trylock(&via_lock->wait_lock)) {
                        atomic_spin_unlock(&proxy_task->pi_lock);
                        atomic_spin_unlock(&waiter->lock);
                        atomic_spin_unlock(&node->lock);
                        goto retry;
                }

Line 25

Acquire the lock on the waiter associated with the node being considered.

Line 31

Acquire the lock of the proxy task at the waiter’s location in the lock chain. Note that the proxy task may be the owner of the current RT-Mutex.

Line 38

Acquire the lock on the RT-Mutex at the waiter’s location. Note that it might be the case that the RT-Mutex is the same as the one the current task is aborting acquisition of.

At this point we have deleted the waiter structure from the old proxy’s pmgt_waiters list and deleted the pmgt_node from the mutex owner’s pmgt_nodes list but we have not put NULL in either of those list entries nor have we reset the waiter->proxy_task field. As a result, while the struct pmgt_waiter instance no longer should represent a proxy relationship between a waiter and the old proxy, it still appears to do so. However, while we have released the pi_lock for the old proxy, we have not released the mutex->wait_lock, which we hold until the new proxy is identified. Therefore, all other contexts which need to determine proxy need to hold the wait_lock for the top node of the waiter involved since not doing so would permit a bad outcome from a race condition between a thread identifying a proxy and a thread releasing a lock which would make it look like a task which has already released a lock was still the proxy of a task waiting on the mutex.

  • wakeup_next_waiter
  • try_to_steal_lock
  • pmgt_remove_waiter
  • pmgt_move_waiters
46              list_del(&waiter->proxy_ent);

48              list_for_each_entry_safe(node2, t, &waiter->nodes, waiter_ent) {

50                      if (node2 != node)
                                atomic_spin_lock(&node2->lock);

53                      node2->waiter = NULL;
54                      list_del(&node2->waiter_ent);

56                      if (node2 != node)
                                atomic_spin_unlock(&node2->lock);
                        else
                                break;
                }

Line 46

Remove the waiter from the proxy task at its location.

Line 48

Prune the set of nodes such that all nodes up-chain from the current location are made dead, as these represent stale blocking relations. Note that we are traversing the list from head to tail. That is, starting with a proxy relation and moving backwards towards the waiter in a down-chain location.

Line 50

The node lock for the variable node is already held, thus we do not need to acquire it.

Lines 53-54

Remove the node from the waiter, and mark the node as dead.

Line 56

When the current node is reached then pruning is complete. The next node on the lock chain will be the relation between the waiter aborting acquisition, and another waiter at a down-chain location.
62              waiter->tnode = list_first_entry(&waiter->nodes,
                                struct pmgt_node, waiter_ent);

65              waiter->proxy_task = current;
66              list_add(&waiter->proxy_ent, &tmp_waiters);

                if (via_lock != lock)
                        atomic_spin_unlock(&via_lock->wait_lock);

                atomic_spin_unlock(&proxy_task->pi_lock);
                atomic_spin_unlock(&waiter->lock);
                atomic_spin_unlock(&node->lock);
        }

        atomic_spin_lock(&current->pi_lock);

78      list_for_each_entry_safe(waiter, tw, &tmp_waiters, proxy_ent) {
79              list_del(&waiter->proxy_ent);
80              list_add(&waiter->proxy_ent, &current->pmgt_waiters);
        }

        atomic_spin_unlock(&current->pi_lock);
}

Line 62

Setup the new top node.

Lines 65-66

Add the waiter to the temporary list.

Lines 78-80

Move all waiters to the current task.

Proxy Node Deferred Clean-up

Nodes that are maked dead must be reclaimed. This is performed when an owner releases a lock. All nodes that are marked dead have been removed from the relevant waiter. The only thing remaining is to remove them from the owner’s list and from the lock’s list of nodes. In the context of releasing an RT-Mutex all locks can saftely be acquired.

static void pmgt_clear_dead_nodes(struct rt_mutex *lock)
{
        struct pmgt_node *node, *t;

        atomic_spin_lock(&current->pi_lock);

        list_for_each_entry_safe(node, t, &lock->pmgt_nodes, via_lock_ent) {
                if (node->waiter == NULL) {
                        list_del(&node->owner_ent);
                        list_del(&node->via_lock_ent);
                        pmgt_free_node(node);
                }
        }

        atomic_spin_unlock(&current->pi_lock);
}