GS Framework

Data Structures

There are two primary data structures for use with the Group Scheduling framework, one for groups and one for group members. Group members may be other groups or tasks, so another structure is required to create a generic member pointer which can include either member type. Further, a series of structures are needed to support the supplementary ioctl calls which the user side Group Scheduling library uses to communicate with the kernel side implementation.

gsched_sdf

struct gsched_sdf {
        char *name;
        struct module *owner;
        struct list_head schedlist_entry;

        void (*insert_group)(struct gsched_group *);
        void (*remove_group)(struct gsched_group *);

        void (*setup_member)(struct gsched_group *, struct gsched_member *);
        void (*insert_member)(struct gsched_group *, struct gsched_member *);
        void (*remove_member)(struct gsched_group *, struct gsched_member *);
        void (*release_member)(struct gsched_group *, struct gsched_member *);
        void (*release_group)(struct gsched_group *);
        struct gsched_member *(*find_member)(struct gsched_group *group, char *member_name);

        int (*set_member_params)(struct gsched_group *, struct gsched_member *, void *, size_t);
        int (*get_member_params)(struct gsched_group *, struct gsched_member *, void *, size_t);
        int (*set_group_params)(struct gsched_group *, void *, size_t);
        int (*get_group_params)(struct gsched_group *, void *, size_t);

        void (*iterator_prepare)(struct gsched_group *, struct rq *rq);
        struct gsched_member *(*iterator_next)(struct gsched_group *, struct gsched_member *, struct rq *rq);
        void (*iterator_finalize)(struct gsched_group *, struct gsched_member *, struct rq *rq);
        int (*is_runnable)(struct gsched_group *, struct gsched_member *);

        int (*fork_member)(struct gsched_group *, struct gsched_member *, struct gsched_member *);
        void (*start_member)(struct gsched_group *, struct gsched_member *, struct rq *, int);
        void (*stop_member)(struct gsched_group *, struct gsched_member *, struct rq *, int);

        size_t per_group_datasize;
        size_t per_member_datasize;
        size_t per_cpu_datasize;
};
  • name - The name of the sdf

  • owner - The module that the sdf is part of.

    Constant after creation.

  • schedlist_entry - The entry on the global group scheduling list of SDFs(gsched_schedlist).

    Protected by gsched_mutex.

  • per_group_datasize - The size of the data to allocate for each group that is controlled by this SDF.

    Constant after creation.

  • per_member_datasize - The size of the data to allocate for each member added to this SDF.

    Constant after creation.

  • per_cpu_datasize - The size of the data to allocate for each CPU for each group that is controlled by this SDF.

    Constant after creation.

The remainder of the members are function pointers filled in by SDF modules. The callbacks are explained here:

Writing an SDF

gsched_group

The gsched_group structure defines a group scheduling group. A group scheduling group is controlled by a single sdf but it may be a member of multiple groups. A group holds a reference to the module of the sdf it uses. The module cannot be unloaded while this reference is held.

atomic_spinlock_t lock;

struct ccsm_set *ns;
int opts; /* options settable by user code */
struct list_head memberships;
struct list_head grouplist_entry;
struct list_head all_members;
struct gsched_sdf *sdf;
void *sched_data;
void *cpu_sched_data[NR_CPUS];

atomic_t usage;
  • lock - non-preepmtable spinlock that controls access to fields and group members

  • ns - A ccsm set that represents this group.

    Constant after creation. No CC required to dereference field.

  • opts - per group options, can be set by user-side programs

    Protected by group->lock.

  • memberships - Each group may be a member of zero or more other groups. When a group becomes a member a struct gs_member is allocated and associated with the group being made a member. This member then represents the group’s membership in another group. The memberships list is a list of struct gs_members which represent this group as a member of another group. This list is protected by group->lock.

  • grouplist_entry - global group list entry

    Protected by gsched_mutex.

  • all_members - The Group Scheduling framework manages members on a

    per-group basis, in addition to any member management that a given SDF implements. The framework automatically adds all members to this list, irregardless of whether the member represents a task or another group. Protected by group->lock.

  • sdf - associated scheduling decision function

  • sched_data - framework managed scheduling data for use by SDF

    Constant after creation. No CC required to dereference field.

  • cpu_sched_data - Each SDF may request that groups automatically

    allocate a given amount of storage to be available via this per-cpu array. The most common use of this is per-cpu runqueues of tasks. If a per-cpu separation of data is not required then it is recommeneded that an SDF use the group->sd per-group scheduling data. Constant after creation. No CC required to dereference field.

  • usage - usage/reference count

    Used with atomic ops.

gsched_member

The gsched_member structure defines a group scheduling member which represents the membership of a group or task within a group.

struct gsched_member {
    int flags;
    int type;
    void *sched_data;

    struct ccsm_set *ns;
    struct list_head owner_entry;
    struct list_head samecomputation_entry;
    struct gsched_group *owner;
    atomic_t usage;

    union {
            struct gsched_member_task task;
            struct gsched_member_group group;
    };
};
  • flags - framework level member flags

    Protected by owner->lock.

  • type - type of member (e.g. task or group)

    Constant after creation. No CC required to access field.

  • sched_data - framework managed scheduling data for use by SDF

    Constant after creation. No CC required to dereference field.

  • ns - A pointer to a ccsm set representing a task or a group.

    Constant after creation. No CC required to dereference field.

  • owner_entry - group that this member structure belongs to.

    Protected by owner->lock.

  • samecomputation_entry - entry is computation’s membership list.

    For task: protected by rq lock. For group: protected by owner->lock.

  • owner - Group that this member is a member of

    Constant after creation. No CC required to dereference field.

  • usage - member usage count

    Used with atomic ops.

  • task - task specific data

  • group - group specific data

gsched_member_task

Structure gsched_member_task is defined in include/linux/kusp/sched_gsched.h.

struct gsched_member_task {
      struct task_struct *ptr;
      struct task_struct *proxy;
      struct list_head waiter_ent;
      struct list_head waiting_members_ent;
      int task_cpu;
      int proxy_cpu;
      enum gsched_task_state state;
      enum gsched_task_state blocked_state;
}

The gsched_member structure holds all information for the member of a group. When a group member is a task the gsched member structure’s anonymous union members ‘task’ and ‘group’ are of structure types appropriate to representing tasks and groups. The gsched_member_task structure is the type used to represent tasks as members of a group.

It holds proxy information and CPU assignment.

  • ptr - Pointer to the task struct of the task represented by this member.

  • proxy - Pointer to the proxy for the task represented by the

    member. Proxy must be on the same CPU. NULL when a process is blocked on a lock but we are still searching for the proxy.

  • waiter_ent - Entry on the list rooted at gsched_task_members

    field in the pmgt_waiter structure which represents all group memberships of this task. When a task is blocking on a lock the list is protected by the pi_lock of the task. After the proxy has been discovered, the list is protected by the pi_lock of the proxy.

  • waiting_members_ent - Entry on list rooted at

    gsched_waiting_members field in task structure of proxy. This list represents all of the tasks waiting on the proxy. Protected by the rq lock of proxy. Contains the default values indicating an empty list if the proxy field is NULL.

  • task_cpu - The cpu that the task represented by this member is on.

  • proxy_cpu - The cpu of the proxy. -1 if proxy is NULL

  • state - One of the values from gsched_task_state.

  • blocked_state - The state that a task will take on if it is runnable

    and then blocks. The blocked state can never be GSCHED_TASK_RUNNABLE. If the state of a task is not GSCHED_TASK_RUNNABLE then the state and blocked_state will have the same value. The reason for this is in the implementation of proxy accounting there are some periods during which the state of the thread is blocked but it must continue running to complete some of the accounting code.

gsched_member_group

Portions of the member structure that only exist for groups.

struct gsched_group *ptr;
  • ptr - Pointer to the group represented by this member.

    Constant after creation. No CC required to dereference field.

Lists

gsched_memberships

Root: task_struct->gsched_memberships

element: gsched_member->samecomputation_entry

For group members, this element field is re-used to represent all the memberships of a group in other groups and is rooted in the group structure at group->memberships

Use: Holds all of the group memberships for a task.

CC: rq lock of task’s cpu

gsched_grouplist

Root: gsched_grouplist

Element: gsched_group->grouplist_entry

Use: The list of all groups that exist

CC: gsched_lock

gsched_schedlist

Root: gsched_schedlist

Element: gsched_sdf->schedlist_entry

Use: The list of all sdfs taht exist

CC: gsched_lock

all_members

Root: gsched_group->all_members

Element: gsched_member->owner_entry

Use: The list of all members in a group

CC: gsched_group->lock

memberships

Root: gsched_group->memberships

Element: gsched_member->samecomputation_entry

Use: The list of all members that a group is represented by. For example: if a group is a member of two other groups then there will be two gsched_member entries on this list.

CC: gsched_group->lock

Concurrency Control

runqueue Lock

Linux scheduling subsystem calls some group scheduling functions

the context rq locks are being used is important

The runqueue lock is assumed to be held by the calling context in these functions.

task_rq_lock is the routine most commonly used to acquire the runqueue lock

this routine disables and saves the interrupt flags

task_rq_unlock

restores the saved interrupt flags

one rq lock per cpu and shared across scheduling classes

GS Framework Functions called in context controlled by rq lock:

  • gsched_pick_next_task
  • gsched_run_scheduler
  • gsched_schedule_proxy
  • gsched_record
  • gsched_enqueue_task
  • gsched_dequeue_task

SDF hooks called in context controlled by rq lock:

  • iterator_prepare
  • iterator_next
  • iterator_finalize
  • start_member
  • stop_member

reference counts

Structures for members and groups use usage counts to keep track of how many places the structures are being used. The reference count fields are called usage. This is useful for keeping structures around when no locks are held. When the usage count drops to zero, the structures are freed.

The framework holds a reference to each group and member after it is created. The reference is not released until gsched_destroy_group/gsched_leave_group are called.

The GS framework tries to permit arbitrary SDF semantics which can include arbitrary data structures for scheduling data with arbitrary sematnics, which might include blocking operations. A reference count for each group held by the GS framework is helpful in achieving this. It is important to note that the GS framework uses group->lock, an atomic spinlock, to control working access to the group, which forbids blocking operations while it is held.

Shen a group must be destroyed, the GS framework provides a two step process.The GS framework rountine destroy_group is called when the group is no longer needed and calls two SDF hooks to help with the process after performing some generic group destruction. The first, remove_group, assumes group->lock is held and this is lmiited to non-blocking operations related to SDF specific group destruction operations.

At this point, destroy_group decrements the group reference count and if this has reached zero then calls release_group which is the SDF hook related to group destruction which permits blocking operations. Note that under current framework assumptions the usage count should always go to zero at this point.

Reference counts are also used for the get/set_member_params and get/set_group_params SDF callbacks. When these callbacks are invoked the framework holds a reference to the member or group involved but it does not hold the group->lock. Therefore, the code inside these callbacks can block. However, to actually access the SDF specific data assocaited with the member or group, the code inside the callback must utilize the appropriate concurrency control, typically the group->lock.

struct gsched_group

The group scheduling groups are protected by the lock atomic spinlock member of the gsched_group structure. Groups are locked when members are added, when the group becomes a member of another group, and when members are removed from the group. Also, it is common for an SDF to lock a group at scheduling time.

Interrupts should be disabled when acquiring the group lock.

Sometimes interrupts are disabled in the calling context when the runqueue lock is acquired and they do not need to be explicitly disabled.

The group lock is held during some of the SDF callback functions:

  • insert_member
  • remove_member
  • remove_group

Internally, the group lock is acquired in:

  • gsched_group_join_group
  • gsched_pid_join_group
  • gsched_task_exit
  • gsched_remove_member
  • gsched_remove_gruop
  • gsched_set_member_params
  • gsched_get_member_params

In cases where multiple group locks must be acquired, the locks must be acquired using transactions, via atomic_spin_lock_trylock, to protect against deadlocks. The group locks do not have a definite ordering because the hierarchy is dynamic.

retry:
        atomic_spin_lock(&group1->lock);

        if (!atomic_spin_lock_trylock(&group2->lock)) {
                atomic_spin_unlock(&group1->lock);
                goto retry;
        }

struct task

Group scheduling uses the gsched_memberships variable inside the task structure to keep track of what members a task has. This list is protected by the runqueue lock.

gsched_lock

The gsched_lock is a preemptable mutex. The gsched_lock must be held when accessing the gsched_grouplist (the list of all groups) and gsched_schedlist (the list of all sdfs) lists.

The gsched_lock controls the group scheduling API. However, the hierarchy can be accessed at scheduling time without using the gsched_lock. Additionally, members are removed from the hierarchy when a task exits without acquiring the gsched_lock.

ccsm_mutex

Group scheduling relies on CCSM for a namespace and CCSM is controlled by a combination of a preemptable mutex and RCU. Currently, group scheduling’s use of CCSM is constrained to a specific lock ordering. Group scheduling uses a CCSM callback to be notified when a thread announces itself to CCSM. The callback function executes with the ccsm_mutex already acquired. Inside the callback, the gsched_lock must be acquired.

All group scheduling functions must currently adhere to the lock ordering of ccsm_mutex followed by gsched_lock to avoid deadlock.

Ordering of SDF callbacks

Some SDF callbacks need to be ordered

Ordering enforced by group->lock

insert_member before remove_member

insert_group before remove_group

Requires entire system to wait on some SDF handlers to complete

If the top group is locked for a callback then no scheduling can occur.

Group API Functions

create_group

Create a new group controlled by a specific sdf.

int gsched_create_group(char *group_name, char *sched_name)
  • group_name - unique name for this group in the group scheduling framework
  • sched_name - sdf this group will be associated with

Acquire ccsm_mutex to preserve lock ordering and create namespace entry later.

Acquire gsched_lock to look up SDF.

Create group.

Get reference to SDF module.

Held by created group for its lifetime.

Insures that SDF does not get unloaded while in use.

Create namespace entry

ccsm_mutex already acquired

no locking needed

Drop ccsm_mutex

No more CCSM operations needed.

Lock ordering not violated.

Call insert_group SDF handler.

get/set_group_params

Call the sdf scheudling hooks to get and set group parameters.

int gsched_get_group_params(char *name, void *value, size_t size)
int gsched_set_group_params(char *name, void *value, size_t size)
  • name - The name of the group to get the parameters from.
  • value - The value to get.
  • size - The size of the value to get.

Acquire gsched_lock

Possibly unneded now that the group is looked up using CCSM?

Look up group in CCSM using RCU.

Acquire reference to group.

Drop gsched_lock.

Call group_params SDF handler with only reference held.

Release reference.

get/set_group_opts

Presently used to mark a group as requiring automatic cleanup when it becomes empty.

int gsched_get_group_opts(char* name, int* opts)
int gsched_set_group_opts(char* name, int* opts)
  • name - The name of the group to get the options from.
  • opts - A pointer to store the options in.

Acquire gsched_lock.

Possibly unneded now that the group is looked up using CCSM?

Look up group in CCSM using RCU.

Acquire group->lock to access options.

destroy_group

Delete a group from the system.

int gsched_destroy_group(char *name)
  • name - The name of the group to destroy

A group cannot be deleted until it has been uninstalled from the system and it has no members.

Acquire ccsm_mutex to preserve lock ordering.

Acquire gsched_lock.

Access CCSM using RCU to find the group

RCU unnecessary with ccsm_mutex held

Need extension to CCSM API

Acquire group->lock to check if there are group members.

Call remove-group

Drop group-lock

Delete namespace entry.

CCSM mutex already acquired.

Member API Functions

the functions below operate on members within a group.

pid_join_group

int gsched_pid_join_group(char *group_name, pid_t pid, char *member_name)
  • group_name - name of the group being joined
  • pid - process id of the task joining a group
  • member_name - unique member name for the task within the joined group

group_join_group

Make a group a member of another group.

int gsched_group_join_group(char *group_name, char *addgroup_name)
  • group_name - The name of the group to add the addgroup_name group to.
  • addgroup_name - The name of the group being added to group_name.

Acquire ccsm_mutex to preserve lock ordering and access CCSM.

Acquire gsched_lock.

Look up group in CCSM using RCU.

RCU unnecessary with ccsm_mutex held

Need extension to CCSM API

Acquire group->lock for the containig group

Acquire group->lock for the member group

inset member

Drop both group locks.

Add member group to containing group’s CCSM set.

Already hold ccsm_mutex

get/set_member_params

int gsched_get_member_params(char *gname, char *mname, void *value, size_t size)
  • gname - The name of the group that mname belongs to.
  • mname - The name of the member in gname to get the parameters from.
  • value - The value to set or get.
  • size - The size of the value.

Acquire gsched_lock

Look up group in CCSM using RCU

Acquire group->lock to search group’s member list.

Get group and member reference to keep both around.

Drop gsched_lock.

Call SDF handler holding only the references.

Release group and member references.

leave_group

int gsched_leave_group(char *group_name, char *member_name)
  • group_name - The group to remove the member from.
  • member_name - The name of the member within the group.

remove_group

Acquire group lock for containing group.

Acquire group lock for member group.

remove member

Drop both group locks.

check for deferred group cleanup

add to workqueue

put framework reference for member to delete member

remove_task

Acquire rq lock for task to access task->gsched_memberships.

Acquire group lock for containing group. The framework holds a reference to each group and member after it is created. Reference is not released until gsched_destroy_group/gsched_leave_group are called. remmove member

Drop rq lock and group lock.

check for deferred group cleanup

add to workqueue

put framework reference for member to delete member

set_exclusive_control

Just a wrapper around sched_setscheduler

May be removed in the future.
int gsched_set_exclusive_control(pid_t pid)
  • pid - The pid of the task to set exclusive on.

clear_exclusive_control

Just a wrapper around sched_setscheduler.

May be removed in the future.
int gsched_clear_exclusive_control(pid_t pid)
  • pid - The pid of the task to set exclusive off.

Special Functions

default_find_member

struct gsched_member *gsched_default_find_member(struct gsched_group *group, char *name)
  • group - group to search within
  • name - name of member to search for

Searches the framework-level member lists for a match and returns a pointer to the member if found, NULL otherwise. Can be used by SDFs as a drop in replacement for the find_member SDF hook, if customized member management is not part of the SDF.

task_exit

void gsched_task_exit(struct task_struct *task)
  • task - A task that is exiting.

Acquire runqueue lock to access gsched_memberships for the task.

Acquire group lock for each member to remove the member.

Put members on temporary list. Members only accessible locally.

Check each member for empty group cleanup.

Put framework reference to member.

This function is called by do_exit to indicate to group scheduling that a task is exiting. When a task exits, the members of the task are removed from the groups that they belong to. The reference count of each member is decremented to match the increment when the member was created. The member will be deleted if its reference count has reached 0.

Scheduling Subsystem Interaction

pick_next_task

Called from __schedule.

Executes with runqueue locked.

run_scheduler

static struct gsched_member *gsched_run_scheduler(struct gsched_group *group, struct rq *rq)
  • group - group whose scheduler is being invoked
  • rq - per-cpu runqueue being scheduled for

Evaluates the hierarchy.

Each group acquires group lock in iterator_prepare and iterator_finalize.

Group locks protect against concurrent hierarchy modification by different CPUs.

This is the all-important scheduling function for the Group Scheduling Framework. Here is where tasks are actually scheduled by the framework and chosen to run. The function itself is deceptively simple. A check is made to the SDF iterator_prepare hook in case the developer wishes to perform some book-keeping prior to choosing a member. Then a call is made to the SDF iterator_next hook. Note that this is the only SDF hook that MUST be implemented, as it is here the scheduling decision is returned to Linux proper. If no decision is reached (NULL is returned by the SDF call) the subroutine exits and vanilla Linux makes a scheduling decision. Otherwise, the type of member returned is checked. If the member is a group, the scheduler is recursively called on that group so that its associated SDF may try to return a member. If the member is a task, checks are made to the user implemented SDF is_runnable hook, as well as ensuring the task selected is on the run queue and on the proper cpu. Assuming all conditions are passed, a check is made to the user implemented SDF iterator finalize hook, in case the user has some final book-keeping to perform, and the member-task is returned to Linux to be scheduled. Otherwise, the whole loop is called again.

dequeue_task

void gsched_dequeue_task(struct rq *rq, struct task_struct  *task, int sleep)

Called from dequeue_task

Executes with runqueue locked.

call group scheduling SDF handler start_member.

enqueue_task

void gsched_enqueue_task(struct rq *rq, struct task_struct  *task, int wakeup)

Called from enqueue_task

Executes with runqueue locked.

call group scheduilng SDF handler stop_member.