There are two primary data structures for use with the Group Scheduling framework, one for groups and one for group members. Group members may be other groups or tasks, so another structure is required to create a generic member pointer which can include either member type. Further, a series of structures are needed to support the supplementary ioctl calls which the user side Group Scheduling library uses to communicate with the kernel side implementation.
struct gsched_sdf {
char *name;
struct module *owner;
struct list_head schedlist_entry;
void (*insert_group)(struct gsched_group *);
void (*remove_group)(struct gsched_group *);
void (*setup_member)(struct gsched_group *, struct gsched_member *);
void (*insert_member)(struct gsched_group *, struct gsched_member *);
void (*remove_member)(struct gsched_group *, struct gsched_member *);
void (*release_member)(struct gsched_group *, struct gsched_member *);
void (*release_group)(struct gsched_group *);
struct gsched_member *(*find_member)(struct gsched_group *group, char *member_name);
int (*set_member_params)(struct gsched_group *, struct gsched_member *, void *, size_t);
int (*get_member_params)(struct gsched_group *, struct gsched_member *, void *, size_t);
int (*set_group_params)(struct gsched_group *, void *, size_t);
int (*get_group_params)(struct gsched_group *, void *, size_t);
void (*iterator_prepare)(struct gsched_group *, struct rq *rq);
struct gsched_member *(*iterator_next)(struct gsched_group *, struct gsched_member *, struct rq *rq);
void (*iterator_finalize)(struct gsched_group *, struct gsched_member *, struct rq *rq);
int (*is_runnable)(struct gsched_group *, struct gsched_member *);
int (*fork_member)(struct gsched_group *, struct gsched_member *, struct gsched_member *);
void (*start_member)(struct gsched_group *, struct gsched_member *, struct rq *, int);
void (*stop_member)(struct gsched_group *, struct gsched_member *, struct rq *, int);
size_t per_group_datasize;
size_t per_member_datasize;
size_t per_cpu_datasize;
};
name - The name of the sdf
Constant after creation.
Protected by gsched_mutex.
Constant after creation.
Constant after creation.
Constant after creation.
The remainder of the members are function pointers filled in by SDF modules. The callbacks are explained here:
The gsched_group structure defines a group scheduling group. A group scheduling group is controlled by a single sdf but it may be a member of multiple groups. A group holds a reference to the module of the sdf it uses. The module cannot be unloaded while this reference is held.
atomic_spinlock_t lock;
struct ccsm_set *ns;
int opts; /* options settable by user code */
struct list_head memberships;
struct list_head grouplist_entry;
struct list_head all_members;
struct gsched_sdf *sdf;
void *sched_data;
void *cpu_sched_data[NR_CPUS];
atomic_t usage;
lock - non-preepmtable spinlock that controls access to fields and group members
Constant after creation. No CC required to dereference field.
Protected by group->lock.
memberships - Each group may be a member of zero or more other groups. When a group becomes a member a struct gs_member is allocated and associated with the group being made a member. This member then represents the group’s membership in another group. The memberships list is a list of struct gs_members which represent this group as a member of another group. This list is protected by group->lock.
Protected by gsched_mutex.
per-group basis, in addition to any member management that a given SDF implements. The framework automatically adds all members to this list, irregardless of whether the member represents a task or another group. Protected by group->lock.
sdf - associated scheduling decision function
Constant after creation. No CC required to dereference field.
allocate a given amount of storage to be available via this per-cpu array. The most common use of this is per-cpu runqueues of tasks. If a per-cpu separation of data is not required then it is recommeneded that an SDF use the group->sd per-group scheduling data. Constant after creation. No CC required to dereference field.
Used with atomic ops.
The gsched_member structure defines a group scheduling member which represents the membership of a group or task within a group.
struct gsched_member {
int flags;
int type;
void *sched_data;
struct ccsm_set *ns;
struct list_head owner_entry;
struct list_head samecomputation_entry;
struct gsched_group *owner;
atomic_t usage;
union {
struct gsched_member_task task;
struct gsched_member_group group;
};
};
Protected by owner->lock.
Constant after creation. No CC required to access field.
Constant after creation. No CC required to dereference field.
Constant after creation. No CC required to dereference field.
Protected by owner->lock.
For task: protected by rq lock. For group: protected by owner->lock.
Constant after creation. No CC required to dereference field.
Used with atomic ops.
task - task specific data
group - group specific data
Structure gsched_member_task is defined in include/linux/kusp/sched_gsched.h.
struct gsched_member_task {
struct task_struct *ptr;
struct task_struct *proxy;
struct list_head waiter_ent;
struct list_head waiting_members_ent;
int task_cpu;
int proxy_cpu;
enum gsched_task_state state;
enum gsched_task_state blocked_state;
}
The gsched_member structure holds all information for the member of a group. When a group member is a task the gsched member structure’s anonymous union members ‘task’ and ‘group’ are of structure types appropriate to representing tasks and groups. The gsched_member_task structure is the type used to represent tasks as members of a group.
It holds proxy information and CPU assignment.
ptr - Pointer to the task struct of the task represented by this member.
member. Proxy must be on the same CPU. NULL when a process is blocked on a lock but we are still searching for the proxy.
field in the pmgt_waiter structure which represents all group memberships of this task. When a task is blocking on a lock the list is protected by the pi_lock of the task. After the proxy has been discovered, the list is protected by the pi_lock of the proxy.
gsched_waiting_members field in task structure of proxy. This list represents all of the tasks waiting on the proxy. Protected by the rq lock of proxy. Contains the default values indicating an empty list if the proxy field is NULL.
task_cpu - The cpu that the task represented by this member is on.
proxy_cpu - The cpu of the proxy. -1 if proxy is NULL
state - One of the values from gsched_task_state.
and then blocks. The blocked state can never be GSCHED_TASK_RUNNABLE. If the state of a task is not GSCHED_TASK_RUNNABLE then the state and blocked_state will have the same value. The reason for this is in the implementation of proxy accounting there are some periods during which the state of the thread is blocked but it must continue running to complete some of the accounting code.
Portions of the member structure that only exist for groups.
struct gsched_group *ptr;
Constant after creation. No CC required to dereference field.
Root: task_struct->gsched_memberships
element: gsched_member->samecomputation_entry
For group members, this element field is re-used to represent all the memberships of a group in other groups and is rooted in the group structure at group->memberships
Use: Holds all of the group memberships for a task.
CC: rq lock of task’s cpu
Root: gsched_grouplist
Element: gsched_group->grouplist_entry
Use: The list of all groups that exist
CC: gsched_lock
Root: gsched_schedlist
Element: gsched_sdf->schedlist_entry
Use: The list of all sdfs taht exist
CC: gsched_lock
Root: gsched_group->all_members
Element: gsched_member->owner_entry
Use: The list of all members in a group
CC: gsched_group->lock
Root: gsched_group->memberships
Element: gsched_member->samecomputation_entry
Use: The list of all members that a group is represented by. For example: if a group is a member of two other groups then there will be two gsched_member entries on this list.
CC: gsched_group->lock
Linux scheduling subsystem calls some group scheduling functions
the context rq locks are being used is important
The runqueue lock is assumed to be held by the calling context in these functions.
task_rq_lock is the routine most commonly used to acquire the runqueue lock
this routine disables and saves the interrupt flagstask_rq_unlock
restores the saved interrupt flagsone rq lock per cpu and shared across scheduling classes
GS Framework Functions called in context controlled by rq lock:
- gsched_pick_next_task
- gsched_run_scheduler
- gsched_schedule_proxy
- gsched_record
- gsched_enqueue_task
- gsched_dequeue_task
SDF hooks called in context controlled by rq lock:
- iterator_prepare
- iterator_next
- iterator_finalize
- start_member
- stop_member
Structures for members and groups use usage counts to keep track of how many places the structures are being used. The reference count fields are called usage. This is useful for keeping structures around when no locks are held. When the usage count drops to zero, the structures are freed.
The framework holds a reference to each group and member after it is created. The reference is not released until gsched_destroy_group/gsched_leave_group are called.
The GS framework tries to permit arbitrary SDF semantics which can include arbitrary data structures for scheduling data with arbitrary sematnics, which might include blocking operations. A reference count for each group held by the GS framework is helpful in achieving this. It is important to note that the GS framework uses group->lock, an atomic spinlock, to control working access to the group, which forbids blocking operations while it is held.
Shen a group must be destroyed, the GS framework provides a two step process.The GS framework rountine destroy_group is called when the group is no longer needed and calls two SDF hooks to help with the process after performing some generic group destruction. The first, remove_group, assumes group->lock is held and this is lmiited to non-blocking operations related to SDF specific group destruction operations.
At this point, destroy_group decrements the group reference count and if this has reached zero then calls release_group which is the SDF hook related to group destruction which permits blocking operations. Note that under current framework assumptions the usage count should always go to zero at this point.
Reference counts are also used for the get/set_member_params and get/set_group_params SDF callbacks. When these callbacks are invoked the framework holds a reference to the member or group involved but it does not hold the group->lock. Therefore, the code inside these callbacks can block. However, to actually access the SDF specific data assocaited with the member or group, the code inside the callback must utilize the appropriate concurrency control, typically the group->lock.
The group scheduling groups are protected by the lock atomic spinlock member of the gsched_group structure. Groups are locked when members are added, when the group becomes a member of another group, and when members are removed from the group. Also, it is common for an SDF to lock a group at scheduling time.
Interrupts should be disabled when acquiring the group lock.
Sometimes interrupts are disabled in the calling context when the runqueue lock is acquired and they do not need to be explicitly disabled.
The group lock is held during some of the SDF callback functions:
- insert_member
- remove_member
- remove_group
Internally, the group lock is acquired in:
- gsched_group_join_group
- gsched_pid_join_group
- gsched_task_exit
- gsched_remove_member
- gsched_remove_gruop
- gsched_set_member_params
- gsched_get_member_params
In cases where multiple group locks must be acquired, the locks must be acquired using transactions, via atomic_spin_lock_trylock, to protect against deadlocks. The group locks do not have a definite ordering because the hierarchy is dynamic.
retry:
atomic_spin_lock(&group1->lock);
if (!atomic_spin_lock_trylock(&group2->lock)) {
atomic_spin_unlock(&group1->lock);
goto retry;
}
Group scheduling uses the gsched_memberships variable inside the task structure to keep track of what members a task has. This list is protected by the runqueue lock.
The gsched_lock is a preemptable mutex. The gsched_lock must be held when accessing the gsched_grouplist (the list of all groups) and gsched_schedlist (the list of all sdfs) lists.
The gsched_lock controls the group scheduling API. However, the hierarchy can be accessed at scheduling time without using the gsched_lock. Additionally, members are removed from the hierarchy when a task exits without acquiring the gsched_lock.
Group scheduling relies on CCSM for a namespace and CCSM is controlled by a combination of a preemptable mutex and RCU. Currently, group scheduling’s use of CCSM is constrained to a specific lock ordering. Group scheduling uses a CCSM callback to be notified when a thread announces itself to CCSM. The callback function executes with the ccsm_mutex already acquired. Inside the callback, the gsched_lock must be acquired.
All group scheduling functions must currently adhere to the lock ordering of ccsm_mutex followed by gsched_lock to avoid deadlock.
Some SDF callbacks need to be ordered
Ordering enforced by group->lock
insert_member before remove_member
insert_group before remove_group
Requires entire system to wait on some SDF handlers to complete
If the top group is locked for a callback then no scheduling can occur.
Create a new group controlled by a specific sdf.
int gsched_create_group(char *group_name, char *sched_name)
Acquire ccsm_mutex to preserve lock ordering and create namespace entry later.
Acquire gsched_lock to look up SDF.
Create group.
Get reference to SDF module.
Held by created group for its lifetime.
Insures that SDF does not get unloaded while in use.
Create namespace entry
ccsm_mutex already acquired
no locking needed
Drop ccsm_mutex
No more CCSM operations needed.
Lock ordering not violated.
Call insert_group SDF handler.
Call the sdf scheudling hooks to get and set group parameters.
int gsched_get_group_params(char *name, void *value, size_t size)
int gsched_set_group_params(char *name, void *value, size_t size)
Acquire gsched_lock
Possibly unneded now that the group is looked up using CCSM?
Look up group in CCSM using RCU.
Acquire reference to group.
Drop gsched_lock.
Call group_params SDF handler with only reference held.
Release reference.
Presently used to mark a group as requiring automatic cleanup when it becomes empty.
int gsched_get_group_opts(char* name, int* opts)
int gsched_set_group_opts(char* name, int* opts)
Acquire gsched_lock.
Possibly unneded now that the group is looked up using CCSM?
Look up group in CCSM using RCU.
Acquire group->lock to access options.
Delete a group from the system.
int gsched_destroy_group(char *name)
A group cannot be deleted until it has been uninstalled from the system and it has no members.
Acquire ccsm_mutex to preserve lock ordering.
Acquire gsched_lock.
Access CCSM using RCU to find the group
RCU unnecessary with ccsm_mutex held
Need extension to CCSM API
Acquire group->lock to check if there are group members.
Call remove-group
Drop group-lock
Delete namespace entry.
CCSM mutex already acquired.
the functions below operate on members within a group.
int gsched_pid_join_group(char *group_name, pid_t pid, char *member_name)
Make a group a member of another group.
int gsched_group_join_group(char *group_name, char *addgroup_name)
Acquire ccsm_mutex to preserve lock ordering and access CCSM.
Acquire gsched_lock.
Look up group in CCSM using RCU.
RCU unnecessary with ccsm_mutex held
Need extension to CCSM API
Acquire group->lock for the containig group
Acquire group->lock for the member group
inset member
Drop both group locks.
Add member group to containing group’s CCSM set.
Already hold ccsm_mutex
int gsched_get_member_params(char *gname, char *mname, void *value, size_t size)
Acquire gsched_lock
Look up group in CCSM using RCU
Acquire group->lock to search group’s member list.
Get group and member reference to keep both around.
Drop gsched_lock.
Call SDF handler holding only the references.
Release group and member references.
int gsched_leave_group(char *group_name, char *member_name)
Acquire group lock for containing group.
Acquire group lock for member group.
remove member
Drop both group locks.
check for deferred group cleanup
add to workqueue
put framework reference for member to delete member
Acquire rq lock for task to access task->gsched_memberships.
Acquire group lock for containing group. The framework holds a reference to each group and member after it is created. Reference is not released until gsched_destroy_group/gsched_leave_group are called. remmove member
Drop rq lock and group lock.
check for deferred group cleanup
add to workqueue
put framework reference for member to delete member
Just a wrapper around sched_setscheduler
May be removed in the future.
int gsched_set_exclusive_control(pid_t pid)
Just a wrapper around sched_setscheduler.
May be removed in the future.
int gsched_clear_exclusive_control(pid_t pid)
struct gsched_member *gsched_default_find_member(struct gsched_group *group, char *name)
Searches the framework-level member lists for a match and returns a pointer to the member if found, NULL otherwise. Can be used by SDFs as a drop in replacement for the find_member SDF hook, if customized member management is not part of the SDF.
void gsched_task_exit(struct task_struct *task)
Acquire runqueue lock to access gsched_memberships for the task.
Acquire group lock for each member to remove the member.
Put members on temporary list. Members only accessible locally.
Check each member for empty group cleanup.
Put framework reference to member.
This function is called by do_exit to indicate to group scheduling that a task is exiting. When a task exits, the members of the task are removed from the groups that they belong to. The reference count of each member is decremented to match the increment when the member was created. The member will be deleted if its reference count has reached 0.
Called from __schedule.
Executes with runqueue locked.
static struct gsched_member *gsched_run_scheduler(struct gsched_group *group, struct rq *rq)
Evaluates the hierarchy.
Each group acquires group lock in iterator_prepare and iterator_finalize.
Group locks protect against concurrent hierarchy modification by different CPUs.
This is the all-important scheduling function for the Group Scheduling Framework. Here is where tasks are actually scheduled by the framework and chosen to run. The function itself is deceptively simple. A check is made to the SDF iterator_prepare hook in case the developer wishes to perform some book-keeping prior to choosing a member. Then a call is made to the SDF iterator_next hook. Note that this is the only SDF hook that MUST be implemented, as it is here the scheduling decision is returned to Linux proper. If no decision is reached (NULL is returned by the SDF call) the subroutine exits and vanilla Linux makes a scheduling decision. Otherwise, the type of member returned is checked. If the member is a group, the scheduler is recursively called on that group so that its associated SDF may try to return a member. If the member is a task, checks are made to the user implemented SDF is_runnable hook, as well as ensuring the task selected is on the run queue and on the proper cpu. Assuming all conditions are passed, a check is made to the user implemented SDF iterator finalize hook, in case the user has some final book-keeping to perform, and the member-task is returned to Linux to be scheduled. Otherwise, the whole loop is called again.
void gsched_dequeue_task(struct rq *rq, struct task_struct *task, int sleep)
Called from dequeue_task
Executes with runqueue locked.
call group scheduling SDF handler start_member.
void gsched_enqueue_task(struct rq *rq, struct task_struct *task, int wakeup)
Called from enqueue_task
Executes with runqueue locked.
call group scheduilng SDF handler stop_member.