HGS use cases
group A is added/removed from group B
kusp/hgs_core.c
group A must be added/removed from every CPU list of group B
group lock for A and group lock for B are held
gsched_lock
task A is added/removed from group B
kusp/hgs_core.c
gsched_lock
rq lock for task A
group lock for B
if task A is in the EC scheduling class
- EC scheduling class must be informed that
the memberships for task A have changed becuase it may need to stop/start scheduling task A
currently
- if the task is on the rq then
it is removed from the rq before the membership change and added after the membership change
- EC examines the
memberships for task A when it is added to the rq and updates its information as appropriate
avatar A, for task A, must be added/removed from the approprite CPU list
if task A does not have a proxy
- avatar A must be added/removed from the list
associated with the CPU to which task A is assigned
if task A does have a proxy
- avatar A must be added/removed from the list for the CPU that the proxy for A is assigned to
- avatar A must also be added to the list of avatars
stored on the PMGT waiter structure for task A
currently
on addition
- avatar A is temporarily
placed on the CPU list that task A is assigned to
- cannot be selected
- slightly later, avatar A
moved to the CPU list of the proxy and assocaited with the PMGT waiter for task A in a context in which the PMGT CC can be accessed easier
- pmgt_update(task, member);
on removal
- avatar A is removed from
whichever CPU it is assigned to
- slightly later, avatar A is removed
from the PMGT waiter structure for task A
- pmgt_update(task, member);
- there is a short period of time between
when the avatar is removed from the SDF and when the avatar is removed from the PMGT waiter structure
- avatar A is marked as
GSCHED_TASK_DESTROYED during this period and it is ignored
group parameters are changed
- kusp/hgs_core.c
- int gsched_get_group_params(const char *name, void *value, size_t size)
- int gsched_set_group_params(const char *name, void *value, size_t size)
- CCSM lookup group name using RCU
- group reference count
- for future development it may be possible to move the group level parameters
into the CCSM set as attributes to take advantage of the RCU control of access to parameter values thuse potentially resolving effects of race conditions
member parameters are changed
- kusp/hgs_core.c
- int gsched_set_member_params(const char *gname, const char *mname, void *value, size_t size)
- int gsched_get_member_params(const char *gname, const char *mname, void *value, size_t size)
- CCSM lookup group name using RCU
- group reference count
- group lock is used to lookup member name on HGS maintained
list of members in the group
- member reference count
a group is created
- kusp/hgs_core.c
- int gsched_create_group(const char *group_name, const char *sched_name)
- static struct gsched_group *__gsched_create_group(const char *name, const char *sdf_name)
- group structure is initially only visible locally
- is exposed through CCSM
- ccsm_mutex
- doesn’t use gsched_lock since the group is not visible
until after it has been initialized
- ie: the rest of HGS doesn’t know about it
a group is destroyed
kusp/hgs_core.c
int gsched_destroy_group(const char *name)
static void hgs_group_destroy(struct ccsm_set *set)
ccsm_mutex
gsched_lock
mainly jsut calls destroy_group
- currently holds the per group lock when calling this function
to be compatible with SDFs written for earlier implementations
- this lock can be removed and it should be to allow
SDFs to perform blocking operatins during destroy
- all SDFs have to be adjusted to use the group lock
internally in destroy
an SDF is installed
kusp/hgs_core.c
int gsched_register_scheduler(struct gsched_sdf *sched)
gsched_lock
reference to the module that the SDF is in
- module subsystem keeps a reference count ot modules so HGS
holding a reference to the module means it cannot be unloaded
an SDF is removed
- kusp/hgs_core.c
- void gsched_unregister_scheduler(struct gsched_sdf *sched)
- gsched_lock
- put back the reference to the module that the SDF is in
task movement between CPUs
kusp/hgs_core.c
void gsched_enqueue_task(struct rq *rq, struct task_struct *task, int wakeup)
when a task is not on the rq
only source rq lock acquired
can occur as a result of “exec”
can occur immediately before a task is awaken
- scheduling class decides which CPU the task wakes up on
user specified CPU assignment change
never deferred to migration thread when not on the rq
when a task is on the rq
- source rq lock and destination rq lock acquired
- load balancing
- user specified CPU assignment change
- deferred (always? sometimes?) to migration thread
complications
- if a task is not m-blocked then the avatars
for the task must follow the task to its new CPU assignment
- if a task is m-blocked then the avatars do
not move
- if a task is a proxy then the avatars for
which it is a proxy must follow it to its new CPU assignment
currently
- all of this is handled by the enqueue HGS routine called from enqueue_task
a task awakens
- kusp/hgs_core.c
- void gsched_dequeue_task(struct rq *rq, struct task_struct *task, int sleep)
- deal with CPU assignment changes that may be happening
as a result of awakening
- if a task has a proxy and it is unexpectedly waking
up for a signal or timer then the avatars for the task must be moved to the CPU of the task
- this is an abort of a mutex lock operation
- tell schedulers that the task’s avatars are
selectable on the CPU of the task
a task blocks
- kusp/hgs_core.c
- void gsched_dequeue_task(struct rq *rq, struct task_struct *task, int sleep)
- tell schedulers that task’s avatars are no longer selectable
- note that this is an example of where we might
consider using CCSM to track the proxy and its set of waiters for state change purposes as a way of notifying the scheduling data structures for the waiters that the proxy is blocked rather than finding out the proxy is blocked at scheduling time
a task m-blocks
- kusp/hgs_core.c
- void gsched_dequeue_task(struct rq *rq, struct task_struct *task, int sleep)
- move task’s avatars to the CPU of the proxy
- tell schedulers the avatars are selectable on the
CPU of the proxy
a task tries to acquire an owned mutex
a task finds a proxy
the proxy for a task changes
kusp/hgs_core.c
static void hgs_pmgt_move_prepare(struct pmgt_waiter *waiter, struct task_struct *proxy)
static void hgs_pmgt_move(struct pmgt_waiter *waiter, struct task_struct *proxy)
the task is marked as searching
- the avatars for the task remain on the CPU
of the old proxy
when the new proxy is found
- the task is then marked as having found a proxy
- the avatars are moved to the CPU of the new
proxy
a task is no longer m-blocked
HGS CC
‘gsched_lock’
- the global lock for the HGS hierarchy
- never used at scheduling time because mutexes
involve blocking and you cannot block at scheudling time
- also cannot use RT spinlocks
- can use atomic spinlocks and RCU reading
mutex
protects list of SDFs
- static LIST_HEAD(gsched_schedlist);
serializes most API calls with a few exceptions
- get/set_member_params
- get/set_group_params
- these exceptions use reference counting
- other API calls can occur while these
calls are occuring
- mostly not an issue since the group’s
data is protected by the per group lock in all SDF implementations
- however, it is possible for parameter
functions to be called on a group that has been destroyed but not yet freed
- the memory is still valid
- changes to the memory effectively
do nothing but care may be needed if more than just writing to variables is taking place
- worth considering holding the ‘gsched_lock’
for the parameter functions, though this would increase the constraint on concurrency
per group locks
- group->lock
- atomic spinlock
- protect the list of group memberships maintained
by the framework for each group
- group->all_members
- this is not a per-CPU list(and shouldn’t be)
- it is needed to find member structures by name
- member structures are not listed in CCSM
but the things they represent(tasks/groups) are
consider making this an RCU list
- because the RCU macros for list operations
permit multiple readers and we would reduce concurrency constraint as a result
protects the list of groups that a group is a member of
- group->memberships
- consider making this an RCU Sched list
- this isn’t really used anywhere currently
protects group and membership data
accessed in many contexts
- scheduling
- locking time(PMGT callbacks)
- task state changes(enqueue_task/dequeue_task)
runqueue lock
atomic spinlock
protects the list of groups that a task is a member of
task->gsched_memberships
off the task structure
duplicated in a list off of the PMGT waiter
hgs_pmgt_waiter->hgs_task_members
- include/linux/kusp/hgs_pmgt.h
remove duplicate list and read the original list
off of the task structure using RCU
consider making this an RCU Sched list
- if this list is RCU then we probably
wont have to duplicate the list on the PMGT waiter and that would be a lot less painful
- semantics of the current
duplicate list have the same stale data implications as RCU sched, so this seems feasible
- this list is only read at
locking time
used internally by HGS to interact with the
list of tasks that the EC scheduler should schedule
- scheduling classes rely primarily on runqueue locks to protect data
- used internally by HGS at locking time to access the
CPU assignment of a proxy to move avatars to that CPU
- protects the list of avatars on a proxy that are m-blocked on
that proxy
- task->gsched_waiting_members
- a list of waiting avatars of of the task structure
- used when a proxy task changes CPUs and the avatars
that are m-blocked on it need to follow it
- updated by add/remove member API
held by Linux in several calling contexts
- scheduling
- task state changes
task->pi_lock
- atomic spinlock
- the pi_lock of a proxy task controls access to the
list of avatars stored on every PMGT waiter representing tasks m-blocked on the proxy task
- pmgt_waiter->hgs_task_members
- PMGT holds the pi_lock for a proxy when it makes
calls into HGS that involve that proxy
- accessed by the add/remove member API to update
PMGT waiters when avatars are added/removed
HGS issues
- the view of the hierarchy for a CPU should not change whlie
while the CPU is making a decision
the hierarchy should be reasonably synchronized for all CPUs
group data may need to change at scheduling time
- group data is often shared among CPUs
holding the group lock for a group as the hierarchy below
the group is evaluated constrains physical concurrency at scheduling time
HGS modifications
per CPU locks could protect an entire view of the hierarchy
- the linux runqueue lock of the cpu or
- HGS specific per-cpu locks
- each group should maintain arbitrary runqueues that determine
which members are selectable on each CPU
- groups will be listed on every CPU
- avatars will only be listed on a single CPU
- HGS will use enqueue/dequeue callbacks to instruct the SDF about
the CPU that an avatar should be on
- the tricky part is dealing with some of the CPU assignemtn modifications/proxy scenarios
- when designing this locking it is instructive to
look at “task_rq_lock” function which shows how to read a CPU assignment without already holding the CPU runqueue lock
- locking is repeated until the read before
the lock operation agrees with a read after the lock operation
- ie: an HGS version could be created
task membership data could be protected by a CPU lock if the
data is only accessed on the CPU that the task member is on
- group data and group membership data could be protected using
RCU, for reading, and the group lock, for writing,
sched RCU
no explicit rcu_read_lock/rcu_read_unlock needed
- implicitly uses
preempt_disable/enable as a read section(preemption is already disabled at scheduling time)
still need rcu_dereference
must use synchronize_sched for updates
group lock should not be held as the hierarchy
is evaluating below the group
- other CPUs have to wait when this happens!
- When a group A is being added to or removed from the hierarchy,
the hierarchy may not be identical on every CPU for a brief time while A is added to each runqueue in group B, the group it is being added to.
- the section where all of the runqueues are changed
can be uninterruptable to help ensure it completes quickly
- The gsched_lock mutex will be used to control hierarchy modification
to ensure that multiple groups can’t be added at the same time. Therefore, the hiearchy on different CPUs will be different by at most one group at any time.
tasks may be a different story
- it would be nice to add different tasks on
different CPUs at the saem time since they are only associated with that CPU (or the CPU of the proxy for the task)
HGS debugging
infinite loop prevention
- add a debugging option that prevents an SDF from
unintentionally infinite looping using the framework iterator loop
- create a config option that is an integer that
sets the number of permitted loops
- use 0 for turning off the option
- use a local variable in gsched_run_scheduler as
a loop counter
- output an event when too many loops have
been performed
- return a NULL or Linux value
shell interface
- Allows the hierarchy to be examined and modified
- already exists, just needs to be updated
- code can be re-used by other tools
graphical debugger
- graphically display the hierarchy
- read events generated from a single invocation of the hierarchy
and allow users to graphically step through what happened
- events should mostly already exist
- can leverage shell interface code to allow the hierarchy to be modified
in the debugger
kernel support options
debug SDF
- simple
- easy to implement
- a hierarchy is placed beneath the
debug SDF and the hierarchy it controls is evaluated when the debugger wants to see another invocation
step group parameter
- indicates to the SDF that
the hierarchy should be invoked to give more events to the debugger
- the CPU that the hierarchy
should be invoked on should also be specified in the group command
- SDF is originally in a stop
mode that indicates it is not making decisions
- if the desired CPU is a
different CPU then the current CPU then defer to the desired CPU using the per-cpu standard Linux workqueue
- IPI smp_send_reschedule() is also an option
- SDF enters step mode
- SDF calls schedule()
- SDF enters stop mode to prevent additional
invocations
- SDF outputs an event to
indicate the start of the hierarchy invocation for debugging
- SDF picks the top group of the
hierarchy it controls
- SDF outputs an event to indicate
the end of the invocation in iterator_finalize
invoking a hierarchy that is disjoint from the system hierarchy
in a non-scheduling context
- more complicated and difficult to implement
- doesn’t interfere with system hierarchy
- defer work to the CPU that should
simulate the hierarchy
- output an event to indicate the
start of the debugging invocation
- invoke HGS hierarchy evaluation
functions with runqueue lock held using the hierarchy being debugged
PMGT optimizations
proxy state
- this actually applies more to how HGS handles proxies
- If a proxy is not runnable then there is no reason
to be able to select the tasks that are blocked on the proxy
- These tasks could be dequeued while the proxy is blocked
and enqueued when the proxy is wakes up
- Currently, tasks that are mblocked on a blocked proxy
are rejected at scheduling time
- This would cause more computation to occur when the proxy is changing
state instead of at scheduling time
- I suspect that it is generally accepted that
the scheduler is invoked more often than the state of a task changes.
modifications to CC (speculative)
CC may be too fine grained
- sometimes information cannot be modified when it would be
conveniant too
- dead nodes
- chain walking
there is a lot of locking
- 5 locks are held in some places
performance may be improved if locking is more coarse
grained
- lock operations are not free
- locking still should remain specific
to a blocking chain, so the system as a whole is still fine grained
- fine grained locking is most beneficial when there is a lot
of concurrency
- but most tasks in a blocking chain aren’t doing anything
- everyone eventually has to access
the proxy data structures anyway and only one task can currently do that at a time
collecting waiting relations in a single data
structure located at the proxy to reduce the amount of locking being performed
- create a table of waiting relations located
at the proxy
- could be controlled by the pi_lock
of the proxy but there may be structures that allow for more concurrency
- would probably be a hashed structure
where you could look up the waiting relations involving a mutex
- and potentially the waiting
relations involving a task but I’m not yet sure if that is needed
eliminates the need for locks for each waiting relation(node)
nice for debugging
- can dump a full graph of the blocking chain
this could be fruitless
- it depends on the overhead of the data structure vs the
overhead of the locking
HGS Callback reorganization
- Currently, there are three callbacks( “start”, “stop”, and “move_member”( which
should probably be reduced to two callbacks (“enqueue” and “dequeue”).
- When a member is added to the group, “insert_member” would be called and then
“enqueue” would be called for a particular CPU if the member is selectable
- enqueue would be callled for every CPU for a group
- When a member moves between CPUs then the member would be “dequeued” from the
CPU it is moving from and “enqueued” on the CPU it is moving to.
- When a member is removed from a group then it would first be “dequeued” if it is
selectable and then “remove_member” would be called
- a group would be dequeued from every CPU
- When a member enters a “blocked” state, ie: blocked for reasons other
than a mutex, it would be “dequeued”
When a member leaves the “blocked” state it would be “enqueued”
- blocked -> mblocked never happens
network queues
netif_receive_skb
- This function sends received packets up the network stack.
- Prior to sending packets up the network stack, this
function interacts with Traffic Control to potentially drop packets by consulting a ingress qdisc with associated filters. This interaction has been changed slightly to allow a queueing discipline to indicate to netif_receive_skb that it has enqueued a packet.
This function is split to use a second receive softirq for queueing.
- I think this could really just be per-cpu regular kernel
threads but Niehaus wanted it to be a softirq initially because that is how it was done in the past
- If a packet is enqueued then the function exits.
- If a packet is not enqueued then netif_receive_skb calls
the second half of the original netif_receive_skb.
- So the receive queueing code doesn’t make
much of a difference if the appropriate qdisc/classes/filters are not set up
receive softirq 2
- This code is very similar to the transmit queueing code except
it calls the second half of netif_receive_skb instead of sending a packet to a network device when a packet is dequeued from the ingress qdisc.
optional stopping
- both the transmit and receive 2 softirqs were modified
to allow a qdisc to indicate that it does not want to dequeue a packet at the moment but it may still have packets waiting to be dequeued
- if the qdisc has packets still waiting
to be dequeued then it is scheduled for later processing by the same softirq
sch_hgnf
- This is a qdisc which is smart enough to be used as
both an egress qdisc and an ingress qdisc.
- The concept of a flow translates into two separate
concepts in Traffic Control terms.
- A “class” inside the sch_hgnf qdisc is basically a container for packets.
- A “filter” specifices which packets are directed into which class.
- TC also has a concept of “actions” which
drop packets but those haven’t been utilized at present by this work
- This qdisc uses a default FIFO class for all unclassified traffic.
- This qdisc allows a scheduler to specify which
classes should be dequeued from the next time a softirq asks the qdisc for a packet (atomic flag).
- Allows a scheduler to specify specific
classes to the qdisc and then invoke the softirq that the qdisc is scheduled on to dequeue from those classes.
- A scheduler can also query the qdisc,
to find out if a particular class is empty (atomic flag).
- priorities can also be assigned to the classes to determine
the order in which they are dequeued from
knqm
- A user side library to talk to traffic control.
- Based off of code in the ‘iproute’ package that comes with
most Linux distros.
- iproute contains the ‘tc’ command to talk to Traffic control using shell.
- http://tldp.org/HOWTO/Adv-Routing-HOWTO/
- http://linux-ip.net/articles/Traffic-Control-HOWTO/
- http://blog.edseek.com/~jasonb/articles/traffic_shaping/
- Before use, all network interfaces should be initialized
to use the sch_hgnf qdisc for egreess and ingress
- knqm_init_all()
- It is possible to initialize only one device
using “knqm_init” but that doesn’t seem particularly useful.
API to create/destroy classes for the sch_hgnf ingress/egress qdiscs
- knqm_flow_create/destroy
API to create/destroy filters
knqm_filter_ip_sport
- This function will set up the
appropriate filtering for queueing packets that come from a socket(egress) and go to a socket(ingress)
- It utilizes the “u32” filter in the
kernel to compare the local port of the socket to the source port of the packet(for egress) and the destination port of the packet(for ingress).
- There was some noise about
using CCSM for egress but it seems needlessly complicated to use two separate mechanisms when one mechanism works. Particularly when there are many many more options for filters.
- Each filter has a priority assigned
to it that both identifies the filter and determines the order in which the filters associated with a qdisc will be consulted when a packet must be classified to a class
- this is a standard part of TC
knqm_filter_destroy
- get rid of a filter
Classes/filters for each network device have
to be configured separately because TC is completely disjoint between devices.
Flow SDF
- This SDF has been designed as a utility SDF which
manages network flows on behalf of other SDFs. This allows other SDFs that consider network flows to be simpler and to not replicate code.
- It could eventually be replaced with management
routines and that would probably be more efficient
- Network flows are associated with groups that use
this SDF.
- A flow may be a member of more than one such group.
- In effect, this allows flows to be placed in the hierarchy.
- All of the transmit and receive 2 softirqs must be
placed in each Flow SDF group.
- When the Flow SDF group is invoked, it will mark all
of the flows with data that it is managing as needing to be processed and try to schedule a softirq that corresponds to the qdisc containing one of those flows.
- Other flows that are not marked will
be ignored until there are no longer marked flows
The Flow SDF is not necessary to create network flows.
- It is for controlling how the created flows are managed more intelligently.
Current explicit flow examples do not use the Flow SDF
discovered network flows
discovering network flows involves the following events
- SOCKET = [
ACCEPT, CONNECT_BEGIN, TCP_CONNECT, RELEASE,
] EXIT = [
DO_EXIT,
]
generally easist to initialize all network interfaces to use sch_hgnf
prior to utilizing discovery
- when a socket is created due to an accept or due to
a tcp connection then a “filter” needs to be created for the socket
- a “class” at the traffic control layer may also need
to be created to represent the process(or using whatever other criteria) prior to creating the filter becuase the filter will need to be assocaited with the class
This requires communicating with Traffic Control.
- Traffic control is talked to through netlink sockets.
- The kernel can use sockets internally
- However, this is problematic with the
current active filter implementation.
- Active filters execute inside of RCU
read sections in which the reader is not supposed to block.
- Communication to TC involves memory
allocations that could block and, thus, shouldn’t be used inside an active filter.
- This active filter issue should be
solved anyway becuase discovery in general is interacting with CCSM which also allocates memory.
- So far, this has been ignored,
because it hasn’t caused any (obvious) problems but that seems like an unwise policy to continue with.
- Currently, there is a userspace daemon in
“kusp/subsystems/knqm/tools/knqmd’ which performs userspace discovery by reading directly from RelayFS and communicates to TC using the knqm library.
Unfortunately, this approach also has issues.
- select() doesn’t seem to
work correctly with the RelayFS buffers and so event processing threads for the knqmd are using polling instead of select()
- obviously this is inefficient
To add control
- Each flow that is created should be placed in a
Flow SDF group that is placed in the hierarchy.
There is a tool in ‘kusp/subsystems/discovery/tools/dscvrnctrl’
which wraps the execution of an application by placing the application in an HGS hierarchy and discovery network flows
- currently just utilizes the static priority group for
the application threads
- creates a datastream to record what happens
- knqmd should be running before dscvrnctrl is used
- softirqs, knqmd, DSKI reader threads also glaced
in the hierarchy
open questions
other areas for customizing semantics
- RCU
- waitqueues
- I/O schedulers
current network flows apply to the end points of a network connection
but it is not really clear that it is an end-to-end solution becuase it does not consider the intermediate systems
- IPv6 has some notion of “flows” and it might be interesting to
try and utilize that to deal with intermediate systems
- MPLS is very focused towards the concept of “flows”
as expressed using “labels”