Future Extensions of GSΒΆ

  • HGS use cases

    • group A is added/removed from group B

      • kusp/hgs_core.c

      • static void hgs_add_group(struct gsched_group *group,

        struct ccsm_set *add_set)

      • static void hgs_remove_group(struct gsched_group *group,

        struct gsched_member *member)

      • group A must be added/removed from every CPU list of group B

      • group lock for A and group lock for B are held

      • gsched_lock

    • task A is added/removed from group B

      • kusp/hgs_core.c

      • static void hgs_add_task(struct gsched_group *group,

        struct ccsm_set *add_set)

      • static void hgs_remove_task(struct gsched_group *group,

        struct gsched_member *member)

      • gsched_lock

      • rq lock for task A

      • group lock for B

      • if task A is in the EC scheduling class

        • EC scheduling class must be informed that

        the memberships for task A have changed becuase it may need to stop/start scheduling task A

        • currently

          • if the task is on the rq then

          it is removed from the rq before the membership change and added after the membership change

          • EC examines the

          memberships for task A when it is added to the rq and updates its information as appropriate

      • avatar A, for task A, must be added/removed from the approprite CPU list

      • if task A does not have a proxy

        • avatar A must be added/removed from the list

        associated with the CPU to which task A is assigned

      • if task A does have a proxy

        • avatar A must be added/removed from the list for the CPU that the proxy for A is assigned to
        • avatar A must also be added to the list of avatars

        stored on the PMGT waiter structure for task A

        • currently

          • on addition

            • avatar A is temporarily

            placed on the CPU list that task A is assigned to

            • cannot be selected
            • slightly later, avatar A

            moved to the CPU list of the proxy and assocaited with the PMGT waiter for task A in a context in which the PMGT CC can be accessed easier

            • pmgt_update(task, member);
          • on removal

            • avatar A is removed from

            whichever CPU it is assigned to

            • slightly later, avatar A is removed

            from the PMGT waiter structure for task A

            • pmgt_update(task, member);
            • there is a short period of time between

            when the avatar is removed from the SDF and when the avatar is removed from the PMGT waiter structure

            • avatar A is marked as

            GSCHED_TASK_DESTROYED during this period and it is ignored

    • group parameters are changed

      • kusp/hgs_core.c
      • int gsched_get_group_params(const char *name, void *value, size_t size)
      • int gsched_set_group_params(const char *name, void *value, size_t size)
      • CCSM lookup group name using RCU
      • group reference count
      • for future development it may be possible to move the group level parameters

      into the CCSM set as attributes to take advantage of the RCU control of access to parameter values thuse potentially resolving effects of race conditions

    • member parameters are changed

      • kusp/hgs_core.c
      • int gsched_set_member_params(const char *gname, const char *mname, void *value, size_t size)
      • int gsched_get_member_params(const char *gname, const char *mname, void *value, size_t size)
      • CCSM lookup group name using RCU
      • group reference count
      • group lock is used to lookup member name on HGS maintained

      list of members in the group

      • member reference count
    • a group is created

      • kusp/hgs_core.c
      • int gsched_create_group(const char *group_name, const char *sched_name)
      • static struct gsched_group *__gsched_create_group(const char *name, const char *sdf_name)
      • group structure is initially only visible locally
      • is exposed through CCSM
      • ccsm_mutex
      • doesn’t use gsched_lock since the group is not visible

      until after it has been initialized

      • ie: the rest of HGS doesn’t know about it
    • a group is destroyed

      • kusp/hgs_core.c

      • int gsched_destroy_group(const char *name)

      • static void hgs_group_destroy(struct ccsm_set *set)

      • ccsm_mutex

      • gsched_lock

      • mainly jsut calls destroy_group

        • currently holds the per group lock when calling this function

        to be compatible with SDFs written for earlier implementations

        • this lock can be removed and it should be to allow

        SDFs to perform blocking operatins during destroy

        • all SDFs have to be adjusted to use the group lock

        internally in destroy

    • an SDF is installed

      • kusp/hgs_core.c

      • int gsched_register_scheduler(struct gsched_sdf *sched)

      • gsched_lock

      • reference to the module that the SDF is in

        • module subsystem keeps a reference count ot modules so HGS

        holding a reference to the module means it cannot be unloaded

    • an SDF is removed

      • kusp/hgs_core.c
      • void gsched_unregister_scheduler(struct gsched_sdf *sched)
      • gsched_lock
      • put back the reference to the module that the SDF is in
    • task movement between CPUs

      • kusp/hgs_core.c

      • void gsched_enqueue_task(struct rq *rq, struct task_struct *task, int wakeup)

      • when a task is not on the rq

        • only source rq lock acquired

        • can occur as a result of “exec”

        • can occur immediately before a task is awaken

          • scheduling class decides which CPU the task wakes up on
        • user specified CPU assignment change

        • never deferred to migration thread when not on the rq

      • when a task is on the rq

        • source rq lock and destination rq lock acquired
        • load balancing
        • user specified CPU assignment change
        • deferred (always? sometimes?) to migration thread
      • complications

        • if a task is not m-blocked then the avatars

        for the task must follow the task to its new CPU assignment

        • if a task is m-blocked then the avatars do

        not move

        • if a task is a proxy then the avatars for

        which it is a proxy must follow it to its new CPU assignment

      • currently

        • all of this is handled by the enqueue HGS routine called from enqueue_task
    • a task awakens

      • kusp/hgs_core.c
      • void gsched_dequeue_task(struct rq *rq, struct task_struct *task, int sleep)
      • deal with CPU assignment changes that may be happening

      as a result of awakening

      • if a task has a proxy and it is unexpectedly waking

      up for a signal or timer then the avatars for the task must be moved to the CPU of the task

      • this is an abort of a mutex lock operation
      • tell schedulers that the task’s avatars are

      selectable on the CPU of the task

    • a task blocks

      • kusp/hgs_core.c
      • void gsched_dequeue_task(struct rq *rq, struct task_struct *task, int sleep)
      • tell schedulers that task’s avatars are no longer selectable
      • note that this is an example of where we might

      consider using CCSM to track the proxy and its set of waiters for state change purposes as a way of notifying the scheduling data structures for the waiters that the proxy is blocked rather than finding out the proxy is blocked at scheduling time

    • a task m-blocks

      • kusp/hgs_core.c
      • void gsched_dequeue_task(struct rq *rq, struct task_struct *task, int sleep)
      • move task’s avatars to the CPU of the proxy
      • tell schedulers the avatars are selectable on the

      CPU of the proxy

    • a task tries to acquire an owned mutex

      • kusp/hgs_core.c
      • static void hgs_pmgt_waiter_init(struct pmgt_waiter *waiter, struct task_struct *task)
      • static void hgs_pmgt_move_prepare(struct pmgt_waiter *waiter, struct task_struct *proxy)
      • task’s avatars are marked as searching for a proxy
    • a task finds a proxy

      • kusp/hgs_core.c
      • static void hgs_pmgt_move(struct pmgt_waiter *waiter, struct task_struct *proxy)
      • task’s avtars are moved to the CPU of the proxy
      • task’s avtars are marked as having found a proxy
    • the proxy for a task changes

      • kusp/hgs_core.c

      • static void hgs_pmgt_move_prepare(struct pmgt_waiter *waiter, struct task_struct *proxy)

      • static void hgs_pmgt_move(struct pmgt_waiter *waiter, struct task_struct *proxy)

      • the task is marked as searching

        • the avatars for the task remain on the CPU

        of the old proxy

      • when the new proxy is found

        • the task is then marked as having found a proxy
        • the avatars are moved to the CPU of the new


    • a task is no longer m-blocked

      • kusp/hgs_core.c
      • static void hgs_pmgt_destroy(struct pmgt_waiter *waiter, struct task_struct *proxy)
      • avatars must be moved back to the CPU of the task
  • HGS CC

    • ‘gsched_lock’

      • the global lock for the HGS hierarchy
      • never used at scheduling time because mutexes

      involve blocking and you cannot block at scheudling time

      • also cannot use RT spinlocks
      • can use atomic spinlocks and RCU reading
      • mutex

      • protects list of SDFs

        • static LIST_HEAD(gsched_schedlist);
      • serializes most API calls with a few exceptions

        • get/set_member_params
        • get/set_group_params
        • these exceptions use reference counting
        • other API calls can occur while these

        calls are occuring

        • mostly not an issue since the group’s

        data is protected by the per group lock in all SDF implementations

        • however, it is possible for parameter

        functions to be called on a group that has been destroyed but not yet freed

        • the memory is still valid
        • changes to the memory effectively

        do nothing but care may be needed if more than just writing to variables is taking place

        • worth considering holding the ‘gsched_lock’

        for the parameter functions, though this would increase the constraint on concurrency

    • per group locks

      • group->lock
      • atomic spinlock
      • protect the list of group memberships maintained

      by the framework for each group

      • group->all_members
      • this is not a per-CPU list(and shouldn’t be)
      • it is needed to find member structures by name
      • member structures are not listed in CCSM

      but the things they represent(tasks/groups) are

      • consider making this an RCU list

        • because the RCU macros for list operations

        permit multiple readers and we would reduce concurrency constraint as a result

      • protects the list of groups that a group is a member of

        • group->memberships
        • consider making this an RCU Sched list
        • this isn’t really used anywhere currently
      • protects group and membership data

      • accessed in many contexts

        • scheduling
        • locking time(PMGT callbacks)
        • task state changes(enqueue_task/dequeue_task)
    • runqueue lock

      • atomic spinlock

      • protects the list of groups that a task is a member of

        • task->gsched_memberships

        • off the task structure

        • duplicated in a list off of the PMGT waiter

          • hgs_pmgt_waiter->hgs_task_members

            • include/linux/kusp/hgs_pmgt.h
        • remove duplicate list and read the original list

        off of the task structure using RCU

        • consider making this an RCU Sched list

          • if this list is RCU then we probably

          wont have to duplicate the list on the PMGT waiter and that would be a lot less painful

          • semantics of the current

          duplicate list have the same stale data implications as RCU sched, so this seems feasible

          • this list is only read at

          locking time

      • used internally by HGS to interact with the

      list of tasks that the EC scheduler should schedule

      • scheduling classes rely primarily on runqueue locks to protect data
      • used internally by HGS at locking time to access the

      CPU assignment of a proxy to move avatars to that CPU

      • protects the list of avatars on a proxy that are m-blocked on

      that proxy

      • task->gsched_waiting_members
      • a list of waiting avatars of of the task structure
      • used when a proxy task changes CPUs and the avatars

      that are m-blocked on it need to follow it

      • updated by add/remove member API
      • held by Linux in several calling contexts

        • scheduling
        • task state changes
    • task->pi_lock

      • atomic spinlock
      • the pi_lock of a proxy task controls access to the

      list of avatars stored on every PMGT waiter representing tasks m-blocked on the proxy task

      • pmgt_waiter->hgs_task_members
      • PMGT holds the pi_lock for a proxy when it makes

      calls into HGS that involve that proxy

      • accessed by the add/remove member API to update

      PMGT waiters when avatars are added/removed

  • HGS issues

    • the view of the hierarchy for a CPU should not change whlie

    while the CPU is making a decision

    • the hierarchy should be reasonably synchronized for all CPUs

    • group data may need to change at scheduling time

      • group data is often shared among CPUs
    • holding the group lock for a group as the hierarchy below

    the group is evaluated constrains physical concurrency at scheduling time

  • HGS modifications

    • per CPU locks could protect an entire view of the hierarchy

      • the linux runqueue lock of the cpu or
      • HGS specific per-cpu locks
      • each group should maintain arbitrary runqueues that determine

      which members are selectable on each CPU

      • groups will be listed on every CPU
      • avatars will only be listed on a single CPU
      • HGS will use enqueue/dequeue callbacks to instruct the SDF about

      the CPU that an avatar should be on

      • the tricky part is dealing with some of the CPU assignemtn modifications/proxy scenarios
      • when designing this locking it is instructive to

      look at “task_rq_lock” function which shows how to read a CPU assignment without already holding the CPU runqueue lock

      • locking is repeated until the read before

      the lock operation agrees with a read after the lock operation

      • ie: an HGS version could be created
    • task membership data could be protected by a CPU lock if the

    data is only accessed on the CPU that the task member is on

    • group data and group membership data could be protected using

    RCU, for reading, and the group lock, for writing,

    • sched RCU

      • no explicit rcu_read_lock/rcu_read_unlock needed

        • implicitly uses

        preempt_disable/enable as a read section(preemption is already disabled at scheduling time)

      • still need rcu_dereference

      • must use synchronize_sched for updates

    • group lock should not be held as the hierarchy

    is evaluating below the group

    • other CPUs have to wait when this happens!
    • When a group A is being added to or removed from the hierarchy,

    the hierarchy may not be identical on every CPU for a brief time while A is added to each runqueue in group B, the group it is being added to.

    • the section where all of the runqueues are changed

    can be uninterruptable to help ensure it completes quickly

    • The gsched_lock mutex will be used to control hierarchy modification

    to ensure that multiple groups can’t be added at the same time. Therefore, the hiearchy on different CPUs will be different by at most one group at any time.

    • tasks may be a different story

      • it would be nice to add different tasks on

      different CPUs at the saem time since they are only associated with that CPU (or the CPU of the proxy for the task)

  • HGS debugging

    • infinite loop prevention

      • add a debugging option that prevents an SDF from

      unintentionally infinite looping using the framework iterator loop

      • create a config option that is an integer that

      sets the number of permitted loops

      • use 0 for turning off the option
      • use a local variable in gsched_run_scheduler as

      a loop counter

      • output an event when too many loops have

      been performed

      • return a NULL or Linux value
    • shell interface

      • Allows the hierarchy to be examined and modified
      • already exists, just needs to be updated
      • code can be re-used by other tools
    • graphical debugger

      • graphically display the hierarchy
      • read events generated from a single invocation of the hierarchy

      and allow users to graphically step through what happened

      • events should mostly already exist
      • can leverage shell interface code to allow the hierarchy to be modified

      in the debugger

      • kernel support options

        • debug SDF

          • simple
          • easy to implement
          • a hierarchy is placed beneath the

          debug SDF and the hierarchy it controls is evaluated when the debugger wants to see another invocation

          • step group parameter

            • indicates to the SDF that

            the hierarchy should be invoked to give more events to the debugger

            • the CPU that the hierarchy

            should be invoked on should also be specified in the group command

            • SDF is originally in a stop

            mode that indicates it is not making decisions

            • if the desired CPU is a

            different CPU then the current CPU then defer to the desired CPU using the per-cpu standard Linux workqueue

            • IPI smp_send_reschedule() is also an option
            • SDF enters step mode
            • SDF calls schedule()
            • SDF enters stop mode to prevent additional


            • SDF outputs an event to

            indicate the start of the hierarchy invocation for debugging

            • SDF picks the top group of the

            hierarchy it controls

            • SDF outputs an event to indicate

            the end of the invocation in iterator_finalize

        • invoking a hierarchy that is disjoint from the system hierarchy

        in a non-scheduling context

        • more complicated and difficult to implement
        • doesn’t interfere with system hierarchy
        • defer work to the CPU that should

        simulate the hierarchy

        • output an event to indicate the

        start of the debugging invocation

        • invoke HGS hierarchy evaluation

        functions with runqueue lock held using the hierarchy being debugged

  • PMGT optimizations

    • proxy state

      • this actually applies more to how HGS handles proxies
      • If a proxy is not runnable then there is no reason

      to be able to select the tasks that are blocked on the proxy

      • These tasks could be dequeued while the proxy is blocked

      and enqueued when the proxy is wakes up

      • Currently, tasks that are mblocked on a blocked proxy

      are rejected at scheduling time

      • This would cause more computation to occur when the proxy is changing

      state instead of at scheduling time

      • I suspect that it is generally accepted that

      the scheduler is invoked more often than the state of a task changes.

    • modifications to CC (speculative)

      • CC may be too fine grained

        • sometimes information cannot be modified when it would be

        conveniant too

        • dead nodes
        • chain walking
        • there is a lot of locking

          • 5 locks are held in some places
        • performance may be improved if locking is more coarse


        • lock operations are not free
        • locking still should remain specific

        to a blocking chain, so the system as a whole is still fine grained

        • fine grained locking is most beneficial when there is a lot

        of concurrency

        • but most tasks in a blocking chain aren’t doing anything
        • everyone eventually has to access

        the proxy data structures anyway and only one task can currently do that at a time

      • collecting waiting relations in a single data

      structure located at the proxy to reduce the amount of locking being performed

      • create a table of waiting relations located

      at the proxy

      • could be controlled by the pi_lock

      of the proxy but there may be structures that allow for more concurrency

      • would probably be a hashed structure

      where you could look up the waiting relations involving a mutex

      • and potentially the waiting

      relations involving a task but I’m not yet sure if that is needed

      • eliminates the need for locks for each waiting relation(node)

      • nice for debugging

        • can dump a full graph of the blocking chain
      • this could be fruitless

        • it depends on the overhead of the data structure vs the

        overhead of the locking

  • HGS Callback reorganization

    • Currently, there are three callbacks( “start”, “stop”, and “move_member”( which

      should probably be reduced to two callbacks (“enqueue” and “dequeue”).

    • When a member is added to the group, “insert_member” would be called and then

      “enqueue” would be called for a particular CPU if the member is selectable

      • enqueue would be callled for every CPU for a group
    • When a member moves between CPUs then the member would be “dequeued” from the

      CPU it is moving from and “enqueued” on the CPU it is moving to.

    • When a member is removed from a group then it would first be “dequeued” if it is

      selectable and then “remove_member” would be called

      • a group would be dequeued from every CPU
    • When a member enters a “blocked” state, ie: blocked for reasons other

      than a mutex, it would be “dequeued”

    • When a member leaves the “blocked” state it would be “enqueued”

      • blocked -> mblocked never happens
  • network queues

    • netif_receive_skb

      • This function sends received packets up the network stack.
      • Prior to sending packets up the network stack, this

      function interacts with Traffic Control to potentially drop packets by consulting a ingress qdisc with associated filters. This interaction has been changed slightly to allow a queueing discipline to indicate to netif_receive_skb that it has enqueued a packet.

      • This function is split to use a second receive softirq for queueing.

        • I think this could really just be per-cpu regular kernel

        threads but Niehaus wanted it to be a softirq initially because that is how it was done in the past

        • If a packet is enqueued then the function exits.
        • If a packet is not enqueued then netif_receive_skb calls

        the second half of the original netif_receive_skb.

        • So the receive queueing code doesn’t make

        much of a difference if the appropriate qdisc/classes/filters are not set up

    • receive softirq 2

      • This code is very similar to the transmit queueing code except

      it calls the second half of netif_receive_skb instead of sending a packet to a network device when a packet is dequeued from the ingress qdisc.

    • optional stopping

      • both the transmit and receive 2 softirqs were modified

      to allow a qdisc to indicate that it does not want to dequeue a packet at the moment but it may still have packets waiting to be dequeued

      • if the qdisc has packets still waiting

      to be dequeued then it is scheduled for later processing by the same softirq

    • sch_hgnf

      • This is a qdisc which is smart enough to be used as

      both an egress qdisc and an ingress qdisc.

      • The concept of a flow translates into two separate

      concepts in Traffic Control terms.

      • A “class” inside the sch_hgnf qdisc is basically a container for packets.
      • A “filter” specifices which packets are directed into which class.
      • TC also has a concept of “actions” which

      drop packets but those haven’t been utilized at present by this work

      • This qdisc uses a default FIFO class for all unclassified traffic.
      • This qdisc allows a scheduler to specify which

      classes should be dequeued from the next time a softirq asks the qdisc for a packet (atomic flag).

      • Allows a scheduler to specify specific

      classes to the qdisc and then invoke the softirq that the qdisc is scheduled on to dequeue from those classes.

      • A scheduler can also query the qdisc,

      to find out if a particular class is empty (atomic flag).

      • priorities can also be assigned to the classes to determine

      the order in which they are dequeued from

    • knqm

      • A user side library to talk to traffic control.
      • Based off of code in the ‘iproute’ package that comes with

      most Linux distros.

      • Before use, all network interfaces should be initialized

      to use the sch_hgnf qdisc for egreess and ingress

      • knqm_init_all()
      • It is possible to initialize only one device

      using “knqm_init” but that doesn’t seem particularly useful.

      • API to create/destroy classes for the sch_hgnf ingress/egress qdiscs

        • knqm_flow_create/destroy
      • API to create/destroy filters

        • knqm_filter_ip_sport

          • This function will set up the

          appropriate filtering for queueing packets that come from a socket(egress) and go to a socket(ingress)

          • It utilizes the “u32” filter in the

          kernel to compare the local port of the socket to the source port of the packet(for egress) and the destination port of the packet(for ingress).

          • There was some noise about

          using CCSM for egress but it seems needlessly complicated to use two separate mechanisms when one mechanism works. Particularly when there are many many more options for filters.

          • Each filter has a priority assigned

          to it that both identifies the filter and determines the order in which the filters associated with a qdisc will be consulted when a packet must be classified to a class

          • this is a standard part of TC
        • knqm_filter_destroy

          • get rid of a filter
        • Classes/filters for each network device have

        to be configured separately because TC is completely disjoint between devices.

    • Flow SDF

      • This SDF has been designed as a utility SDF which

      manages network flows on behalf of other SDFs. This allows other SDFs that consider network flows to be simpler and to not replicate code.

      • It could eventually be replaced with management

      routines and that would probably be more efficient

      • Network flows are associated with groups that use

      this SDF.

      • A flow may be a member of more than one such group.
      • In effect, this allows flows to be placed in the hierarchy.
      • All of the transmit and receive 2 softirqs must be

      placed in each Flow SDF group.

      • When the Flow SDF group is invoked, it will mark all

      of the flows with data that it is managing as needing to be processed and try to schedule a softirq that corresponds to the qdisc containing one of those flows.

      • Other flows that are not marked will

      be ignored until there are no longer marked flows

      • The Flow SDF is not necessary to create network flows.

        • It is for controlling how the created flows are managed more intelligently.
      • Current explicit flow examples do not use the Flow SDF

  • discovered network flows

    • discovering network flows involves the following events

      SOCKET = [


      ] EXIT = [



    • generally easist to initialize all network interfaces to use sch_hgnf

    prior to utilizing discovery

    • when a socket is created due to an accept or due to

    a tcp connection then a “filter” needs to be created for the socket

    • a “class” at the traffic control layer may also need

    to be created to represent the process(or using whatever other criteria) prior to creating the filter becuase the filter will need to be assocaited with the class

    • This requires communicating with Traffic Control.

      • Traffic control is talked to through netlink sockets.
      • The kernel can use sockets internally
      • However, this is problematic with the

      current active filter implementation.

      • Active filters execute inside of RCU

      read sections in which the reader is not supposed to block.

      • Communication to TC involves memory

      allocations that could block and, thus, shouldn’t be used inside an active filter.

      • This active filter issue should be

      solved anyway becuase discovery in general is interacting with CCSM which also allocates memory.

      • So far, this has been ignored,

      because it hasn’t caused any (obvious) problems but that seems like an unwise policy to continue with.

      • Currently, there is a userspace daemon in

      “kusp/subsystems/knqm/tools/knqmd’ which performs userspace discovery by reading directly from RelayFS and communicates to TC using the knqm library.

      • Unfortunately, this approach also has issues.

        • select() doesn’t seem to

        work correctly with the RelayFS buffers and so event processing threads for the knqmd are using polling instead of select()

        • obviously this is inefficient
    • To add control

      • Each flow that is created should be placed in a

      Flow SDF group that is placed in the hierarchy.

    • There is a tool in ‘kusp/subsystems/discovery/tools/dscvrnctrl’

    which wraps the execution of an application by placing the application in an HGS hierarchy and discovery network flows

    • currently just utilizes the static priority group for

    the application threads

    • creates a datastream to record what happens
    • knqmd should be running before dscvrnctrl is used
    • softirqs, knqmd, DSKI reader threads also glaced

    in the hierarchy

  • open questions

    • other areas for customizing semantics

      • RCU
      • waitqueues
      • I/O schedulers
    • current network flows apply to the end points of a network connection

    but it is not really clear that it is an end-to-end solution becuase it does not consider the intermediate systems

    • IPv6 has some notion of “flows” and it might be interesting to

    try and utilize that to deal with intermediate systems

    • MPLS is very focused towards the concept of “flows”

    as expressed using “labels”

Previous topic

Implementing Customized Programming Models

This Page