ResiliNets:
Multilevel Resilient and Survivable Networking Initiative

James P.G. Sterbenz and David Hutchison
The University of Kansas (US) and Lancaster University (UK)

The resilient and survivable networking initiative (ResiliNets) is investigating the architecture, protocols, and mechanisms to provide resilient, survivable, and disruption-tolerant networks, services, and applications.

resilinets model

Scope and Definition

Resilience is the ability of the network to provide and maintain an acceptable level of service in the face of various challenges to normal operation:

Resilient networks aim to provide acceptable service to applications:

Resilient network services must:

Resilient networks are engineered and have emergent behaviour to:

Note that while attack detection is an important endeavor, it is in some sense futile, since a sufficiently sophisticated distributed denial of service attack is indistinguishable from legitimate traffic. Thus traffic anomaly detection that attempts to detect and resist DDOS attacks simply incrementally raise the bar over which crackers must pass. Since both cases adversely affect servers, cross traffic, and exhaust network resources, our goal is resilience regardless of whether or not an attack is occurring.

We are exploiting new architectures, algorithms, and protocols. as well as techniques in programmable, active, and cognitive networking to achieve these goals. Three key themes are knobs-and-dials, adaptive composable protocol mechanisms, and intelligent resource tradeoffs.

  1. Knobs and dials provide instrumentation upward and influence downward, respectively, between the layers in the form of vertical control loops. Thus, we believe in the benefits of layers as applied to network structure and role (physical/link: hop-by-hop, network: path, transport: end-to-end, application), but in softening the boundaries and providing cross-layer optimisations. Knobs and dials are also necessary between the data, control, and management planes.
  2. Context-aware, adaptive and composable protocol mechanisms understand the current environment, the characteristics below (via cross-layer dials), and apply the appropriate mechanisms to achieve resilience and survivability at each protocol layer. It is essential to keep mechanisms logically distinct for correct operation, for example discrimination of congestion (throttle), corruption (retransmit), and delay (wait).
  3. Resource tradeoffs consist of properly understanding and trading resources (and constraints) against one-another. These consist of processing, memory, bandwidth, energy, and latency.

Relationship of resilience to survivability and disruption tolerance

The primary difference between our definition of resilience vs. survivability and disruption tolerance is that resilient networks are engineered to tolerate legitimate but unpredictably high-traffic loads (such as flash crowds), while maximising the service provided to other users of the network, as well as being resistant to attack.

Survivability is the capability of a system to fulfill its mission in a timely manner, even in the presence of attacks or failures [CMU SEI], including large scale natural disasters.

Disruption tolerance is the ability for end-to-end applications to operate even when network connectivity is not strong (weak, episodic, or asymmetric) and the network is unable to provide stable end-to-end paths.

Thus survivability and disruption tolerance are necessary but not sufficient for resilience.

Relationship of resilience to fault tolerance

Fault tolerance the ability of a system or component to continue normal operation despite the presence of hardware or software faults [IEEE].

Fault tolerant systems are generally engineered only to tolerate isolated random natural failures. Thus, fault tolerance is necessary but not sufficient for survivability (and therefore resilience). We do believe that we can learn from past work in fault tolerance, particularly by extending work in design methodology and metrics.

Multi-Level Resilience and Survivability

We believe that it is essential to solve the problem of resilience on all levels, both from a network architectural perspective as well as from a protocol layering and plane viewpoint. Starting from the bottom-up, each level is made as resilient as practical (understanding cost and resource tradeoffs). Higher levels are themselves organised into resilient structures using the resilient lower-level building blocks.

Network architecture view

From a network architecture perspective, auto-configured fault tolerant components are self-organised into resilient network structures.

Protocol layer view

From a protocol layer perspective, it is essential in a bottom-up manner to make each layer as resilient and survivable as practical, given economic and policy constraint. In every case this is a necessary, but not sufficient condition for resilience at the layer above. Traditional research has emphasised the lower layers (physical and link); we believe that new emphasis must be placed on the network and transport layers, as well as on services and applications.

Protocol plane view

From a protocol plane perspective, it is necessary that data, control, and management planes each be resilient, as well as their interactions and collective behaviour.

ResiliNets Strategy

Resilient and survivable networking depends on a strategy of layers of resistance (D2R2): defence / defense, detection, remediation, and recovery.

Defense

It is first essential that the network architecture, protocols, and service mechanisms be as resistant as possible from either attack or from the effects of large-scale natural disasters and environmental challenges. For example, secure network infrastructure protocols are less likely to be compromised; spatial diversity lowers the impact when part of the infrastructure is attacked.

Automatic Detection

Even though a network is resistant to attacks and challenges, we must assume that they will occur. Therefore, a resilient survivable network must be context-aware and automatically detect when it is threatened or under attack.

Adaptive Remediation

Once the network has detected a challenge, compromise, or attack, it must remediate the effects and adapt its topology and behavior to mitigate the effects and minimise the impact as much as possible on the rest of the network and its users.

Autonomic Recovery

As a particular attack ends or new infrastructure is deployed after a natural disaster, the mitigation can end and the network must autonomically self-organise and self-repair itself back to normal operation.

Research Projects and Activities

Details on the activities in the ResiliNets initiative are described in the ResiliNets Wiki.


Last updated 15 August 2006 – Valid XHTML 1.1Lynx inspectedW3C A Conformance
©2003–2006 James P.G. Sterbenz <jpgs@ittc.ku.edu> <jpgs@comp.lancs.ac.uk> and David Hutchison <dh@comp.lancs.ac.uk>