DSPP Filter Library Manual

This document presents the set of commonly used filters in the Data Streams Post-Processing filter library. The filters described here have proved useful in a variety of applications either directly, to help process a set of data, or indirectly as a starting point for writing a customized filter precisely implementing semantics specific to a particular situation. The DSPP User Manual describes how to use post-processing in general, while How to Write a Custom Filter describes writing custoized filters for particular situations. The DSPP Internals Manual describes the internals of the post-processing framework for those interested or needing to know that level of detail to modify or extend the framework. Most users will not need that level of detail.

There are also a set of tutorial examples that provide a reasonable introductory sequence:

  1. Simple DSUI : First KUSP example explains instrumentation and analysis of program behavior using events from the user-level.
  2. Signal Pipeline DSUI : More complex instrumentation example, still at the user level
  3. Signal Pipeline DSKI : Adds the simplest possible set of kernel events to the Signal Pipeline DSUI example to note execution intervals of threads implementing pipeline stages
  4. Active Filtering: Adds use of an active filter to the previous example to filter out context-switch events for all threads except those specified as part of the experiment, thus greatly reducing the volume of data that must be processed

Introduction

Postprocess is a python-based framework for filtering and generating reports of instrumentation data gathered by DSUI/DSKI. It works by instantiating an arbitrary number of pipelines, each with its own series of filters, through which the raw instrumentation data is directed, being processed in various ways, finally producing output in various forms. This document describes the Post-processing filter library, containing some commonly used filters grouped into various modules. These modules are located in the $KUSPROOT/datastreams/src/datastreams/postprocess/filters directory.

Note: Filters with (U) are currently untested.

List of filter categories:

Reduction Filters

Graphing Filters

These filters located in the module graph.py create graphs of the entities matching the entities specified.

Discovery Filters

Filters related to discovering the components of a computation by observing a variety of OS and system software events. Filters related to discovering how system resources are used, and by what processes running what programs.

Global Time Line Filters

Filters related to taking data from both user and OS level in a set of machines supporting distributed computations. These filters take information from the clock synchronization subsystem and maps all events on each machine onto an abstract common global time line, expressed as an interval within which the event occurred.

Output Filters

Network Filters

Get a glimpse of how networks function

SDF Filters

Filters related to how processes are scheduled under various SDFs

Cluster Filters

Filters related to analyzing how clusters of computers support distributed computations. Note that these should generally be higher level analyses of sets of events afte they have been mapped onto a common Global Time Line.

Conversion Filters

Raw data collected from application and kernel level instrumentation points usually takes the form of an extremely large number of event records. However, there are other types of data present, so we use the word entity to refer to a data record in a Data Stream binary file. When faced with a raw data file containing a large number of entities, the user often has a number of different analyses in mind, which individually depend on different subsets of the entities in the raw data. The utility module (FIXME.D cross reference) contains many filters that permit gathering a subset of the raw data relevant to a particular analysis. The filters in this module are used when the relevant data has been obtained but is not in the most convenient form for a given analysis.

For many kinds of analyses, separate raw events denote the beginning and ending of some interval of interest. For example, events may mark when processing of a particular kind begins and ends, when the actual data of interest is how long the process takes. In this instance, it is convenient to convert the pairs of begin and end events to single entities that represent the interval directly. In this case, the intervals are “synthesized” or “derived” from the raw data, but it is worth noting that Data Streams does support the generation of interval entities directly. In other cases, the best representation of the data of interest is as a distribution, which people most often represent using a histogram.

This section discusses the three filters which support converting pairs of events into intervals, a stream of events into a histogram, and a stream of intervals into a histogram.

FIXME.Devin: The current filters are fully documented and work as advertised, but internally are not always as efficient as we might like, and so we hope to either modify these filters, or add additional choices in the future.

Event to Histogram

This filter supports creating distributions from event streams where individual data elements are contained within single events. For example, events which record the number of bytes submitted to a write operation or returned by a read operation. Other examples include the size of messages coming into or out of network interfaces, and the size of disk I/O operations in device drivers.

The settings for the histogram include the upper and lower bound and the number of buckets used to represent the distribution. From this information, the bucket size can be calculated. Conceptually, when a given entity flows through this filter, its data is examined to see into which bucket it falls, and the corresponding bucket is incremented.

In its current implementation, the raw data is accumulated in a list, and only when the entity stream is complete is that list of data inserted into the histogram and a histogram entity produced. This seems to be required for two reasons: (1) If the user does not know the upper and lower bounds of the data, or is unwilling to specify them ahead of time, then accumulating the entire data set in order to determine the upper and lower bounds and thus to configure the histogram, and (2) Because many histogram APIs in graphing packages assume that a vector of raw data will be submitted, and it is likely that this filter was written in imitation of them.

Future development should probably include a variant on this filter that can take lower and upper bound specification and insert directly into the histogram without creating a copy of very large data sets internally.

Module: dspp_conversion

Name: event_to_histogram

Parameters: event, histogram, data, lowerbound, upperbound, buckets, consume

  1. event (string): Family/entity name of event to convert.
  2. histogram (string): Family/entity name of histogram to output.
  3. data FIXME.Devin: Confusing parameter
  4. lowerbound (real): Lower bound of histogram. Leave blank to auto-compute.
  5. upperbound (real): Upper bound of histogram. Leave blank to auto-compute.
  6. buckets (integer): Number of buckets in histogram
  7. consume (boolean): Whether to delete matching entities after processing.

Event to Interval

This filter processes event data logged during program execution into interval entities. This is particularly useful when measuring the duration of time between two points in a program’s execution. For example, this would be useful if you want to measure the duration of a loop, or the execution interval of a context switch. This stream of interval can then be converted to a histogram of durations of time, and then graphed.

Module: dspp_conversion

Name: event_to_interval

Parameters: start_event, start_machine, end_machine, end_event, interval, consume, tag_match, ignore_missing

  1. start_event (string): Family/event name of start event to convert.
  2. start_machine (string): Machine the start event occurred on.
  3. end_machine (string): Machine the end event occurred on.
  4. end_event (string): Family/event name of end event to convert.
  5. interval (string): Family/event name of interval to output.
  6. consume (boolean): Whether to delete matching entities after processing.
  7. tag_match (boolean): If True, reject pairs that don’t have matching tags. If False, do not reject pairs that don’t have matching tags.
  8. ignore_missing (boolean): If True, do not give a warning about missing start/end events. If False, give a warning.

DSKI Event to Interval

In order to match start and end events, the normal event to interval filter (above) uses the tag values. However, some events do not use the tag value, and so that filter will simply not work. In particular, the DSKI events use the PID. Thus, this filter simply matches the start and end events by using their PID values.

Module: dspp_conversion

Name: dski_event_to_interval

Parameters: start_event, end_event, interval, consume

  1. start_event (string): Family/event name of start event to convert.
  2. end_event (string): Family/event name of end event to convert.
  3. interval (string): Family/event name of interval to output.
  4. consume (boolean): Whether to delete matching entities after processing.

Interval to Histogram

This filter creates a Histogram distribution representation using the duration of the interval entities specified. This can then be graphed using one of the graphing filters for a useful summary of the data set.

Module: dspp_conversion

Name: interval_to_histogram

Parameters: interval, histogram, lowerbound, upperbound, buckets, units, consume

  1. interval (string): Family/entity name of interval to convert.
  2. histogram (string): Family/entity name of generated histogram.
  3. lowerbound (real): Lower bound of histogram. Leave blank to auto-compute.
  4. upperbound (real): Upper bound of histogram. Leave blank to auto-compute.
  5. buckets (integer): Number of buckets in histogram.
  6. units (string): Interval time coordinates to use.
  7. consume (boolean): Whether to delete intervals after processing.

Input/Output Filters

Because of the nature of filters, they constantly change the data set you are working on. Sometimes it can be useful to store the state of the data set for later use. This is achieved with the use of these pickling filters, which can either save the current state or retrieve a previous state of a data set.

The state is pickled to a file specified by the user, which can then be unpickled at a later stage in postprocessing.

There are various protocols which can be used for pickling, and which are user specified. Protocol 0 is ASCII; Protocol 1 is the old binary format; Protocol 2 provides a more efficient pickling of new-style classes, such as list and dict. The default value is 0, and if a negative value is supplied, the highest possible pickle value is chosen.

In its current implementation, the unpickle filter works correctly, but not as elegantly as it could be. The best option would be to remove the unickle filter completely, and modify the head filter to allow an option to specify an input file as pickled, rather than C binary.

Also it should be noted that the process is currently relatively slow, because the namespace for the entities is not cleared before pickling. Unless it is possible to retrieve the namespace after unpickling, it is unclear how it would be realistic to clear it beforehand.

FIXME.Devin: Tentative name. We want to group the pickle and unpickle together.

FIXME.D: Pickling does not appear to work currently. Needs a revamp.

Pickle

FIXME.Devin: Should I add a boolean parameter: whether or not to pass on the data as well as pickling it?

This filter pickles the data set. This is useful when you want to store the current state of the data set for later postprocessing. It can then be unpickled using the unpickle filter.

Module: dspp_input_output

Name: pickle

Parameters: protocol_version, filename

  1. protocol_version (integer): Protocol version for pickling the data set.
  2. filename (string): File name to pickle data set in.

FIXME.D: The unpickle filter may not be necessary anymore (we can specify pickled files as input in the head filter)

Unpickle

This filter unpickles a data set. This is used when you had previously pickled a data set, and want to retrieve its previous state. Every entity in the pickled file is then passed on sequentially along the pipeline.

Module: dspp_input_output

Name: unpickle

Parameters: filename

  1. filename (string): File name to unpickle data set from.

Sanity Filters

Sometimes through errors originating elsewhere, the data set can be malformed. The sanity filters check the data set for common errors, such as holes and out-of-order. Order issues can be fixed with a sort filter, but holes are not patchable.

Trouble Maker

This filter’s only purpose is for testing actual sanity filters. In its current implementation, it is rather simplistic. It randomly deletes and re-orders events in the datastream.

Module: dspp_sanity

Name: trouble_maker

Parameters: None

Error Detection

This filter is the main sanity filter. It will check if the data set has any problems, such as holes or out of order entities. This is a useful sanity check before running any other filters on the data set. However, it is important to note that “ns” cannot be used as input for order_key unless the filter TSC_conversion in the module dspp_utility has been run first. If this filter detects any ordering problems, one of the sort filters should be run.

Name: error_detect

Parameters: hole, order, order_key, output

  1. hole (boolean): If True, it will check for holes. If False, it will not.
  2. order (boolean): If True, it will check for order. If False, it will not.
  3. order_key (string): Time coordinates to determine order by.
  4. output (string): Output file to print error information. If none is provided, it will be printed in the terminal.

Hole and Order

FIXME.Devin: Most likely redundant.

These were individual filters which served the same purpose individually that error_detect now does combined. Most likely can be deleted.

Sort Filters

Intro

Sort by Time

This filter sorts the data set by the log time of the entities. This is useful when, after passing the data set through the sanity filters, you discover that entities are out of order.

Module: dspp_sort

Name: sort_time

Parameters: sort_key

  1. sort_key (string): Time coordinates to sort on.

Sort by Sequence Number

This filter sorts the data set by the sequence number of the entities. This is useful when, after passing the data set through the sanity filters, you discover that entities are out of order.

Module: dspp_sort

Name: sort_time

Parameters: None

Utility Filters

Intro

Filter by Events

This filter is useful when you have many events in a data set, but you are interested only in a subset of these events. It has two modes of use: discard and retain. Discard mode is useful when you can make a list of events you know you do not want, and whose elimination would simplify the data set. Retain mode is useful when it is convenient to explicitly list all the events necessary for a given analysis.

Module: dspp_utility

Name: filter_by_events

Parameters: events, discard

  1. events (list): List of event names.
  • Format: [“Family Name1/Event Name1”, “Family Name2/Event Name2”, ...]
  1. discard (boolean): If True, the filter discards all instances of the listed events. If False, the filter retains all instances of the listed events, discarding all others.

Filter by Tag

This filter is useful when you have many events in a data set, but you are interested only in a subset of these events with certain tag numbers. It has two modes of use: discard and retain. Discard mode is useful when you are uninterested in events with certain tag numbers. Retain mode is useful when you are only interested in events with certain tag numbers. For example, if you have labelled events of interest with a particular tag number, you can retain only those events with this filter.

Module: dspp_utility

Name: filter_by_tag

Parameters: tag_values, discard

  1. tag_values (list): List of tag values.
  2. discard (boolean): If True, the filter discards all events with the given tag values. If False, the filter retains all events with the given tag values.

Filter by Time

This filter is useful when you have many events in a data set, but you are interested only in a subset of these events that occur at certain times. It has two modes of use: discard and retain. Discard mode is useful when you are uninterested in events that occur at certain times. For example, if you have errors or data drops that make the data set incomplete, you may discard those time periods. Retain mode is useful when you are only interested in events that occur at certain times. For example, if other analysis of the data set has identified time intervals of interest, you may now only retain those time periods in another pipeline.

Module: dspp_utility

Name: filter_by_time

Parameters: time_intervals, time_key, discard

  1. time_intervals (list): List of time intervals.
  • Format: [[start1, end1], [start2, end2], ...]
  • Start and end times are values consistent with the time_key chosen.
  1. time_key (string): Time coordinate to use for filtering.

  2. discard (boolean): If True, the filter discards all events inside the

    listed time intervals. If False, the filter retains all events within the listed time intervals.

Filter between Events

This filter is useful when you have many events in a data set, but you are interested only in a subset of these events that are bounded by particular events. It has two modes of use: discard and retain. Discard mode is useful when you are uninterested in events between these two events, while retain mode is useful when you are interested in the events between the start and end events.

Module: dspp_utility

Name: filter_btwn_events

Parameters: start_event, end_event, discard

  1. start_event (string): Start event to filter on.
  2. end_event (string): End event to filter on.
  3. discard (boolean): If True, the filter discards all events between the given events. If False, the filter retains all events within the given events.

Filter during an Interval

This filter is useful when you have many events in a data set, but you are interested only in a subset of these events that are bounded by a particular interval. It has two modes of use: discard and retain. Discard mode is useful when you are uninterested in events that occur during the specified interval. Retain mode is useful when you are interested in events that occur during the specified interval.

Module: dspp_utility

Name: filter_by_interval

Parameters: interval, discard

  1. interval (string): Interval to filter on.
  2. discard (boolean): If True, the filter discards all events occurring during the given interval. If False, the filter retains all events within the given interval.

Filter by Machine

This filter is useful when you have many events in a data set, but you are interested only in a subset of these events which occur on a certain machine. It has two modes of use: discard and retain. Discard mode is useful when you are uninterested in events that occur on a particular machine, while retain mode is useful when you are interested only in events that occur on the specified machine.

Module: dspp_utility

Name: filter_by_machine

Parameters: machine, discard

  1. machine (string): Machine to filter on.
  2. discard (boolean): If True, the filter discards all events that occurred on the given machine. If False, the filter retains all events that occurred on the given machine.

Sink

This filter is useful when you want to halt the datastream. It will not send any more data further along the pipeline. FIXME.Devin: This is tentative. For example, if the pipeline is incomplete further along...

Module: dspp_utility

Name: sink

Parameters: None

Null

This filter is useful when you do not want to do anything to the data set. It will pass on the exact same data it receives. FIXME.D: WHY IS THIS GOOD?

Module: dspp_utility

Name: null

Parameters: None

Timestamp Conversion

FIXME.D: It may be useful to have a filter which converts absolute time to elapsed time since beginning of a recording interval.

This filter converts TSC (timestamp counter) to nanoseconds. However, TSC remains a part of the data set’s dictionary of time values. This filter achieves the conversion by using data from clock administrative events. It will keep track of TSC-to-nanosecond correspondence for multiple machines. This filter must be used before using any other filter which specifies “ns” as an input for a time coordinate.

Module: dspp_utility

Name: TSC_conversion

Parameters: redo_clock, consume

  1. redo_clock (boolean): Whether to recompute timestamps for entities that already have a nanosecond value
  2. consume (boolean): Whether to consume clock administrative events

Narrate

This filter outputs the whole data set to a file or to the terminal. It prints out all relevant data of each entity, in the format specified by the parameters. This is useful to get a quick snapshot of the current state of the data set.

Module: dspp_utility

Name: narrate

FIXME.Devin: I am not sure if all these parameters are necessary, and need to fix histogram output.

Parameters: output, divisor, line_every_n_us, print_extra_data, print_description, ignore_time, absolute_time, show_admin

  1. output (string): Output file to send narration text
  2. divisor (real): Value to divide converted timestamps by
  3. line_every_n_us (integer): Place a line every n microseconds
  4. print_extra_data (boolean): Print extra data associated with events
  5. print_description (boolean): Print description associated with events
  6. ignore_time (boolean): Ignore all time related operations
  7. absolute_time (boolean): Format timestamps as time-of-day
  8. show_admin (boolean): Show admin events

Count Events

This filter counts the number of times specified events occur. This filter is useful when the only relevant information is the number of occurrences of particular events. If the list of events given to this filter is empty, it will count the total number of entities in the data stream.

Module: dspp_utility

Name: count

Parameters: events, output

  1. events (list): List of events to count.
  • Format: [“Family Name1/Event Name1”, “Family Name2/Event Name2”, ...]
  1. output (string): Output file to print information to. If none is given, it is printed in the terminal.

Split Output

Intro

Module: dspp_utility

Name: split_outputs

Parameters: outputs

  1. outputs (list): List of outputs to send output to

Contents:

Indices and tables