StreamInsight Resiliency

Article
07/28/2016

When a system outage interrupts the processing of events by a StreamInsight application, you typically have requirements for the quality and timeliness of the application's output after recovery.

You want the content of output streams to match what it would have been if no outage had occurred.
You want the duration of the outage to be as brief as possible.

The Premium edition of StreamInsight provides a checkpointing feature that can periodically save the state of queries to disk. You can use this feature along with related features of sources and sinks to achieve equivalence of the output stream after recovery from an outage. Since properly written sources replay only the missed events since the last checkpoint that was captured, the amount of time spent in recovery is kept to a minimum.

Checkpoints

A StreamInsight checkpoint operation persists the state of a query to disk in a consistent manner. After an outage, the query can then be restored to its state as of the checkpoint.

Checkpointing alone does not guarantee that the stream of events produced by a query in the absence of an outage will be equivalent to the stream of events produced if an outage occurs. Two problems can affect the equivalence:

Events can be missed. The events received by StreamInsight after the checkpoint, and the events that occurred between an outage and recovery, are not captured by a checkpoint. These events must be presented to the server again to be included in query output. Solving this problem requires the participation of sources that are capable of replaying the missed events.
Events can be duplicated. The events produced by StreamInsight after the last checkpoint before the outage will be produced again during recovery from the outage when sources replay events as expected. Solving this problem requires the participation of sinks that are capable of eliminating these duplicate events.

The checkpoint log is the set of files that contain the persisted checkpoint information. The log is saved to a directory that you specify when you configure the server for resiliency. This directory should be reserved for use only by StreamInsight and should not be modified. In addition, it must be unique for each StreamInsight instance.

In This Topic

Three Levels of Resiliency

There are three levels of resiliency that you can achieve with StreamInsight. Selecting a level depends on your requirements and your ability to change existing applications, sources, and sinks.

State retention. You can use checkpoints to save the state of queries without making any changes to sources or sinks. This level of resiliency does not guarantee that the resulting stream after recovery from an outage is equivalent to the stream if no outage had occurred, because events that occurred after the last checkpoint was captured and events that occurred during the failure have been lost. However, this may be acceptable in situations where equivalent results are not needed, and where approximately correct output can be achieved with partial input.
Complete output. You can guarantee that no events will be missed by changing sources so that they can replay events. The output stream from a recovered query will be logically equivalent to a superset of the output stream from an uninterrupted query, and the additional events will be duplicates of events in the uninterrupted stream.
Equivalent output. You can guarantee logically equivalent output by changing sources and also changing sinks to eliminate duplicate events.

In This Topic

High-Water Marks

The high-water mark is the highest application time seen up to a specific point in the stream of events. When a checkpoint is requested, StreamInsight captures a checkpoint on the next high-water mark change on each of the inputs.

In This Topic

Events and State That Cannot Be Saved by Checkpointing

To understand the prerequisites for output that is complete and equivalent after recovery from an outage, it is useful to recognize the events and state that cannot be saved by StreamInsight checkpointing. These events and this state must be persisted separately to be available after recovery from an outage.

Events or state that cannot be saved by checkpointing	Solution
Events that arrive after the last checkpoint and before the outage	To be available for replay after recovery from an outage, these events must be persisted in a data store.
Events that arrive during the outage	To be available after recovery from an outage, these events must be persisted in a data store.
Knowledge of events that were produced as output after the last checkpoint and before the outage	To support the removal of duplicate events by sinks after recovery, these events must be persisted in a data store.
Any state that is maintained by custom sources or sinks	To be available after recovery from an outage, this state must be persisted in a data store by the custom sources or sinks.

In This Topic

Replay by Sources

When a StreamInsight application restarts after an outage, the call to initialize a resilient source provides the high-water mark to the source. The source is expected to replay its input stream from the high-water mark.

Correct replay by all sources guarantees output that is complete.

Therefore output that is complete has the following requirements:

A source that uses the high-water mark provided to identify the events that must be replayed.
The availability after recovery of all the events that occurred after the last checkpoint that was captured before the outage.
The availability after recovery of all the events that occurred during the outage.
The correct replay of these events by all sources.
Checkpointing and the recovery of query state.

In This Topic

Elimination of Duplicates by Sinks

To identify the location of the checkpoint in the output stream, the call to initialize a resilient sink provides both the high-water mark and an offset to the checkpoint marker from the high-water mark.

If a query is properly replayed, then the internal query state will be as of the last checkpoint, and any events that were produced after the last checkpoint was taken will be produced again upon restart. This means that any events that were produced as output after the last checkpoint but before the outage will be produced a second time during recovery. These are the duplicates that the sink must remove. How these are removed is up to the sink; for example, the duplicate copies can be ignored.

Correct elimination of duplicates by all sinks (after correct replay by all sources) guarantees output that is equivalent.

Therefore output that is equivalent has the following requirements in addition to the requirements listed previously for complete output:

A sink that uses the high-water mark and offset provided to identify the events that must be removed or ignored.
The availability after recovery of all the events that occurred after the last checkpoint that was captured before the outage. (This location in the stream is identified by the high-water mark and offset that are provided to the sink when it is recreated after recovery.)
The correct removal of duplicate events by all sinks.

In This Topic

More Information

For more information about building, monitoring, and troubleshooting resilient StreamInsight applications, see the following topics:

For an end-to-end code sample of a resilient application that includes replay and de-duplication, see the StreamInsight 2.1 Checkpointing Sample on the StreamInsight Samples page on Codeplex.

In This Topic

StreamInsight Resiliency

Checkpoints

Three Levels of Resiliency

High-Water Marks

Events and State That Cannot Be Saved by Checkpointing

Replay by Sources

Elimination of Duplicates by Sinks

More Information

Additional resources