Skip to content

A header-only C++ library implementing seven reservoir sampling algorithms for streaming data.

License

Notifications You must be signed in to change notification settings

LiorKogan/StreamSampler

Repository files navigation

StreamSampler

A header-only C++ library implementing seven reservoir sampling algorithms for streaming data.

Copyright © 2015 Lior Kogan (koganlior1 [at] gmail [dot] com)

Released under the Apache License, Version 2.0

Overview

A data stream is a sequence of elements made available over time, where the total number of elements is typically unknown in advance and may be very large. In this setting, it is often desirable to maintain a simple random sample — a fixed-size subset of the stream such that each element observed so far has an equal probability of being included.

A stream sampler maintains one or more such samples and updates them incrementally as new elements arrive. Crucially, the samples remain simple random samples at all times, regardless of the number of elements processed. Stream sampling relies on online algorithms: the stream size is unknown, only a single pass over the data is possible, and memory usage must remain constant. Depending on the algorithm, update time is linear or sub-linear in the number of processed elements.

This library provides a collection of unweighted reservoir sampling algorithms without replacement, each offering different performance tradeoffs while preserving correct sampling guarantees.

Implemented Algorithms

Performance highlights

  • Algorithms X, Y, Z, K, L, and M improve performance by calculating how many elements to skip at each step, generating far fewer random numbers than R (hence the sub-linear time complexity).
  • Z, K, L, and M are typically much faster than R, with M usually being the most performant.

Practical API design

In the original papers, algorithms were formulated to control element fetching via an external function (GetNextElement()), which can be cumbersome in practice.

This library reformulates the algorithms to a user-friendly API: the user fetches elements from the stream and calls the sampler’s member function:

AddElement(const ElementType& element); // copy semantics
AddElement(ElementType&& element);      // move semantics

Both functions return the number of stream elements to skip before the next call, maintaining algorithm efficiency.

Multiple independent samples are supported within the same sampler.

Testing and examples

StreamSamplerTest provides:

  • StreamSamplerExample() — usage example
  • StreamSamplerPerformanceBenchmark() — comparative performance tests
  • StreamSamplerTestUniformity() — verifies that samples remain uniform and unbiased

About

A header-only C++ library implementing seven reservoir sampling algorithms for streaming data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published