Peter Teichman
Sunday, September 4, 2022

Samp: Stream Sampling on the Command Line

I needed a tool for random sampling that fit the Unix philosophy for modular program design. I needed a lossy cat(1). I wrote samp.

It reads lines from stdin as its input, discards all but K lines, and writes the K lines to stdout. I use it all the time.

If you have a local Go toolchain, you can install it with go install github.com/pteichman/samp@latest.

Reservoir sampling

Reservoir sampling is an algorithm that only needs to keep the K output items in memory rather than buffering the full stream. It can be described succinctly:

Better algorithms exist: mostly these generate fewer random numbers while processing the stream. Benchmarking on my laptop hasn’t shown them to be worth the complexity for the sizes of streams I work with on the command line (up to millions of items).

Design decisions