I needed a tool for random sampling that fit the Unix philosophy for modular program design. I needed a lossy cat(1)
. I wrote samp.
It reads lines from stdin as its input, discards all but K lines, and writes the K lines to stdout. I use it all the time.
If you have a local Go toolchain, you can install it with go install github.com/pteichman/samp@latest
.
Reservoir sampling is an algorithm that only needs to keep the K output items in memory rather than buffering the full stream. It can be described succinctly:
Better algorithms exist: mostly these generate fewer random numbers while processing the stream. Benchmarking on my laptop hasn’t shown them to be worth the complexity for the sizes of streams I work with on the command line (up to millions of items).