Random sampling¶

>>> from cryptorandom.cryptorandom import SHA256
>>> from cryptorandom.sample import random_sample
>>> import numpy as np

We provide a sampling module compatible with any pseudorandom number generator that has randint and random methods. The module includes a variety of algorithms for weighted or unweighted sampling, with or without replacement.

The main workhorse is the random_sample function. The default sampling algorithm is sample_by_index, sampling indices without replacement.

>>> fruit = ['apple', 'banana', 'cherry', 'pear', 'plum']
>>> s = SHA256(1234567890)
>>> random_sample(fruit, 2, prng=s)
    array(['plum', 'cherry'], dtype='<U6')

Numpy and the base random module offer methods for drawing simple random samples with and without replacement, but don’t allow you to choose the pseudorandom number generator. Numpy’s choice method uses the Fisher-Yates method to draw a random sample.

>>> np.random.choice(fruit, 2)
array(['plum', 'apple'], dtype='<U6')

The sampling methods available in cryptorandom are below.

Method	weights	replacement
Fisher-Yates	no	without replacement
PIKK	no	without replacement
recursive	no	without replacement
Waterman_R	no	without replacement
Vitter_Z	no	without replacement
sample_by_index	no	without replacement
Exponential	yes	either
Elimination	yes	without replacement

>>> %timeit random_sample(fruit, 2, method="Fisher-Yates", prng=s)
10000 loops, best of 3: 46.4 µs per loop
>>> %timeit random_sample(fruit, 2, method="PIKK", prng=s)
10000 loops, best of 3: 41.3 µs per loop
>>> %timeit random_sample(fruit, 2, method="recursive", prng=s)
100000 loops, best of 3: 15 µs per loop
>>> %timeit random_sample(fruit, 2, method="Waterman_R", prng=s)
100000 loops, best of 3: 16.9 µs per loop
>>> %timeit random_sample(fruit, 2, method="Vitter_Z", prng=s)
10000 loops, best of 3: 22 µs per loop
>>> %timeit random_sample(fruit, 2, method="sample_by_index", prng=s)
100000 loops, best of 3: 15 µs per loop

Some sampling methods (Fisher-Yates, PIKK, sample_by_index, Exponential, and Elimination) return ordered samples, i.e. they are equally likely to return [1, 2] as they are to return [2, 1].

>>> s = SHA256(1234567890)
>>> counts = {}
>>> for i in range(10000):
>>>     sam = pikk(5, 2, prng=s)
>>>     if str(sam) in counts.keys():
>>>         counts[str(sam)]+=1
>>>     else:
>>>         counts[str(sam)]=0
>>> counts
    {'[1 2]': 549,
     '[1 3]': 528,
     '[1 4]': 512,
     '[1 5]': 502,
     '[2 1]': 515,
     '[2 3]': 485,
     '[2 4]': 487,
     '[2 5]': 482,
     '[3 1]': 484,
     '[3 2]': 482,
     '[3 4]': 466,
     '[3 5]': 525,
     '[4 1]': 468,
     '[4 2]': 512,
     '[4 3]': 490,
     '[4 5]': 490,
     '[5 1]': 547,
     '[5 2]': 460,
     '[5 3]': 507,
     '[5 4]': 489}

The reservoir algorithms (Waterman_R and Vitter_Z) and the recursive method aren’t guaranteed to randomize the order of sampled items.

>>> s = SHA256(1234567890)
>>> counts = {}
>>> for i in range(10000):
>>>     sam = recursive_sample(5, 2, prng=s)
>>>     if str(sam) in counts.keys():
>>>         counts[str(sam)]+=1
>>>     else:
>>>         counts[str(sam)]=0
>>> counts
    {'[1 2]': 492,
     '[1 3]': 499,
     '[1 4]': 503,
     '[1 5]': 1016,
     '[2 1]': 462,
     '[2 3]': 487,
     '[2 4]': 525,
     '[2 5]': 985,
     '[3 1]': 481,
     '[3 2]': 485,
     '[3 4]': 507,
     '[3 5]': 984,
     '[4 1]': 524,
     '[4 2]': 475,
     '[4 3]': 516,
     '[4 5]': 1043}