Random sampling¶
>>> from cryptorandom.cryptorandom import SHA256
>>> from cryptorandom.sample import random_sample
>>> import numpy as np
We provide a sampling module compatible with any pseudorandom number generator that has randint and random methods. The module includes a variety of algorithms for weighted or unweighted sampling, with or without replacement.
The main workhorse is the random_sample function. The default sampling algorithm is sample_by_index, sampling indices without replacement.
>>> fruit = ['apple', 'banana', 'cherry', 'pear', 'plum']
>>> s = SHA256(1234567890)
>>> random_sample(fruit, 2, prng=s)
array(['plum', 'cherry'], dtype='<U6')
Numpy and the base random module offer methods for drawing simple random samples with and without replacement, but don’t allow you to choose the pseudorandom number generator. Numpy’s choice method uses the Fisher-Yates method to draw a random sample.
>>> np.random.choice(fruit, 2)
array(['plum', 'apple'], dtype='<U6')
The sampling methods available in cryptorandom are below.
Method |
weights |
replacement |
---|---|---|
Fisher-Yates |
no |
without replacement |
PIKK |
no |
without replacement |
recursive |
no |
without replacement |
Waterman_R |
no |
without replacement |
Vitter_Z |
no |
without replacement |
sample_by_index |
no |
without replacement |
Exponential |
yes |
either |
Elimination |
yes |
without replacement |
>>> %timeit random_sample(fruit, 2, method="Fisher-Yates", prng=s)
10000 loops, best of 3: 46.4 µs per loop
>>> %timeit random_sample(fruit, 2, method="PIKK", prng=s)
10000 loops, best of 3: 41.3 µs per loop
>>> %timeit random_sample(fruit, 2, method="recursive", prng=s)
100000 loops, best of 3: 15 µs per loop
>>> %timeit random_sample(fruit, 2, method="Waterman_R", prng=s)
100000 loops, best of 3: 16.9 µs per loop
>>> %timeit random_sample(fruit, 2, method="Vitter_Z", prng=s)
10000 loops, best of 3: 22 µs per loop
>>> %timeit random_sample(fruit, 2, method="sample_by_index", prng=s)
100000 loops, best of 3: 15 µs per loop
Some sampling methods (Fisher-Yates, PIKK, sample_by_index, Exponential, and Elimination) return ordered samples, i.e. they are equally likely to return [1, 2] as they are to return [2, 1].
>>> s = SHA256(1234567890)
>>> counts = {}
>>> for i in range(10000):
>>> sam = pikk(5, 2, prng=s)
>>> if str(sam) in counts.keys():
>>> counts[str(sam)]+=1
>>> else:
>>> counts[str(sam)]=0
>>> counts
{'[1 2]': 549,
'[1 3]': 528,
'[1 4]': 512,
'[1 5]': 502,
'[2 1]': 515,
'[2 3]': 485,
'[2 4]': 487,
'[2 5]': 482,
'[3 1]': 484,
'[3 2]': 482,
'[3 4]': 466,
'[3 5]': 525,
'[4 1]': 468,
'[4 2]': 512,
'[4 3]': 490,
'[4 5]': 490,
'[5 1]': 547,
'[5 2]': 460,
'[5 3]': 507,
'[5 4]': 489}
The reservoir algorithms (Waterman_R and Vitter_Z) and the recursive method aren’t guaranteed to randomize the order of sampled items.
>>> s = SHA256(1234567890)
>>> counts = {}
>>> for i in range(10000):
>>> sam = recursive_sample(5, 2, prng=s)
>>> if str(sam) in counts.keys():
>>> counts[str(sam)]+=1
>>> else:
>>> counts[str(sam)]=0
>>> counts
{'[1 2]': 492,
'[1 3]': 499,
'[1 4]': 503,
'[1 5]': 1016,
'[2 1]': 462,
'[2 3]': 487,
'[2 4]': 525,
'[2 5]': 985,
'[3 1]': 481,
'[3 2]': 485,
'[3 4]': 507,
'[3 5]': 984,
'[4 1]': 524,
'[4 2]': 475,
'[4 3]': 516,
'[4 5]': 1043}