Randomness versus Reproducibility

Bill Evans Discussion Leave a Comment

In the last several years, there has been considerable effort made by publications to enforce reproducibility of the research they publish. In short, if a reader is unable to reproduce the mathematical results of a paper, where is the rigor of the review process or the power of the conclusion? Although not all technical authors are guilty of modifying data or manipulating results, there is a history of accusations pertaining to the fixing of data or doctoring the results for publication or commercial purposes.

A recent and notable addition to reproducibility, at least where R is concerned, is in the use of the “R-markdown” file. Markdown is a formatting language that aims to provide simple mechanisms to turn relatively simple human-readable text into many different formats, including HTML, DOCX, LaTeX (and PDF), and EPUB. Though there are several implementations with various incompatibilities, in general they all have the same goal — simplifying a standardized output without having to deal with nuances, intricacies, and inconsistencies of professional formatting venues. (Just try to move multiple images and tables, numbered and with captions, between chapters in a Word document and you will quickly see my point. Likewise for looking through the forest of backslashes in LaTeX.)

The mention of “R-markdown” extends this by including actual code (R code, in this case) in the report document; when the document is “compiled” into the target format, the code is executed and the results included as well. Though this sounds like it might provide very little utility for the required overhead, imagine the copy/paste loop in which authors are caught when fixing code snippets: it is very common for test results to be updated but not the code that generated them, an obstacle with which teachers and article reviewers must often deal.

I’ve been participating in a discussion recently where the author needs to reproduce arbitrary cycles within a loop. His n (the number of simulations) is on the order of 1e7 which can quickly over-power even well-equipped computers. Because of the computation time as well as his need to re-execute arbitrary loops within the whole shebang, he chooses to save the state during each loop.

Much of the discussion (both within his post and in related articles/blogs) orbits around the lack of scientific cleanliness by forcing specific sequences of randomness. For instance, when dealing with uncertainty, if we use known/predictable sequences of random numbers then the process is arguably not purely stochastic (random). In fact, it is certainly possible to hand-pick a seed to the pseudo-random number generator (PRNG) such that the process provides the results we want. At best, this intentional bias provides false insight to the problem, and is likely scientifically impure and even ethically questionable.

However, the author’s intent was not to use predictable sequences of random numbers; it was merely to be able to go back in time to be able to reproduce the random numbers in that cycle of the loop, in order to dive deeper in the analysis. His method actually does not set any PRNG seeds until he needs to reproduce an iteration, so it is theoretically a more robust method (even if his R implementation is discouraged for technical reasons). I made recommendations that comply with R’s recommendations while still providing the analytical purity of pseudo-random numbers and accommodating the author’s need to “go back in time.” The more I think about this discussion, the more I feel I need to revise my own efforts in Reproducible Reporting (specifically in R or similar languages).

Historically, I’ve merely made a simple call to R’s set.seed with a hard-coded integer. I usually try to use an honest-to-goodness random number for this, but that is neither obvious nor a solid alibi. How can I “seed” the execution of a stochastic model without implying possible analyst’s bias?

My suggestion:

(seed = sample(.Machine$integer.max, size = 1))
set.seed(seed)

In an article or report, this results in the random generation of a seed (that I do not control), printing that seed in the generation of the report, and using the seed. Every time I re-execute the code in the document (to add chapters, for example), a different seed is produced randomly but the report still enables the reader to reproduce the code. Voilà!

When we need to go “back in time” to specific iterations of a massive loop, I suggest the following:

(seed = sample(.Machine$integer.max, size = 1))
set.seed(seed)
n = 1e7 # some large number
for (idx in seq_len(n)) {
    set.seed(seeds[idx])
    ## ...
}

Technically, the only number you need to record to be able to reproduce this massive consumer of pseudo-random numbers is the initial seed. All of the other 1e7 random integers used as seeds for each loop iteration can be regenerated. And the generation of 1e7 random integers takes less than a few seconds on most computers, so performance is not a problem here. Whether to sample with or without replacement is a different mathematical conversation, and one that I am not qualified to answer.

The last thing to consider is the PRNG algorithm in use. R and many other languages by default use the Mersenne-Twister algorithm. For more robust reproducibility, the call to set.seed should be identifying the “kind” (and “normal.kind”) of PRNG, in case the default generator changes and reproducibility is still a concern. In that case, set.seed(seed, kind = "Mersenne-Twister", normal.kind = "Twister") is the most robust solution so far.

Happy Simulating!

Bill EvansRandomness versus Reproducibility

Leave a Reply

Your email address will not be published. Required fields are marked *