thoughtt ...thinking aloud

Randomness of Uninitialized Variables

The other day I came across this interesting question on StackOverflow. The OP is merely asking whether it is a good idea to use an Uninitialized Variable (UV) as a random number generator. There are several answers to this question, from the Software Engineering point of view, it might be a terrible idea; because this kind of code introduces “undefined behavior” into our program [1].

One Variable at a time

For me, the Statistical Point of view of this question is far more interesting than of its Software Engineering one. A pseudo-random number generator (PRNG) must have some properties, including Repeatability and Randomness. Repeatability means your algorithm must return the same output given equal initial seed value. Of course, we can’t set a seed for Uninitialized Variable (to best of my knowledge), and this is the first problem with UVs. The second problem, which is far more important, is their distribution! A PRNG should produce Independent and identically distributed (IID) outputs from a uniform or other known distribution.

So what is the distribution of UVs ? I am going to address this question in this post. I created a C program with only one UV in it.

#include "stdio.h"

int main() {
    int a;
    printf("%d\n", a);
    return 0;
}

I compiled this code on MacOs with Apple clang version 11.0.0, and after that, I ran the executable for 1,000,000 times and plotted its histogram. (Note: The same above code only generates 0 on my Ubuntu machine, more on this later.)

Histogram of one UV
Histogram of one UV

There are two interesting parts here. The first is the triangular shape of the histogram, which strongly suggests the output distribution UVs is not uniform at all. The second is that I got all positive numbers between 667685 and 536727589 (One forth of INT_MAX, which is 2147483647)

For a moment, let’s forget we have all the fancy statistical tools we have and just use these numbers in two use-cases. First, let’s generate a random texture with them:

Generate Texture with UVs
Generate Texture with UVs

All three images above have the same 100x100 pixels dimension. For each pixel, I assigned a value between 0 to 255 and plotted it as an image. For the leftmost image, I first sampled 10,000 value from my previous experiment then linearly scaled them between 0 and 255. In the middle image, Instead of scaling, I simply calculated the remainder by 255 for each value. And in the rightmost image, I used numpy’s random, just for comparison. It is sharply visible that the leftmost image has more red in it! One could think that calculating the remainder gives better randomness than scaling (After all Blum Blum Shub was a method of generating random numbers!).

In the next experiment, let’s use our UVs to draw some random points on a 100x100 canvas. Drawing 1000 random points (10% of total space) will give us this:

Filling Space with UVs
Filling Space with UVs

Again, the leftmost image contains a scaled version of UVs for choosing coordinates of points. The middle is the remainder version, and in the right image, I am using numpy’s randint method. This image gives a clear picture about what is wrong with UVs distribution. Not only UVs not covering space evenly, which one might assume, UVs don’t generate visit specific points in space at all. In the below graph, you can see a particular combination of digits won’t happen at all in the last two digits of UVs. Maybe this is due to the fact each UV represents an address in the memory, and all addresses are not valid?!

Grid representation of last two digits of UVs
Grid representation of last two digits of UVs

More Variables, More Weirdness

The above example was very boring, only one variable in an executable. Let’s add more!

#include "stdio.h"

int get1() {
    int r;
    return r;
}

int get2() {
    int g;
    return g;
}

int main() {
    int a;
    int b;
    int c;
    int d;
    int e;
    int f;
    int g;

    printf("%d, %d, %d, %d, %d, %d, %d, %d, %d\n", a, b, c, d, e, f, g, get1(), get2());

    return 0;
}

I once again complied and rand the executable for 1,000,000 times and stored the results. Here are some interesting statistics from these runs:

variable mean std min max
a
b
c
d
e
f
g
get1
get2

Variable a got only positive values and c only negative values, which suggests that these numbers are not memory addresses! Other variables are less attractive, almost the same 0 value for all runs. I tried to fit a Triangular distribution on a’s data using seaborn. All of the data has perfectly lay inside the triangle! Very amazing!

Distribution of Variable Uninitialized ``a``
Distribution of Variable Uninitialized ``a``

For variable c, things were a little different. I had Unifrom distribution on my hand, and you can see the fitted distribution by seaborn is almost covering the whole data again.

Distribution of Variable Uninitialized ``c``
Distribution of Variable Uninitialized ``c``
Side by Side Comparision
Side by Side Comparision

I compiled the same code on my Ubuntu machines; their results were kind of similar, but not too similar. I’ve got all zero values for b and c this time.

variable mean std min max
a
b
c
d
e
f
g
get1
get2

Drawing more plots show that variables a, d, e, f and g have an almost uniform distribution. Unlike in the previous example, none of the variables have a Triangular distribution. Here are plots for each variable:

Distribution of Variables on Ubuntu - ``1,000,000`` data points

Last two digits on Ubuntu - ``10,000`` sampled values

Conclusion

I drew some plots and other things, and it is self-evident that Uninitialized Variables do not follow any strict pattern. The behavior of these variables wholly depends on the underlying compiler (thus the word “undefined behavior”). As an out of the box random number generator, the properties of these variables are not desirable and don’t satisfy the characteristics of standard PRNG. I think from both Statistical and Software engineering points of view, usage of UV in a program is a terrible idea and should be avoided.

Further Reading

  1. The common and false assumption about existing PRNGs is that they produce IID results. Read more on Random Sampling: Practice Makes Imperfect