The other day I came across this interesting question on StackOverflow. The OP is merely asking whether it is a good idea to use an Uninitialized Variable (UV) as a random number generator. There are several answers to this question, from the Software Engineering point of view, it might be a terrible idea; because this kind of code introduces “**undefined behavior**” into our program [1].

For me, the Statistical Point of view of this question is far more interesting than of its Software Engineering one. A *pseudo-*random number generator (PRNG) must have some properties, including **Repeatability** and **Randomness**. Repeatability means your algorithm must return the same output given equal initial seed value. Of course, we can’t set a seed for Uninitialized Variable (to best of my knowledge), and this is the first problem with UVs. The second problem, which is far more important, is their distribution! A PRNG should produce Independent and identically distributed (IID) outputs from a uniform or other known distribution.

So what is the distribution of UVs ? I am going to address this question in this post. I created a C program with only one UV in it.

```
#include "stdio.h"
int main() {
int a;
printf("%d\n", a);
return 0;
}
```

I compiled this code on MacOs with `Apple clang version 11.0.0`

, and after that, I ran the executable for `1,000,000`

times and plotted its histogram. (**Note**: The same above code only generates `0`

on my Ubuntu machine, more on this later.)

There are two interesting parts here. The first is the triangular shape of the histogram, which strongly suggests the output distribution UVs is not *uniform* at all. The second is that I got all positive numbers between `667685`

and `536727589`

(One forth of `INT_MAX`

, which is `2147483647`

)

For a moment, let’s forget we have all the fancy statistical tools we have and just use these numbers in two use-cases. First, let’s generate a random texture with them:

All three images above have the same `100x100`

pixels dimension. For each pixel, I assigned a value between `0`

to `255`

and plotted it as an image. For the leftmost image, I first sampled `10,000`

value from my previous experiment then linearly scaled them between `0`

and `255`

. In the middle image, Instead of scaling, I simply calculated the remainder by `255`

for each value. And in the rightmost image, I used numpy’s random, just for comparison. It is sharply visible that the leftmost image has more red in it! One could think that calculating the remainder gives better *randomness* than scaling (After all Blum Blum Shub was a method of generating random numbers!).

In the next experiment, let’s use our UVs to draw some random points on a `100x100`

canvas. Drawing `1000`

random points (`10%`

of total space) will give us this:

Again, the leftmost image contains a scaled version of UVs for choosing coordinates of points. The middle is the remainder version, and in the right image, I am using numpy’s `randint`

method. This image gives a clear picture about what is wrong with UVs distribution. Not only UVs not covering space evenly, which one might assume, UVs don’t generate visit specific points in space at all. In the below graph, you can see a particular combination of digits won’t happen at all in the last two digits of UVs. Maybe this is due to the fact each UV represents an address in the memory, and all addresses are not valid?!

The above example was very boring, only one variable in an executable. Let’s add more!

```
#include "stdio.h"
int get1() {
int r;
return r;
}
int get2() {
int g;
return g;
}
int main() {
int a;
int b;
int c;
int d;
int e;
int f;
int g;
printf("%d, %d, %d, %d, %d, %d, %d, %d, %d\n", a, b, c, d, e, f, g, get1(), get2());
return 0;
}
```

I once again complied and rand the executable for `1,000,000`

times and stored the results. Here are some interesting statistics from these runs:

variable | mean | std | min | max |
---|---|---|---|---|

`a` |
\(2.68 * 10^8\) | \(1.09 * 10^8\) | \(56.0\) | \(5.36 * 10^8\) |

`b` |
\(3.27 * 10^4\) | \(0\) | \(32.0\) | \(32.0\) |

`c` |
\(-4.06 * 10^8\) | \(7.74 * 10^7\) | \(-5.41 * 10^8\) | \(-2.72 * 10^8\) |

`d` |
\(0\) | \(0\) | \(0\) | \(0\) |

`e` |
\(0\) | \(0\) | \(0\) | \(0\) |

`f` |
\(0\) | \(0\) | \(0\) | \(0\) |

`g` |
\(0\) | \(0\) | \(0\) | \(0\) |

`get1` |
\(1\) | \(0\) | \(1\) | \(1\) |

`get2` |
\(1\) | \(0\) | \(1\) | \(1\) |

Variable `a`

got only positive values and `c`

only negative values, which suggests that these numbers are not memory addresses! Other variables are less attractive, almost the same `0`

value for all runs. I tried to fit a Triangular distribution on `a`

’s data using seaborn. All of the data has perfectly lay inside the triangle! Very amazing!

For variable `c`

, things were a little different. I had Unifrom distribution on my hand, and you can see the fitted distribution by seaborn is almost covering the whole data again.

I compiled the same code on my Ubuntu machines; their results were kind of similar, but not too similar. I’ve got all zero values for `b`

and `c`

this time.

variable | mean | std | min | max |
---|---|---|---|---|

`a` |
\(3.26 * 10^4\) | \(73.92\) | \(3.25 * 10^4\) | \(3.27 * 10^4\) |

`b` |
\(0\) | \(0\) | \(0\) | \(0\) |

`c` |
\(0\) | \(0\) | \(0\) | \(0\) |

`d` |
\(-3.13 * 10^6\) | \(1.24 * 10^9\) | \(-2.14 * 10^9\) | \(2.14 * 10^9\) |

`e` |
\(3.263 * 10^4\) | \(73.92\) | \(3.25 * 10^4\) | \(3.27 * 10^4\) |

`f` |
\(-3.13 * 10^6\) | \(1.24 * 10^9\) | \(-2.14 * 10^9\) | \(2.14 * 10^9\) |

`g` |
\(3.26 * 10^4\) | \(73.92\) | \(3.25 * 10^4\) | \(3.27 * 10^4\) |

`get1` |
\(0\) | \(0\) | \(0\) | \(0\) |

`get2` |
\(0\) | \(0\) | \(0\) | \(0\) |

Drawing more plots show that variables `a`

, `d`

, `e`

, `f`

and `g`

have an almost uniform distribution. Unlike in the previous example, none of the variables have a Triangular distribution. Here are plots for each variable:

I drew some plots and other things, and it is self-evident that Uninitialized Variables do not follow any strict pattern. The behavior of these variables wholly depends on the underlying compiler (thus the word “**undefined behavior**”). As an out of the box random number generator, the properties of these variables are not desirable and don’t satisfy the characteristics of standard PRNG. I think from both Statistical and Software engineering points of view, usage of UV in a program is a terrible idea and should be avoided.

- The common and
**false**assumption about existing PRNGs is that they produce IID results. Read more on Random Sampling: Practice Makes Imperfect

Feel free to share!