Another way of thinking about global variables

(This is a computer programming topic.)

The problem

So “global variables” are bad, right?

Well, yes. They suffer from initialisation order problems: if x is a global and f is a function which uses it, if f gets called before x is given an initial value, then something is going to go wrong. More generally, if one function uses a value while another changes it something may go wrong. This can be a problem even without multiple threads.
One “alternative” is to use “instance” variables which either store a value or are empty. A special “access” function (often called instance in Java) can return the stored variable or stop the program with a suitable message when the variable has not yet been initialised. But this adds run-time overhead and doesn’t solve the more general problem of simultaneous access.A solution which solves many of these problems is to put your variables in a caller function (or an object held by a caller), then pass a reference to the variable into each function using it. This solves the “initialisation” problem since one cannot take the reference of a variable which has not yet been created — though this still allows errors if the variable isn’t initialised as it is created. But it doesn’t solve the simultaneous access problem.Add in lifetime analysis, as the Rust compiler does, and the simultaneous access problem is solved too.But is this a panacea? No. In smaller code bases this is fine, but in large projects there can be many variables which need to get passed about, sometimes being passed through several functions just to reach the place they are needed. This can result in function signatures being of unwieldy length just to pass all the needed references in, as well as increase code churn, since any time a variable needs to be added which is used by f which is called by g which is called by, etc., some callee F, then not in addition to f the signature of g, etc., also need to be updated. And not just the signature, also the call.Further, this promotes a hierarchy of data ownership. In some cases this may be perfectly appropriate, e.g. (to use a common computer science example) a car has four wheels which each have one tire, so it may be perfectly appropriate to have some structure like this (to use Rust syntax):

struct Wheel {
traction: f32,
tread_depth: f32,
}
struct Car {
wheels: [Wheel; 4],
steering: SteeringWheel,
}
let myCar = Car::new();

Now, say you put six cars in a race. Before the race starts, five commentators each give their opinion on each car. The logical way to store those comments, according to the data hierarchy, is to add a list of comments within each Car struct, or, may, a list of car comments within each commentator’s memory. But do you really want lists embedded in lists? A better option might be to have a matrix of comments (a table, with one axis correlated with cars and the other with commentators).Adding a table like this, however, has its own problems. It goes against the grain of object oriented design and the data hierarchy. Why is this bad? Well, for one thing, there are more variables to pass around. If a function wants to list each car alongside its comments, it needs to be passed not just the cars but the comments too. Etc. It can also make it hard to reason about how variables are used (e.g. where they are modified).Okay, enough cheesy examples. Some real ones. I have had to choose between the dangers of global variables and the evil complexity of deep hierarchies many times while working on the C++ simulator OpenMalaria. One example is interventions affecting mosquitoes (which, for those who know nothing about malaria, are the vectors which transmit the disease). The simulator has two transmission models, a simplistic “non-vector” model and the more detailed “vector” model which simulates populations of mosquitoes of several species of mosquito. An intervention affecting mosquitoes, for example larviciding (treating pools of stagnant water where mosquitoes breed in order to kill off their larvae), has a parameter describing its effectiveness against each type of mosquito as well as some parameter describing what portion of breeding sites are currently affected. The former is dependent on the mosquito species, so hierarchically belongs within the “species” object within the “vector” object. The “portion affected” parameter hierarchically does not belong in the “species” object, but if it is not it must be passed into the function which calculates emergence of new mosquitoes from breeding sites. If it is, the function deploying larvicide must have access to each “species” object to update it, which in an object-oriented program means that the “species” type should have a function to deploy the intervention, and the “vector” type should have a function to call that. One intervention on its own is not a problem, but there are several interventions, some with effects within multiple modules of the code, and many other things going on at the same time. This leads to bloated interfaces (e.g. the interface (abstract class) for the transmission model), many parameters needing to be passed into functions, and data associated with one thing (in this case, the larviciding intervention) being split between several different modules.A solution

Global variables would make all this a lot easier, however they have their own problems, highlighted above.

However, if the problem is not the global variables themselves, but the way in which they are accessed, then why can’t we just control the access?

Let global variables be created normally, but alongside each create a token (an access key). Let any function using that variable require the associated token (in read-only or read-write form); this could be written explicitly in the function signature but for the most part the compiler could work this out automatically. Let any function calling a function requiring a particular token require that token itself, and within the function guard usage of the token as Rust currently guards lifetimes: if a token is in use by a function, it is locked and cannot be used by another function (or the variable accessed). These token signatures should bubble up all the way to the program’s main function.

Additionally, any function assigning to a variable before doing anything else with the variable would be marked as initialising that token. It would be required that the main function initialise every token which is used.

Similarly, any function destroying a global variable without re-assigning it would be marked as a de-initialiser. It should be required that main is also a de-initialiser for every token used, and that a de-initialised token cannot be used without being initialised again. To allow re-assignment of globals it may be required that new-assignment vs re-assignment be explicitly differentiated, or this could be automatically checked at run-time.

It is worth noting that for all the complexity of this proposal, it still only ensures correct initialisation and that certain dodgy practices may be prevented (e.g. I am not quite sure how passing references to functions requiring a token would work), as well as preventing simultaneous access from multiple threads.

It may be useful additionally to allow “locking” a token with a key (e.g. a private member of a struct) by usage of special function signatures locking/unlocking with the given key.

Conclusion

Initialisation of global variables can be enforced before usage, as can correct deinitialisation. Certain types of simultaneous access, including from multiple threads and where explicitly locked by one user, can be prevented. And this can be done without run-time overhead.

However, function signatures may increase in complexity quite substantially in full form, but other than for API boundaries and maybe a few special cases this can all be done implicitly within the compiler.

Cell, RefCell

As I typed this, I was not really sure if the cost-benefit ratio of adding tokens to control access to global variables was worthwhile. An existing solution in Rust would be to use Cell and RefCell (see Manish Goregaokar’s blog post on this). This, combined with Option to control initialisation would be a good alternative for mutable global state, adding only a little run-time cost.

About dhardy

A software developer who landed in Switzerland, I love conjecturing over a few things computer-related, open collaboration, and quietly promoting linux/KDE as a desktop OS.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a comment