Cosmic Rays

Tox4j primarily targets mobile devices (Android natively and iOS via RoboVM). The sheer number of users this will eventually be exposed to tox4j poses a special challenge: how to make robust code. We have a code review and design review process outlined below, but it is not enough. To illustrate, consider the following: There are over 1 billion (10^9) Android devices, of which we target about 90% (the remaining 10% are either too old or locked down). For simplicity, let's keep the 1 billion as a number. In 2013, there were 300 million iPhones and 800 million Android phones, so given a roughly equal growth, we can assume to be targetting 500 million iPhones. That is 1.5 billion devices, and the number will only be growing.

Tox is a peer-to-peer distributed system with algorithms that make it scale to a theoretically infinite number of users, so scalability is not our concern for now (it may become one once we start supporting load-balanced cloud services in the Tox network). Our problem is that these devices are all subject to cosmic radiation and other environmental influences. Some of these devices are used in very hot areas (many people in third world countries have a smart phone, even before having a house), some in very cold areas. These conditions cause issues with memory and CPU.

All Android devices are equipped with ECC RAM (error correcting codes), mitigating the problem, but it is not enough. We have seen environmental influences on file systems causing incorrect file names to be written. An example we recently ran into was that after saving the state file “state.dat”, which we do by writing a new “state.dat.new”, removing the old “state.dat” and renaming “state.dat.new” to “state.dat”, failed. Our code did not expect what had happened: the first 't' in “state.dat.new” had been shot to pieces by cosmic rays. The letter 't' in binary is 1110100, and there had been enough influence to defeat ECC and parity checks, turning it into 1101110, which is the letter 'n'. We now had a file named “snate.dat.new” and our code failed. On another occasion, a single bit was flipped in the second 't' (1110100 → 1110000 = 'p'), turning “state.dat” into “stape.dat”. Imagine if these two things had happened at the same time - we would have a file “snape.dat”. That is not something you'd want on your phone.

How often does this happen? Well, perhaps once every 30 years, which happens to be less than 1 billion seconds, meaning that this thing happens at least once per second. This is an unacceptable loss of robustness, and tox4j-team is determined to provide a failure-free user experience. Therefore, we have agreed on the following guidelines:

We believe that these guidelines in combination with careful defence-in-depth design will help us reduce the effect of cosmic rays on our code by at least an order of magnitude. This means our global user will experience a malfunctioning only once every 10 seconds. Future efforts will be aimed at reducing the error rate even further.