Cosmic Rays

Tox4j primarily targets mobile devices (Android natively and iOS via RoboVM). The sheer number of users this will eventually be exposed to tox4j poses a special challenge: how to make robust code. We have a code review and design review process outlined below, but it is not enough. To illustrate, consider the following: There are over 1 billion (10^9) Android devices, of which we target about 90% (the remaining 10% are either too old or locked down). For simplicity, let's keep the 1 billion as a number. In 2013, there were 300 million iPhones and 800 million Android phones, so given a roughly equal growth, we can assume to be targetting 500 million iPhones. That is 1.5 billion devices, and the number will only be growing.

Tox is a peer-to-peer distributed system with algorithms that make it scale to a theoretically infinite number of users, so scalability is not our concern for now (it may become one once we start supporting load-balanced cloud services in the Tox network). Our problem is that these devices are all subject to cosmic radiation and other environmental influences. Some of these devices are used in very hot areas (many people in third world countries have a smart phone, even before having a house), some in very cold areas. These conditions cause issues with memory and CPU.

All Android devices are equipped with ECC RAM (error correcting codes), mitigating the problem, but it is not enough. We have seen environmental influences on file systems causing incorrect file names to be written. An example we recently ran into was that after saving the state file “state.dat”, which we do by writing a new “state.dat.new”, removing the old “state.dat” and renaming “state.dat.new” to “state.dat”, failed. Our code did not expect what had happened: the first 't' in “state.dat.new” had been shot to pieces by cosmic rays. The letter 't' in binary is 1110100, and there had been enough influence to defeat ECC and parity checks, turning it into 1101110, which is the letter 'n'. We now had a file named “snate.dat.new” and our code failed. On another occasion, a single bit was flipped in the second 't' (1110100 → 1110000 = 'p'), turning “state.dat” into “stape.dat”. Imagine if these two things had happened at the same time - we would have a file “snape.dat”. That is not something you'd want on your phone.

How often does this happen? Well, perhaps once every 30 years, which happens to be less than 1 billion seconds, meaning that this thing happens at least once per second. This is an unacceptable loss of robustness, and tox4j-team is determined to provide a failure-free user experience. Therefore, we have agreed on the following guidelines:

  • We store every object at least 3 times. When operating on objects, we compare them and if one of them doesn't agree, we throw it away. The number 3 (N) is configurable, and the configuration is stored N+2 times for additional robustness.
  • Every file is written 6 times: 3 times forward and 3 times backward.
  • Every file name is a palindrome and written twice, e.g. “status.dat” becomes “status.dattad.sutatsstatus.dattad.sutats”. This reduces the valid file name length to 63 (floor(255/4)), but the robustness is worth it.
  • Instead of opening a file by name, which can obviously fail if one letter is shot to pieces by cosmic rays, every file open operation is preceded by a glob listing, so we can select the closest match from the many cosmic-ray-attacked palindrome file names.
  • The use of built-in types smaller than Int (i.e. Boolean, Byte, Short) is not permitted. Instead, we use wrapper value classes that store the value in a duplicated format. E.g. A Boolean value is stored 32 times in an Int, so “true” is 11111… and “false” is 00000…, and for testing the value of a Boolean, we count the bits set to 1. If more than half the bits are 1, the value is considered “true”, otherwise “false”. This makes Boolean very robust against bit flips. For Byte we store each bit 4 times, for Short use error-correcting Hamming codes instead of redundancy. The choice of using Int originates in the fact that JVM stack slots are 32 bits, so using anything less than that would be a waste.

We believe that these guidelines in combination with careful defence-in-depth design will help us reduce the effect of cosmic rays on our code by at least an order of magnitude. This means our global user will experience a malfunctioning only once every 10 seconds. Future efforts will be aimed at reducing the error rate even further.

Print/export