Section1







Leakage Detection Tutorial, Part 1: Motivation

Download python scripts presented in this tutorial

The first DPA attack exploited the fact that a guessable bit of the intermediate state impacted differently on the average power consumption of a cryptographic device depending on whether it took the value 1 or 0. A set of traces, collected as plaintext inputs varied, were repeatedly split into two, according to subkey-dependent guesses at the value of the bit. The partition which revealed the biggest difference in mean power consumption was taken to reveal the correct subkey; the index associated with this maximum was taken to reveal one of possibly several locations in the trace at which the bit was manipulated.

A large class of leakage detection tests follow a similar rationale, except that the (unprotected) intermediate values are known rather than guessed. (In the presence of countermeasures such as masking, sometimes the protected values are unknown to the evaluator; we leave this case aside for the time being.) A set of traces is generated by running the algorithm repeatedly for different inputs and measuring the power consumption. The data are then partitioned according to a known value, and compared at each index to look for ‘significant’ differences in the mean. (We will define ‘significant’ formally later).

Contents

  1. Simulating Hamming Weight Leakage
  2. What Happens When We Change the Input?
  3. Fixed-Versus-Random Test
  4. Reasons the Fixed-Versus-Random Test Might Fail
  5. Correlation-Based Leakage Detection
  6. Comparing the Data Complexity of Correlation and the t-Test

Simulating Hamming Weight Leakage

Suppose a device leaks the Hamming weight of the intermediate values, and that it is running an unprotected implementation of
AES. Then at some point in time it will compute the value SBox(subkey XOR plaintext) for a given byte of plaintext. The power
consumption at this point in time will be a multiple of HW(SBox(subkey XOR plaintext)), plus a constant and some random noise,
neither of which depend on the key or the plaintext. We can simulate this as follows:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from LUTs import HW_LUT, SBox_LUT
In [2]:
# Arbitrary (fixed) key byte.
subkey = 174
# Arbitrary (fixed) plaintext byte.
plaintext = 211

# Baseline power consumption independent of data.
P_const = 50.0
# Coefficient on the data-dependent part of the power consumption.
coeff = 2.0
# Noise standard deviation. (In real life you do not know the m agnitude of the noise).
sigma = 3.0

P_total = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintext))) + sigma*np.random.randn()

N.B. It will become clear that P_const is irrelevant to the outcome of the tests, as it is cancelled out in the subtraction.
Similarly, re-scaling coeff and sigma together will leave the outcome unchanged. This means that we could, without loss of generality, simulate leakages ignoring the constant and coefficient terms and controlling the signal-to-noise ratio via
sigma only. However, we leave all the terms in for now to make the processes in operation more transparent.

What Happens When We Change the Input?

Compare this with the leakage produced by a second plaintext input. Even though 211 and 31 both have the same Hamming weight
(5), the Hamming weights of their S-box outputs after combination with the subkey of 174 are 8 and 3 respectively. So we expect
them to lead to different power consumptions:

In [3]:
plaintext = np.array([211, 31], dtype=np.uint8)
P_total = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintext))) + sigma*np.random.randn(2)

str_stdout = 'Absolute difference in power consumption for plaintexts %i and %i: %0.2f'% (plaintext[0], 
                                                                                          plaintext[1], 
                                                                                          np.abs(P_total[0]- P_total[1]))

print(str_stdout)
Absolute difference in power consumption for plaintexts 211 and 31: 4.30

However, because the noise is a random process, we will also get a different power consumption just by repeating the computation
with the same input. This may sometimes be larger than the difference produced by different plaintexts.

In [4]:
plaintext = np.array([211, 211], dtype=np.uint8)
P_total = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintext))) + sigma*np.random.randn(2)

str_stdout = 'Absolute difference in power consumption for plaintexts %i and %i: %0.2f' % (plaintext[0], 
                                                                                           plaintext[1], 
                                                                                           np.abs(P_total[0]- P_total[1]))      

print(str_stdout)
Absolute difference in power consumption for plaintexts 211 and 211: 6.24

So the question is, how can we tell if the difference we are seeing is caused by changing the plaintext or is simply a result of
the random noise? This is where the t-test comes in, as proposed for use in leakage evaluation by Goodwill, Jun, Jaffe and
Rohatgi in 2011 (“A testing methodology for side-channel resistance validation”).

Fixed-Versus-Random Test

A so-called “fixed-versus-random” experiment compares a set of traces all associated with the same fixed input against a set in
which the inputs have been allowed to vary at random. Typically the entire state is fixed (conversely, random) but since we are
considering the power consumption of a given byte-wise operation we only need consider one byte. The key is kept fixed and is
the same in both sets.

In [5]:
# Total sample size for the two samples.
N = 100
# Vector of fixed plaintexts.
plaintextF = 211*np.ones((N//2, 1), dtype=np.uint8)
# Vector of random plaintexts.
plaintextR = np.random.randint(0, 256, (N//2, 1), dtype=np.uint8)
P_totalF = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintextF))) + sigma*np.random.randn(N//2,1)
P_totalR = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintextR))) + sigma*np.random.randn(N//2,1)

str_stdout = 'Absolute difference in mean power consumption between %i fixed and %i random plaintexts: %0.2f'  \
             % (N/2, N/2, np.abs(np.mean(P_totalF) - np.mean(P_totalR)))
print(str_stdout)
Absolute difference in mean power consumption between 50 fixed and 50 random plaintexts: 7.17

There is a difference … but what does it mean? Random variation in both sets implies that the difference of means will never be exactly zero even if there is no data-dependent part to the traces at all. (Note, though, that the value of P_const has no bearing on the difference as it cancels out; this is the rationale for simulating and/or modelling leakage with an assumed constant of zero).

The t-test is a statistical method designed to answer the question of whether an observed difference between two sample means is ‘large enough’ (relative to the variation observed in the samples) to conclude that there is a real, underlying difference. The t-statistic is defined as the difference in sample means divided by an estimate of the standard error of that observed difference relative to the ‘real’ population difference. There are different options for computing this latter quantity depending on the assumptions one makes about the data. The original TVLA paper uses Welch’s version, which allows or unequally sized samples with different population variances.

We will return to the formalities of the test later. For now, we will rely on the application of it as proposed by Goodwill et al. They derive a threshold of +/- 4.5 as the decision criteria or whether or not an observed t-statistic is ‘large enough’ to be unlikely to occur if the two populations actually have the same mean after all. We apply this to our example scenario:

In [6]:
# Sample variance of the 'fixed plaintext' traces
varF = np.var(P_totalF)
# Sample variance of the 'random plaintext' traces
varR = np.var(P_totalR)

# The formula for Welch's t-statistic:
tStatistic = (np.mean(P_totalF) - np.mean(P_totalR)) / np.sqrt(varF/(N/2) + varR/(N/2))

if tStatistic > 4.5:
    std_out = 't = %0.2f > 4.5: leakage detected.' % (tStatistic)
    print(std_out)
else:
    if tStatistic < -4.5:
        std_out = 't = %0.2f < -4.5: leakage detected.' % (tStatistic)
        print(std_out)
    else:
        std_out = '-4.5 < t = %0.2f < 4.5: no leakage detected.' % (tStatistic)
        print(std_out)
t = 12.44 > 4.5: leakage detected.

Reasons the Fixed-Versus-Random Test Might Fail

The parameters so far have been deliberately chosen such that the test detects the leakage that is present with high probability — although, any statistical procedure has a certain probability of error (that is, a false negative in our constructed example). We now look at some of the ways that a t-test can fail to detect leakage even though it is present.

Insufficient Data

First, the test might fail because the sample size is too small.

In [7]:
# Try playing around with this number to get a sense of the range in which the 
# test does and doesn't succeed.
Nnew = 10
# Sample variance of the first Nnew/2 'fixed plaintext' traces.
VarF = np.var(P_totalF[0: Nnew//2: 1])
# Sample variance of the first Nnew/2 'random plaintext' traces.
VarR = np.var(P_totalR[0: Nnew//2: 1])

# The formula for Welch's t-statistic:
tStatistic = (np.mean(P_totalF[0: Nnew//2: 1]) - np.mean(P_totalR[0: Nnew//2: 1])) / np.sqrt(varF/(Nnew/2) + varR/(Nnew/2))

if tStatistic > 4.5:
    std_out = 't = %0.2f > 4.5: leakage detected.' % (tStatistic)
    print(std_out)
else:
    if tStatistic < -4.5:
        std_out = 't = %0.2f < -4.5: leakage detected.' % (tStatistic)
        print(std_out)
    else:
        std_out = '-4.5 < t = %0.2f < 4.5: no leakage detected.' % (tStatistic)
        print(std_out)
t = 5.28 > 4.5: leakage detected.

Low Signal-to-Noise Ratio

Second, the test might fail because there is too much noise relative to the variance of the data-dependency — that is, the ‘signal’. (In fact, this problem can be overcome by increasing the sample size, as the above experiment perhaps hints towards. We will discuss this in a later section of the Tutorial).

There are two ways of demonstrating this in our simulated experiment: increasing the noise magnitude, or decreasing the signal strength, a.k.a. the ‘effect size’. We begin with the former:

In [8]:
# Try playing around with this number to get a sense of the impact of noise.
sigma_new = 20.0

# Simulate the 'noisier' traces using the same intermediate values:
P_totalF_noisier = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintextF))) + sigma_new*np.random.randn(N//2,1)
P_totalR_noisier = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintextR))) + sigma_new*np.random.randn(N//2,1)

VarF = np.var(P_totalF_noisier)
VarR = np.var(P_totalR_noisier)
# The formula for Welch's t-statistic:
tStatistic = (np.mean(P_totalF_noisier) - np.mean(P_totalR_noisier)) / np.sqrt(varF/(N/2) + varR/(N/2))
              
if tStatistic > 4.5:
    std_out = 't = %0.2f > 4.5: leakage detected.' % (tStatistic)
    print(std_out)
else:
    if tStatistic < -4.5:
        std_out = 't = %0.2f < -4.5: leakage detected.' % (tStatistic)
        print(std_out)
    else:
        std_out = '-4.5 < t = %0.2f < 4.5: no leakage detected.' % (tStatistic)
        print(std_out)
t = 28.67 > 4.5: leakage detected.

Returning to our original value of sigma we next consider what it means to decrease the signal strength. Most obviously, we could reduce the value of coeff — we leave this as an experiment for the reader. What is maybe less obvious is that a different choice of plaintext input for the ‘fixed’ acquisition can lead to a smaller margin between the population means. In the language of statistical hypothesis testing this difference is known as the ‘effect size’ and plays an important role in test design and in the interpretation of outcomes, as we shall explore more formally in a later section of the tutorial.

Our initial choice of fixed input was chosen to produce an intermediate value with a Hamming weight of 8, maximising the distance between the population means of the fixed and random acquisitions. (The latter is constructed to have uniformly distributed byte intermediates, and therefore a mean Hamming weight of 4).

But what if we now choose a fixed input that leads to an intermediate with a Hamming weight of 5, minimising the true underlying distance between the two samples?

In [9]:
# Alternative input byte for the fixed plaintext acquisition.
plaintextF_close = 51*np.ones((N//2,1),  dtype=np.uint8)
P_totalF_close = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey,plaintextF_close))) + sigma*np.random.randn(N//2,1)
      
# Sample variance of the 'fixed plaintext' traces
varF = np.var(P_totalF_close)
# Sample variance of the 'random plaintext' traces
varR = np.var(P_totalR)
# The formula for Welch's t-statistic:
tStatistic = (np.mean(P_totalF_close) - np.mean(P_totalR)) / np.sqrt(varF/(N/2) + varR/(N/2))
              
if tStatistic > 4.5:
    std_out = 't = %0.2f > 4.5: leakage detected.' % (tStatistic)
    print(std_out)
else:
    if tStatistic < -4.5:
        std_out = 't = %0.2f < -4.5: leakage detected.' % (tStatistic)
        print(std_out)
    else:
        std_out = '-4.5 < t = %0.2f < 4.5: no leakage detected.' % (tStatistic)
        print(std_out)
-4.5 < t = 3.90 < 4.5: no leakage detected.

The parameters of the experiment have been chosen so that that the detection test is likely to fail for a ‘close to average’ Hamming weight. This can be mitigated for by increasing the sample size (try it!) However, it is possible to choose a plaintext input such that there is no underlying difference between the means of the fixed and random distributions, as we consider next.

Unfortunate Choice of Fixed Input

Thirdly, the test might fail because the chosen plaintext for the ‘fixed’ acquisition leads to an intermediate value with a Hamming weight equal to the average Hamming weight for a uniform random plaintext. Consider, for example, the plaintext byte 223, which itself has a Hamming weight of 7 but, when combined with the subkey we have chosen for our experiment (174) leads to an S-box output with a Hamming weight of 4.

In such a case, there truly is no difference in the population means of the two samples generated by the fixed-versus-random experiment. The distributions aren’t the same — the overall variance of the total power consumption is increased when the inputs are allowed to vary — but the t-test is designed to detect mean differences only. Increasing the sample size cannot help us here (try it!) as no amount of data will enable the t-test to detect a difference that isn’t there (except by chance; see below).

This hints towards a limitation of the t-test, which we will return to in a later section of the tutorial. More immediately, it highlights a shortcoming in the fixed-versus-random experiment design, as its effectiveness can be seen to depend on the choice of plaintext for the fixed-input acquisition.

However, this can be mitigated for by repeating the experiment with a number of different fixed inputs. Moreover, in a typical real-world scenario, it is not just one trace point being tested but a progression of measurements associated with a code sequence. Any operation depending on the input and the secret key has the potential to leak information, and it is highly unlikely (if not impossible) for all of the intermediates derived from the fixed input to have a Hamming weight
equal to the average Hamming weight. So leakage will be detected somewhere in the trace, even if for a given fixed input it is overlooked at certain indices.

In [10]:
# Alternative input byte for the fixed plaintext acquisition.
plaintextF_alt = 223*np.ones((N//2,1), dtype = np.uint8)    
P_totalF_alt = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey,plaintextF_close))) + sigma*np.random.randn(N//2,1)
  
     
# Sample variance of the 'fixed plaintext' traces
varF = np.var(P_totalF_alt)
# Sample variance of the 'random plaintext' traces
varR = np.var(P_totalR)
# The formula for Welch's t-statistic:
tStatistic = (np.mean(P_totalF_alt) - np.mean(P_totalR)) / np.sqrt(varF/(N/2) + varR/(N/2))

if tStatistic > 4.5:
    std_out = 't = %0.2f > 4.5: leakage detected.' % (tStatistic)
    print(std_out)
else:
    if tStatistic < -4.5:
        std_out = 't = %0.2f < -4.5: leakage detected.' % (tStatistic)
        print(std_out)
    else:
        std_out = '-4.5 < t = %0.2f < 4.5: no leakage detected.' % (tStatistic)
        print(std_out)
-4.5 < t = 1.41 < 4.5: no leakage detected.

Nature of Statistical Hypothesis Tests

Fourthly, the test might fail simply by chance. Statistical hypothesis testing provides a framework for controlling error rates — that is, reducing false positives and negatives to within acceptable bounds. But the errors are traded off against each other and are never eliminated. We will talk more about how to design experiments with formal statistical criteria in mind, but even a well-designed test has a certain probability of failing to detect leakage when it is present, and (conversely) of concluding that leakage is present when it isn’t.

The test scenario we have chosen as our main example can be experimentally shown to have a very low probability of failing to detect (in formal statistical language, we say that it has ‘high power’)…

In [11]:
N = 100
numReps = 1000000
countDetect = 0

for rep in range(0, numReps, 1):
    # Vector of fixed plaintexts.
    plaintextF = 211*np.ones((N//2,1), dtype = np.uint8)
    # Vector of random plaintexts.
    plaintextR = np.random.randint(0, 256, (N//2, 1), dtype=np.uint8)
    P_totalF = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintextF))) + sigma*np.random.randn(N//2,1)
    P_totalR = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintextR))) + sigma*np.random.randn(N//2,1)

    # Sample variance of the 'fixed plaintext' traces.
    varF = np.var(P_totalF)
    # Sample variance of the 'random plaintext' traces.
    varR = np.var(P_totalR)
    # The formula for Welch's t-statistic:
    tStatistic = (np.mean(P_totalF) - np.mean(P_totalR)) / np.sqrt(varF/(N/2) + varR/(N/2))
    countDetect = countDetect + (np.abs(tStatistic) > 4.5)

std_out = 't-test failed to detect leakage %d out of %d times' % (numReps-countDetect, numReps)
print(std_out)
t-test failed to detect leakage 0 out of 1000000 times

However, changing the sample size or the effect size (try it!) will reduce the power of the test and result in more false negatives. The key thing to note here is that the outcome is not wholly determined by the sample size and signal-to-noise ratio; whilst there are configurations which will almost always succeed and configurations which will almost always fail, there are scenarios in between these two extremes where the probabilistic nature of the test procedure is more readily appreciated. For example, see if you can experimentally find a sample size at which the above test succeeds in detecting leakage about 4 in every 5 attempts. We say that such a test has a power of 80%; a forthcoming section of the tutorial will show how to ascertain and/or control the power formulaically rather than experimentally.

To illustrate the possibility of false positives we now consider a scenario where both subsamples have exactly the same distribution, so that ideally we should not be able to ‘detect’ any difference between them.

In [12]:
N = 100
numReps = 1000000
countDetect = 0

for rep in range(0, numReps, 1):
    # Vector of fixed plaintexts.
    plaintextF = 211*np.ones((N//2,1), dtype = np.uint8)
    # Simulate two trace points, both based on the same plaintext input
    P_total1 = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintextF))) + sigma*np.random.randn(N//2,1)
    P_total2 = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintextF))) + sigma*np.random.randn(N//2,1)

    # Sample variance of trace 1.
    var1 = np.var(P_total1)
    # Sample variance of trace 2.
    var2 = np.var(P_total2)
    # The formula for Welch's t-statistic:
    tStatistic = (np.mean(P_total1) - np.mean(P_total2)) / np.sqrt(var1/(N/2) + var2/(N/2))
    countDetect = countDetect + (np.abs(tStatistic) > 4.5)

std_out = 't-test "falsely" detected leakage %d out of %d times' % (countDetect, numReps)
print(std_out)
t-test "falsely" detected leakage 20 out of 1000000 times

In a later part of the tutorial we will explain how the threshold for detection is set (implicitly, in the case of standard TVLA) according to a chosen ‘acceptable’ rate of false positives which can be adjusted to produce a lower rate of false negatives.

Correlation-Based Leakage Detection

A Eurocrypt 2016 paper by Durvaux and Standaert (“From Improved Leakage Detection to the Detection of Points of Interests in Leakage Traces”) proposes a method which bypasses the risk of poorly-chosen inputs to the fixed-versus-random detection test. Instead of testing for a difference in average power consumption between two specially constructed samples, it tests for correlation between measured power consumption and predicted power consumption in one random sample.

However, the advantages of this approach do not come ‘for free’. It requires detailed knowledge about the form of the leakage — at each point in the trace! — in order to make the predictions that lead to high correlations at the vulnerable indices. This information is obtained in a profiling phase, where traces taken from an identical device are used to estimate a power model. The correlation coefficient can be used without this added knowledge, for example by correlating the leakage with the known value of a single bit, bypassing the need for a profiled power model. But this is essentially equivalent to a t-test where the partition is made according to that same targeted bit (that is, there is nothing superior about the correlation coefficient per se; it is taking advantage of the scope to incorporate more information that produces the improvement. See, e.g., the 2010 paper “One for all – all for one: Unifying standard differential power analysis attacks” by Mangard, Oswald and Standaert, in IET Information Security).

We step through the basic methodology in the following, leaving some of the statistical formalities for a future section of the tutorial.

The Profiling Phase

Keeping all the same leakage assumptions as before, suppose that we have 1000 measurements per intermediate value, from which to learn the average power consumption (remembering that, in real life, this is unknown to the evaluator):

In [13]:
Np = 1000
# Initialise a vector for the power model look-up table, which will be 
# constructed by acquiring the average power consumption per byte value:
PM_LUT = np.zeros((256,1), dtype=np.float64)

# Iterate over every byte value
for v in range(0,256,1):
    # To minimise unnecessary storage complexity, we overwrite the
    # byte-dependent simulated leakages on each iteration:
    P_total_v = P_const + coeff*HW_LUT(v*np.ones((Np,1), dtype=np.uint8)) + sigma*np.random.randn(Np, 1)
    
    # Compute the mean, which will be used to predict the leakage associated
    # with intermediates taking the value v:
    PM_LUT[v] = np.mean(P_total_v)

It is easy to see that the derived power model tracks with the Hamming weight of the intermediates, as we would expect. (Note that correlation measures the linear relationship between two variables, which is invariant to changes in scale and location).

In [14]:
int_byte_axis = np.linspace(0, 255, 256, dtype=np.uint8)    
plt.figure(1)
plt.clf()
plt.plot(int_byte_axis, PM_LUT, 'b', label='Profiled power model')
plt.plot(int_byte_axis, HW_LUT(int_byte_axis), 'r', label='Hamming Weight')
plt.xlim((0,255))
plt.xlabel('Intermediate byte')
plt.ylabel('LUT output')
plt.title('Profiled power model')
plt.legend(loc='right')
plt.show()

The Detection Phase

For the correlation-based detection, measurements are taken from the target device as it operates consecutively on just one set of uniformly random inputs. The power model, and knowledge of the plaintext and intermediate operations, are together used to make a prediction for the data-dependent part of the leakage, which is correlated with the measurements at each point in the trace to determine the indices (if any) at which a particular target intermediate leaks. As with the t-test, distributional assumptions are used to derive a suitable threshold at which to conclude that there is leakage. (The precise details are outside
the scope of this present tutorial, but can be found in Durvaux and Standaert, 2016).

As before, for now we consider just a single simulated point of interest in order to demonstrate the methodology.

In [15]:
# Generate random plaintext bytes.
plaintext_corr = np.random.randint(0, 256, size=(N, 1), dtype=np.uint8)
# Simulate the associated leakage:
P_total_corr = P_const + coeff*HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintext_corr)))
# Predict the data-dependent part of the leakage:
P_predicted = PM_LUT[SBox_LUT(np.bitwise_xor(subkey, plaintext_corr)), 0]

# Compute the correlation between the predictions and the simulated leakages:
r = np.corrcoef(P_predicted, P_total_corr, rowvar= False)
r = r[0,1]

The following transformation of the correlation is approximately distributed accoriding to a standard normal, which can be used to provide a suitable threshold for deciding whether or not there is evidence for leakage.

In [16]:
rZ = 0.5*np.log((1+r)/(1-r))*np.sqrt(N-3)

if rZ > 4.5:
    std_out = 'r = %0.2f; rZ = %0.2f > 4.5: leakage detected.' % (r, rZ)
    print(std_out)
else:
    if rZ < -4.5:
        std_out = 'r = %0.2f; rZ = %0.2f < -4.5: leakage detected.' % (r, rZ)
        print(std_out)
    else:
        std_out = 'r = %0.2f; -4.5 < rZ = %0.2f < 4.5: no leakage detected.' % (r, rZ)
        print(std_out)
r = 1.00; rZ = 38.51 > 4.5: leakage detected.

As mentioned already, the correlation test has the advantage of not depending on the appropriate choice of a partition. There is therefore no requirement to repeat it for different inputs. It also ‘ties’ the discovered leakage to a particular intermediate value (that is, the one that is computed in order to make the leakage prediction). The fixed-versus-random t-test, by contrast, highlights all points which are in any way dependent on the input — even those which occur before key mixing (and so do not depend on secret information) and those which occur after diffusion (and are therefore not susceptible to standard DPA attack methodologies). It is much easier to exploit or correct for the vulnerabilities discovered by correlation-based detection — but, on the other hand, it becomes necessary to test all interesting intermediates separately, which typically adds more to the complexity of the evaluation process than the obligation to repeat the fixed-versus-random test for different inputs.

Comparing the Data Complexity of Correlation and the t-Test

Since one-off statistical experiments are subject to error, in order to compare the average performance of the two tests fairly we will need to run repeat experiments. We consider the same leakage conditions as before, and use the leakage model already obtained. We try four different fixed plaintext inputs, each producing a different average distance from the global average over all bytes. To simplify the presentation of the code we now take advantage of the fact that the detection tests are invariant to scale and location, and omit the constant and the coefficient on the data-dependent part. Note that this code has not been optimised for computational efficiency but rather we have tried to keep it easy to follow.

In [17]:
# Number of repetitions.
numReps = 500
# Maximum number of traces.
maxN = 2000
# Arbitrary (fixed) key byte.
subkey = 174
# Four different choices of fixed plaintext byte.
plaintextF = np.array([223, 51, 80, 211], dtype=np.uint8)
# Initialise counter for correlation based detections.
detect_corr = np.zeros(maxN//2)
# Initialise counter for t-test based detections.
detect_FvR = np.zeros((maxN//2, len(plaintextF)))

# Keep the same sigma as previously (and thus the same signal signal strength,
# as we had coeff = 1.0).
sigma = 6.0

for rep in range(0, numReps, 1):
    # All of the plaintexts are random for the correlation-based test; we will
    # use the first maxN/2 of them as the random set in the t-test detection.
    plaintextR = np.random.randint(0, 256, size=(maxN,1), dtype=np.uint8)
    P_R = HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintextR))) + sigma*np.random.randn(maxN, 1)
    P_F = np.zeros((maxN//2, len(plaintextF)))
    for pt in range(0, len(plaintextF), 1):
        P_F[:,pt] = np.reshape((HW_LUT(SBox_LUT(np.bitwise_xor(subkey, plaintextF[pt]*np.ones((maxN//2, 1), dtype=np.uint8)))) + sigma*np.random.randn(maxN//2,1)), newshape=(1000,))
        
    for n in range(4, maxN, 2):
        # Predict the data-dependent part of the leakage for the correlation-
        # based detection:
        P_predicted = PM_LUT[SBox_LUT(np.bitwise_xor(subkey, plaintextR[0:n])),0]
        # Compute the correlation between the predictions and the simulated
        # leakages:
        r = np.corrcoef(P_predicted, P_R[0:n], rowvar=False)
        r = r[0,1]
        # Take the Fisher transform of this correlation:
        rZ = 0.5*np.log((1+r)/(1-r))*np.sqrt(n-3)

        # Compare rZ against the threshold for the detection and update the
        # tally:
        detect_corr[n//2] = detect_corr[n//2] + (np.abs(rZ) > 4.5)
        # Now compute the three different fixed-versus-random test statistics
        # and similarly store the detection result:
        for pt in range(0, len(plaintextF), 1):
            # Sample variance of the 'random plaintext' traces.
            varR = np.var(P_R[0:n//2], ddof=1)
            # Sample variance of the 'fixed plaintext' traces.
            varF = np.var(P_F[0:n//2, pt], ddof=1)
            # The formula for Welch's t-statistic:
            tStatistic = (np.mean(P_F[0:n//2, pt]) - np.mean(P_R[0:n//2])) / (np.sqrt(varF/(n/2) + varR/(n/2)))
            # Compare tStatistic against the threshold for detection and update
            # the tally:
            detect_FvR[n//2, pt] = detect_FvR[n//2, pt] + (np.abs(tStatistic) > 4.5)
        
    
# Visually compare the detection rates for each strategy:
labelFvR = ['F-v-R with HW(IV) = 4','F-v-R with HW(IV) = 5','F-v-R with HW(IV) = 6', 'F-v-R with HW(IV) = 8']
plt.figure(2)
plt.clf()
x_axis = np.linspace(2, maxN, len(detect_FvR))
for j in range(0,len(plaintextF),1):
    plt.plot(x_axis, detect_FvR[:,j]/numReps, label=labelFvR[j])

plt.plot(x_axis, detect_corr/numReps, color='violet', label='Correlation')
plt.title('Detection rates as sample size increases')
plt.xlabel('Sample size')
plt.ylabel('Detection rate')
plt.legend(loc='best')
plt.show()

The figure shows that the relative performance of the correlation-based detection and the fixed-versus-random t-test depends on the choice of fixed input for the latter. In particular, when the distance between the fixed and the random means is maximised, correlation requires roughly twice the number of traces to achieve a similar detection rate as the t-test. On the other hand, the detection rate when the distance between the fixed and random means is non-zero but as small as possible (i.e. the Hamming weight of the intermediate is only one away from 4) is considerably reduced and grows slowly with the number of traces. (In this scenario, it takes around 6 times as many traces as the correlation to achieve full detection; we have truncated the experiment to reduce the computational complexity of the tutorial material.)

In the next part of this tutorial we will consider the more realistic scenario in which the detection tests are performed against whole sequences of traces, measured (for example) during the execution of a cryptographic algorithm.