jrnl · home about list my companies

# List of QUBO formulations

Below a list of 48 optimization problems and a reference to the QUBO formulation of each problem is shown. While a lot of these problems are classical optimization problems from mathematics (also mostly NP-hard problems), there are interestingly already formulations for Machine Learning, such as the L1 norm or linear regression. Graph based optimization problems often encode the graph structure (adjacency matrix) in the QUBO, while others, such as Number Partitioning or Quadratic Assignment encode the problem matrices s.t. a non-linear system of equations can be found in the QUBO. The quadratic unconstrained binary optimization (QUBO) problem itself is a NP-hard optimization problem that is solved by finding the ground state of the corresponding Hamiltonian on a Quantum Annealer. The ground state if found by adiabatically evolving from an initial Hamiltonian with a well-known ground state to the problem Hamiltonian.

This is by no means the definite list of all QUBO formulations out there. This list will grow over time.

For 14 of these problems the QUBO formulation is implemented using Python (including unittests) in the QUBO-NN project. The Github can be found here.

Problem QUBO formulation
Number Partitioning (NP) Glover et al.2
Maximum Cut (MC) Glover et al.2
Minimum Vertex Cover (MVC) Glover et al.2
Set Packing (SP) Glover et al.2
Set Partitioning (SPP) Glover et al.2
Maximum 2-SAT (M2SAT) Glover et al.2
Maximum 3-SAT (M3SAT) Dinneen et al.9
Graph Coloring (GC) Glover et al.2
General 0/1 Programming (G01P) Glover et al.2
Quadratic Assignment (QA) Glover et al.2
Quadratic Knapsack (QK) Glover et al.2
Graph Partitioning Lucas7
Decisional Clique Problem Lucas7
Maximum Clique Problem Chapuis19
Binary Integer Linear Programming Lucas7
Exact Cover Lucas7
3SAT Lucas7
Maximal Independent Set Djidjev et al.8
Minimal Maximal Matching Lucas7
Set Cover Lucas7
Knapsack with Integer Weights Lucas7
Clique Cover Lucas7
Job Sequencing Problem Lucas7
Hamiltonian Cycles Problem Lucas7
Minimal Spanning Tree Lucas7
Steiner Trees Lucas7
Directed Feedback Vertex Set Lucas7
Undirected Feedback Vertex Set Lucas7
Feedback Edge Set Lucas7
Traveling Salesman (TSP) Lucas7
Traveling Salesman with Time Windows (TSPTW) Papalitsas et al.1
Graph Isomorphism Calude et al.4
Subgraph Isomorphism Calude et al.4
Induced Subgraph Calude et al.4
Capacitated Vehicle Routing (CVRP) Irie et al.5
Multi-Depot Capacitated Vehicle Routing (MDCVRP) Harikrishnakumar et al.6
L1 norm Yokota et al.3
k-Medoids Bauckhage1 et al.10
Contact Map Overlap Problem Oliveira et al.11
Minimum Multicut Problem Cruz-Santos et al.12
Broadcast Time Problem Calude et al.13
Maximum Common Subgraph Isomorphism Huang et al.14
Densest k-subgraph Calude et al.15
Longest Path Problem McCollum et al.16
Airport Gateway Assignment Stollenwerk et al.18
Linear regression Date et al.17
Support Vector Machine Date et al.17
k-means clustering Date et al.17

Note that a few of these references define Ising formulations and not QUBO formulations. However, these two formulations are very close to each other. As a matter of fact, converting from Ising to QUBO is as easy as setting Si2xi1S_{i} \rightarrow 2x_i - 1, where SiS_i is an Ising spin variable and xix_i is a QUBO variable.

Published on

# Quantum Annealing Hamiltonian Example Calculation

Let's calculate quickly what happens.

We have an extremely simple Number Partitioning problem consisting of the set of numbers {1,2,3}\{1, 2, 3\}. It has an optimal solution, which is splitting the set into two subsets {1,2}\{1, 2\} and {3}\{3\}.

The Ising formulation for Number Partitioning is given as A(j=1mnjxj)2A(\sum_{j=1}^{m}n_{j}x_{j})^2, where njn_j is the j-th number in the set. Thus, with A=1A=1, for the given set the result is (x1+2x2+3x3)2=x1x1+4x1x2+6x1x3+4x2x2+12x2x3+9x3x3(x_1 + 2x_2 + 3x_3)^2 = x_1x_1 + 4x_1x_2 + 6x_1x_3 + 4x_2x_2 + 12x_2x_3 + 9x_3x_3. Now, the Hamiltonian for an Ising formulation is very simple: Replace xix_i by a Pauli Z Gate. We can now write the Hamiltonian using QuBits (with an exemplary QuBit state):


Thus, we can evaluate the cost of a possible set of subsets by using QuBits. For instance, a state 010|010\rangle{} means that the numbers 1 and 3 are in one subset and the number 2 is in another subset. It should be noted that 1Zi1=1\langle{}1|Z_i|1\rangle{} = -1 and 0Zi0=1\langle{}0|Z_i|0\rangle{} = 1.

More specifically (where xi=Zix_i = Z_i), the Quantum formulation for our problem is:

010x1x1+4x1x2+6x1x3+4x2x2+12x2x3+9x3x3010\langle{}010|x_1x_1 + 4x_1x_2 + 6x_1x_3 + 4x_2x_2 + 12x_2x_3 + 9x_3x_3|010\rangle{}

Here, we have 0x10=1\langle{}0|x_1|0\rangle{} = 1, 1x21=1\langle{}1|x_2|1\rangle{} = -1 and 0x30=1\langle{}0|x_3|0\rangle{} = 1. Thus, the total cost is:

010x1x1+4x1x2+6x1x3+4x2x2+12x2x3+9x3x3010=14+6+412+9=4\langle{}010|x_1x_1 + 4x_1x_2 + 6x_1x_3 + 4x_2x_2 + 12x_2x_3 + 9x_3x_3|010\rangle{} = 1 - 4 + 6 + 4 - 12 + 9 = 4.

Note that one of the the optimal solutions is:

001x1x1+4x1x2+6x1x3+4x2x2+12x2x3+9x3x3001=1+46+412+9=0\langle{}001|x_1x_1 + 4x_1x_2 + 6x_1x_3 + 4x_2x_2 + 12x_2x_3 + 9x_3x_3|001\rangle{} = 1 + 4 - 6 + 4 - 12 + 9 = 0.

Another one would have been the state 110|110\rangle{}.

In summary, we can now adapt a bunch of QuBits to evaluate the goodness of a solution. This in turn opens interesting possibilities..


  • Lucas, Andrew. "Ising formulations of many NP problems.", 2014
Published on

# A note on Adiabatic Evolution in Quantum Annealing

In a lot of introductory literature about Quantum Annealing, Adiabatic Quantum Computing (AQC) and Quantum Annealing (QA) are explained very well, but one crucial connection seem to always be amiss.. What exactly is evolution, or adiabatic evolution?

First, the Hamiltonian of a system gives us the total energy of that system. It basically explains the whole system (what kind of kinetic or potential energy exists, etc). The ground state of a given Hamiltonian is the state associated with the lowest possible total energy in the system. Note that finding the ground state of a system is extremely tough. A problem such as the Traveling Salesman Problem (TSP) is then converted into a formulation (QUBO or Ising) for which a Hamiltonian is set up easily.

Quantum Annealing works as per AQC as follows:

  • Set up the problem Hamiltonian for our problem, s.t. the ground state of the Hamiltonian corresponds to the solution of our problem.
  • Set up an initial Hamiltonian for which the ground state is easily known and make sure the system is in the ground state.
  • Adiabatically evolve the Hamiltonian to the problem Hamiltonian.

Adiabatic evolution follows the adiabatic theorem in that the evolution should not be too fast, else the system could jump into the next excited state above the ground state, which would result in a sub-optimal solution in the end. The general formula is as follows:

Ht=s(t)Hi+(1s(t))Hf\mathcal{H}_t = s(t) \mathcal{H}_i + (1 - s(t)) \mathcal{H}_f,

where Hi\mathcal{H}_i is the initial Hamiltonian and Hf\mathcal{H}_f is the problem Hamiltonian. The paramater s(t)s(t) is modified over time. In the beginning, the system (or time-dependent) Hamiltonian consists solely of the initial Hamiltonian. In the end, it consists solely of the problem Hamiltonian (and is hopefully still in the ground state, if the adiabatic theorem was followed).

Setting up Hamiltonians consists of translating the TSP instance to an Ising spin glass problem instance. The Hamiltonian of that problem is very well known, and by translating to an Ising problem, one automatically encodes the problem in a way s.t. minimizing the Hamiltonian (i.e. minimizing the total energy and thus getting into the ground state) automatically minimizes the Ising spin glass instance and thus the TSP instance.

With the Hamiltonians out of the way, a huge question marks sits with the adiabatic evolution part. What is going on here? The figure below shall help build an intuition.

One can see different system Hamiltonians over time and the corresponding ground state (black dot) of all possible states. The experiment would now start with the light blue Hamiltonian and over time slowly change to the problem Hamiltonian (dark purple). Thus, we are modifying the energy landscape over time! Looking back at the time-dependent Hamiltonian equation further above, this suddenly makes sense. The time-dependent Hamiltonian changes over time in that it consists of both the initial Hamiltonian and problem Hamiltonian with different fractions. Thus, the energy landscape changes over time in that it first consists solely of the initial Hamiltonian and later more and more of the problem Hamiltonian. So the literature is not missing a description of adiabatic evolution after all. Hopefully though, this visualization helps people out there understand easier the intuition behind it.

There are quite a few more topics to cover, but this post ends here to keep things short and concise. More will likely follow.

A project this was related to is QUBO-NN.


  • McGeoch, Catherine C. "Adiabatic quantum computation and quantum annealing: Theory and practice."
  • Lucas, Andrew. "Ising formulations of many NP problems.", 2014
Published on

# Multiplying large numbers with Neural Networks

When working on QUBO-NN, one of the problem types (specifically: Quadratic Knapsack) was causing issues in that a good neural network (with a decent R2R^2 coefficient) could not be trained. One of my ideas was that a possible cause could be the large numbers that had to be multiplied to get to a solution. Judging from this stackexchange post, I was kind of right. In the end, it was not the main cause, but it did influence the number of nodes in the neural network needed to train decent models.

I decided to do a small experiment to test this out.

First, and this is also known in the literature (though the publication is rather old), it gets tougher to train a neural network to multiply numbers, the larger they become. The first experiment is set up as follows. The dataset consists of 10000 randomly generated numbers in the range 1 to n, where n is varied across configurations (50, 200, 500, 2000). The neural network architecture consists of just one hidden layer of size 20 (and this is fixed). The optimizer of choice is Adam with a learning rate of 0.005, and the activation function is ReLU. I always train for 500 epochs and use a batch size of 10. The dataset is normalized between 0 and 1.

The next figure shows an interesting result.

In the first 300 epochs, one would assume that the model for n=2000 is a failure with a R2R^2 coefficient below 0.96. The jumps are also extremely interesting --- each model has its own jump in the R2R^2 coefficient, and the lower n, the earlier the jump happens.

Future work would include training for thousands of epochs and observing whether the worst model (with n=2000) still improves further.

A next experiment shows that including more nodes helps immensely (c.f. the next figure). The worst model (n=2000) is trained on a neural network with hidden layer size 100 (instead of 20).

The solution proposed in the previously linked stackexchange post to use the logarithm works extremely well: For the worst model (with n=2000), after just 6 epochs the R2R^2 coefficient is at 0.99999999999875220.9999999999987522. Damn. The same previously cited publication supports this idea of using the logarithm with similar results. The idea is simple: Before training, transform the input and target using the natural logarithm. Afterwards, if one requires the actual numbers, simply take the exponential of the result.

In summary, training neural networks to multipy large numbers quickly leads to huge training times (with a large number of epochs required to reach a good R2R^2 coefficient). The logarithm trick helps, since addition is trivial for a neural network to learn. Adding more nodes also helps.

The source code for the experiments can be found here.

Published on

# QUBO NN - Reverse Engineering QUBO matrices

I've been trying to create a reverse-engineering pipeline for QUBO matrices, and it has been quite interesting.. Not as easy as one would think. So the story here is quite simple: Since the current global Quantum Compute is concentrated across a few companies and institutions, what if these institutions read your QUBO matrices and infer secret information. For instance, say you're a hedge fund that quantifies some secret alpha - they trust you not to steal that and thereby instantly devalue their alpha.

In classical compute (with cloud computing) there are interesting ideas such as homomorphic computation (using homomorphic encryption). The goal of the project is to create a pipeline that can classify the problem a given QUBO matrices is based on and then infer the parameters that the QUBO matrix has been generated with. Below a schematic overview is given.

I used neural networks for both steps. We do neural network multi-classification (step a) and neural network regression (step b). Now, the results are quite promising.

The first step, i.e. classifying, is pretty straightforward. I set up 11 problems and generated 100000 problem instances each. The chart below shows how after just 40 epochs we are already near 0 in terms of error rate:

The second step is not as straightforward. 5 of the problems turned out to be trivially reversible. The R2R^2 coefficient comes very close to 1.0:

Maximum Cut needs additional thoughts (the blog post about that can be found here) - specifically, the output needs to be an adjacency matrix and not an edge list. This is actually very important:

As for the rest, I am actively looking into it. A table with all 11 problems and their reversibility can be found here. Some problems such as Quadratic Assignment (QA) I assumed to be non-reversible, but I was wrong. The QUBO of a QA problem defines an over-determined linear system of equations, which can be solved by a neural network. However, the best I got so far is a R2R^2 coefficient of 99%. Not bad, but could be better. The problem I really struggled with so far is Maximum 2-SAT. I am not 100% sure yet, but my hunch is that it is simply too complex for a standard fully connected neural network.

For anyone interested, the Github repository can be found here.

Published on

# Learning with Errors Problem Example in Post Quantum Cryptography

There are a multitude of approaches in post-quantum cryptography that promise security in a world where Quantum Computers are powerful enough to break RSA using Shor's algorithm. Some interesting ones include isogeny based ones or lattices, to which Learning with Errors (LWE) belongs. The contribution of this post is a simple and easily modifiable example on how asymmetric encryption based on LWE works.

Initially, B creates an over-defined and solvable system of equations such as:

5a + 4b + c = 17 mod 20
3a + 10b + c = 19 mod 20
10a + 5b + 3c = 14 mod 20

with a=2, b=1 and c=3. Those three numbers are the private key of B.

Next, B adds errors to the equation system:

5a + 4b + c = 17+1 mod 20
3a + 10b + c = 19-2 mod 20
10a + 5b + 3c = 14+1 mod 20

and denotes the coefficients and the wrong values the public key:

5   4   1   18
3   10  1   17
10  5   3   15

with modulo 20.

A now wants to encrypt a message using B's public key and send it to B. She goes bit by bit. We start with the very first bit of the message, which is for example 1. For each row of the public key she flips a coin. She sums up the rows come up with tails. Let's for example assume row 1 and 3 came up. She sums up the coefficients and values and ends up with:

15a + 9b + 4c = 18+15 mod 20 = 13 mod 20

Depending on the message bit, which in our example is 1, she adds an error:

  • a large one if the bit is 1
  • a small one if the bit is 0

In our case we change 13 mod 20 to for example 2 mod 20. A sends to Bob the coefficients and the wrong value:

15  9   4   2

B now inserts the private key and calculates:

15a + 9b + 4c = 15*2 + 9*1 + 4*3 = 51 mod 20 = 11 mod 20

Since the value transmitted by A is 2 and this is a large error, B assumes the bit was 1. Likewise, if the bit would have been 0, A would have for example sent 12 mod 20 instead of 2 mod 20.

The reason this scheme works is that it is very easy to add errors to the system of equations, but very tough to find the errors when a, b, c are unknown.

And where is the lattice in this? The public key of B can be seen as a set of basis vectors in a lattice and the attacker needs to solve the Closest Vector Problem to find the good (real) vectors, which is very tough (polynomial runtime algorithms approximate the solution only within exponential quality).

Published on

# I accidentally tried to train 1 Quadrillion parameters (and a story of structural bias)

What a rookie mistake, but a funny one indeed. So it all started with me seeing what I'd mistakenly called avoidable bias.

The train error rate bottoms out at roughly 28, which is NOT ~0 and definitely not satisfying. This looks like underfitting and a big chunk of avoidable bias, at first. As per theory, one should increase the number of parameters to deal with underfitting, so that's exactly what I did. One thing to keep in mind - statistical error (or variance), i.e. the difference between train and test errors increase proportionally with the number of parameters. So if we overshoot, we overfit badly. In that case, increasing the number of data points might help.

With all this in mind, I just went ahead. Now, what I kind of forgot is how quickly the number of parameters grow in fully connected networks. The issue with the data is that it is very high dimensional - the input size is 4096 and the output size was 180 at this stage. With literally zero hidden layers and just that input and output, we already have 4096 * 180 = 737280 parameters. So anyways, I started with a hidden layer of 2000, 5000, 10000 and at some point ended up with two massive 50000 node layers. Also tested 10 ×\times{} 10000 layers at some point too. Let's do some quick maths:

4096 * 50000 ** 2 * 180 = 1.8432e+15

That's ~1.8 Quadrillion parameters. I was astonished as to why the so called 'avoidable' bias was not going away. And of course, training the models became very slow too. Further, the bias stayed at an even higher level with a higher number of parameters!

Two main takeaways here:

  • This was structural bias (unavoidable)
  • Training 1.8 Quadrillion parameters lead to the bias staying elevated since training is too inhibited. After the same amount of time (when compared to the simple 700k parameter model) we simple haven't learnt anything and thus the train error stays high.

After changing my data I ended up with no structural bias and a bit of avoidable bias. I upped the number of training epochs and got to close to 0 error loss.

Observe below the relationships between different types of errors and model complexity (number of parameters) or number of training epochs.

The data was related to this post and this project. Specifically, I tried to predict the parameters (i.e. the exact graph) that lead to a 64 ×\times{} 64 QUBO matrix of a Maximum Cut problem. At first, I encoded the output as the list of edge tuples (since we had 90 edges pre-defined, that's 180 nodes). I am not 100% sure where the structural bias is coming from, but my hunch is that it is the ordering of edges. Now, it's not as simple as saying that the edges are stored in a dict and thus they're ordered in a random way - that's not the case since Python 3.7. The edges (=dict) keep insertion order. But what if the insertion itself was random? That is exactly the case here - I am using gnm_random_graph which generates edges randomly in random order and inserts them randomly. And randomness is structural bias that cannot be learnt by a neural network. So there you have it. I ended up predicting a full adjacency matrix, which gets rid of the structural bias. Works now, nice!

Published on

# Oblivious Transfer Example With Physically Unclonable Function (PUF)

Physically Unclonable Functions (PUFs) are an extremely interesting and hot topic in cryptography. The basic gist is, you have a certain physical object that cannot be cloned due to the extremely intricate properties of it (e.g. the atomic structure of a DVD) that depend on the manufacturing process. The assumption is that each DVD is unique when looking at its atomic structure. Now, when measuring the reflection of a DVD with a high precision device of some sort, these differences show up. Imagine you put a chip in a capsule which inner walls consist of such reflective material. The chip has a device that measures the reflection and based on that generates a key pair (a public and a private one). This public key pair can be used to communicate with the device and identify it, and if someone were to try to tamper with the device by drilling into it - the reflective material would break, changing the measurement and thus the public key. Lastly, an additional digital signature on the device ensures that anyone can check for tampering (the public key is contained in the digital signature). The most important part is that the device is secret-free. The 'secret' is the material and there is no way to clone that material to re-use the secret and there is no way to find out the private key without breaking the whole thing. Pretty cool concept.

PUFs can be used for many cryptographic primitives, one of which is Oblivious Transfer (OT). The point of OT is: A wants to get a certain information from B based on A's choice. B should stay oblivious of the choice of A in the process however, and A should not be able to get all of the options from B. Pretty specific, I know. But being a cryptographic primitive has lots of uses. See the paper Founding Cryptography on Oblivious Transfer - Efficiently, which builds secure two-party computation based on OT.

Anyways, enough of introduction. The main contribution of this small post is to show an example calculation of how Oblivious Transfer would work with PUFs. In the figure below you can see how the scheme works. A uses the PUF to create a list of challenge-response pairs (CRPs) and sends over the PUF to B. B chooses two values x0 and x1 and sends those to A. A now chooses one of those and sends back the challenge XOR'd with the chosen value. B measures the response of the PUF of the two challenges v XOR x0 and v XOR x1. The interesting thing here is that depending on the choice of A, one of those challenges ends up being the real challenge C of A (e.g. if A chooses x0, v XOR x0 = C XOR x0 XOR x0 = C). B sends both choices (s0, s1) back, but XOR'd with both responses. A chooses the correct one based on its choice and since it has the response for C, can compute y0 XOR R = s0 (if continuing with the example of choosing x0).

Now, an example calculation.

s0 = 5
s1 = 6

# A chooses random C and a choice b (can be 0 or 1).
C = 5412
R = 7777
b = 1

# B chooses random x0, x1.
x0 = 44
x1 = 66

# A calculates (remember, A chose 1):
v = C ^ x1 = 5412 ^ 66 = 5478

# B measures the two challenges.
R0 = v ^ x0 = 5478 ^ 44 = 5450 = 9999
R1 = v ^ x1 = 5478 ^ 66 = 5412 = 7777

# B sends back both choices XOR'd with both responses.
y0 = s0 ^ R0
y1 = s1 ^ R1

# A XORs the second reply from B (first one is garbage) and gets the choice value they wanted.
s1 = y1 ^ R = 6

In summary, what we just observed: A lets B compute both options so B doesn't know the chosen bit b. And B computes both options in a way that the option not chosen is garbled, while the option chosen can be deduced by A.

Published on

# QUBOs for TSP and Maximum-3SAT

Generating QUBOs for Maximum 3-SAT and the Traveling Salesman Problem (TSP) was a bit tougher than the other 9 problems that were outlined in https://arxiv.org/pdf/1811.11538.pdf. In this post I would like to give a quick example on how one would do so. For reference, the context of this is this project. To understand this post, it is required to read the arxiv paper above. Else I believe the code won't make any sense.

For Maximum 3-SAT, the following clauses are given:

(x1x2x3),(x1x2x3),(x1x2x3),(x1x2x3)(x_1 \lor x_2 \lor x_3), (\overline{x_1} \lor x_2 \lor x_3), (x_1 \lor \overline{x_2} \lor x_3), (\overline{x_1} \lor x_2 \lor \overline{x_3})

In code (c.f. here) it looks like this, where the number is the variable index and the boolean denotes the state of the variable:

problem = Max3SAT(
        ((0, True), (1, True), (2, True)),
        ((0, False), (1, True), (2, True)),
        ((0, True), (1, False), (2, True)),
        ((0, False), (1, True), (2, False))

The QUBO matrix we are looking for looks like this:

    [0.0, -1.0, 1.5, -0.5, 0.5, -0.5, 0.5],
    [-1.0, 1.0, 0.0, -0.5, -0.5, 0.5, -0.5],
    [1.5, 0.0, -1.0, -0.5, -0.5, -0.5, 0.5],
    [-0.5, -0.5, -0.5, 2.0, 0.0, 0.0, 0.0],
    [0.5, -0.5, -0.5, 0.0, 1.0, 0.0, 0.0],
    [-0.5, 0.5, -0.5, 0.0, 0.0, 1.0, 0.0],
    [0.5, -0.5, 0.5, 0.0, 0.0, 0.0, 0.0]

And as a normalized heatmap:

For Maximum 2-SAT, the QUBO formulation is really space-efficient (which is important for current architectures) and only scales with the number of variables (see arxiv link above). Awesome! For Maximum 3-SAT unfortunately we scale with both the number of variables and the number of clauses.

The code to generate a QUBO for Maximum 3-SAT is seen below. The main loop is split in two parts, one that generates the blue part of the formula below and one that generates the green part. To be specific, the formula is just for 3 variables, so it's not general. The green part is simply the combinations of all variables (e.g. if you have 5 variables, there's 10 combinations).

i=1m((1+wi)(yi1+yi2+yi3)yi1yi2yi1yi3yi2yi32wi\sum^{m}_{i=1} \textcolor{blue}{((1 + w_i)(y_{i1}+y_{i2}+y_{i3})} - \textcolor{green}{y_{i1}y_{i2} - y_{i1}y_{i3} - y_{i2}y_{i3} - 2w_i}

Here's the code:

def gen_qubo_matrix(self):
    n = self.n_vars + len(self.clauses)
    Q = np.zeros((n, n))

    # Clauses is a list of 3-tuples of 2-tuples.
    # Eg: [((0, True), (2, True), (3, False)), ((1, True), (3, False), (4, True))]
    # So we loop through all tuples and find the type of it.
    # There are 4 types: (True, True), (True, False), etc.
    for i, clause in enumerate(self.clauses):
        clause_idx = self.n_vars + i
        Q[clause_idx][clause_idx] += 2

        # (1 + w1) * (x1 + x2 + x3) = x1 + x2 + x3 + w1x1 + w1x2 + w1x3
        # (1 + w1) * ((1 - x1) + x2 + x3) = -x1 + x2 + x3 + w1 -w1x1 +w1x2 +w1x3
        # (1 + w1) * ((1 - x1) + x2 + (1 - x3)) = -x1 + x2 -x3 + 2w1 -w1x1 +w1x2 -w1x3
        for item in clause:
            item_idx = item[0]
            val = item[1]
            if val:
                Q[item_idx][item_idx] -= 1
                Q[clause_idx][item_idx] -= .5
                Q[item_idx][clause_idx] -= .5
            if not val:
                Q[item_idx][item_idx] += 1
                Q[clause_idx][clause_idx] -= 1
                Q[clause_idx][item_idx] += .5
                Q[item_idx][clause_idx] += .5

        # -x1x2 -x1x3 -x2x3
        # -(1-x1)x2 -x1x3 -x2x3 = -1 +x1x2
        # -(1-x1)(1-x2) -x1x3 -x2x3 = -1 -2x1x2 +x1 +x2
        for (item1, item2) in itertools.combinations(clause, 2):
            idx1 = item1[0]
            idx2 = item2[0]
            val1 = item1[1]
            val2 = item2[1]
            if val1 and val2:
                Q[idx1][idx2] += .5
                Q[idx2][idx1] += .5
            if not val1 and val2:
                Q[idx2][idx2] += 1.
                Q[idx1][idx2] -= .5
                Q[idx2][idx1] -= .5
            if val1 and not val2:
                Q[idx1][idx1] += 1.
                Q[idx1][idx2] -= .5
                Q[idx2][idx1] -= .5
            if not val1 and not val2:
                Q[idx1][idx2] += 1.
                Q[idx2][idx1] += 1.
                Q[idx1][idx1] -= 1.
                Q[idx2][idx2] -= 1.

    return Q

As you can see, it's longer than most other QUBO generators in the project. The matrix that is generated also looks very intricate. A small note for newcomers to QUBOs: Whenever there is a variable that is not on the diagonal, e.g. x1x2x_1x_2, one splits the constant in two. This is why you see these .5 additions and subtractions. A lone variable x3x_3 would be on the diagonal at Q[3][3] and leads to no such split. What about 2x1x22x_1x_2? Well, we end up with ±1\pm1.

As for TSP, the problem is defined as a distance matrix between the nodes of the graph and a constraint c. For instance, below we see three nodes with distances normalized to the range 0 to 1.

dist_matrix = [
    [0, 1/3., 2/3.],
    [1/3., 0, 1/3.],
    [1, 1/3., 0]

The resulting QUBO matrix looks like this (note: the constraint is set to 400):

    [-800.00, 800.00, 800.00, 800.00, 0.00, 0.00, 800.00, 3.33, 10.00],
    [800.00, -800.00, 800.00, 0.00, 800.00, 0.00, 3.33, 800.00, 3.33],
    [800.00, 800.00, -800.00, 0.00, 0.00, 800.00, 6.67, 3.33, 800.00],
    [800.00, 3.33, 10.00, -800.00, 800.00, 800.00, 800.00, 0.00, 0.00],
    [3.33, 800.00, 3.33, 800.00, -800.00, 800.00, 0.00, 800.00, 0.00],
    [6.67, 3.33, 800.00, 800.00, 800.00, -800.00, 0.00, 0.00, 800.00],
    [800.00, 0.00, 0.00, 800.00, 3.33, 10.00, -800.00, 800.00, 800.00],
    [0.00, 800.00, 0.00, 3.33, 800.00, 3.33, 800.00, -800.00, 800.00],
    [0.00, 0.00, 800.00, 6.67, 3.33, 800.00, 800.00, 800.00, -800.00]

And as a normalized heatmap:

TSP is unfortunately rather inefficient to encode. For nn nodes, we have a nnn * n distance matrix and a n2n2n^2 * n^2 QUBO matrix! Below is the code for generating a TSP QUBO (c.f. here for more context):

def gen_qubo_matrix(self):
    n = len(self.dist_matrix)
    Q = np.zeros((n ** 2, n ** 2))

    quadrants_y = list(range(0, n ** 2, n))
    quadrants_x = quadrants_y[1:] + [quadrants_y[0]]

    # The diagonal positive constraints
    for start_x in quadrants_y:
        for start_y in quadrants_y:
            for i in range(n):
                Q[start_x + i][start_y + i] = 2 * self.constraint

    # The distance matrices
    for (start_x, start_y) in zip(quadrants_x, quadrants_y):
        for i in range(n):
            for j in range(n):
                if i == j:
                Q[start_x + i][start_y + j] = self.P * self.dist_matrix[j][i]
            Q[start_x + i][start_y + i] = 2 * self.constraint

    # The middle diagonal negative constraints
    for start_x in quadrants_x:
        for i in range(n):
            Q[start_x + i][start_x + i] = -2 * self.constraint
            for j in range(n):
                if i != j:
                    Q[start_x + i][start_x + j] += 2 * self.constraint

Note: P is set to 10.

This wraps up the post. I've shown an example problem instance for Maximum 3-SAT and one for TSP and code on how to generate QUBO matrices for each.

Published on

# PPO - a Note on Policy Entropy in Continuous Action Spaces

I've always wondered what policy entropy really means in the context of PPO. From other posts (e.g. here by AurelianTactics - btw very cool guy saw him on the official reddit RL Discord) I know that it should continuously go down. It also corresponds to the exploration the PPO agent is taking --- high entropy means high exploration.

In this case I will be looking at continuous action spaces. Here, each action in the action space is represented by a gaussian distribution. The policy is then called a gaussian policy. In my case, I have two continuous actions (for more information, check out the post Emergent Behaviour in Multi Agent Reinforcement Learning - Independent PPO).

Now, entropy for a gaussian distribution is defined as follows: 12log(2πeσ2)\frac{1}{2} \log(2\pi{}e\sigma{}^2). What baselines PPO does now is to simply sum the entropies across the actions axis (e.g. given a batch of 700 actions s.t. the shape is (700,2), this becomes (700,)) and then take the mean over the batch. Their inner code is equal to the definition above since log(ab)=log(a)+log(b)\log(a*b) = \log(a) + \log(b). For reference, this is using the natural log.

This is a single entropy calculation:

tf.reduce_sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)

And this is taking the mean over the whole batch:

entropy = tf.reduce_mean(pd.entropy())

The right side (.5 * np.log(2.0 * np.pi * np.e)) ends up being roughly 1.41. You can now deduce, given an entropy reading, how uncertain (and thus exploratory) each action is. High entropy equals high exploration and vice versa. To quantify the above, check out the figure below:

You can see that at roughly variance = 0.242 entropy is 0.

This is how policy entropy looks for me when training two predator PPOs to catch prey:

As you can see, exploration decreases continuously. It starts with variance = 1 for both actions and ends up at variance = 0.3 after training. (starts at 2.82 = 1.41 * 2, which is incidentally the entropy for 2 actions summed up, given variance of both is 1). Nice!

Update: Some notes regarding why we take the mean.

One could argue that the logstd does not change over the 700 batch, since the policy did not change. And that is true, we actually get 700 times the same entropy number. The reason why every PPO implementation takes the mean here is two-fold.

First of all, check out the loss PPO optimizes (see paper):

LtCLIP+VF+H(θ)=Et[LtCLIP(θ)c1LtVF(θ)+c2H[πθ](st)]L_t^{CLIP+VF+H}(\theta{}) = \mathbb{E}_t[L_t^{CLIP}(\theta{}) - c_1L_t^{VF}(\theta{}) + \textcolor{blue}{c_2H[\pi_{\theta}](s_t)}]

The blue part is the entropy, and observe how it is in the expectation E\mathbb{E}, for which you usually take the mean.

Secondly, autograd. One optimization idea would be to just take the first entropy number and throw away the others (so an indexing operation instead of the mean), but that would not play well with autograd, which uses the gradients to update all parameters (including the logstd parameter and thus your policy itself) in the backwards pass. Calculating the gradient of a mean is easy, but the gradient of an indexing operation? The key word here is 'differentiable indexing' and it is possible, but you only update part of your parameters (the part you indexed). In this case my hunch is that you end up only updating your policy once instead of 700 times, which defeats the purpose of a 700 batch. At that point you should just do singular updates and no batching.

Published on

# Burry is sharing Heavy Metal again - his TSLA short must be bleeding

This is so hilarious I just gotta save this for the books. Michael Burry, the famous main character from The Big Short who predicted the housing market crash and made a nice 100M for himself (total net worth back then 300M), is short on TSLA. And he has started sharing heavy metal on his Twitter, which is interesting - this is portrayed in the movie as something he uses to get rid of the stress of BLEEDING CASH EVERY MONTH lol. The position must be pretty hard on his nerves. On the other hand, he just made some very good cash on his GME play.

For the record, I totally agree with him. There are quite a few indicators that the market is devolving into a bubble. Wallstreetbets for instance has gained over 6M subscribers over the GME fiasco (which I missed, but I made some money shorting it). It had 2M subscribers before, and has 8.8M now. I have personally heard of some laymen in my circle of friends that they are becoming interested in the stock market. So yea, interesting times. And then there is Bitcoin and TSLA..

So let's see how 2021 plays out. I now know that my post There is no recession coming was technically right. We basically did not have a recession in 2020, that was like a small shake-out! So maybe I will look back at this post some time in the future and agree with my past self.

While writing this post I was listening to the Heavy Metal Michael Burry shared - my god that is some good shit. I might just take a dive into that genre off of his recommendation.

Here you go: link

Update: Just found a satirical comment I posted literally at the bottom of the crash in March 2020. Everyone on WSB was balls deep in puts. Fucking hilarious.

Published on

# Data Preparation is Everything - Parsing The German Commercial Register (Handelsregister)

There are three major issues with the commercial register (Handelsregister) data. First of all, every company has their own rules, e.g. in terms of appointments. Who is allowed to be appointed, when and why and with whose permission.. Thousands of combinations. Secondly, our clerks do not seem to follow the format 100%. And they make spelling mistakes sometimes. But the most glaring one is: Each register entry is ONE line of TEXT. And there is no official delimiter it seems. The clerks all kind of go with a few loose rules but yea - it is definitely not defined by a RFC or something lol. So what does this leave us with. LOTS of Regex and special rules, and it is pretty painful. But this should be known to any serious ML practitioner, that most of the work lies in data preparation.

For instance, I want you to look at this regex (which is just a small part of the data preparation pipeline):

(Die |Ist).*(ein|zwei)?.*(Gesellschaft|Vereinigung|Genossenschaft|vertritt|Vorstand|Geschäftsführer).*(ein|zwei)?.*
(vertreten|Prokuristen|vertreten.*gemeinsam|vertreten.*gemeinschaftlich|einzelvertretungsberechtigt|Vertretungsbefugnis von der Generalversammlung bestimmt)

It might not be perfect, but it gets the job done. So anyways, enough of the data preparation pipeline, how about a first result. While doing some descriptive statistics (data exploration), the first thing that interested me was the number of registrations per year over time. Check it out:

2015 we had a near recession with oil dropping and the Chinese market crashing, so it makes sense that in bad times less businesses were registered. But what is extremely surprising - in 2020 we had slightly more registrations than in 2019! What the hell? Seems like the 2020 global recession was kind of a fake recession for a certain chunk of the population.

Another thing that is interesting, but not plotted here (since it would look boring) - month over month the number of registrations are roughly the same. So there is no seasonality here. Quite interesting!

Just a note regarding the 2007 data (which is roughly 130 registrations) - the available data starts at the very end of 2007. Thus it looks like there are nearly no registrations in 2007, but in reality it's not a full year.

I will be working with this data some more and I will update this post (or create a new one). Specifically I am looking at doing Latent Semantic Analysis (LSA) and using the gensim library. Stay tuned!

Update: I have done some more analysis, specifically LSA. Note: The next few paragraphs will contain a lot of German words.

An important task is getting rid of small and common words such as 'der' (engl.: the), else one would throw off (dis)similarity measures. So for me this currently looks like that:

und oder der die das von zur nur so einer einem ein allein darum dazu für dafür
zum zu dort da grundkapital stammkapital eur dem usd gbp im mit an dem bei art
rahmen vor allem aller sowie aus in den als

These are often called 'stopwords', and we simply filter them out, which improves the LSA model considerably - we get more confident topic assignments.

Notice words like 'Stammkapital' (base capital) - they may look like they are important to the text, but if literally every text has Stammkapital included, it does not really give us any more information.

A seemingly important hyperparameter (similar to the number of clusters in clustering) is the number of topics. I am currently experimenting a bit here. Having too small of a number of topics seems to make most topics include stuff like 'Halten' (holding) which is a very important word since there are a lot of holding companies. Too many topics has duplicates as follows:

66 ['it', 'consulting', 'management', 'marketing', 'insbesondere', 'bereichen', 'it-dienstleistungen', 'gegenstand', 'vermarktung', 'zubehör']
67 ['marketing', 'insbesondere', 'dienstleistungen', 'it', 'in-', 'gegenstand', 'ausland', 'consulting', 'it-beratung', 'deren']

And then there are weird combinations such as the following. IT services and construction?!

81 ['an-', 'it-dienstleistungen', 'verbundenen', 'allen', 'trockenbau', 'alle', 'geschäften', 'nebst', 'damit', 'kauf']

Anyways, I will keep updating this post with new findings.

Published on

# Hadamard Gate Transformation for 3 or more QuBits

While trying to understand Grover's Algorithm, I started wondering how exactly the Hadamard gate transformation looks like for not just one QuBit or even two, but for three or more. Ekert et al. in their paper Basic concepts in quantum computation show with Equation 12 a very nice general formula for the Hadamard gate transformation:

Hnx=z(1)xzz2nH^{\bigotimes{}n}|x\rangle{} = \frac{\sum_z (-1)^{x \cdot z}|z\rangle{}}{\sqrt{2^n}},

where \cdot is the element-wise product. The sum over zz indicates all possible states of xx (we're in a superposition!). For a n-length state there are 2n2^n possible states, so for 3 there are 8 states.

Assuming I'd like to apply Grover's algorithm to 3 QuBits as shown on the Qiskit page on Grover's algorithm, there is a point at which Hadamard is applied to 3 QuBits on a non-trivial state (not just 000|000\rangle{}). This is where the formula above comes in. Let's first calculate it through for 000|000\rangle{}. Below I calculate the element-wise products xzx \cdot z.

(000)(000)=00+00+00=0 \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix} \cdot \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix} = 0*0 + 0*0 + 0*0 = 0

(000)(001)=00+00+01=0... \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix} \cdot \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix} = 0*0 + 0*0 + 0*1 = 0 \\ ...

As you can see, they all amount to 0 here. Now, the general equation from above becomes trivial (note: n=3n=3 and x=000x=|000\rangle{}):

H3000=z(1)0z23H^{\bigotimes{}3}|000\rangle{} = \frac{\sum_z (-1)^0|z\rangle{}}{\sqrt{2^3}}

H3000=z1z8H^{\bigotimes{}3}|000\rangle{} = \frac{\sum_z 1|z\rangle{}}{\sqrt{8}}

H3000=18(000+001+010+011+100+101+110+111)H^{\bigotimes{}3}|000\rangle{} = \frac{1}{\sqrt{8}} (|000\rangle{} + |001\rangle{} + |010\rangle{} + |011\rangle{} + |100\rangle{} + |101\rangle{} + |110\rangle{} + |111\rangle{})

In the last step all the sum was evaluated to its 8 possible states. This is probably very familiar looking now! Next, let's try this for a non-trivial state, such as 101|101\rangle{}. Again, at first I calculate the element-wise products xzx \cdot z.

(101)(000)=10+00+10=0 \begin{pmatrix} 1 \\ 0 \\ 1 \end{pmatrix} \cdot \begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix} = 1*0 + 0*0 + 1*0 = 0

(101)(001)=10+00+11=1... \begin{pmatrix} 1 \\ 0 \\ 1 \end{pmatrix} \cdot \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix} = 1*0 + 0*0 + 1*1 = 1 \\ ...

I've again only shown the first two. But there is an important difference here already, the result of the element-wise product for states x=101x=|101\rangle{} and z=001z=|001\rangle{} is 11! And this is where the negative signs come from when evaluating the whole sum, since (1)1=1(-1)^{1} = -1.

H3101=18(000001+010011100+101110+111)H^{\bigotimes{}3}|101\rangle{} = \frac{1}{\sqrt{8}} (|000\rangle{} - |001\rangle{} + |010\rangle{} - |011\rangle{} - |100\rangle{} + |101\rangle{} - |110\rangle{} + |111\rangle{})

If you want to calculate this by hand and compare, go right ahead. To make sure your calculations are correct, here is a short code snippet using Qiskit, that does basically all the calculations for you. To customize it, change the index in circuit.initialize(). E.g. Just imagine the state from before as a binary number (1012 =510101_{2} ~= 5_{10}), which means we were looking at 5. Now set the number at index 5 to 1 in the list [0, 0, 0, 0, 0, 0, 0, 0].

import qiskit as qk

q = qk.QuantumRegister(3)
c = qk.ClassicalRegister(3)
circuit = qk.QuantumCircuit(q, c)

# We want 101, so the fifth. This is 0-indexed. So the sixth place:
circuit.initialize([0, 0, 0, 0, 0, 1, 0, 0], [q[0], q[1], q[2]])

simulator = qk.BasicAer.get_backend('statevector_simulator')
job = qk.execute(circuit, simulator)
ket = job.result().get_statevector()
for amplitude in ket:

This will print the amplitudes multiplied with 18 =0.3535\frac{1}{\sqrt{8}} ~= 0.3535. Incidentally, negative signs are in the exact same places as in the result from our manual calculation above.


Coming back to the example shown on Qiskits page, in step 3.1 they apply Hadamard on the following state:

ϕ2=18(000+001+010+011+100101110+111)|\phi{}_{2}\rangle{} = \frac{1}{\sqrt{8}} (|000\rangle{} + |001\rangle{} + |010\rangle{} + |011\rangle{} + |100\rangle{} - |101\rangle{} - |110\rangle{} + |111\rangle{})

This is the pointe of this blog post: To calculate the resulting state (after Hadamard gate transformation), you would have to calculate H(z)H(|z\rangle{}) for each of the sub-states above (so for 000|000\rangle{}, 001|001\rangle{}, etc.) and then sum them up1. Some of them would then cancel each other out. This is why in their example, after step 3.1 the state looks so short:

ϕ3a=12(000+011+100111)|\phi{}_{3a}\rangle{} = \frac{1}{2} (|000\rangle{} + |011\rangle{} + |100\rangle{} - |111\rangle{})

Anyways, this is it for this post. I hope I could give some more intuition into the Hadamard gate transformation with more than just 1 or 2 QuBits. Maybe the script above will also help you in your endeavours.

Update: I've created a script to prove that the states indeed cancel each other out. Find it here. Observe in the output at the very and that only 4 states have a non-zero amplitude. Those are exactly the four states that we end up with!

Also, for the interested reader here are a few related papers that may help in understanding:

  1. I got that idea from here. Notice how the Hadamard gate transformation evolves on a complex state: H(120+121)=12(120+121)+12(120121)=12(0+1)+12(01)=0H(\frac{1}{\sqrt{2}}|0\rangle{} + \frac{1}{\sqrt{2}}|1\rangle{}) = \frac{1}{\sqrt{2}}(\frac{1}{\sqrt{2}}|0\rangle{} + \frac{1}{\sqrt{2}}|1\rangle{}) + \frac{1}{\sqrt{2}}(\frac{1}{\sqrt{2}}|0\rangle{} - \frac{1}{\sqrt{2}}|1\rangle{}) = \frac{1}{2}(|0\rangle{} + |1\rangle{}) + \frac{1}{2}(|0\rangle{} - |1\rangle{}) = |0\rangle{} 

Published on

# Parsing the USP search path BNF using pyparsing

Let's parse the search path grammar! The USP (User Services Platform) committee has given us a BNF in their pdf that describes the object path model (including search path) using a context-free language.

This is how the BNF of Path Objects looks like:

idmpath ::= objpath | parampath | cmdpath | evntpath | searchpath
objpath ::= name '.' (name (('.' inst)|(reffollow '.' name) )? '.')*
parampath ::= objpath name
cmdpath ::= objpath name '()'
evntpath ::= objpath name '!'
inst ::= posnum | expr | '*'
expr ::= '[' (exprcomp ( '&&' exprcomp )*) ']'
exprcomp ::= relpath oper value
relpath ::= name (reffollow? '.' name )*
reffollow ::= ( '#' (posnum | '*') '+' )| '+'
oper ::= '==' | '!=' | '<' | '>' | '<=' | '>='
value ::= literal | number
name ::= [A-Za-z_] [A-Za-z_0-9]*
literal ::= '"' [^"]* '"'
posnum ::= [1-9] [0-9]*
number ::= '0' | ( '-'? posnum )

I've interpreted a searchpath based on the HTML specification, which unfortunately is not defined above, as follows:

searchpath ::= objpath name?

The pyparsing PEG language looks like this:

# Helpers.
alphas = pyparsing.alphas.upper() + pyparsing.alphas.lower()
dot = Suppress(Char('.'))

# Basic stuff such as names and operators.
name = Word(alphas + '_', alphas + pyparsing.nums + '_')
literal = Char('"') + CharsNotIn('"') + Char('"')
posnum = Word(pyparsing.nums[1:], pyparsing.nums)
number = Char('0') | (Optional(Char('-')) + posnum)
value = literal | number
oper = Literal('==') | Literal('!=') | Literal('<') | Literal('>') | Literal('<=') | Literal('>=')

# The real stuff - references, expressions, paths.
reffollow = (Char('#') + (posnum | Char('*')) + Char('+')) | Char('+')
relpath = name + ZeroOrMore(Optional(reffollow) + dot + name)
exprcomp = relpath + oper + value
expr = Char('[') + exprcomp + ZeroOrMore(Literal('&&') + exprcomp) + Char(']')
inst = posnum | expr | Char('*')
objpath = name + dot + ZeroOrMore(name + Optional((dot + inst) | (reffollow + dot + name)) + dot)
parampath = objpath + name

searchpath = objpath + Optional(name)

Note: PEG languages are not equivalent to context-free languages. In fact they are less powerful, since they are more strict in terms of ambiguity. E.g. the following rule cannot be parsed by a PEG:

S ::= 'x' S 'x' | 'x'

So why not simply do a Regex? BNFs describe context-free languages and often cannot be implemented using a regular language. Specifically, context-free languages should strictly contain context-free rules only (i.e. the left side of each rule in the BNF further above has only one variable) and they can have recursion on both sides. Regular languages do not allow that and are even more strict - they can only contain a rule that ends in a terminal (e.g. a character) or a terminal combined with a variable (given terminal a and variable B: aB or Ba. In above example it could be '*' posnum or posnum '*'). For instance, the rule for exprcomp as it stands is not allowed in regular languages, since it contains three variables. Now, these rules can still possibly be converted into a regular language by enumerating all possible combinations of a variable and a terminal. E.g. oper consists of 6 possible values, so why not add 6 rules a → '==' value, a → '>=' value, and so on. However, this is a lot of work. Let's just stay context-free and use pyparsing!

There are lemmas to prove that a language is not regular such as the Pumping Lemma. However, I haven't been able to find a contradiction (i.e. any word in the above language can be pumped). This does not mean it is regular, so it is not a proof, but I have a gut feeling the above language is in fact regular. One way to actually prove this would be to create a finite automaton, i.e. a DFA or NFA from the BNF. Another way could be to try to convert every single rule into a left-/right-regular rule. If this is successful, it is indeed a regular language.

Published on

# Flap List - Cable Modems with intermittent connectivity problems

This post is a summary of the flap list feature. It is not necessarily just a Cisco-only feature, even though I will be focusing on Cisco here. There are other vendors such as Casa, Arris, Huawei.. The list goes on.

A flap list keeps track of CMs with connectivity problems (they're flapping back and forth). The flap list can help in deciding between whether there is a problem with the given CM, or whether there is a problem with the upstream or downstream cable.

There are three classes of problems:

  • Reinsertions: A CM re-registers more frequently than a specified insertion time. Too many reinsertions may indicate problems in the downstream cable or that the CM is provisioned wrongly.
  • Hits and misses: A hit occurs when a CM successfully responds to MAC-layer keepalive messages that the CMTS sends out. A miss occurs when the CM does not respond in after a timeout. Too many misses followed by hits may indicate problems in the up/downstream cable.
  • Power adjustments: A CM can adjust their upstream transmission power up to a maximum power level. Too many adjustments may indicate a problem with an amplifier in the upstream direction.

Use cases:

  • If a customer reports a problem, but the flap list does not contain the customer's modem, problems with the cable can be ruled out. The issue is then most likely local.
  • CMs with more than 50 power adjustments a day have a potential issue in their upstream path.
  • CMs with roughly same number of hits and misses and with a lot of insertions have a potential issue in their downstream path.
  • All CMs incrementing their insertion numbers at the same time indicates a problem with the provisioning servers.
  • CMs with high CRC errors have bad upstream paths.
  • Correlating CMs on the same upstream port with similar flap list statistics (same number of hits/misses/insertions) may show cable or node-wide issues.

References: Cisco Flap List pdf (use cases are from this pdf)


KPI (per CM) Description
ccsFlapInsertionFailNum If a CM registered more than once in a certain period (default: 90s), the first registration is considered failed, which increments the insertion fail KPI.
ccsFlapHitsNum CMTS sends request every 10 secs, a successful response within 25ms from CM increases flap hits by one.
ccsFlapMissesNum If the response is completely missing or takes more than 25ms, flap misses increases by one.
ccsFlapPowerAdjustmentNum If upstream power is adjusted more than X dB (default: 1 dB, but they say it often should be more like 6 dB), this KPI is increased.
ccsFlapCreateTime Time when this modem was added to the flap list. After max age (default: 7 days) they get removed again.

If any of the main KPIs (insertion fail, hit/miss, power adjustment) is significantly higher than the others, this is a very important signal.

If hit/miss is highest → the modem keeps going up and down.

If power adjustment highest → "improper transmit power level setting at the modem end" (personal story: I used to have that with M-Net. Internet kept going down completely every few days. They reduced power level (which capped our downstream MBit at a lower level..), and everything was fine)

The point is that these KPIs tell a lot.

Below is an illustration of misses increasing live on some cable modem. The snmpget was executed in the course of an hour.

snmpget -v2c -c 'REDACTED' IP iso.
iso. = Gauge32: 1580

snmpget -v2c -c 'REDACTED' IP iso.
iso. = Gauge32: 1634

snmpget -v2c -c 'REDACTED' IP iso.
iso. = Gauge32: 1650

It should be noted that there are some differences between vendors in terms of their Flap List support.

Vendor reinsertions hits misses power adjustments last flap time creation time
Cisco Yes Yes Yes Yes Yes Yes
Harmonic Yes Yes Yes Yes Yes No
Casa Yes Yes Yes Yes Yes No
Arris Yes No No No Yes No

The latest Arris Flap List MIB can be found here. The OID to look for is called cadIfCmtsCmStatusInsertionFlaps ( Unfortunately, hits, misses and power adjustments do not seem to be supported. The Cisco Flap List MIB can be found here. The Casa Flap List MIB can be found here. And the Harmonic (CableOS) Flap List MIB can be found here. One OID to look for here is hrmFlapInsertionFails (

Published on

# Emergent Behaviour in Multi Agent Reinforcement Learning - Independent PPO

I've been trying to get a bunch of sharks to cooperate in hunting using independent learning. And it has been quite a blast. Specifically, I've adapted the baselines repo to add support for multiple agents (MARL) and some other things to PPO: https://github.com/Instance-contrib/baselines/tree/tf2-instance.

Now, the environment consists of fishes (green cycles) and sharks (orange cycles):

Fishes use a pretty good static algorithm to evade sharks, while sharks are trained independently using PPO. The environment is a POMDP - the observation space only includes the nearest few fishes and sharks. The reward per eaten fish is 10. If there are no sharks in sight and the time is right, fishes will also procreate. Thus, with smart ecosystem management the sharks could eat infinitely.

A few new additions were made: Not only is there two or three sharks instead of just one, there is also starvation and thus pressure to eat. There is also a stun move that enables a shark to stun the other for a few seconds and a killzone: If two sharks are in that zone when one of them kills a fish, the rewards are split (so each shark gets 5). The question here is, what do the sharks learn to do? Will they completely greedily swim around, stun each other and steal fishes until there is none left. Or will they learn ecosystem management together, i.e. leave the fish population intact such that they can eat for a long time? Will they learn to hunt together in some settings?

The chart below shows that indeed ecosystem management, or what I call herding, is possible.

This is the left-over fish population at the end of an episode, over all training steps. One can see that while at the beginning the sharks aren't able to hunt well yet (section a), in section b they learn to greedily eat all fishes (and become quite proficient at that). What is interesting, is that in section c they learn to keep 2 fishes around. And that is with two independent neural networks (PPOs)! Quite awesome.

The second awesome emergent behaviour is cooperation. One way of creating that is reward shaping, but I did not go down this path. Instead, I forced cooperation by making it much tougher to hunt. The sharks are now slower by half, which makes it pretty much impossible to get a fish on its own. The next chart shows how two sharks learn to hunt from two sides to catch fish. This is completely learned behaviour in spite of the split killzone reward and the stun move!

There is a lot more analysis to be done here. For instance, the exact parameters that induce cooperation can be quantified in much more detail. See the chart below - there are 4 dimensions explored here, 3 of which have a significant influence on the average cooperation rate (over 20 models with each model evaluated 20 times). First, a trivial influence is the radius in which cooperation is registered. Increasing it increases the cooperation rate. More interesting is the number of initial fishes in the population and the speed of the shark. Especially setting the speed of sharks to 0.03, which effectively halves the speed, increases the rate to a point where they only cooperate. Having a lower number of fishes initially makes the aquarium less crowded and thus catching a fish tougher; This increases the cooperation rate.

There is a lot more to write here, but for now I will have to leave it at that. The repository can be found here. It also contains a summary of the results and a few nice animations.

Published on

# Zoho Mail API example in Python Flask

Since I could not find an example out there, here's a comprehensive and complete example on using the Zoho Mail API with Python Flask.

Some pointers: It was very important to make sure that the top level domains of all API URLs match. In my example, they're all dot eu. How it works is as follows: The application prints a register link that needs to be clicked only once per run. It gets an initial access token and a refresh token. A second thread uses the refresh token every hour to get a new valid access token (since they're only valid for one hour). There is also a Flask route to actually send a HTML generated E-Mail for testing.

Now, a possible improvement could be saving the refresh token to the filesystem to persist restarts. But that is an exercise for the developer.

The project can be found here: https://github.com/instance01/zoho-mail-api-example.

Excerpt with the main functionality:

import json
import time
import requests
import threading

from flask import Flask
from flask import request
from flask import render_template

app = Flask(__name__)

# Configure this.
FROM_EMAIL_ADDR = 'mail@yourdomain.de'
TO_EMAIL_ADDR = 'mail@otherdomain.de'
BASE_OAUTH_API_URL = 'https://accounts.zoho.eu/'
BASE_API_URL = 'https://mail.zoho.eu/api/'

    "access_token": "",
    "refresh_token": "",
    "api_domain": "https://www.zohoapis.eu",
    "token_type": "Bearer",
    "expires_in": 3600,
    "account_id": ""

def req_zoho():
    url = (
    print('CLICK THE LINK:')
    print('This only has to be done once.')

def get_access_token(code):
    state = request.args.get('state')
    url = '%soauth/v2/token' % BASE_OAUTH_API_URL
    data = {
        'code': code,
        'client_id': CLIENT_ID,
        'client_secret': CLIENT_SECRET,
        'redirect_uri': REDIRECT_URL,
        'scope': 'ZohoMail.messages.CREATE,ZohoMail.accounts.READ',
        'grant_type': 'authorization_code',
        'state': state
    headers = {'Content-Type': 'application/x-www-form-urlencoded'}
    r = requests.post(url, data=data, headers=headers)
    data = json.loads(r.text)
    ZOHO_DATA['access_token'] = data['access_token']

def get_account_id():
    url = BASE_API_URL + 'accounts'
    headers = {
        'Authorization': 'Zoho-oauthtoken ' + ZOHO_DATA['access_token']
    r = requests.get(url, headers=headers)
    data = json.loads(r.text)
    ZOHO_DATA['account_id'] = data['data'][0]['accountId']

def send_mail(body, email_address):
    url = BASE_API_URL + 'accounts/%s/messages'
    url = url % ZOHO_DATA['account_id']
    data = {
       "fromAddress": FROM_EMAIL_ADDR,
       "toAddress": email_address,
       "ccAddress": "",
       "bccAddress": "",
       "subject": "Test E-Mail",
       "content": body,
       "askReceipt": "no"
    headers = {
        'Authorization': 'Zoho-oauthtoken ' + ZOHO_DATA['access_token']
    r = requests.post(url, headers=headers, json=data)

def refresh_auth():
    # Update the access token every 50 minutes using the refresh token.
    # The access token is valid for exactly 1 hour.
    while True:
        url = (
        ) % (BASE_OAUTH_API_URL, ZOHO_DATA['refresh_token'], CLIENT_ID, CLIENT_SECRET)
        r = requests.post(url)
        data = json.loads(r.text)
        if 'access_token' in data:
            ZOHO_DATA['access_token'] = data['access_token']
            print('refreshed', ZOHO_DATA)
            time.sleep(3000)  # 50 minutes
            # Retry after 1 minute

@app.route('/callback/', methods=['GET', 'POST'])
def zoho_callback_route():
    code = request.args.get('code', None)
    if code is not None:
    return 'OK', 200

@app.route('/sendmail/', methods=['GET', 'POST'])
def send_mail_route():
    # Send a HTML email!
    data = ['1', '2', '3']
    mail = render_template('mail_template.j2', data=data)
    send_mail(mail, TO_EMAIL_ADDR)
    return 'OK', 200

def main():
    t = threading.Thread(target=refresh_auth)

if __name__ == '__main__':
Published on

# box2d - making b2Body clonable or copyable

I've been porting LunarLander from Openai/gym (Python) to C++. Since I'm using MCTS and related planners, the environment needs to be copyable. Well, box2d does not have copy-constructors, so at first glance we're screwed. Fortunately the environment is rather simple - it has three dynamic bodies (lander and the two legs), two joints and a static ground, the moon. It turns out, with a minor modification to box2d itself and a small hack that fixes sleeping it is possible to copy the whole LunarLander world.

Now, to clone the world and its bodies, I use the reset() method of the environment, i.e. we start with a clean initial world where the lander is still at its initial position. The shape of the moon can be copied easily, since it is generated from a random float vector. Simply copy the vector and generate the exact same moon. Finally, each body is cloned with the following function:

void LunarLanderEnv::clone_body(b2Body* current, b2Body* other) {
  auto transform = other->GetTransform();

  auto vel = other->GetLinearVelocity();
  current->SetTransform(transform.p, transform.q.GetAngle());

Basically, all properties like position (transform), angle and velocity are copied. These are all public functions of b2Body except for GetSleepTime and SetSleepTime, which I had to add myself. The change to box2d itself can be found here.

The gist is - when setting the velocity, sleep time is discarded (set to 0.0f). That is very bad, since the environment ends when the lander is asleep (IsAwake() returns false, i.e. sleep time exceeds some threshold). Thus, sleep time needs to be copied too.

inline float b2Body::GetSleepTime() const
    return m_sleepTime;;

inline void b2Body::SetSleepTime(float newSleepTime) {
  m_sleepTime = newSleepTime;

It's unfortunate that the sleep time is not public, but it makes sense - we are now going rather deep into box2d. They do not support cloning/copying bodies, it is what it is. But for my purposes this works.

Lastly, the hack I unfortunately had to introduce (and I hope at some point I can get rid of it), is as follows.

clone_body(leg1, other.leg1);
clone_body(leg2, other.leg2);
clone_body(lander, other.lander);

for (int i = 0; i < 2; ++i) {
  world->Step(1.0 / FPS, 6 * 30, 2 * 30);
  world->Step(1.0 / FPS, 6 * 30, 2 * 30);
  world->Step(1.0 / FPS, 6 * 30, 2 * 30);
  world->Step(1.0 / FPS, 6 * 30, 2 * 30);
  world->Step(1.0 / FPS, 6 * 30, 2 * 30);
  clone_body(leg1, other.leg1);
  clone_body(leg2, other.leg2);
  clone_body(lander, other.lander);

So, of course you can do the 5 world->Step calls in a for loop. But the gist is, we need to give the bodies a chance to settle down and fall asleep. It seems when forcefully copying all the public properties of bodies (velocity etc.) the bodies still jiggle around. Maybe it's due to the joints, I don't know. Apparently there is still some state I missed copying. The hack makes it all work for now, but maybe at some point I will be able to solve this in a cleaner way.

The project this is relevant to is GRAB0. The full implementation of the LunarLander environment can be found here. This includes the above code snippets in a wider context.

A related post is the Torch C++ tutorial with a few more code snippets.

Published on

# Productivity

I saw this great list of productivity tips from a year ago by guzey. And it's awesome.

Now, you may indulge in that list if you like. There's lots of these little productivity posts from a lot of different, very productive people on Hackernews. This will simply be another one of those. The reason is two-fold: It's for me to reflect on later. And it's just another combination. You can combine productivity tricks in many ways until you find your combination that just works. So here's mine. Maybe it works for you.

It consists of two rules.

  1. Prepare very small tasks (split up big tasks into small ones) the day before. I usually have a few (like 3?) well defined big tasks and a few smaller administrative tasks. Well defined means: They are broken into small tasks. It's like prepping food the night before work. Some productive people do that. (the food thing - I don't do that. But you get the gist.)

  2. If you notice you are starting to procrastinate, get in the habit of questioning what the root cause may be. Spoiler: It's emotions. Your task feels unsurmountably huge. Your task feels useless. Your task feels too difficult. Your task feels like it's not getting you anywhere. Your task feels too easy (yes. that's possible too. Too easy == useless. For example when preparing for an exam. You have a few tasks to redo some exercises you deem easy -> that may result in procrastination, since you feel that redoing them again is not really getting you anywhere.)

See sources on the emotions (contains also related discussion): HN:22124489 HN:23537317

These are the two rules driving me since more than a year. guzey says procrastination tricks stop working from time to time. Let's see when these two rules stop working for me.

Note: They fail for me too, sometimes. Some days are just simply garbage. Keep in mind though - it's fine, it's a human thing. It is what it is.

I may update this post at some point with further experiences and tricks.

Update: Now that I think of it, I actually use more tricks from time to time. But the two rules above are the main ones. Mostly even just rule 1. Other tricks I sometimes use include: Win the morning (Aurelius) and some kind of lax Pomodoro technique based on tasks - finish a task (however long it takes, be it 15 min or 1 hour), take a small break. The Pomodoro one I only use for studying, not for work. For work or development I usually do not need any productivity tricks except for the two rules I mentioned. I don't actually like Pomodoro because sometimes the break leads to a 2 hour odyssey. After that, productivity is very impaired due to guilt.

Published on

# Failure of k-fold cross validation - The parity problem

Sometimes cross validation may fail. In the following, a bit more intuition into an exercise that shows a pitfall of leave-one-out cross validation is presented.

For the reference of the task at hand, refer to this book, exercise 11.5.1.

Failure of k-fold cross validation:
Consider a case in that the label is
chosen at random according to P[y = 1] = P[y = 0] = 1/2. Consider a
learning algorithm that outputs the constant predictor h(x) = 1 if the parity
of the labels on the training set is 1 and otherwise the algorithm outputs the
constant predictor h(x) = 0. Prove that the difference between the leave-oneout
estimate and the true error in such a case is always 1/2.

The gist of the solution is as follows. Be advised, this is just for intuition. The rigorous proof can be found here.

Firstly, parity is defined like on binary numbers. A list of labels {x0,..,xn}\{x_0, .., x_n\} has the parity (inxi)mod2(\sum_i^n x_i) \bmod 2.

Imagine the sample consists of 3 items with the following labels: {1,0,0}\{1, 0, 0\}. This set has a parity of 1. Now, when taking one of the items out (leave-one-out) for cross validation, there are two cases:

  1. Labels are now {1,0}\{1, 0\}. The parity of this three-tuple is still 1. The predictor is trained on this and will always return 1. However, the validation set consists of {0}\{0\}, which has a parity of 0. Thus, the estimated error by CV is 1.

  2. Labels are now {0,0}\{0, 0\}. The parity of this three-tuple is 0. The predictor will always return 0. The validation set consists of {1}\{1\}. Again, estimated error of 1.

In essence, the estimated error will always be 1 (keep in mind, we take the average of all estimated errors).

Now, imagine the same for a sample of a much higher size, such as 1001, i.e. with 501 1's and 500 0's. The true error is then roughly 0.5: The predictor always predicts 1 due to the parity 1, but close to half of the samples are 0. You may keep going ad infinitum.

Finally, as the estimated error is 1.0 (as shown above), we thus get the difference of 0.5, as required in the exercise.

Published on

# Kernels - Additional Intuition

If you're here, you probably checked out a few blog posts or Wikipedia to understand Kernels. So this is just a few additional thoughts. Firstly, why even do this 'kernel trick'?

The application we will be looking at in this post is clustering a set of points. For this, it is of course important to have some kind of distance measure between the points, in fact that is the main function needed.

Consider a dataset of 4-dimensional points. Pick a point (x1,x2,x3,x4). For the sake of this example you find that to separate the points in multiple clusters you will have to move them into the 10-dimensional space using the following formula:

(i4xiyi)2(\sum_i^4 x_iy_i)^2

Fully expanded, each of the 4-dimensional points is now the following 10-dimensional point:

{x12y12,2x1x2y2y1,2x1x3y3y1,2x1x4y4y1,x22y22,x32y32,x42y42,2x2x3y2y3,2x2x4y2y4,2x3x4y3y4}\{x_1^2 y_1^2 , 2 x_1 x_2 y_2 y_1 , 2 x_1 x_3 y_3 y_1 , 2 x_1 x_4 y_4 y_1 , x_2^2 y_2^2 , x_3^2 y_3^2 , x_4^2 y_4^2 , 2 x_2 x_3 y_2 y_3 , 2 x_2 x_4 y_2 y_4 , 2 x_3 x_4 y_3 y_4\}

You could just convert all points into this and then apply a clustering algorithm, which will use some kind of distance measure on those points such as euclidean distance, but observe all these calculations.. For each point you have to do 1010 calculations. Now imagine if you did not have to convert all points. You use the polynomial kernel:

(i4xiyi)2=(xTy)2(\sum_i^4 x_iy_i)^2 = (x^Ty)^2

The crucial point here is that the kernel gives us a kind of similarity measure (due to the dot product), which is what we want. Since this is basically the only thing we need for a successful clustering, using a kernel works here. Of course, if you needed the higher dimension for something more complicated, a kernel would not suffice any longer. The second crucial point here is that calculating the kernel is much faster. You do not have to first convert all the points to 10 dimensions and then apply some kind of distance measure, no, you do that in just one dot operation, i.e. one sum loop. To be precise, you iterate N=4N = 4 times with a kernel, but you do it N2.5=10N*2.5=10 times with the naive way of converting points to higher dimension.

Published on

# Advanced Torch C++ Tutorial (multiple objectives) - Code Snippets

This is a collection of notes regarding the C++ frontend of PyTorch. I'm a proponent of reading code to understand concepts in computer science, so this 'tutorial' is most likely not for everyone. Still, I recommend to just go ahead and read the goddamn code. In this post I'll just highlight a few specific topics to get one started. Documentation is in my opinion lacking, so anyone developing a project using the C++ frontend should basically download PyTorch from Github to read the code as needed. Let's get started.

Modules always need to be named XXImpl. For instance, your module could look like this in your header:

struct A2CNetImpl : public torch::nn::Cloneable<A2CNetImpl> {
    A2CNetImpl() {};
    ~A2CNetImpl() {};


It is possible to train multiple different objectives at once. For instance, let's assume you want to both learn to predict the policy and the value of a given state, you can do just that by simply returning a std::pair in the forward method of your custom module:

std::pair<torch::Tensor, torch::Tensor>
A2CNetImpl::forward(torch::Tensor input) {
  auto x = input.view({input.size(0), -1});
  x = seq->forward(x);
  auto policy = F::softmax(action_head(x), F::SoftmaxFuncOptions(-1));
  auto value = value_head(x);
  return std::make_pair(policy, value);

It is possible to create a network architecture dynamically from a configuration file. Here I pass a std::vector net_architecture (e.g. {64, 64}) and then iteratively create a linear layer of size 64 (as passed) and a ReLU activation layer. At the end I create two heads, a policy and a value head.

seq = register_module("seq", torch::nn::Sequential());
int n_features_before = n_input_features;
int i = 0;
for (int layer_features : net_architecture) {
  auto linear = register_module(
      "l" + std::to_string(i),
      torch::nn::Linear(n_features_before, layer_features)
  auto relu = register_module("r" + std::to_string(i + 1), torch::nn::ReLU());
  n_features_before = layer_features;
  i += 2;
action_head = register_module("a", torch::nn::Linear(n_features_before, 3));
value_head = register_module("v", torch::nn::Linear(n_features_before, 1));

The policy optimizers of course support polymorphism, so creating different ones based on configuration is possible too:

A2CNet policy_net = A2CNet();

std::shared_ptr<torch::optim::Optimizer> policy_optimizer;
std::string optimizer_class = params["optimizer_class"];
if (optimizer_class == "adam") {
  auto opt = torch::optim::AdamOptions(lr);
  if (params["use_weight_decay"])
  policy_optimizer = std::make_shared<torch::optim::Adam>(policy_net->parameters(), opt);
} else if (optimizer_class == "sgd") {
  auto opt = torch::optim::SGDOptions(lr);
  policy_optimizer = std::make_shared<torch::optim::SGD>(policy_net->parameters(), opt);

Actually training two objectives:


// Forward.
torch::Tensor action_probs;
torch::Tensor values;
std::tie(action_probs, values) = policy_net->forward(samples);
auto mcts_actions = attached_mcts_actions.detach_();

// Calculate losses.
torch::Tensor cross_entropy;
if (params["tough_ce"]) {
  auto err = -(torch::log(action_probs) * mcts_actions).sum({1});
  cross_entropy = (err).sum({0});
} else {
  auto argmax_mcts_actions = mcts_actions.argmax({1});
  cross_entropy = F::cross_entropy(
cross_entropy /= mcts_actions.size(0);

torch::Tensor value_loss = F::smooth_l1_loss(

cross_entropy.backward({}, true, false);

More interesting code snippets and the whole code can be found in the following two repositories:

Find the files a2c.cpp and a2c.hpp in both.

Published on

# Conflicting Samples and More Exploration

You should know when you see a local minimum. See the following example that I've been battling with for a while. In my opinion it's a symptom of not enough exploration.

This is a separated list of 3-item tuples that represent the likelihoods of 3 actions, rotating left, rotating right and going forward. The fat tuple is interesting, as it's not confident at all between right and forward. This is the cause of the sample we're learning with: 2, 2, 2, 2, 2, 1, 0, 2, 2, .. That sample rotates to the right, then to the left and then goes forward. Basically the two rotations are useless. Even more, we do two different actions in the same state, rotating right (1) and going forward (2). See where I'm going with this? Those are exactly the two actions that the fat tuple is not confident with.

0.064 0.199 0.739 | 0.029 0.011 0.962 | 0.03 0.06 0.912 | 0.018 0.023 0.961 | 0.013 0.022 0.967 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495 | 0.996 0.003 0.002 | 0.008 0.498 0.495

The result of a greedy evaluation of the above policy is: 22222101010101010101010101010101010101010... So you end up rotating until the end, not reaching the goal. Below is an every so slightly different policy (check out the unconfident tuple - this time we go forward instead of rotating right when greedily evaluating).

0.064 0.188 0.749 | 0.029 0.01 0.962 | 0.03 0.057 0.914 | 0.018 0.022 0.962 | 0.013 0.021 0.968 | 0.008 0.486 0.507 | 0.002 0.001 0.998 | 0.002 0.001 0.999 | 0.001 0.999 0.002 | 0.001 0.001 1 | 0.001 0.001 0.999 | 1 0.001 0.001 | 0.001 0.001 1 | 0.001 0.001 1 | 0.001 1 0.001 | 0.001 0.001 1 | 0.001 0.001 1 | 1 0.001 0.001 | 0.001 0.001 1 | 0.001 1 0.001 | 0.001 0.001 1 | 0.001 0.001 1 | 0.498 0.001 0.503 | 0.001 0.001 1 | 0.001 0.001 1 | 0.001 0.001 1 | 0.001 0.001 1 | 1 0.001 0.001 |

Evaluation: 222222221220221220212222222022122

So, this training sample sucks and I need more exploration based on it. Let's see if my hypothesis is correct. I might update this post at some point.

Update: My theory was right, but the reason was wrong. See the issue here and the resolving commit here. What I saw above was this off by one mistake, as mentioned in the issue. Simply a logical mistake by myself. What finally fixed it however, was 'fuzzing' the gradient bandit in GRAB0 (bear in mind, the above topic came from that project) and noticing that when given a policy, the gradient bandit needs at least 2000 simulations to find a better policy (and even then, only most of the time, but that's good enough).

Published on

# PreDeCon - Density based Projected Subspace Clustering

PreDeCon1 is a projected subspace clustering method based on DBSCAN. This will be a short post, mostly to illustrate the concept with an example, since I believe code is worth a thousand words. But the gist is to weigh the distances based on the variance of their neighbors. I.e., if your neighbors are all distributed in a line, you assign a very small weight to distances along the line and very huge weight to the perpendicular dimension. So, if a point is slightly outside of the line, it's already not part of your cluster (since the distance is weighted heavily). Often, those weights can be up to 1 to 100! So going outside of the line penalizes (=weighs more) the distance by a factor of 100.

See for yourself in the following picture of an example data set. PreDeCon will find two clusters, both the lines in the picture.

The way this works, is that the red point is NOT a core point due to our weighted clustering, even though it would be with naive DBSCAN. Thus, DBSCAN stops here, instead of continueing and adding all points to one single cluster. And why is that not a core point? Well, the high variance of the right neighbor of our red point lies actually on the vertical line, and not the horizontal line as with the other neighbor! Thus, the weight given to the horizontal distance between the red point and the right neighbor will be multiplied by 100. That's because the red point looks to the neighbor like an outlier, since it's obviously not on the vertical line. The weight given to the vertical distance would've been simply 1. But the red point and the right neighbor have a vertical distance of 0 and a horizontal distance of 1.. Thus, the weighted distance here will be 0 * 1 + 1 * 100. There's more to it, but that gets the method across.

Finally, here's Python code explaining the above concept, including all details (such as correct euclidean distance calculation and not the simplification we used).

# PreDeCon - Projected Clustering
# Paper: Density Connected Clustering with Local Subspace Preferences, Böhm et al., 2004
# This code is not optimized.
import numpy as np

def get_neighbor(X, candidate):
    """Return the eps-neighbors of candidate.
    neighbors = []
    for pt in X:
        if ((pt - candidate) ** 2).sum() ** .5 <= eps:
    return neighbors

def get_weights(X, candidate):
    dists_x = []
    dists_y = []
    for neighbor in get_neighbor(X, candidate):
        dists_x.append((neighbor[0] - candidate[0]) ** 2)
        dists_y.append((neighbor[1] - candidate[1]) ** 2)

    var_x = sum(dists_x) / len(dists_x)
    var_y = sum(dists_y) / len(dists_y)

    weight_x = 1 if var_x > delta else K
    weight_y = 1 if var_y > delta else K

    return weight_x, weight_y

def pref_weighted_dist(X, neighbor, candidate):
    weights = get_weights(X, neighbor)
    dist = 0
    for i in range(2):
        dist += weights[i] * (neighbor[i] - candidate[i]) ** 2
    return dist ** .5

def is_core(X, candidate):
    good_ones = []
    for neighbor in get_neighbor(X, candidate):
        dist = max(
            pref_weighted_dist(X, neighbor, candidate),
            pref_weighted_dist(X, candidate, neighbor)
        if dist <= eps:
    return len(good_ones) >= minpts

X = np.array([
    [1, 5],
    [2, 5],
    [3, 5],  # p3
    [4, 5],
    [5, 5],
    [6, 5],  # p6, red point
    [7, 5],
    [7, 6],
    [7, 7],
    [7, 4],
    [7, 3],
    [7, 2]

minpts = 3
eps = 1
delta = 0.25
K = 100

print('p3', is_core(X, X[2]))
print('p6', is_core(X, X[5]))

Can also be found here.


  1. Density Connected Clustering with Local Subspace Preferences, Böhm et al., 2004 

Published on

# SUBCLU by example

SUBCLU1 is a subspace clustering algorithm based on DBSCAN2 and A-priori3.

Assume the following small exemplary data set with three dimensions. DBSCAN is parameterized with eps=3 and minpts=2 (including the core point).

2 10 2
1 20 1
9 30 9
8 40 9

The algorithm is bottom-up, so let's start with singular dimensions 1 to 3. D(x,y) is the euclidean distance between two points.

Dimension 1 (points: 2, 1, 9, 8): D(2,1) = 1 < eps. Since we have two points, 2 is a core point. Same can be said about 1. So, the first cluster consists of 2 and 1. Same can be applied to 9 and 8. So dimension 1 contains two clusters.

Dimension 2 (points: 10, 20, 30, 40): All points are too far away from each other (distance is at least sqrt(10) between each pair), so no cluster can be found here.

Dimension 3 (points: 2, 1, 9, 9): Analogous to dimension 1. Two clusters.

Next, we generate the k+1 subspace candidates (where k=1, since we just did singular dimensions). Keep A-priori in mind: The k+1 candidates need to have k-1 attributes in common. Not relevant for 2 dimensions, but will be relevant for 3 dimensions later.

So here we have the following combinations of dimensions: 1,2; 1,3; 2,3. We may now prune some of those candidates using the A-priori principle, i.e. if k dimensions of a given candidate do not contain a cluster, that candidate will also not contain a cluster in k+1 dimensions. For proofs, refer to the original paper1. In this case, with k=1, dimension 2 has no clusters and thus both the candidates 1,2 and 2,3 can be pruned.

Next, we apply DBSCAN on subspace 1,3 (points: (2,2), (1,1), (9,9), (8,9)). D((2,2), (1,1)) = D((1,1), (2,2)) = sqrt(2) < eps. That's a cluster with points (2,2) and (1,1) (since we have 2 points and minpts=2). D((9,9), (8,9)) = D((8,9), (9,9)) = 1 < eps. Again, another cluster.

To generate three dimensions, we would need more candidates. For example, with 1,2 and 1,3 one could generate 1,2,3 (since both share the first dimension: 1). However, currently the only 2-dimensional candidate with clusters is 1,3. Thus, we stop here.

Result: The two dimensions 1 and 3 contain two subspace clusters

2 2
1 1


9 9
8 9


  1. Density-Connected Subspace Clustering for High-Dimensional Data, Kailing et al., 2004 

  2. A Density-Based Algorithm for Discovering Clusters, Ester et al., 1996 

  3. Fast Algorithms for Mining Association Rules, Agrawal et al., 1994 

Published on

# Subspace and Correlation Clustering

I've compiled a list of ELKI invocations for subspace/correlation clustering with nicely fit parameters (except for PROCLUS). This post is actually more of a reminder for myself. A note on PROCLUS: Due to the randomness of k-medoids it seems I only get good clusters half of the time. It's something to consider.

java -jar elki.jar KDDCLIApplication -algorithm clustering.subspace.SUBCLU -dbc.in exampledata2.txt -subclu.epsilon 3 -subclu.minpts 1 -subclu.mindim 1 -distance.dims 1,2,3

java -jar elki.jar KDDCLIApplication -algorithm clustering.subspace.CLIQUE -dbc.in exampledata3.txt -clique.tau 0.3 -clique.xsi 8

java -jar elki.jar KDDCLIApplication -algorithm clustering.subspace.PROCLUS -dbc.in exampledata3.txt -proclus.mi 10 -projectedclustering.k 2 -projectedclustering.l 2

java -jar elki.jar KDDCLIApplication -algorithm clustering.subspace.PreDeCon -dbc.in exampledata5.txt -predecon.delta 0.25 -predecon.kappa 100 -predecon.lambda 2 -dbscan.minpts 3 -dbscan.epsilon 1

java -jar elki.jar KDDCLIApplication -algorithm clustering.correlation.ORCLUS -dbc.in exampledata5.txt -orclus.alpha 0.5 -projectedclustering.k 2 -projectedclustering.l 1

java -jar elki.jar KDDCLIApplication -algorithm clustering.correlation.FourC -dbc.in exampledata5.txt -dbscan.epsilon 1 -dbscan.minpts 3 -predecon.kappa 100 -pca.filter.delta 0.5

java -jar elki.jar KDDCLIApplication -algorithm clustering.Leader -dbc.in exampledata5.txt -leader.threshold 5

2 5 2
1 5 1
9 5 9
8 5 9
4.5, 4.0, 1.2
4.1, 2.0, 1.3
4.2, 3.0, 1.3
4.4, 2.0, 1.2
4.3, 5.0, 1.1
1.3, 4.0, 1.5
1.2, 5.0, 1.6
1.3, 2.0, 1.5
1.4, 3.0, 1.6
1 4
2 4
3 4
4 4
5 4
6 4
7 1
7 2
7 3
7 4
7 5
7 6
Published on

# Real-time Dynamic Programming (RTDP) applied to Frozen Lake

Real-time dynamic programming1 (RTDP) uses Markov Decision Processes (MDPs) as a model of the environment. It samples paths through the state space based on the current greedy policy and updates the values along its way. It's an efficient way of real-time planning, since not necessarily the whole state space is visited, and works well for stochastic environments. As such, with some additional work, it is possible to apply RTDP on stochastic gym environments such as frozen lake. In my case, I tried just that. Given a 20x20 map and full observability, RTDP successfully gets a decent reward after a few minutes of training on one core. The source code can be found at the end of the post.

Here's the algorithm:

And this is the environment:


For the MDP we need four components: States, actions, transition matrix and costs. Fortunately, since we're using a gym environment, most of this is already given. There are four actions, the states are known and the gym environment exports the transition matrix (including the probabilities). That is, only the costs are missing. Out of the box, frozen lake only gives one positive reward at the very goal. In all other cases, e.g. falling into a hole, zero. One idea for a custom cost mapping could be: 0.1 for each action, 1 for a hole, -1 for the goal (negative cost). A keyword related to negative cost is self-absorbing goals, i.e. goals that have just a single action going back to the goal itself with 0 or even negative cost. A self-absorbing goal is a requirement of stochastic shortest path MDPs (SSP MDPs)4, which is pretty much what we have here. We want the path with the lowest cost (which is the same as a shortest path with lowest cost=time). By the way, gamma needs to be set to 1.

As a result of the above mapping, RTDP achieves an average reward of 0.48 (over 10000 evaluations). This is great, since this is not an easy stochastic environment - you only have a chance of 33% of actually ending up where you intend to be! But this is also where RTDP excels. A faster alternative to RTDP would be discretizing the transition probabilities and applying D* Lite.

There are also improvements to RTDP, such as BRTDP2. For more information, refer to Mausam and Kobolov3. Implementations (including RTDP on frozen lake) are to be found here:

  • https://github.com/instance01/RTDP/
  • https://github.com/instance01/BRTDP-DS-MPI/
  • References:

    1. Learning to act using real-time dynamic programming, 1994, Barto et al 

    2. Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees, 2005, McMahan et al 

    3. Planning with Markov Decision Processes: An AI Perspective, 2012, Mausam and Kobolov 

    4. Solving Stochastic Shortest-Path Problems with RTDP, 2002, Bonet and Geffner 

    Published on

    # pb_c_init and pb_c_base

    Just a short note on those two variables of the AlphaZero pseudo code. Let's dive right into it.

    Decreasing pb_c_base gives visits a higher priority. Check out the C++ snippet below. (it's from here btw, a research project I am working on)

    double pb_c = std::log((parent_node->visits + base + 1) / base) + pb_c_init;
    pb_c *= std::sqrt(parent_node->visits) / (child_node->visits + 1);
    double prior_score = pb_c * action_probs[0][child_node->action].item<double>();
    return mean_q + prior_score;

    You see that visits is getting normalized with base. If base is the default 19625 from the AlphaZero pseudo code, but your MCTS only does 50 simulations (like in my case), you at most get 50 visits and thus this never becomes very significant:

    double pb_c = std::log((50 + 19625 + 1) / 19625) + pb_c_init;
    = double pb_c = 0.00086772153 + pb_c_init;.

    With pb_c_base set to 500 we suddenly get:

    double pb_c = std::log((50 + 500 + 1) / 500) + pb_c_init;
    = double pb_c = 0.04218159451 + pb_c_init;.

    Definitely a difference.

    Now, pb_c_init on the other hand handles mean_q. The default here is 1.25., i.e. with above results we have a factor of 2.25 before multiplying with visits and the action probability. Basically, one can adjust the significance of the prior score in relation to the mean Q value. If pb_c_init is set lower, the MCTS will base its exploration more on the mean Q value. If it is set higher, the Q value is less significant.

    For the interested reader, I have both a Python and C++ version of AlphaZero implemented here: https://github.com/instance01/GRAB0/

    Published on

    # You should do screenshots of your Desktop from time to time

    This is for all the people who work most of the day on their computer. If you use your Desktop as a temporary dumping ground or for ideas or your current projects, consider screenshotting it from time to time. If it's in some other dedicated folder, screenshot that one.

    The Desktop or the folder is a current snapshot of most things you're up to. Well, maybe your open tabs in your favorite browser could also be added. Anyways, doing this every few weeks will give you some excellent material to review at the end of the year. It is very low effort and doesn't even have to happen at some given interval. Whenever you feel like it - do it.

    It's like taking pictures while being in Italy on vacation. At some point later in life you might take that album full of pictures and reminisce on it. I assure you, later you will definitely appreciate those screenshots. You will remember some parts of the year and possibly devulge in some nostalgia. You might laugh about the weird things you considered. Failed projects or maybe that one success - you will see its beginnings.

    Makes you think though. Life has become insanely digital, at least for me.

    Published on

    # Resolution and Indexing

    This is a seminar paper I did in December 2019.

    The post surveys the origins of resolution and a selected set of variants: Binary resolution with theory unification, hyper-resolution and parallel unit resulting resolution. Performance improvements enabled by indexing data structures such as substitution trees or abstraction trees are explored. Finally, related work and further topics are listed.


    As early as in the 18th century Leibniz formulated the idea of automatic resolution with his `calculus ratiocinator', in which he described a vision of machines being able to automatically resolve any logical problem51, i.e. for example the satisfiability of a propositional formula or a theorem of logic. Now, centuries later, resolution, an essential part of solving these kind of problems, enjoys a broad range of applications, though Leibniz' vision has not come into reality so far: Theorem provers do not have the ability to solve any such given problem. An excellent case suggesting that this may never become reality is Hilbert's 10th problem, in which an algorithm is to be found that can decide whether a diophantine equation is solvable: Matiyasevich showed in 1970 that it is indeed impossible to devise such an algorithm31 30.

    Still, since its first formal definition by Robinson in 1965, many areas have been found to profit off of the resolution principle41. Examples include automatic theorem provers36 such as Vampi.e.cite{RV99}, SAT solvers such as BerkMin17 or logic programming. For the latter, PROLOG is a prominent example: A PROLOG program is defined as a logic formula and refutation using SLD resolution15 21 is the `computation' of the program25. As for SAT solvers, resolution helps in averting combinatorial explosion and is used as a preprocessing step35. By creating new clauses that may show the unsatisfiability earlier the total state space that is visited is lessened. Specifically hyper-resolution has been shown to improve SAT solvers successfully in terms of efficiency5 20. Notable examples include HyPre5, HyperBinFast16, PrecoSAT7 or CryptoMiniSAT46.

    The remainder of this post is structured as follows. First, binary resolution is presented. On top of that, three variants are explored: Equational-theory resolution, hyper-resolution and unit resulting resolution. In each chapter a note on possible performance improvements through indexing is made. Lastly, concluding remarks and references to related topics are listed.

    Binary Resolution

    To derive a proof for a hypothesis given certain axioms, resolution can be of particular use. The hypothesis can be encoded as a proof of contradiction and resolution repeatedly applied. If the result is an empty clause, the theorem cannot be fulfilled if the hypothesis is negated. This contradiction is the proof that the hypothesis is actually correct. The opposite is true if the result is not an empty clause. Another classic application is the satisfiability of a certain problem that consists of multiple premises (consisting of multiple boolean variables) that have to be fulfilled. The resolution principle defines an elementary rule to confirm the (un-) satisfiability of that problem, i.e. whether there is a certain configuration of variables that can fulfill the problem. In the following, the rule is presented in detail.

    Binary resolution41 is especially of interest for its support of predicate logic. With predicate logic it is possible to define relations between variables, which makes it more expressive than (finite) propositional logic6. While propositional resolution works analogously to binary resolution, using a similar propositional resolution principle11, there is a major difference to binary resolution — it makes use of unification. Now, in this context propositional resolution is not covered. Thus, in what follows it is reasoned in predicate logic exclusively. To dissect binary resolution, a few definitions are needed.

    • A term is a constant or a variable or an n-ary function consisting of n terms.
    • An n-ary predicate symbol on n terms is a literal. The negation of a literal is also a literal.
    • A finite set of literals is a clause. The set can also be seen as a disjunction of the literals.

    Binary resolution introduces the following rule (see Example 1): If, given two clauses, in this case {A(x),C(x)}\{A(x), C(x)\} and {¬A(x),D(x)}\{\neg A(x), D(x)\} , one contains a positive and the other a corresponding negative literal, all else being equal, the two clauses resolve into a new clause, also called the resolvent. In this case the resolvent is {C(x),D(x)}\{C(x), D(x)\} . So far this is similar to propositional resolution.

    A positive literal and its negative version is also called a complement. Now, binary resolution additionally relaxes the notion of a complement by allowing a negative and its positive literal to be resolved if they can merely be unified. See Example 2 below.

    It can be seen that A(x)A(x) and ¬A(y)\neg A(y) are unifiable by the substitution {yx}\{y \mapsto x\} . This substitution is then applied to the resolvent, yielding {C(y),D(y)}\{C(y), D(y)\} . It is important to note that this is always the substitution of a most general unifier19.

    There is still a limitation. For example, the two clauses {A(x,a)}\{A(x, a)\} and {¬A(b,x)}\{\neg A(b, x)\} cannot be satisfied, since this is akin to xA(x,a)\forall x A(x, a) and x¬A(b,x)\forall x \neg A(b, x) (where the two xx variables are different) and assuming xx in the first clause becomes bb and xx in the second clause becomes aa , {A(b,a)}\{A(b, a)\} and {¬A(b,a)}\{\neg A(b, a)\} become unsatisfiable. Yet, binary resolution in its current form is not able to resolve to an empty clause. To fix this, a renaming step in the first clause is added32. Then, xx may become yy (resulting in {A(y,a)}\{A(y, a)\} ) and thus unification and resolution is successful.

    There is still a case that cannot be resolved correctly: The two clauses {A(x),A(y)}\{A(x), A(y)\} and {¬A(x),¬A(y)}\{\neg A(x), \neg A(y)\} are unsatisfiable, yet an empty clause cannot be derived, as all literals of the first clause are positive and all literals of the second clause are negative. For this purpose, factoring is introduced. The main point of factoring is deduplication. If some of the literals of a clause are unifiable, the clause after substitution of the most general unifier is a factor. A resolution can then commence using factors. In the above case, the substitution {xy}\{x \mapsto y\} is a most general unifier. Thus, after applying it, the result is the factors {A(y)}\{A(y)\} and {¬A(y)}\{\neg A(y)\} , since a unification removes duplicates. Now, in a simple resolution step an empty clause is derived.

    It is possible to use indexing to speed up resolution by a small factor. With indexing, redundant terms are shared, decreasing the total amount of memory used, and lookup of unifiable terms is faster. To realize indexing, structures such as discrimination trees19 or substitution trees18 are used. As Graf mentions19, an index could be built in advance, for instance, by adding all literals into a discrimination tree and thus enabling fast retrieval of complementary literals for unification for a given literal in a resolution step. An exemplary tree of a set of clauses can be seen in Figure 1. Traversing the tree, to for example look for a complementary literal for B(x)B(x) , is faster than testing all possible literals each time a complementary candidate is required.

    Binary Resolution with Theory Unification

    Standard binary resolution does not have the power to resolve clauses for which certain axioms or theorems are available, such as associativity or commutativity in an equation. Consider the two clauses {A(a,b)}\{A(a, b)\} and {¬A(b,a)}\{\neg A(b, a)\} . If commutativity holds, i.e. A(a,b)=A(b,a)A(a, b) = A(b, a) , binary resolution cannot resolve the clauses, since syntactic unification3, the unification without theories that binary resolution is based on, cannot change constants and does not know about the rules of commutativity. Now, a naive approach would be to generate all possible clauses that arise from the theory and then try them all until one succeeds. However, for the clause {A(x,y,z)}\{A(x, y, z)\} that would already result in six total clauses, considering that all possible combinations that could arise from commutativity need to be covered. It is immediately clear that this does not scale. To this end, binary resolution can be coupled with equational theory unification37, also called E-unification, i.e. if a clause including an equation is given together with a certain theory that holds, a resolution can be found efficiently.

    Next, E-unification is explored in more detail. The notion of most general unifier is similar to normal unification, though to a limited extent. This means that there is an ordering of substitutions — some are more general than others: For two given substitutions σ\sigma and τ\tau , σ\sigma is more general if there is a υ\upsilon where xτx\tau = xσυx\sigma\upsilon and the given theory still holds. The substitution is the most general E-unifier, if it is also the most general substitution. Syntactic unification can have multiple most general unifiers, however this is taking into account variants or renaming variables. Ignoring this, a syntactically unifiable set of clauses has just one most general unifier13. This is comparable to unitary problems that also only have one most general E-unifier. However, and this is the main difference between syntactic unification and E-unification, there are cases in which E-unifiable clauses do not have a single most general E-unifier3. For example, the equational clause A(x,y)=A(a,b)A(x, y) = A(a, b) (with commutativity) has two most general E-unifiers {xa,yb}\{x \mapsto a, y \mapsto b\} and {xb,ya}\{x \mapsto b, y \mapsto a\} . Strictly speaking, none of them is then the most general E-unifier, since both are not related (e.g. if one was the instance of the other), both have the same number of substitutions and for both a more general substitution does not exist. Thus, E-unification reasons with sets of E-unifiers. A set of ordered E-unifiers is complete, if it contains all E-unifiers that are applicable to the problem or at least all most general ones, i.e. for each applicable E-unifier there is at least one more general unifier in the set. Now, a minimal set of E-unifiers only contains unifiers that cannot compared, i.e. for a given E-unifier there is no more or less general one in the set. For a given minimum set there are three types of problems3:

    1. Unitary, if the theory is empty (so no additional theory holds, just equality) and there is just one most general E-unifier (which is a syntactic one, as no theory holds).
    2. Finitary, if the set is finite and has more than one E-unifier. For example, commutativity is finite: There are only so many combinations for a given problem43.
    3. Infinitary, if the set is infinite.
    4. Zero, if the set is empty.

    The type of problem is interesting since an algorithm might have to deal with having infinitely possible E-unifiers. Now, a list of theories is given in Table 1. It is interesting to note that there are some theories that are not decidable. Also, not every theory has published algorithms for E-unification. These are potentially open problems.

    Theory Type Decidability Algorithm
    Associativity (A) Infinitary Decidable 37
    Commutativity (C) Finitary Decidable 43
    Distributivity (D) Infinitary* (Un) Decidable 42
    (A) and (D) Unitary/Finitary Decidable 47
    (A), (D) and idempotency (I) Unitary/Finitary Decidable 2
    Abelian groups Unitary/Finitary Decidable 22
    Commutative rings Zero Undecidable None
    Boolean rings Unitary/Finitary Decidable 8
    Endomorphisms Unitary/Finitary (Un) Decidable None

    Table 1: List of theories with respective facts3

    Types denoted by * only apply to some problems such as ones containing constants or arbitrary function symbols.

    Now, equational theory is restricted in that clauses need to contain some kind of equation. However, it has been found to be possible to create specific unification algorithms, similar to the ones referenced in Table 1: Z-resolution12 compiles an axiom by generating a Lisp program that can then automatically deduce that axiom. An axiom that can be Z-resolved has to be in the form of a two literal clause (also called Z-clause), which together with another clause resolves to some other clause. It can be seen immediately that this gives considerable freedom in what the clauses contain and thus the axiom they represent. In fact, they can represent theories from E-unification, such as commutativity. Compiling an axiom can be compared to a human remembering that if A=BA=B , then B=AB=A , and thus only one rule needs to be saved in memory. The other one is deduced when needed.

    The idea of such specific unification algorithms is not applicable to recursive clauses. For instance, one such clause is {A(x),¬A(f(x))}\{A(x), \neg A(f(x))\} . It resolves with itself to the new clause {A(x),¬A(f(f(x)))}\{A(x), \neg A(f(f(x)))\} , also called a self-resolvent. Now, it is immediately clear that there are infinite such resolvents. Generating a unification algorithm for them all is infeasable, however Ohlbach showed that abstraction trees can be used to encode those resolvents efficiently34. Furthermore, standard operations such as merge do not have to be adapted for infinite clauses. Self-resolvents can be compared to instances of the original clause. With this, a deterministic procedure for generating new ones is possible. The clauses can be added into an abstraction tree with continuations, i.e. when a node with a continuation is reached while traversing, the next level can be generated with the procedure and by appending a new tree of same structure. Thus, it is possible to represent infinite sets of unifiers.

    Refer to Figure 2 for an example. The clauses {A(x,a)}\{A(x, a)\} and {A(a,x)}\{A(a, x)\} with associativity over A can be resolved with an infinite set of E-unifiers, e.g. {{xa}\{\{x \mapsto a\} , {xA(a,a)}\{x \mapsto A(a, a)\} , {xA(a,A(a,a))}\{x \mapsto A(a, A(a, a))\} , ...}\} . Now, when traversing the tree the continuation is generated whenever a continuation identifier, in this case denoted by a CC , is encountered.


    In the context of theorem proving, especially for huge problems, the number of clauses generated by resolution may become substantially huge. In fact, the number grows exponentially in terms of when unsatisfiability\footnotemark{} can be confirmed4. \footnotetext{\ The attentive reader might question why it only grows in terms of unsatisfiability. Well, predicate logic is undecidable10 49 and thus it is not clear which complexity satisfiability to denote with4. } Additionally, part of the clauses will mostly be tangential to the proof9. Storing them wastes memory and considering them as proof candidates wastes CPU cycles, causing the theorem prover to slow down. In the following, variants of binary resolution that deal with this are explored.

    A prominent one is hyper-resolution. It offers improvements in terms of speed: There are less resolvents generated and thus less steps needed to get to a result23. Published in 1965 by Robinson40, it has since then found wide use as part of SAT solvers, going as far as to make problems solvable that were previously unsolvable5. Next, a short overview is given. First, a few definitions are needed. A positive clause is a clause containing only positive literals (i.e. no literal with a negation sign). A negative clause then only contains negative ones. Since the resolvent is positive, this is also called positive hyper-resolution. The corresponding negative hyper-resolution would result in a negative resolvent. The hyper-resolution rule is now as follows. Given a list of strictly positive clauses, also called electrons, and a clause that is not positive (i.e. either completely or partly negative), also called the nucleus, the clauses (all of them) can be resolved in one step: The nucleus and the electrons are unified simultaneously. Figure 3 illustrates this. Given are the four clauses {¬A(x),B(x)}\{\neg A(x), B(x)\} , {C(y)}\{C(y)\} , {A(x),B(x)}\{A(x), B(x)\} and {¬B(z),¬C(z)}\{\neg B(z), \neg C(z)\} . Bold clauses are the respective nuclei and there are two steps until the empty clause is reached. Multiple electrons (or clauses) are resolved at each step. After each step the new clause is added to all clauses. It can immediately be seen that binary resolution would require three steps in this case. After unification, the negative literals of the nucleus are required to have counterparts in all of the electrons, causing all these complements to be removed. Since all negative literals are gone, a positive clause (or an empty clause) remains.

    At this point, however, finding the unifier is the main computational issue, as the possible combinations grow exponentially with the amount of electrons for each negative literal in the nucleus19. For this, indexing comes into play, specifically substitution trees. In fact, many theorem provers such as LEO-III50 24 rely on substitution trees among others for speed-up. A substitution tree allows for efficient retrieval of terms unifiable with a given term. Given two substitution trees, the merge operation then returns the substitutions that are compatible, i.e. where the codomain of the same variables can be unified. This can be extended to multiple substitution trees and is called a multi-merge, which is especially relevant for hyper-resolution, since a set of electrons and a nucleus need a simultaneous unifier. Now, Graf19 proposes to keep a substitution tree for each literal of the nucleus, i.e. all required substitutions for that literal can be looked up. When the simultaneous unifier is needed, a multi-merge on the substitutions trees of all negative literals of the nucleus together with the respective electrons is executed.

    Before diving deep into Unit Resulting Resolution, a tangential improvement should be noted. Often, there are multiple clauses that can be used as nucleus at the current step. However, some result in a faster resolution than others. An efficient selection can be achieved using an ordering of the literals (after unification). The idea of an ordering can be traced back to Slagle and Reynolds in 196744: For instance, clauses containing the conclusion (if looking for a certain theorem to be proven) can be prioritized that way, increasing the speed of resolution27.

    Unit Resulting Resolution

    Unit resulting resolution (UR resolution) works according to the following rule. Given is a set of nn unit clauses (i.e. clauses containing only one literal) and another clause with n+1n+1 literals, also called the nucleus, in which each unit clause has a complementary literal (with unification). After resolution with a simultaneous unifier, only one unit clause remains. For example, given the clauses {¬A(z),B(x),C(y)}\{\neg A(z), B(x), C(y)\} , {A(x)}\{A(x)\} and {¬C(z)}\{\neg C(z)\} , UR resolution resolves to {B(x)}\{B(x)\} . In contrast to hyper-resolution, the resulting unit clause may be positive or negative, since the multi-literal clause may have positive or negative literals. Now, UR resolution is a good candidate for parallelization, as Aßmann described in his PhD thesis in 19931. As a matter of fact, using multiple processes is very relevant to current developments: Latest CPUs are not improving in terms of performance due to their clock speed, but due to their core count48. Thus, focusing on this area yields favorable runtimes. Additionally, it is clear that redundancy of calculations is reduced, if for example there are multiple big clauses overlapping in terms of their literals.

    Aßmann's idea consists of 3 steps. First, a clause graph is generated. It connects the literals of the nucleus to the complementary unit clauses, while also keeping track of the unifier between both. More specifically, each edge in the graph is enriched by the unifier split into two components: The positive component contains the substitutions that are applied to the positive literal of a complementary pair, the negative component the ones that are applied to the negative literal. Splitting the unifier up is useful for the actual search algorithm. A simple example can be seen in Figure 4. The dashed box represents the nucleus.

    Now, for each clause (or node in the graph) a process is started that executes Algorithm 1. It should be noted that the function `smgu' returns the simultaneous most general unifier. Additionally, the part of the unifier between the nucleus and a unit clause that belongs to the nucleus is called test substitution, while the one belonging to the unit clause is the send substitution. Finally, the core of the algorithm is a two-step update from the nucleus towards the unit clauses. After the nucleus receives the unifier from a given unit clause, all other unit clauses are sent their send substitution modified with the smgu of the currently received substitutions and the test substitution (see line 11). The intuition of this is as follows. The clauses need to be kept updated with currently known substitutions. To do so, the substitution that operates on their respective literal is updated. Lastly, this loop is repeated until a simultaneous most general unifier between all substitutions is found (see line 5).

    Parallel UR Resolution (PURR) by Graf and Meyer33 improves upon Aßmann's clause process by increasing the degree of parallelization even further. Now, each edge between a nucleus literal and a unit clause (instead of just a clause) in the clause graph is assigned a process — the resolution process. Its task can be compared to the inner most loop in Aßmann's clause process. Additionally, substitution trees instead of single substitutions are shared across processes. This enables the use of efficient operations such as the multi-merge operation. Lastly, the terminator process, which runs on a single clause, confirms whether a simultaneous most general unifier has been found. In detail, the resolution process now keeps substitution trees for each literal of the nucleus (except for the one its edge is connected to) cached. Whenever a substitution tree is received, it subsumes19 the cached version of it, i.e. all instances of the substitutions in the cached substitution tree are removed in the new version. The same is done the other way around: The cached version is updated in that instances of a given new substitution are removed. The union of both (so no substitutions are lost) is then merged with all other cached substitution trees, resulting in a substitution tree that contains most general unifiers between the cached substitutions and the new substitution. The unit clause literal is finally updated by applying the lightest substitutions (i.e. substitutions with the smallest amount of symbols) from the new substitution tree to its part of the substitution (recall that substitutions are kept split in two on each edge of the graph). Finally, terminator processes check for the desired simultaneous most general unifier of all clauses (similar to line 5 in Algorithm 1): If all collected substitutions for a clause can be merged, such a unifier is found. Subsumption is also applied here.

    It is clear at this point that PURR heavily depends on substitutions trees. However, this also enables efficient operations such as subsumption that aid the merge operation in that unnecessary instances of substitutions are filtered beforehand and improves the performance of PURR.


    The introduction of the resolution principle has had a major impact — it is now a widely used technique in many areas of computer science and related fields. Performance is of particular concern, and work towards improving it has been plentiful in the past: Much attention to it was devoted by Graf in his book Term Indexing. This includes data structure such as discrimination trees, abstraction trees or substitution trees. Additionally, many algorithmic adaptions or extensions based on binary resolution have since then been proposed, such as hyper-resolution, resolution with equational theory, or unit resulting resolution.

    This post covered much of that. First, the resolution principle was presented. An extension to it, binary resolution with equational theory, was explored: Now, knowledge about axioms in equational clauses can be used to create specific fast algorithms for fast resolution. Hyper-resolution, which reduces the amount of clauses generated, was demonstrated. And finally, PURR was reviewed: Unit resulting resolution combined with parallelization leads to good resource utilization and thus efficient resolution of unit clauses.

    Still, more areas can be surveyed. General improvements to resolution are possible, such as faster unification algorithms29 14. As for further adaptions to classic binary resolution, there are multiple: Examples include paramodulation39, in which a clause containing an equation can be used as a substitution in another clause, linear resolution26 28 or demodulation52. Lastly, the reader interested in more related work on theorem provers could explore alternative proof procedures for (un)satisfiability, like analytic tableaux45 and its variants.


    1. Ulrich Assmann. A model for parallel deduction. 

    2. Franz Baader and Wolfram Buettner. Unification in commutative idempotent monoids. 

    3. Franz Baader and Wayne Snyder. Unification theory. 

    4. Matthias Baaz and Alexander Leitsch. Complexity of resolution proofs and function introduction. 

    5. Fahiem Bacchus and Jonathan Winter. Effective preprocessing with hyper-resolution and equality reduction. 

    6. Jon Barwise. Handbook of mathematical logic, volume~90. 

    7. Armin Biere. P $re, i$ cosat@ sc‚Äô09. 

    8. Wolfram Buttner and Helmut Simonis. Embedding boolean expressions into logic programming. 

    9. Chin-Liang Chang and Richard Char-Tung Lee. em Symbolic logic and mechanical theorem proving. 

    10. Alonzo Church. A note on the entscheidungsproblem. 

    11. Martin Davis and Hilary Putnam. A computing procedure for quantification theory. 

    12. John~K Dixon. Z-resolution: theorem-proving with compiled axioms. 

    13. Elmar Eder. Properties of substitutions and unifications. 

    14. Gonzalo Escalada-Imaz and Malik Ghallab. A practically efficient and almost linear unification algorithm. 

    15. Jean~H. Gallier. Logic for Computer Science: Foundations of Automatic Theorem Proving. 

    16. Roman Gershman and Ofer Strichman. Cost-effective hyper-resolution for preprocessing cnf formulas. 

    17. Eugene Goldberg and Yakov Novikov. Berkmin: A fast and robust sat-solver. 

    18. Peter Graf. Substitution tree indexing. 

    19. Peter Graf and D~Fehrer. Term indexing. 

    20. Marijn~JH Heule, Matti Jaervisalo, and Armin Biere. Revisiting hyper binary resolution. 

    21. Robert Kowalski. Predicate logic as programming language. 

    22. D.S. Lankford, G.~Butler, and B.~Brady. Abelian group unification algorithms for elementary terms. 

    23. Theodor Lettmann. Aussagenlogik: Deduktion und Algorithmen: Deduktion und Algorithmen. 

    24. Tomer Libal and Alexander Steen. Towards a substitution tree based index for higher-order resolution theorem provers. 

    25. John~W Lloyd. Foundations of logic programming. 

    26. Donald~W Loveland. A linear format for resolution. 

    27. Donald~W Loveland. Automated Theorem Proving: a logical basis. 

    28. David Luckham. Refinement theorems in resolution theory. 

    29. Alberto Martelli and Ugo Montanari. An efficient unification algorithm. 

    30. Yuri~V Matiyasevich and Jens~Erik Fenstad. Hilbert's tenth problem, volume 105. 

    31. Yuri~Vladimirovich Matiyasevich. The diophantineness of enumerable sets. 

    32. Bernard Meltzer. Theorem-proving for computers: some results on resolution and renaming. 

    33. Christoph Meyer. Parallel unit resulting resolution. 

    34. Hans~Juergen Ohlbach. Compilation of recursive two-literal clauses into unification algorithms. 

    35. Duc~Nghia Pham. Modelling and Exploiting Structures in Solving Propositional Satisfiability Problems. 

    36. Alain Pirotte. Automatic theorem proving based on resolution. 

    37. Gordon Plotkin. Building-in equational theories. 

    38. Alexandre Riazanov and Andrei Voronkov. Vampire. 

    39. G~Robinson and L~Wos. Paramodulation and theorem-proving in first-order theories with equality. 

    40. John~Alan Robinson. Automatic deduction with hyper-resolution. 

    41. John~Alan Robinson et~al. A machine-oriented logic based on the resolution principle. 

    42. Manfred Schmidt-Schau$\beta$. A decision algorithm for distributive unification. 

    43. Joerg~H. Siekmann. Unification of commutative terms. 

    44. James~R Slagle. Automatic theorem proving with renamable and semantic resolution. 

    45. Raymond~M Smullyan. First-order logic. 

    46. Mate Soos. The cryptominisat 5 set of solvers at sat competition 2016. 

    47. Mark~E. Stickel. A complete unification algorithm for associative-commutative functions. 

    48. Thomas~N Theis and H-S~Philip Wong. The end of moore's law: A new beginning for information technology. 

    49. Alan~M. Turing. Turing. on computable numbers, with an application to the entscheidungsproblem. 

    50. Max Wisniewski, Alexander Steen, and Christoph Benzmueller. The leo-iii project. 

    51. Bruno Woltzenlogel~Paleo. Leibniz's Characteristica Universalis and Calculus Ratiocinator Today. 

    52. LT~Wos, George~A Robinson, Daniel~F Carson, and Leon Shalla. The concept of demodulation in theorem proving. 

    Published on

    # Is this the catalyst?

    It cannot be denied that something weird is going on. A whole lot of things currently don't add up in the financial system. The FED having to maintain a standing repo facility would be one thing. Why are banks not injecting liquidity? Why are they holding off? Covid-19, it should honestly be mostly a non-event. Sure, a correction is warranted. But not at this volatility. Retail (which includes me) is of course as always dumbfounded.

    I think next week will show if this is the real deal. See the picture below (10 day variance of QQQ). Will we reach 2008 levels?

    Published on

    # What is the gist of LOLA?

    LOLA is the shorthand for Learning with Opponent-Learning Awareness. Let's get some intuition.

    First of all, policy gradient, what is that. Well, you repeatedly update the policy in the direction the total expected reward is going when following the policy in a given state s (resulting in some action a). Wew what a sentence !!! So, let's assume our policy is represented by something that can be parameterized, such as a neural net with softmax output (so you have a probability for each action). Well, obviously by watching the total reward go up and down when taking different actions you can adjust the weights of that neural net.

    In LOLA when you update your policy gradient you don't just take into account the world as is (statically). So in a static scenario your value function returns the expected value given two parameters (one for each agent). And you simply update your own parameter. But the other parameter is static.. The other agent is inactive. But in a dynamic scenario you take into account how you would change the update step of the other agent. You suddenly notice what the other agent is learning based on what you're doing. So the total reward now changes based on your own actions, but also on what effect that has on the other agent's learning.

    Check out the official video at roughly 12:05. LOLA learning rule is defined as:

    (our agents parameter update) + (How does our agent change the learning step of the other agent) * (How much our agents return depends on the other agents learning step)

    This post is a WIP. As I get more intuition I will update this post.

    Published on

    # Policy Gradient Baseline - The Intuition

    If you are in a bad state, the agent should still try and do the best action given the circumstances. If you don't use baseline, you will get a pretty bad reward just because you're in a bad state. However, we want to reward good actions in bad states, still.

    Example: Lets take V(s)=Eπ[Gts]V(s) = \mathbb{E}_{\pi}[G_t | s] for state ss as our baseline b.b. Basically you take the mean returns of all possible actions at s.s. You would expect the return of your action to be slightly better or worse than b.b. So if V(s)V(s) = 5 and reward of our action = 4: 4-5=-1. If V(s)V(s) = -5 and reward of our action = -6: -6-(-5)=-1. So it's two actions that give wildly different returns as is (4 vs -6) but in the context of their situation they are only a bit bad (-1). Without baseline the second action would be extremely bad (-6) even though given the context it is only slightly bad (-1).

    Essentially this is like centering our data.

    By the way, V(s)V(s) can come from a neural network that has learnt it.

    Published on

    # MiFID II - Contact your broker

    MiFID II gets my pulse going. Still, this does not mean that you can only trade UCITS ETFs from now on. Recently I've successfully enabled a non-UCITS ETF for trading for myself, specifically QQQ3, by contacting my broker and telling them that the KID for QQQ3 is available online (since September 2019 apparently). After a few emails back and forth it seems they are now in the process of enabling access to it.

    A note on MiFID II. Why the fuck are ETFs not allowed, but derivates and futures are all good? Political bullshit. The people benefitting from this are most likely a minority.

    A note on QQQ3. I am starting an experiment with this. I know this could be FOMO, but still. I will start building a position in it with a minor percentage of my account. The thing I always need, and this might be something psychological, is a position in something. Else I won't track it. If money is on the line, it gets my attention. And even with a drawdown of like 90% (on QQQ3), I will still have the motivation to fix the position. And this is important for the next recession.

    Bigger backtests incl. optimum leverage: http://ddnum.com/articles/leveragedETFs.php

    Published on

    # FFT of a line is another line

    The fast Fourier transform of a vertical line is a horizontal line. WTF?

    One interpretation of the FFT is summing up many sine curves, i.e. for a given function there exists some infinite amount of sine curves that, when added up, can represent an approximation of the function. The function that we want to represent in the picture below1 is a popular example - the square wave.

    Another point to understand: The power of the frequency tells us the weighting of one sine curve. High power? That sine curve is important.

    Now, back to the vertical line. The FFT is telling me: Sum up the sinues curves of ALL frequences there exist. And that shall result in a vertical line. It needs ALL frequences to represent that line, and at the same power. But why?!

    And wait, those are sine curves.. Should we not get some kind of residuals around the vertical line, resulting in something similar to the shape of Mt. Fuji? I can immediately tell you - this is completely wrong. You need to think in an infinite sense. Assume we want to start creating such a vertical line.

    The brown curve on top is the sum of all sine curves. As you can see (on the right side I simply shifted the curves a bit on the yy axis for easier comprehension), a line appears! The trick here is that different sine curves cancel each other out. But coincidentally, there is a point where all curves do the same thing: They are at their local maximum. This is where our vertical line forms. By the way, don't mind the yy axis, the wave can be normalized to 00 back from 2020 easily. This is the basic idea and can be followed into infinity.

    Now, to actually get a straight vertical line, apparently all frequencies have exactly the same power, going as far as to even depending on the height of the line. If its height is 11 , the line in frequency domain lies at y=1y = 1 . For 22 , it is 22 . And so on. I have not been able to understand intuitively yet why this is the case. However, in the following, the mathematical proof shows that this is indeed always the case for any given vertical line.

    First a definition of the DFT is needed:

    Xk=n=0N1xnei2πNknX_k = \sum_{n = 0}^{N - 1}x_n * e^{\frac{-i2\pi}{N}kn}

    Assume the time domain data is as follows: [1,0,...,0][1, 0, ..., 0] . It is immediately clear that only the first summand is non-zero, since the rest is multiplied with xn=0x_n = 0 . Additionally, for the first summand, since n=0n = 0 and x0=1x_0 = 1 , it holds that 1e0=11 * e^0 = 1 and thus XkX_k is 11 . This holds for all kk .

    This works analogously for all multiples, such as [9,0,...,0]:9e0=9[9, 0, ..., 0]: 9 * e^0 = 9 .

    >>> import numpy as np
    >>> np.fft.fft([9, 0, 0, 0, 0, 0, 0, 0])
    array([9.+0.j, 9.+0.j, 9.+0.j, 9.+0.j, 9.+0.j, 9.+0.j, 9.+0.j, 9.+0.j])

    Further references:




    Published on

    # The product rule in RNNs

    I find that the product rule is always forgotten in popular blog posts (see 1 and 2) discussing RNNs and backpropagation through time (BPTT). It is clear what is happening in those posts, but WHY exactly, in a mathematical sense, does the last output depend on all previous states? For this, let us look at the product rule3.

    Consider the following unrolled RNN.

    Assume the following:

    ht=σ(Wht1+Uxt)h_t = \sigma(W * h_{t-1} + Ux_t)

    yt=softmax(Vht)y_t = \mathrm{softmax}(V * h_t)

    Using a mix of Leibniz' and Langrange's notation, I now derive:

    h3W=σ(Wh2+Ux3)W=\frac{\partial h_3}{\partial W} = \frac{\partial \sigma(Wh_2 + Ux_3)}{\partial W} =

    σ[Wh2+Ux3]=\sigma' * [Wh_2 + Ux_3]' = // Chain rule

    σ[Wh2]=\sigma' * [Wh_2]' =

    σ[Wσ(Wh1+Ux2)]=\sigma' * [W * \sigma(Wh_1 + Ux_2)]' =

    σ(h2+Wh2)=\sigma' * (h_2 + W * h_2') = // Product rule

    σ(h2+Wσ[Wh1+Ux2])=\sigma' * (h_2 + W * \sigma' * [Wh_1 + Ux_2]') =

    σ(h2+Wσ(h1+Wσ(h0+Wh0)))=\sigma' * (h_2 + W * \sigma' * (h_1 + W * \sigma' * (h_0 + W * h_0'))) =

    σh3h2+\sigma_{h_3}' * h_2 \mathbf{+} σh3Wσh2h1+\sigma_{h_3}' * W * \sigma_{h_2}' * h_1 \mathbf{+} σh3Wσh2Wσh1h0+\sigma_{h_3}' * W * \sigma_{h_2}' * W * \sigma_{h_1}' * h_0 \mathbf{+} σh3Wσh2Wσh1Wh0\sigma_{h_3}' * W * \sigma_{h_2}' * W * \sigma_{h_1}' * W * h_{0}'

    Chain rule happens in line 1 to 2, product rule in line 4 to 5. Line 3 is simply explained by Ux not containing W (which we're deriving for). Now, it can be immediately seen that each summand of the last result keeps referencing further and further into the past.

    Lastly, since this assumes the reader is familiar with the topic, a really nice further explanation of BPTT for the interested reader can be found here.

    Published on

    # Dissecting the wave algorithm for detecting FBC downstream impairments

    A more recent and editorialized version of this post can be found here.

    The wave impairment is one of many impairments that can be detected on a FBC downstream spectrum. In the following, we take a cursory look at an implementation of the wave impairment detection algorithm.

    To start trying to fit a wave onto a spectrum, the data has to be stitched together. Only channels actually send data (these are the long, upside down U-shapes in the spectrum). The black line in the following picture shows how the algorithm stitches those channels together.

    Based on a smoothed version of the resulting data (using rolling averages) an inverse FFT (IFFT) is calculated. This is the meat of the algorithm.

    Now, those huge impulses (denoted by a)) in the chart in 160 steps are the U-shapes. The algorithm guesses that harmonic based on the channel width and the spectrum width. They are supposed to decrease the higher the frequency. If they don't, if one of the impulses is bigger than the previous one, a wave impulse is detected. This is for waves that by coincidence fall exactly in the same frequency as the U shapes. If this was the case (unfortunately I do not have an example spectrum), the U shape would continue where there are no channels (just noise), even though with less power (height of the shape would be much smaller). These are called harmonic impulses. Non-harmonic impulses, see b), are the ones that are on a different frequency than the guessed one. Most of the time, from what I've seen, they are before 160. A significant impulse is one that has at least 45% the height of the last harmonic impulse. Now, all non-harmonic impulses that are higher than the last significant harmonic impulse are considered wave impulses.

    A few additional constraints were added. First, a definition: The main harmonic is the first (and highest/most powerful) harmonic impulse. Now, the following three conditionals deal with false positives:

    • If more than 75% of all channels are flat, the canditate wave is ignored

    • Low frequency waves are ignored (below index 5, so a few hundred kHz)

    • Low power waves are ignored (if the power is less than 70% of the main harmonic)

    Future work: Describe how to calculate the location (in metres) of the issue in the cable. Yes, that's possible!

    Published on

    # Algorithmic Design for pattern matching in 2-dimensional data

    You have some data, you need to find a pattern in it. The pattern is easy, your trained eye immediatly finds it - so it seems. Getting to that 0% error rate in reality however is very tough, since there are always edge cases. So when you see the following pattern, what kind of conditionals do you think of first?

    You're using Python with its great scientific eco system. You have many tools at your disposal; numpy and scipy give you a simple derivative in one line. What do?

    For reference, this is a Resonant Peak impairment in a FBC downstream spectrum. A normal scan would not have that peak in the middle-right, but would simply be flat. Taking a look at the official DOCSIS PNM best practices guide, causes for a Resonant Peak on a cable line could be: 'defective components, cold solder joints, loose modules (or module covers), loose or missing screws.'

    Anyways, here's my methodology. Before we dig deep, what is meant by a conditional? In above example, one such conditional could be the flatness of the channel in which the resonant peak has its peak. Basically the range around the peak. There are cases of very flat 'resonant peaks' which are not resonant peaks. When the algorithm was still much too general, the following was getting classified as a resonant peak.

    Let's fit a small linear regression in that range and check that the resulting line is flat - the slope should be something very low such as -0.1 or 0.2. The fitting error of the linear regression should be small too - we ought to be sure that the line represents the range. If it does not and the error is very huge, maybe the channel is not flat at all. Maybe the linear regression got distorted by outliers that cannot be fixed by L2 regularization. Lastly, whenever I talk about separating, I mean separating a correct detection of a pattern (true positive) from a wrong detection in any way (false positive, false negative). A good separation should correctly return whether a sample contains a pattern (or some part of it) or not. Now, this out of the way, onto the methods.

    1. Create unittests. Find 'case studies' of interesting samples in your data that show off normal cases and edge cases. Do not forget to include important true negatives - scans that do not contain a pattern, but might (if your conditionals are not specific enough). Those true negatives come up while testing, when you find that your algorithm is too general and returns false positives. After fixing the false positives they turn into true negatives.
    2. Your conditionals need to be as specific as possible, but also as general as possible. This is a contradiction, but bear with me. There are many possibilities, you can use derivatives, averages, percentiles, fitting to curves, anything your heart desires. But which one is the best. Use the one that separates the best. Often I see that an idea, like making sure the range around the resonant peak in the above example is not flat, just does not separate good enough. There are true positives with flatness 1 and error 1, and then there are false positives with flatness 1.1 and error 1 up to flatness 8. See what I mean? It is too close. Since I only see a small subset of the data, I deem this too risky.
    3. The conditionals do not need to be perfect. It is good if you can simply reduce false positives. For example, the flatness above can be made very strict. This gets rid of half of the false positives, which is good enough. Maybe this particular conditional simply does not separate perfectly between true positive and true negative? It is fine. Of course, if you come up with a better conditional, you can get rid of the earlier, worse one. Do not forget that. With that, on to the last point.
    4. Keep it as simple as possible. Get rid of conditionals that once worked but got replaced by better ones. There is more than just the error rate, such as efficiency - an algorithm that is too slow can only see limited use.

    Here's a few additional notes on each point. Regarding point 1. This can also be compared to labelling data and machine learning, but on an extremely small scale. Each unittest is a label - I looked at the sample manually and detected whether there is my pattern. Using initial unittests I compose a few conditionals with hardcoded constants. Afterwards, when I see a wrong detection, I adjust my conditionals. Regarding point 2. A conditional that separates very well was the flatness one with wave. On the left side we see a correct wave. On the right side - a false positive. The algorithm sees the black line (since channel blocks are stitched together) and in the FFT this is apparently good enough to count as a wave. But, the majority of channels are just flat. Thus, checking for flatness across the spectrum gives us another good conditional.

    Another good separator was the new filter algorithm. When fitting a tanh, the parameters of the tanh that represented a filer were wildly different from the ones without a filter. Perfect candidate for a conditional that separates very well. Lastly, using percentiles rather than averages with noisy data is probably a better idea. It separates better!

    Update: Checking the channels at the edges of the resonant peak too and making the flatness check much more strict (flatness < .6 and err < .3) makes this a much better separator on the given unittests. There is a lot of difference between true negatives (usually at least one channel at flatness < .4) and true positives (usually all channels above 1, often above 2). This is what we want: A lot of room for error.

    Published on

    # FBC Filter Algorithm

    A more recent and editorialized version of this post can be found here.

    In the course of my work I had to detect filters in FBC (Full Band Capture) downstream spectrum scans. These spectrum scans may contain impairments which need to be detected. Usually, the operator implements a filter on purpose, to simply disable part of the spectrum. Example: Maybe there is a known impairment in the neighborhood that occurs whenever the cable is used close to its full capacity. Filtering part of the spectrum decreases the maximum possible bandwidth, but fixes the issue. Now, a filter is not exactly an impairment. Still, detecting it is useful. Since the shape is very pronounced, it is also not too difficult to achieve. Thus, let's take a look.

    In the following the algorithm for detecting filters is presented. Then, it is shown that using a tanh is appropriate for representing a filter. Finally, an idea for future work is noted. The filter algorithm works as follows. First, a definition of tanh is needed. Since tanh is one of the trigonometric functions, the same transformation rules apply.

    def tanh(self, x, a, b, c, d):
        """A tanh. Parameters:
        a: Amplitude. Height.
        b: Phase/horizontal shift.
        c: Period. How flat/straight up is the tanh?
        d: Vertical shift.
        return a * (1 + np.tanh((x - b) / c)) + d

    Now, using scipy's curve_fit and supplying a few carefully crafted initial parameters for a, b, c and d (simply done by fitting a curve on a few scans with known filter impairments and using the average as the initial guess), a tanh can be fitted on a given sample. Then, the resulting tanh (its parameters) and the fitting error can be compared to a hard coded constant. In this case, this would be the initial guess extended by a range; e.g. the parameter c could then be in the range 80 to 150 to be a tanh that represents a filter impairment.

    In this example, the following parameters and ranges make sense:

    # The parameters a, b, c, d for changing the shape of a tanh function.
    # See the tanh function above for detailed information on each parameter.
    # These magic numbers were found empirically. Refer to the unit tests.
    GUESSED_PARAMS = [23., 367., 128., -61.]
    # Allowed deviation in the parameters of the tanh function and the guessed
    # initial one, in both directions. For example, parameter a is in the range
    # [17,29].
    MAX_DEVIATION_PER_PARAM = np.array([6, 80, 70, 20])

    To validate this algorithm, a clustering on 200000 scans was done using sklearn's MiniBatchKmeans and partially fitting 10000 samples at once. This results in roughly 2GB RAM usage compared to over 64GB (unknown how much was needed as it crashed) when using vanilla Kmeans.

    Assuming 12 clusters (more ideas were tried – 12 seems roughly the best) the following cluster center for filters can be observed:

    A small dataset of 208 samples was drawn from scans that the given MiniBatchKmeans model predicts to be in the above cluster. This dataset was used for adapting the algorithm (to support steeper filters) and added as a unittest.

    Drilling down, it can be seen that most filters have this form, which can be represented by a tanh.

    Unfortunately it seems the above cluster still contains a few false positives – a few samples are not filters at all. To make this exact, the 12 clusters would need adjusting. This could be future work. However, for the purpose of creating a small dataset of filters, this is irrelevant.

    Published on

    # A note on DBSCAN for 2-dimensional data

    This is a short note on the parameters of DBSCAN for 2-dimensional data, e.g. FBC spectrum scans.

    The data is a vector of amplitudes (in dB) of length 8704. DBSCAN needs to find neighbors for each point and thus needs a distance metric between two vectors.

    The default metric used by scipy's DBSCAN is euclidean, i.e. each point qi of vector1 is compared to point pi of vector2 in the following manner:

    i=1n(qipi)2\sqrt{\sum_{i = 1}^{n}(q_i - p_i)^2}

    The eps parameter of DBSCAN is exactly that value above (when using the euclidean metric). Now, to find a sensible eps, a simple rough calculation shows, given that 5 dB is the maximum distance allowed between two points to be neighbors: (8704 * 5**2) ** .5 ~= 466. This is a first candidate for a good eps value.

    Empirically it can be seen that any value between 100 and 2000 works fine for vectors of length 8704.

    As for the second important parameter, min_samples, the following can be said. First, a core point needs to be defined. This is a point which has more than min_samples neighbors in its range (restricted by eps). The important thing to remember here is that only the neighbors of core points are considered in each iteration as candidates for expansion of a cluster. So if min_samples is huge, the clusters will be smaller and more numerous and vice versa. In conclusion, this parameter is the one to tune in terms of the resulting number of clusters. It can be compared to the parameter k of kmeans.

    Published on

    # There is no recession coming

    I just want to be contrarian. Fearmongering all across media1, yield curve and foreign investment, trade war, germany close to a recession2, corporate debt37. Both sides have valid points. But mostly I see caution and restraint all around. Where is the irrational exuberance4?

    The good thing is, I am not that heavily invested right now. So it is really exciting to be part of history and follow the markets, but with not that much risk.

    I am extremely interested in what the black swan event will be this time.

    Since yields are down everywhere5, which is unprecedented, pension funds need to get their yield from somewhere. Maybe this is where irrational exuberance may come from? But this would probably be years away..

    Published on

    # Meta Tasks

    It's funny, I've started doing some kind of 'meta' tasks: If I recognize that I'm procrastinating on a task, I write down: 'Find subtask for $bigtask' And then some time later I come up with a subtask and write it down: 'Do $subtask' And that's it, after a while I do the subtask (since it's usually small) and effectively get started on the bigger task. And when you've started, you've already done the grunt work of defeating procrastination.

    It's important to put in thought while creating todo points. If you invest 5 minutes in thinking about what subtasks you write down, it'll potentially save you hours in the long run.

    Published on

    # The Elephant In The Brain

    This is a small summary of part 2 of the book "The Elephant in the Brain".

    Body language: We are blind to our body language because it betrays our selfish, competitive nature

    Laughter: Signal "I feel safe" while playing games (also comedians play: they violate norms and talk about taboos but do this in playful manner)

    Conversation: Signal how knowledgeable one is, for mating and alliance purposes

    Consumption: Signalling to others, mating purposes

    Art: Signalling that resources can be wasted on art, mating and alliance purposes

    Charity: Signalling, alliance purposes (as friend one is more likely to help if very charitable)

    Education: Signalling, just get the certificate

    Medicine: Often just to show loyalty to the patient (eg neighbor brings selfmade cake, not store-bought one to show care), signals others how much political support one has (in evolutionary context, eg 1 mio years ago), supporters hope for return support if they ever get sick

    Religion: Community, costly rituals and giving up resources to show that one's committed so one becomes trustworthy and part of community, in community best chance of survival

    Politics: Voters vote for their group, not for oneself, they signal loyalty to their group (to various degree)

    Published on

    # PCA by example with exercices and solutions

    The topic is PCA - principal component analysis. We'll look at an example and have a few exercies + solutions at the end of the post.


    If we want to reduce the dimension of data (e.g. from 2D to 1D) due to space constraints, easier calculations or visualization, we want to lose as little information about our data as possible.

    Take a look at the charts below. We will project (the cyan lines) our data (the cyan points) onto the blue line, i.e. we will reduce the 2D coordinate system to the 1-dimensional line. Now which projection explains our data better? After the projection, on chart A we retain the information that the two pairs are far away from each other. On chart B we do not. Another way of reasoning about this is to say: The variance of the distances (black arrows) is higher on chart A. So the higher the variance, the better explained our data by our projection.

    Let's define a few things.

    • A principal component (PC) is an axis. In our example: The blue line is our axis.

    • The blue line in chart A is PC1, the one in chart B is PC2. PC1 is the best principal component in terms of data explanation.

    • PCA maximizes the sum of squared distances of the projected points to the origin to find the PC with highest variance.

    • That sum is the eigenvalue of the PC. Variance of a PC: Eigenvalue / (n-1). Thus the highest eigenvalue equals the best PC.

    • The normalized vector on the PC line is the eigenvector, representing the direction our PC is going.

    So in conclusion: The eigenvector with the largest eigenvalue is the direction along which our data has maximum variance, i.e. the data is maximally explained.

    If we have two eigenvalues 6 and 2, then the PC for eigenvalue 6 explains 6/(6+2) = 75%. (because eigenvalues are connected to variance)

    In our example: PC2 is perpendicular to PC1. In a 3D example: PC3 would be perpendicular to the plane of PC1 and PC2.

    Steps of PCA

    1. Mean normalization
    2. Compute Covariance matrix
    3. Compute eigenvectors/values with said matrix
    4. Select top k eigenvalues and their eigenvectors
    5. Create orthogonal base with the eigenvectors
    6. Transform data by multiplying with said base


    1. Calculate covariance.
            a = A - 11*A / n
            Cov(A) = a'a / n
    2. Solve det(M-Iλ) = 0.
    3. For both λ₁ and λ₂ solve:
        M * [ x   = λ₁ * [ x
              y ]          y ]
    4. Put x=1 and convert to unit vector: (x**2 + y**2)**.5 = 1.
    5. Orthogonal base consists of both eigenvectors side by side:
        [ x₁ x₂
          y₁ y₂ ]
    6. Take first eigenvector and apply to a (from step 1).
        a * [ x₁
              y₁ ]
    This is the transformed data.


    Data D:
    [ 1 0
      2 0
      3 0
      5 6
      6 6
      7 6 ]
    1. Cov
    a =
    [ 1 0   [ 4 3   [ -3 -3
      2 0     4 3     -2 -3
      3 0  -  4 3  =  -1 -3
      5 6     4 3      1  3
      6 6     4 3      2  3
      7 6 ]   4 3 ]    3  3 ]
    Cov(D) = M = a'a / n = [ 4.66 6
                             6    9 ]
    2. Solve det(M-Iλ) = 0
    (4.66-λ) * (9-λ) - 6*6 = 0
    λ₁ = 13.213
    λ₂ = 0.454
    [ 4.66 6   * [ x   = 13.213 * [ x
      6    9 ]     y ]              y ]
    4.66x + 6y = 13.213x
    6x + 9y = 13.213y
    => y = 1.4244x
    x = 1
    y = 1.4244
    Unit vector:
    x₁ = 1 / sqrt(1 + 1.4244**2) = 0.57
    y₁ = 1.4244 / sqrt(1 + 1.4244**2) = 0.82
    Repeat steps 3 and 4 for λ₂ = 0.454. Solution:
    x₂ = 0.82
    y₂ = -0.57
    [ 0.57  0.82
      0.82 -0.57 ]
    [ -3 -3               [ -4.18
      -2 -3                 -3.6
      -1 -3   * [ 0.57   =  -3.03
       1  3       0.82 ]     3.03
       2  3                  3.6
       3  3 ]                4.18 ]


    1. a) Consider we conduct a PCA on a two-dimensional data set and we get the eigenvalue 6 and 2. Draw a distribution of sample points that may give rise to this result. Also, draw the two eigenvectors.

    b) Consider 3 data points in a 2-dimensional space R**2: (-1, 1), (0, 0), (1, 1). What's the first principal component of the given dataset?

    If we project the original data points onto the 1-dimensional subspace spanned by the principal component you choose, what are their coordinates in this subspace? What is the variance of the projected data?

    For the projected data you just obtained above, now if we represent them in the original 2-dimensional space and consider them as the reconstruction of the original data points, what is the reconstruction error (squared)? Compute the reconstruction of the points.

    2. a) Name four steps for performing a PCA.

    b) Suppose we perform PCA on a two-dimensional dataset and it yields two eigenvalues which are equal. What does it mean regarding the importance of the dimension? Would pursuing a dimensionality reduction be a good choice? Please explain. Sketch a dataset where its two eigenvalues would have the same size.

    c) Given the data points below, would PCA be capable of identifying the two lines? Sketch the principle axes. (please view with monospace font) i)

    x           x
      x       x
        x   x
        x   x
      x       x
    x           x


    x                   x


    1. a) Points: [-1, .5**.5], [-1, 0], [-1, -(.5**.5)], [1, 0], [1, .5**.5], [1, -(.5**.5)]

    >>> pca = PCA(n_components=2)
    >>> pca.fit(np.array([[-1, .5**.5], [-1, 0], [-1, -(.5**.5)], [1, 0], [1, .5**.5], [1, -(.5**.5)]]))
    PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
      svd_solver='auto', tol=0.0, whiten=False)
    >>> pca.singular_values_
    array([2.44948974, 1.41421356])

    How? I divided the eigenvalue 6 into 6 points (because 4 points would've been weird to calculate). And set their distance to 1 from the proposed principal component line, ie they're at x = -1 and x = 1. This way their SS is 1+1+1+1+1+1 = 6. Now to get the eigenvalue 2, we need to set y values to sqrt(.5). Why? Because two points are ON the second PC, we only care for four points. So to get eigenvalue 2: sqrt(.5)**2 = 0.5. And 0.5 * 4 = 2.

    b) It's a line f(x) = .5 Their coordinates: [-1, 0], [0, 0], [1, 0] Variance: 1 Reconstruction error: 2

    2. a) Normalize input Calculate covariance matrix Calculate eigenvectors of said matrix Eigenvector with greatest eigenvalue is our first principal component.

    b) Both dimensions are equally important. Pursuing dimensionality reduction would be bad because we lose a lot of data/the data becomes very skewed. We lose 50% explanation of the data. Dataset: [0, 0], [1, 0], [0, 1], [1, 1]

    c) If we fit the PCs to the apparent 'lines', one line will contribute ~0 variance to the total variance of the PC. Now if we set the PCs as horizontal/vertical axes, both lines will contribute, maximizing the total variance. So PCA will not find the apparent 'lines'. Also, both principal components will have the same importance.

    Published on

    # Twitter Snowflake Format

    Following scenario: We have 4 workers and they want to generate unique ids. Usually you'd need a central server that they get the ids from. That central server would, for example, just use the current time in ms plus a counter if we get multiple messages in the same ms. With snowflake format, we additionally add a worker id.

    So let's say each worker has same time in ms. Due to the additional worker id, we already have 4 unique ids! If time in ms is different, we already have unique ids no matter what. And if we get multiple messages at the same time, we have our additional counter. If that counter runs out (it's usually rather small, like 4096), we simply wait 1 ms. But think about it, 4000 messages per millisecond?

    So there you have it, distributed unique ids!

    For a technical breakdown on which bits are used for what in the snowflake format, see the sources below.


    Published on