Below a list of 48 optimization problems and a reference to the QUBO
formulation of each problem is shown. While a lot of these problems are
classical optimization problems from mathematics (also mostly NP-hard
problems), there are interestingly already formulations for Machine Learning,
such as the L1 norm or linear regression.
Graph based optimization problems often encode the graph structure (adjacency
matrix) in the QUBO, while others, such as Number Partitioning or Quadratic
Assignment encode the problem matrices s.t. a non-linear system of equations
can be found in the QUBO.
The quadratic unconstrained binary optimization (QUBO) problem itself is a
NP-hard optimization problem that is solved by finding the ground state of the
corresponding Hamiltonian on a Quantum Annealer. The ground state if found by
adiabatically evolving from an initial Hamiltonian with a well-known ground
state to the problem Hamiltonian.
This is by no means the definite list of all QUBO formulations out there.
This list will grow over time.
For 14 of these problems the QUBO formulation is implemented using Python
(including unittests) in the
QUBO-NN
project. The Github can be found here.
Problem
QUBO formulation
Number Partitioning (NP)
Glover et al.^{2}
Maximum Cut (MC)
Glover et al.^{2}
Minimum Vertex Cover (MVC)
Glover et al.^{2}
Set Packing (SP)
Glover et al.^{2}
Set Partitioning (SPP)
Glover et al.^{2}
Maximum 2-SAT (M2SAT)
Glover et al.^{2}
Maximum 3-SAT (M3SAT)
Dinneen et al.^{9}
Graph Coloring (GC)
Glover et al.^{2}
General 0/1 Programming (G01P)
Glover et al.^{2}
Quadratic Assignment (QA)
Glover et al.^{2}
Quadratic Knapsack (QK)
Glover et al.^{2}
Graph Partitioning
Lucas^{7}
Decisional Clique Problem
Lucas^{7}
Maximum Clique Problem
Chapuis^{19}
Binary Integer Linear Programming
Lucas^{7}
Exact Cover
Lucas^{7}
3SAT
Lucas^{7}
Maximal Independent Set
Djidjev et al.^{8}
Minimal Maximal Matching
Lucas^{7}
Set Cover
Lucas^{7}
Knapsack with Integer Weights
Lucas^{7}
Clique Cover
Lucas^{7}
Job Sequencing Problem
Lucas^{7}
Hamiltonian Cycles Problem
Lucas^{7}
Minimal Spanning Tree
Lucas^{7}
Steiner Trees
Lucas^{7}
Directed Feedback Vertex Set
Lucas^{7}
Undirected Feedback Vertex Set
Lucas^{7}
Feedback Edge Set
Lucas^{7}
Traveling Salesman (TSP)
Lucas^{7}
Traveling Salesman with Time Windows (TSPTW)
Papalitsas et al.^{1}
Graph Isomorphism
Calude et al.^{4}
Subgraph Isomorphism
Calude et al.^{4}
Induced Subgraph
Calude et al.^{4}
Capacitated Vehicle Routing (CVRP)
Irie et al.^{5}
Multi-Depot Capacitated Vehicle Routing (MDCVRP)
Harikrishnakumar et al.^{6}
L1 norm
Yokota et al.^{3}
k-Medoids
Bauckhage1 et al.^{10}
Contact Map Overlap Problem
Oliveira et al.^{11}
Minimum Multicut Problem
Cruz-Santos et al.^{12}
Broadcast Time Problem
Calude et al.^{13}
Maximum Common Subgraph Isomorphism
Huang et al.^{14}
Densest k-subgraph
Calude et al.^{15}
Longest Path Problem
McCollum et al.^{16}
Airport Gateway Assignment
Stollenwerk et al.^{18}
Linear regression
Date et al.^{17}
Support Vector Machine
Date et al.^{17}
k-means clustering
Date et al.^{17}
Note that a few of these references define Ising formulations and not QUBO
formulations. However, these two formulations are very close to each other. As
a matter of fact, converting from Ising to QUBO is as easy as setting
$S_{i} \rightarrow 2x_i - 1$, where $S_i$ is an Ising spin variable and
$x_i$ is a QUBO variable.
We have an extremely simple Number Partitioning problem consisting of the set
of numbers $\{1, 2, 3\}$. It has an optimal solution, which is splitting the
set into two subsets $\{1, 2\}$ and $\{3\}$.
The Ising formulation for Number Partitioning is given as
$A(\sum_{j=1}^{m}n_{j}x_{j})^2$, where $n_j$ is the j-th number in the set.
Thus, with $A=1$, for the given set the result is
$(x_1 + 2x_2 + 3x_3)^2 = x_1x_1 + 4x_1x_2 + 6x_1x_3 + 4x_2x_2 + 12x_2x_3 + 9x_3x_3$.
Now, the Hamiltonian for an Ising formulation is very simple: Replace $x_i$
by a Pauli Z Gate. We can now write the Hamiltonian using QuBits (with an
exemplary QuBit state):
$\langle{}010|Z_1Z_2Z_3|010\rangle{}$
Thus, we can evaluate the cost of a possible set of subsets by using QuBits.
For instance, a state $|010\rangle{}$ means that the numbers 1 and 3 are in
one subset and the number 2 is in another subset.
It should be noted that $\langle{}1|Z_i|1\rangle{} = -1$ and
$\langle{}0|Z_i|0\rangle{} = 1$.
More specifically (where $x_i = Z_i$), the Quantum formulation for our problem is:
In a lot of introductory literature about Quantum Annealing, Adiabatic Quantum
Computing (AQC) and Quantum Annealing (QA) are explained very well, but one
crucial connection seem to always be amiss.. What exactly is evolution,
or adiabatic evolution?
First, the Hamiltonian of a system gives us the total energy of that system. It
basically explains the whole system (what kind of kinetic or potential energy
exists, etc). The ground state of a given Hamiltonian is the state associated
with the lowest possible total energy in the system. Note that finding the
ground state of a system is extremely tough.
A problem such as the Traveling Salesman Problem (TSP) is then converted into a
formulation (QUBO or Ising) for which a Hamiltonian is set up easily.
Quantum Annealing works as per AQC as follows:
Set up the problem Hamiltonian for our problem, s.t. the ground state of the Hamiltonian corresponds to the solution of our problem.
Set up an initial Hamiltonian for which the ground state is easily known and make sure the system is in the ground state.
Adiabatically evolve the Hamiltonian to the problem Hamiltonian.
Adiabatic evolution follows the adiabatic theorem in that the evolution should
not be too fast, else the system could jump into the next excited state above
the ground state, which would result in a sub-optimal solution in the end.
The general formula is as follows:
where $\mathcal{H}_i$ is the initial Hamiltonian and $\mathcal{H}_f$ is the
problem Hamiltonian. The paramater $s(t)$ is modified over time. In the
beginning, the system (or time-dependent) Hamiltonian consists solely of the
initial Hamiltonian. In the end, it consists solely of the problem Hamiltonian
(and is hopefully still in the ground state, if the adiabatic theorem was
followed).
Setting up Hamiltonians consists of translating the TSP instance to an Ising
spin glass problem instance. The Hamiltonian of that problem is very well
known, and by translating to an Ising problem, one automatically encodes the
problem in a way s.t. minimizing the Hamiltonian (i.e. minimizing the total
energy and thus getting into the ground state) automatically minimizes the
Ising spin glass instance and thus the TSP instance.
With the Hamiltonians out of the way, a huge question marks sits with the
adiabatic evolution part. What is going on here? The figure below shall help
build an intuition.
One can see different system Hamiltonians over time and the corresponding
ground state (black dot) of all possible states. The experiment would now start
with the light blue Hamiltonian and over time slowly change to the problem
Hamiltonian (dark purple). Thus, we are modifying the energy landscape over
time! Looking back at the time-dependent Hamiltonian equation further above,
this suddenly makes sense. The time-dependent Hamiltonian changes over time in
that it consists of both the initial Hamiltonian and problem Hamiltonian with
different fractions. Thus, the energy landscape changes over time in that it
first consists solely of the initial Hamiltonian and later more and more of the
problem Hamiltonian.
So the literature is not missing a description of adiabatic evolution after
all. Hopefully though, this visualization helps people out there understand
easier the intuition behind it.
There are quite a few more topics to cover, but this post ends here to keep
things short and concise. More will likely follow.
When working on QUBO-NN,
one of the problem types (specifically: Quadratic Knapsack) was causing issues
in that a good neural network (with a decent $R^2$ coefficient) could not be
trained. One of my ideas was that a possible cause could be the large numbers
that had to be multiplied to get to a solution. Judging from
this stackexchange post,
I was kind of right. In the end, it was not the main cause, but it did
influence the number of nodes in the neural network needed to train decent
models.
I decided to do a small experiment to test this out.
First, and this is also known in the literature
(though the publication is rather old), it gets tougher to train a neural
network to multiply numbers, the larger they become.
The first experiment is set up as follows. The dataset consists of 10000
randomly generated numbers in the range 1 to n, where n is varied across
configurations (50, 200, 500, 2000). The neural network architecture consists
of just one hidden layer of size 20 (and this is fixed). The optimizer of
choice is Adam with a learning rate of 0.005, and the activation function is
ReLU. I always train for 500 epochs and use a batch size of 10. The dataset is
normalized between 0 and 1.
The next figure shows an interesting result.
In the first 300 epochs, one would assume that the model for n=2000 is a
failure with a $R^2$ coefficient below 0.96. The jumps are also extremely
interesting --- each model has its own jump in the $R^2$ coefficient, and the
lower n, the earlier the jump happens.
Future work would include training for thousands of epochs and observing
whether the worst model (with n=2000) still improves further.
A next experiment shows that including more nodes helps immensely (c.f. the
next figure). The worst model (n=2000) is trained on a neural network with
hidden layer size 100 (instead of 20).
The solution proposed in
the previously linked stackexchange post
to use the logarithm works extremely well:
For the worst model (with n=2000), after just 6 epochs the $R^2$ coefficient is at $0.9999999999987522$. Damn.
The same previously cited publication
supports this idea of using the logarithm with similar results.
The idea is simple: Before training, transform the input and target using the
natural logarithm. Afterwards, if one requires the actual numbers, simply
take the exponential of the result.
In summary, training neural networks to multipy large numbers quickly leads to
huge training times (with a large number of epochs required to reach a good
$R^2$ coefficient). The logarithm trick helps, since addition is trivial for
a neural network to learn. Adding more nodes also helps.
The source code for the experiments can be found here.
I've been trying to create a reverse-engineering pipeline for QUBO matrices,
and it has been quite interesting.. Not as easy as one would think. So the
story here is quite simple: Since the current global Quantum Compute is
concentrated across a few companies and institutions, what if these
institutions read your QUBO matrices and infer secret information. For
instance, say you're a hedge fund that quantifies some secret alpha - they
trust you not to steal that and thereby instantly devalue their alpha.
In classical compute (with cloud computing) there are interesting ideas such as
homomorphic computation (using homomorphic encryption). The goal of the project
is to create a pipeline that can classify the problem a given QUBO matrices is
based on and then infer the parameters that the QUBO matrix has been generated
with. Below a schematic overview is given.
I used neural networks for both steps. We do neural network
multi-classification (step a) and neural network regression (step b). Now, the
results are quite promising.
The first step, i.e. classifying, is pretty straightforward. I set up 11
problems and generated 100000 problem instances each. The chart below shows how
after just 40 epochs we are already near 0 in terms of error rate:
The second step is not as straightforward. 5 of the problems turned out to be
trivially reversible. The $R^2$ coefficient comes very close to 1.0:
Maximum Cut needs additional thoughts (the blog post about that can be found here) - specifically, the output needs to be an adjacency matrix and not an edge list.
This is actually very important:
As for the rest, I am actively looking into it. A table with all 11 problems
and their reversibility can be found here.
Some problems such as Quadratic Assignment (QA) I assumed to be non-reversible,
but I was wrong. The QUBO of a QA problem defines an over-determined linear
system of equations, which can be solved by a neural network. However, the best
I got so far is a $R^2$ coefficient of 99%. Not bad, but could be better. The
problem I really struggled with so far is Maximum 2-SAT. I am not 100% sure
yet, but my hunch is that it is simply too complex for a standard fully
connected neural network.
For anyone interested, the Github repository can be found
here.
There are a multitude of approaches in
post-quantum cryptography
that promise security in a world where Quantum Computers are powerful enough to
break RSA using Shor's algorithm. Some interesting ones include isogeny based
ones or lattices, to which Learning with Errors (LWE) belongs.
The contribution of this post is a simple and easily modifiable example on how
asymmetric encryption based on LWE works.
Initially, B creates an over-defined and solvable system of equations such as:
5a + 4b + c = 17 mod 20
3a + 10b + c = 19 mod 20
10a + 5b + 3c = 14 mod 20
with a=2, b=1 and c=3. Those three numbers are the private key of B.
Next, B adds errors to the equation system:
5a + 4b + c = 17+1 mod 20
3a + 10b + c = 19-2 mod 20
10a + 5b + 3c = 14+1 mod 20
and denotes the coefficients and the wrong values the public key:
5 4 1 18
3 10 1 17
10 5 3 15
with modulo 20.
A now wants to encrypt a message using B's public key and send it to B.
She goes bit by bit. We start with the very first bit of the message, which is
for example 1.
For each row of the public key she flips a coin. She sums up the rows come up
with tails. Let's for example assume row 1 and 3 came up. She sums up the
coefficients and values and ends up with:
15a + 9b + 4c = 18+15 mod 20 = 13 mod 20
Depending on the message bit, which in our example is 1, she adds an error:
a large one if the bit is 1
a small one if the bit is 0
In our case we change 13 mod 20 to for example 2 mod 20.
A sends to Bob the coefficients and the wrong value:
15 9 4 2
B now inserts the private key and calculates:
15a + 9b + 4c = 15*2 + 9*1 + 4*3 = 51 mod 20 = 11 mod 20
Since the value transmitted by A is 2 and this is a large error, B assumes the
bit was 1.
Likewise, if the bit would have been 0, A would have for example sent 12 mod 20
instead of 2 mod 20.
The reason this scheme works is that it is very easy to add errors to the
system of equations, but very tough to find the errors when a, b, c are
unknown.
And where is the lattice in this? The public key of B can be seen as a set of
basis vectors in a lattice and the attacker needs to solve the
Closest Vector Problem
to find the good (real) vectors, which is very tough (polynomial runtime
algorithms approximate the solution only within exponential quality).
What a rookie mistake, but a funny one indeed.
So it all started with me seeing what I'd mistakenly called avoidable bias.
The train error rate bottoms out at roughly 28, which is NOT ~0 and definitely
not satisfying. This looks like underfitting and a big chunk of avoidable
bias, at first. As per theory, one should increase the number of parameters to
deal with underfitting, so that's exactly what I did. One thing to keep in
mind - statistical error (or variance), i.e. the difference between train and
test errors increase proportionally with the number of parameters. So if we
overshoot, we overfit badly. In that case, increasing the number of data points
might help.
With all this in mind, I just went ahead. Now, what I kind of forgot is how
quickly the number of parameters grow in fully connected networks. The issue
with the data is that it is very high dimensional - the input size is 4096 and
the output size was 180 at this stage. With literally zero hidden layers and
just that input and output, we already have 4096 * 180 = 737280 parameters.
So anyways, I started with a hidden layer of 2000, 5000, 10000 and at some
point ended up with two massive 50000 node layers. Also tested 10 $\times{}$
10000 layers at some point too. Let's do some quick maths:
4096 * 50000 ** 2 * 180 = 1.8432e+15
That's ~1.8 Quadrillion parameters.
I was astonished as to why the so called 'avoidable' bias was not going away.
And of course, training the models became very slow too. Further, the bias
stayed at an even higher level with a higher number of parameters!
Two main takeaways here:
This was structural bias (unavoidable)
Training 1.8 Quadrillion parameters lead to the bias staying elevated since training is too inhibited. After the same amount of time (when compared to the simple 700k parameter model) we simple haven't learnt anything and thus the train error stays high.
After changing my data I ended up with no structural bias and a bit of avoidable bias.
I upped the number of training epochs and got to close to 0 error loss.
Observe below the relationships between different types of errors and model complexity (number of parameters) or number of training epochs.
The data was related to
this post
and this project.
Specifically, I tried to predict the parameters (i.e. the exact graph) that
lead to a 64 $\times{}$ 64 QUBO matrix of a Maximum Cut problem. At first, I
encoded the output as the list of edge tuples (since we had 90 edges
pre-defined, that's 180 nodes). I am not 100% sure where the structural bias is
coming from, but my hunch is that it is the ordering of edges. Now, it's not as
simple as saying that the edges are stored in a dict and thus they're ordered
in a random way - that's not the case since Python 3.7. The edges (=dict) keep
insertion order. But what if the insertion itself was random? That is exactly
the case here - I am using
gnm_random_graph
which generates edges randomly in random order and inserts them randomly. And
randomness is structural bias that cannot be learnt by a neural network. So
there you have it.
I ended up predicting a full adjacency matrix, which gets rid of the structural
bias. Works now, nice!
Physically Unclonable Functions (PUFs) are an extremely interesting and hot
topic in cryptography. The basic gist is, you have a certain physical object
that cannot be cloned due to the extremely intricate properties of it (e.g. the
atomic structure of a DVD) that depend on the manufacturing process. The
assumption is that each DVD is unique when looking at its atomic structure.
Now, when measuring the reflection of a DVD with a high precision device of
some sort, these differences show up. Imagine you put a chip in a capsule which
inner walls consist of such reflective material. The chip has a device that
measures the reflection and based on that generates a key pair (a public and a
private one). This public key pair can be used to communicate with the device
and identify it, and if someone were to try to tamper with the device by
drilling into it - the reflective material would break, changing the
measurement and thus the public key. Lastly, an additional digital signature on
the device ensures that anyone can check for tampering (the public key is
contained in the digital signature). The most important part is that the device
is secret-free. The 'secret' is the material and there is no way to clone that
material to re-use the secret and there is no way to find out the private key
without breaking the whole thing. Pretty cool concept.
PUFs can be used for many cryptographic primitives, one of which is Oblivious
Transfer (OT). The point of OT is: A wants to get a certain information from B
based on A's choice. B should stay oblivious of the choice of A in the process
however, and A should not be able to get all of the options from B. Pretty
specific, I know. But being a cryptographic primitive has lots of uses. See the
paper
Founding Cryptography on Oblivious Transfer - Efficiently,
which builds
secure two-party computation
based on OT.
Anyways, enough of introduction. The main contribution of this small post is to
show an example calculation of how Oblivious Transfer would work with PUFs.
In the figure below you can see how the scheme works. A uses the PUF to create
a list of challenge-response pairs (CRPs) and sends over the PUF to B. B
chooses two values x0 and x1 and sends those to A. A now chooses one of those
and sends back the challenge XOR'd with the chosen value. B measures the
response of the PUF of the two challenges v XOR x0 and v XOR x1. The
interesting thing here is that depending on the choice of A, one of those
challenges ends up being the real challenge C of A (e.g. if A chooses x0, v XOR
x0 = C XOR x0 XOR x0 = C). B sends both choices (s0, s1) back, but XOR'd with
both responses. A chooses the correct one based on its choice and since it has
the response for C, can compute y0 XOR R = s0 (if continuing with the example
of choosing x0).
Now, an example calculation.
s0 = 5
s1 = 6
# A chooses random C and a choice b (can be 0 or 1).
C = 5412
R = 7777
b = 1
# B chooses random x0, x1.
x0 = 44
x1 = 66
# A calculates (remember, A chose 1):
v = C ^ x1 = 5412 ^ 66 = 5478
# B measures the two challenges.
R0 = v ^ x0 = 5478 ^ 44 = 5450 = 9999
R1 = v ^ x1 = 5478 ^ 66 = 5412 = 7777
# B sends back both choices XOR'd with both responses.
y0 = s0 ^ R0
y1 = s1 ^ R1
# A XORs the second reply from B (first one is garbage) and gets the choice value they wanted.
s1 = y1 ^ R = 6
In summary, what we just observed:
A lets B compute both options so B doesn't know the chosen bit b. And B computes
both options in a way that the option not chosen is garbled, while the option
chosen can be deduced by A.
Generating QUBOs for Maximum 3-SAT and the Traveling Salesman Problem (TSP)
was a bit tougher than the other 9 problems that were outlined in https://arxiv.org/pdf/1811.11538.pdf.
In this post I would like to give a quick example on how one would do so.
For reference, the context of this is this project.
To understand this post, it is required to read the arxiv paper above. Else I believe the code won't make any sense.
For Maximum 3-SAT, the following clauses are given:
For Maximum 2-SAT, the QUBO formulation is really space-efficient (which is
important for current architectures) and only scales with the number of
variables (see arxiv link above). Awesome! For Maximum 3-SAT unfortunately we
scale with both the number of variables and the number of clauses.
The code to generate a QUBO for Maximum 3-SAT is seen below. The main loop is
split in two parts, one that generates the blue part of the formula below and
one that generates the green part. To be specific, the formula is just for 3
variables, so it's not general. The green part is simply the combinations of
all variables (e.g. if you have 5 variables, there's 10 combinations).
def gen_qubo_matrix(self):
n = self.n_vars + len(self.clauses)
Q = np.zeros((n, n))
# Clauses is a list of 3-tuples of 2-tuples.
# Eg: [((0, True), (2, True), (3, False)), ((1, True), (3, False), (4, True))]
# So we loop through all tuples and find the type of it.
# There are 4 types: (True, True), (True, False), etc.
for i, clause in enumerate(self.clauses):
clause_idx = self.n_vars + i
Q[clause_idx][clause_idx] += 2
# (1 + w1) * (x1 + x2 + x3) = x1 + x2 + x3 + w1x1 + w1x2 + w1x3
# (1 + w1) * ((1 - x1) + x2 + x3) = -x1 + x2 + x3 + w1 -w1x1 +w1x2 +w1x3
# (1 + w1) * ((1 - x1) + x2 + (1 - x3)) = -x1 + x2 -x3 + 2w1 -w1x1 +w1x2 -w1x3
for item in clause:
item_idx = item[0]
val = item[1]
if val:
Q[item_idx][item_idx] -= 1
Q[clause_idx][item_idx] -= .5
Q[item_idx][clause_idx] -= .5
if not val:
Q[item_idx][item_idx] += 1
Q[clause_idx][clause_idx] -= 1
Q[clause_idx][item_idx] += .5
Q[item_idx][clause_idx] += .5
# -x1x2 -x1x3 -x2x3
# -(1-x1)x2 -x1x3 -x2x3 = -1 +x1x2
# -(1-x1)(1-x2) -x1x3 -x2x3 = -1 -2x1x2 +x1 +x2
for (item1, item2) in itertools.combinations(clause, 2):
idx1 = item1[0]
idx2 = item2[0]
val1 = item1[1]
val2 = item2[1]
if val1 and val2:
Q[idx1][idx2] += .5
Q[idx2][idx1] += .5
if not val1 and val2:
Q[idx2][idx2] += 1.
Q[idx1][idx2] -= .5
Q[idx2][idx1] -= .5
if val1 and not val2:
Q[idx1][idx1] += 1.
Q[idx1][idx2] -= .5
Q[idx2][idx1] -= .5
if not val1 and not val2:
Q[idx1][idx2] += 1.
Q[idx2][idx1] += 1.
Q[idx1][idx1] -= 1.
Q[idx2][idx2] -= 1.
return Q
As you can see, it's longer than most other QUBO generators in the project. The
matrix that is generated also looks very intricate. A small note for newcomers
to QUBOs: Whenever there is a variable that is not on the diagonal, e.g.
$x_1x_2$, one splits the constant in two. This is why you see these .5
additions and subtractions. A lone variable $x_3$ would be on the diagonal at
Q[3][3] and leads to no such split. What about $2x_1x_2$? Well, we end up with
$\pm1$.
As for TSP, the problem is defined as a distance matrix between the nodes of the graph and a constraint c.
For instance, below we see three nodes with distances normalized to the range 0 to 1.
TSP is unfortunately rather inefficient to encode. For $n$ nodes, we have a $n * n$
distance matrix and a $n^2 * n^2$ QUBO matrix! Below is the code for generating
a TSP QUBO (c.f. here for more context):
def gen_qubo_matrix(self):
n = len(self.dist_matrix)
Q = np.zeros((n ** 2, n ** 2))
quadrants_y = list(range(0, n ** 2, n))
quadrants_x = quadrants_y[1:] + [quadrants_y[0]]
# The diagonal positive constraints
for start_x in quadrants_y:
for start_y in quadrants_y:
for i in range(n):
Q[start_x + i][start_y + i] = 2 * self.constraint
# The distance matrices
for (start_x, start_y) in zip(quadrants_x, quadrants_y):
for i in range(n):
for j in range(n):
if i == j:
continue
Q[start_x + i][start_y + j] = self.P * self.dist_matrix[j][i]
Q[start_x + i][start_y + i] = 2 * self.constraint
# The middle diagonal negative constraints
for start_x in quadrants_x:
for i in range(n):
Q[start_x + i][start_x + i] = -2 * self.constraint
for j in range(n):
if i != j:
Q[start_x + i][start_x + j] += 2 * self.constraint
Note: P is set to 10.
This wraps up the post. I've shown an example problem instance for Maximum
3-SAT and one for TSP and code on how to generate QUBO matrices for each.
I've always wondered what policy entropy really means in the context of PPO.
From other posts (e.g.
here
by AurelianTactics - btw very cool guy saw him on the official reddit RL
Discord) I know that it should continuously go down. It also corresponds to the
exploration the PPO agent is taking --- high entropy means high exploration.
In this case I will be looking at continuous action spaces. Here, each action
in the action space is represented by a gaussian distribution. The policy is
then called a gaussian policy. In my case, I have two continuous actions (for
more information, check out the post
Emergent Behaviour in Multi Agent Reinforcement Learning - Independent PPO).
Now, entropy for a gaussian distribution is defined as follows: $\frac{1}{2} \log(2\pi{}e\sigma{}^2)$.
What baselines PPO does
now is to simply sum the entropies across the actions axis (e.g. given a batch
of 700 actions s.t. the shape is (700,2), this becomes (700,)) and then take
the mean over the batch. Their inner code is equal to the definition above
since $\log(a*b) = \log(a) + \log(b)$. For reference, this is using the
natural log.
The right side (.5 * np.log(2.0 * np.pi * np.e)) ends up being roughly 1.41.
You can now deduce, given an entropy reading, how uncertain (and thus
exploratory) each action is. High entropy equals high exploration and vice
versa. To quantify the above, check out the figure below:
You can see that at roughly variance = 0.242 entropy is 0.
This is how policy entropy looks for me when training two predator PPOs to
catch prey:
As you can see, exploration decreases continuously. It starts with variance = 1
for both actions and ends up at variance = 0.3 after training. (starts at 2.82
= 1.41 * 2, which is incidentally the entropy for 2 actions summed up, given
variance of both is 1).
Nice!
Update: Some notes regarding why we take the mean.
One could argue that the logstd does not change over the 700 batch, since the
policy did not change. And that is true, we actually get 700 times the same
entropy number. The reason why every PPO implementation takes the mean here is
two-fold.
First of all, check out the loss PPO optimizes (see paper):
The blue part is the entropy, and observe how it is in the expectation
$\mathbb{E}$, for which you usually take the mean.
Secondly, autograd. One optimization idea would be to just take the first
entropy number and throw away the others (so an indexing operation instead of
the mean), but that would not play well with autograd, which uses the gradients
to update all parameters (including the logstd parameter and thus your policy
itself) in the backwards pass. Calculating the gradient of a mean is easy, but
the gradient of an indexing operation? The key word here is 'differentiable
indexing' and it is possible, but you only update part of your parameters (the
part you indexed). In this case my hunch is that you end up only updating your
policy once instead of 700 times, which defeats the purpose of a 700 batch. At
that point you should just do singular updates and no batching.
This is so hilarious I just gotta save this for the books.
Michael Burry, the famous main character from The Big Short who predicted the
housing market crash and made a nice 100M for himself (total net worth back
then 300M), is short on TSLA. And he has started sharing heavy metal on his
Twitter, which is interesting - this is portrayed in the movie as something he
uses to get rid of the stress of BLEEDING CASH EVERY MONTH lol. The position
must be pretty hard on his nerves. On the other hand, he just made some very
good cash on his GME play.
For the record, I totally agree with him. There are quite a few indicators that
the market is devolving into a bubble. Wallstreetbets for instance has gained
over 6M subscribers over the GME fiasco (which I missed, but I made some money
shorting it). It had 2M subscribers before, and has 8.8M now. I have personally
heard of some laymen in my circle of friends that they are becoming interested
in the stock market. So yea, interesting times. And then there is Bitcoin and
TSLA..
So let's see how 2021 plays out. I now know that my post There is no
recession coming was technically right. We basically did not have a
recession in 2020, that was like a small shake-out! So maybe I will look back
at this post some time in the future and agree with my past self.
While writing this post I was listening to the Heavy Metal Michael Burry shared
- my god that is some good shit. I might just take a dive into that genre off
of his recommendation.
Update: Just found a satirical comment I posted literally at the bottom of the
crash in March 2020. Everyone on WSB was balls deep in puts. Fucking hilarious.
There are three major issues with the commercial register (Handelsregister)
data. First of all, every company has their own rules, e.g. in terms of
appointments. Who is allowed to be appointed, when and why and with whose
permission.. Thousands of combinations. Secondly, our clerks do not seem to
follow the format 100%. And they make spelling mistakes sometimes.
But the most glaring one is: Each register entry is ONE line of TEXT. And there
is no official delimiter it seems. The clerks all kind of go with a few loose
rules but yea - it is definitely not defined by a RFC or something lol.
So what does this leave us with. LOTS of Regex and special rules, and it is
pretty painful. But this should be known to any serious ML practitioner, that
most of the work lies in data preparation.
For instance, I want you to look at this regex (which is just a small part of
the data preparation pipeline):
(Die |Ist).*(ein|zwei)?.*(Gesellschaft|Vereinigung|Genossenschaft|vertritt|Vorstand|Geschäftsführer).*(ein|zwei)?.*
(vertreten|Prokuristen|vertreten.*gemeinsam|vertreten.*gemeinschaftlich|einzelvertretungsberechtigt|Vertretungsbefugnis von der Generalversammlung bestimmt)
It might not be perfect, but it gets the job done.
So anyways, enough of the data preparation pipeline, how about a first result.
While doing some descriptive statistics (data exploration), the first thing
that interested me was the number of registrations per year over time. Check it
out:
2015 we had a near recession
with oil dropping and the Chinese market crashing, so it makes sense that in
bad times less businesses were registered.
But what is extremely surprising - in 2020 we had slightly more registrations
than in 2019! What the hell? Seems like the 2020 global recession was kind of a
fake recession for a certain chunk of the population.
Another thing that is interesting, but not plotted here (since it would look
boring) - month over month the number of registrations are roughly the same. So
there is no seasonality here. Quite interesting!
Just a note regarding the 2007 data (which is roughly 130 registrations) - the
available data starts at the very end of 2007. Thus it looks like there are
nearly no registrations in 2007, but in reality it's not a full year.
I will be working with this data some more and I will update this post (or
create a new one). Specifically I am looking at doing Latent Semantic Analysis
(LSA) and using the
gensim
library. Stay tuned!
Update: I have done some more analysis, specifically LSA.
Note: The next few paragraphs will contain a lot of German words.
An important task is getting rid of small and common words such as 'der'
(engl.: the), else one would throw off (dis)similarity measures. So for me
this currently looks like that:
und oder der die das von zur nur so einer einem ein allein darum dazu für dafür
zum zu dort da grundkapital stammkapital eur dem usd gbp im mit an dem bei art
rahmen vor allem aller sowie aus in den als
These are often called 'stopwords', and we simply filter them out, which
improves the LSA model considerably - we get more confident topic assignments.
Notice words like 'Stammkapital' (base capital) - they may look like they are
important to the text, but if literally every text has Stammkapital included,
it does not really give us any more information.
A seemingly important hyperparameter (similar to the number of clusters in
clustering) is the number of topics. I am currently experimenting a bit here.
Having too small of a number of topics seems to make most topics include stuff
like 'Halten' (holding) which is a very important word since there are a lot of
holding companies. Too many topics has duplicates as follows:
While trying to understand Grover's Algorithm, I started wondering how exactly
the Hadamard gate transformation looks like for not just one QuBit or even two,
but for three or more.
Ekert et al. in their paper Basic concepts in quantum computation
show with Equation 12 a very nice general formula for the Hadamard gate
transformation:
where $\cdot$ is the element-wise product. The sum over $z$ indicates all possible states of $x$ (we're in a superposition!). For a n-length state there are $2^n$ possible states, so for 3 there are 8 states.
Assuming I'd like to apply Grover's algorithm to 3 QuBits as shown
on the Qiskit page on Grover's algorithm,
there is a point at which Hadamard is applied to 3 QuBits on a non-trivial state (not just $|000\rangle{}$).
This is where the formula above comes in.
Let's first calculate it through for $|000\rangle{}$. Below I calculate the element-wise products $x \cdot z$.
In the last step all the sum was evaluated to its 8 possible states. This is probably very familiar looking now!
Next, let's try this for a non-trivial state, such as $|101\rangle{}$.
Again, at first I calculate the element-wise products $x \cdot z$.
I've again only shown the first two. But there is an important difference here
already, the result of the element-wise product for states $x=|101\rangle{}$ and $z=|001\rangle{}$ is $1$!
And this is where the negative signs come from when evaluating the whole sum, since $(-1)^{1} = -1$.
If you want to calculate this by hand and compare, go right ahead. To make sure
your calculations are correct, here is a short code snippet using Qiskit, that
does basically all the calculations for you.
To customize it, change the index in circuit.initialize(). E.g. Just imagine
the state from before as a binary number ($101_{2} ~= 5_{10}$), which means we
were looking at 5. Now set the number at index 5 to 1 in the list [0, 0, 0, 0, 0, 0, 0, 0].
import qiskit as qk
q = qk.QuantumRegister(3)
c = qk.ClassicalRegister(3)
circuit = qk.QuantumCircuit(q, c)
# We want 101, so the fifth. This is 0-indexed. So the sixth place:
circuit.initialize([0, 0, 0, 0, 0, 1, 0, 0], [q[0], q[1], q[2]])
circuit.h(q[0])
circuit.h(q[1])
circuit.h(q[2])
print(circuit)
simulator = qk.BasicAer.get_backend('statevector_simulator')
job = qk.execute(circuit, simulator)
ket = job.result().get_statevector()
for amplitude in ket:
print(amplitude)
This will print the amplitudes multiplied with $\frac{1}{\sqrt{8}} ~= 0.3535$.
Incidentally, negative signs are in the exact same places as in the result from
our manual calculation above.
This is the pointe of this blog post: To calculate the resulting state (after Hadamard gate transformation), you would have to calculate $H(|z\rangle{})$ for each of the sub-states above (so for $|000\rangle{}$, $|001\rangle{}$, etc.) and then sum them up^{1}. Some of them would then cancel each other out. This is why in their example, after step 3.1 the state looks so short:
Anyways, this is it for this post. I hope I could give some more intuition into
the Hadamard gate transformation with more than just 1 or 2 QuBits. Maybe the
script above will also help you in your endeavours.
Update: I've created a script to prove that the states indeed cancel each other out.
Find it here.
Observe in the output at the very and that only 4 states have a non-zero
amplitude. Those are exactly the four states that we end up with!
Also, for the interested reader here are a few related papers that may help in understanding:
I got that idea from here. Notice how the Hadamard gate transformation evolves on a complex state: $H(\frac{1}{\sqrt{2}}|0\rangle{} + \frac{1}{\sqrt{2}}|1\rangle{}) = \frac{1}{\sqrt{2}}(\frac{1}{\sqrt{2}}|0\rangle{} + \frac{1}{\sqrt{2}}|1\rangle{}) + \frac{1}{\sqrt{2}}(\frac{1}{\sqrt{2}}|0\rangle{} - \frac{1}{\sqrt{2}}|1\rangle{}) = \frac{1}{2}(|0\rangle{} + |1\rangle{}) + \frac{1}{2}(|0\rangle{} - |1\rangle{}) = |0\rangle{}$↩
Let's parse the search path grammar! The USP (User Services Platform) committee
has given us a BNF in their pdf that describes the object path model (including
search path) using a context-free language.
Note: PEG
languages are not equivalent to context-free languages. In fact they are less
powerful, since they are more strict in terms of ambiguity. E.g. the following
rule cannot be parsed by a PEG:
S ::= 'x' S 'x' | 'x'
So why not simply do a Regex? BNFs describe context-free languages and often
cannot be implemented using a regular language. Specifically, context-free
languages should strictly contain context-free rules only (i.e. the left side
of each rule in the BNF further above has only one variable) and they can have
recursion on both sides. Regular languages do not allow that and are even more
strict - they can only contain a rule that ends in a terminal (e.g. a
character) or a terminal combined with a variable (given terminal a and
variable B: aB or Ba. In above example it could be '*' posnum or posnum '*').
For instance, the rule for exprcomp as it stands is not allowed in regular
languages, since it contains three variables. Now, these rules can still
possibly be converted into a regular language by enumerating all possible
combinations of a variable and a terminal. E.g. oper consists of 6 possible
values, so why not add 6 rules a → '==' value, a → '>=' value, and so on.
However, this is a lot of work. Let's just stay context-free and use pyparsing!
There are lemmas to prove that a language is not regular such as the Pumping
Lemma. However, I haven't been able to find a contradiction (i.e. any word in
the above language can be pumped). This does not mean it is regular, so it is
not a proof, but I have a gut feeling the above language is in fact regular.
One way to actually prove this would be to create a finite automaton, i.e. a
DFA or NFA from the BNF. Another way could be to try to convert every single
rule into a left-/right-regular rule. If this is successful, it is indeed a
regular language.
This post is a summary of the flap list feature.
It is not necessarily just a Cisco-only feature, even though I will be focusing on Cisco here. There are other vendors such as Casa, Arris, Huawei.. The list goes on.
A flap list keeps track of CMs with connectivity problems (they're flapping back and forth).
The flap list can help in deciding between whether there is a problem with the given CM, or whether there is a problem with the upstream or downstream cable.
There are three classes of problems:
Reinsertions: A CM re-registers more frequently than a specified insertion time. Too many reinsertions may indicate problems in the downstream cable or that the CM is provisioned wrongly.
Hits and misses: A hit occurs when a CM successfully responds to MAC-layer keepalive messages that the CMTS sends out. A miss occurs when the CM does not respond in after a timeout. Too many misses followed by hits may indicate problems in the up/downstream cable.
Power adjustments: A CM can adjust their upstream transmission power up to a maximum power level. Too many adjustments may indicate a problem with an amplifier in the upstream direction.
Use cases:
If a customer reports a problem, but the flap list does not contain the customer's modem, problems with the cable can be ruled out. The issue is then most likely local.
CMs with more than 50 power adjustments a day have a potential issue in their upstream path.
CMs with roughly same number of hits and misses and with a lot of insertions have a potential issue in their downstream path.
All CMs incrementing their insertion numbers at the same time indicates a problem with the provisioning servers.
CMs with high CRC errors have bad upstream paths.
Correlating CMs on the same upstream port with similar flap list statistics (same number of hits/misses/insertions) may show cable or node-wide issues.
If a CM registered more than once in a certain period (default: 90s), the first registration is considered failed, which increments the insertion fail KPI.
ccsFlapHitsNum
CMTS sends request every 10 secs, a successful response within 25ms from CM increases flap hits by one.
ccsFlapMissesNum
If the response is completely missing or takes more than 25ms, flap misses increases by one.
ccsFlapCrcErrorNum
ccsFlapPowerAdjustmentNum
If upstream power is adjusted more than X dB (default: 1 dB, but they say it often should be more like 6 dB), this KPI is increased.
ccsFlapCreateTime
Time when this modem was added to the flap list. After max age (default: 7 days) they get removed again.
If any of the main KPIs (insertion fail, hit/miss, power adjustment) is significantly higher than the others, this is a very important signal.
If hit/miss is highest → the modem keeps going up and down.
If power adjustment highest → "improper transmit power level setting at the modem end" (personal story: I used to have that with M-Net. Internet kept going down completely every few days. They reduced power level (which capped our downstream MBit at a lower level..), and everything was fine)
The point is that these KPIs tell a lot.
Below is an illustration of misses increasing live on some cable modem. The snmpget was executed in the course of an hour.
snmpget -v2c -c 'REDACTED' IP iso.3.6.1.4.1.9.9.114.1.1.5.1.14.0.38.151.18.124.193
iso.3.6.1.4.1.9.9.114.1.1.5.1.14.0.38.151.18.124.193 = Gauge32: 1580
snmpget -v2c -c 'REDACTED' IP iso.3.6.1.4.1.9.9.114.1.1.5.1.14.0.38.151.18.124.193
iso.3.6.1.4.1.9.9.114.1.1.5.1.14.0.38.151.18.124.193 = Gauge32: 1634
snmpget -v2c -c 'REDACTED' IP iso.3.6.1.4.1.9.9.114.1.1.5.1.14.0.38.151.18.124.193
iso.3.6.1.4.1.9.9.114.1.1.5.1.14.0.38.151.18.124.193 = Gauge32: 1650
It should be noted that there are some differences between vendors in terms of
their Flap List support.
Vendor
reinsertions
hits
misses
power adjustments
last flap time
creation time
Cisco
Yes
Yes
Yes
Yes
Yes
Yes
Harmonic
Yes
Yes
Yes
Yes
Yes
No
Casa
Yes
Yes
Yes
Yes
Yes
No
Arris
Yes
No
No
No
Yes
No
The latest Arris Flap List MIB can be found here.
The OID to look for is called cadIfCmtsCmStatusInsertionFlaps
(1.3.6.1.4.1.4998.1.1.20.2.2.1.51). Unfortunately, hits, misses and power
adjustments do not seem to be supported.
The Cisco Flap List MIB can be found here.
The Casa Flap List MIB can be found here.
And the Harmonic (CableOS) Flap List MIB can be found here.
One OID to look for here is hrmFlapInsertionFails (1.3.6.1.4.1.1563.10.1.4.2.1.1.1.1.2).
I've been trying to get a bunch of sharks to cooperate in hunting using independent learning. And it has been quite a blast.
Specifically, I've adapted the baselines repo to add support for multiple agents (MARL) and some other things to PPO: https://github.com/Instance-contrib/baselines/tree/tf2-instance.
Now, the environment consists of fishes (green cycles) and sharks (orange cycles):
Fishes use a pretty good static algorithm to evade sharks, while sharks are trained independently using PPO. The environment is a POMDP - the observation space only includes the nearest few fishes and sharks. The reward per eaten fish is 10. If there are no sharks in sight and the time is right, fishes will also procreate. Thus, with smart ecosystem management the sharks could eat infinitely.
A few new additions were made: Not only is there two or three sharks instead of just one, there is also starvation and thus pressure to eat. There is also a stun move that enables a shark to stun the other for a few seconds and a killzone: If two sharks are in that zone when one of them kills a fish, the rewards are split (so each shark gets 5). The question here is, what do the sharks learn to do? Will they completely greedily swim around, stun each other and steal fishes until there is none left. Or will they learn ecosystem management together, i.e. leave the fish population intact such that they can eat for a long time? Will they learn to hunt together in some settings?
The chart below shows that indeed ecosystem management, or what I call herding, is possible.
This is the left-over fish population at the end of an episode, over all training steps. One can see that while at the beginning the sharks aren't able to hunt well yet (section a), in section b they learn to greedily eat all fishes (and become quite proficient at that). What is interesting, is that in section c they learn to keep 2 fishes around. And that is with two independent neural networks (PPOs)! Quite awesome.
The second awesome emergent behaviour is cooperation. One way of creating that is reward shaping, but I did not go down this path. Instead, I forced cooperation by making it much tougher to hunt. The sharks are now slower by half, which makes it pretty much impossible to get a fish on its own. The next chart shows how two sharks learn to hunt from two sides to catch fish. This is completely learned behaviour in spite of the split killzone reward and the stun move!
There is a lot more analysis to be done here. For instance, the exact parameters that induce cooperation can be quantified in much more detail. See the chart below - there are 4 dimensions explored here, 3 of which have a significant influence on the average cooperation rate (over 20 models with each model evaluated 20 times). First, a trivial influence is the radius in which cooperation is registered. Increasing it increases the cooperation rate. More interesting is the number of initial fishes in the population and the speed of the shark. Especially setting the speed of sharks to 0.03, which effectively halves the speed, increases the rate to a point where they only cooperate. Having a lower number of fishes initially makes the aquarium less crowded and thus catching a fish tougher; This increases the cooperation rate.
There is a lot more to write here, but for now I will have to leave it at that.
The repository can be found here.
It also contains a summary of the results and a few nice animations.
Since I could not find an example out there, here's a comprehensive and complete example on using the Zoho Mail API with Python Flask.
Some pointers: It was very important to make sure that the top level domains of all API URLs match. In my example, they're all dot eu.
How it works is as follows: The application prints a register link that needs to be clicked only once per run. It gets an initial access token and a refresh token. A second thread uses the refresh token every hour to get a new valid access token (since they're only valid for one hour).
There is also a Flask route to actually send a HTML generated E-Mail for testing.
Now, a possible improvement could be saving the refresh token to the filesystem to persist restarts. But that is an exercise for the developer.
I've been porting LunarLander from Openai/gym (Python) to C++.
Since I'm using MCTS and related planners, the environment needs to be
copyable. Well, box2d does not have copy-constructors, so at first glance
we're screwed. Fortunately the environment is rather simple - it has three
dynamic bodies (lander and the two legs), two joints and a static ground, the
moon. It turns out, with a minor modification to box2d itself and a small hack
that fixes sleeping it is possible to copy the whole LunarLander world.
Now, to clone the world and its bodies, I use the reset() method of the
environment, i.e. we start with a clean initial world where the lander is still
at its initial position. The shape of the moon can be copied easily, since it
is generated from a random float vector. Simply copy the vector and generate
the exact same moon.
Finally, each body is cloned with the following function:
void LunarLanderEnv::clone_body(b2Body* current, b2Body* other) {
auto transform = other->GetTransform();
current->SetAwake(other->IsAwake());
current->SetAngularVelocity(other->GetAngularVelocity());
current->SetLinearDamping(other->GetLinearDamping());
current->SetAngularDamping(other->GetAngularDamping());
current->SetGravityScale(other->GetGravityScale());
current->SetEnabled(other->IsEnabled());
auto vel = other->GetLinearVelocity();
current->SetLinearVelocity(vel);
current->SetTransform(transform.p, transform.q.GetAngle());
current->SetSleepTime(other->GetSleepTime());
}
Basically, all properties like position (transform), angle and velocity are
copied.
These are all public functions of b2Body except for GetSleepTime and
SetSleepTime, which I had to add myself.
The change to box2d itself can be found here.
The gist is - when setting the velocity, sleep time is discarded (set to 0.0f).
That is very bad, since the environment ends when the lander is asleep
(IsAwake() returns false, i.e. sleep time exceeds some threshold). Thus, sleep
time needs to be copied too.
It's unfortunate that the sleep time is not public, but it makes sense - we are
now going rather deep into box2d. They do not support cloning/copying bodies,
it is what it is. But for my purposes this works.
Lastly, the hack I unfortunately had to introduce (and I hope at some point I can get rid of it), is as follows.
So, of course you can do the 5 world->Step calls in a for loop. But the gist
is, we need to give the bodies a chance to settle down and fall asleep. It
seems when forcefully copying all the public properties of bodies (velocity
etc.) the bodies still jiggle around. Maybe it's due to the joints, I don't
know. Apparently there is still some state I missed copying.
The hack makes it all work for now, but maybe at some point I will be able to
solve this in a cleaner way.
The project this is relevant to is GRAB0.
The full implementation of the LunarLander environment can be found here. This includes the above code snippets in a wider context.
I saw this great
list of productivity tips from a year ago by guzey. And it's awesome.
Now, you may indulge in that list if you like. There's lots of these little
productivity posts from a lot of different, very productive people on
Hackernews. This will simply be another one of those. The reason is two-fold:
It's for me to reflect on later. And it's just another combination. You can
combine productivity tricks in many ways until you find your combination
that just works. So here's mine. Maybe it works for you.
It consists of two rules.
Prepare very small tasks (split up big tasks into small ones) the day
before. I usually have a few (like 3?) well defined big tasks and a few smaller
administrative tasks. Well defined means: They are broken into small tasks. It's
like prepping food the night before work. Some productive people do that. (the
food thing - I don't do that. But you get the gist.)
If you notice you are starting to procrastinate, get in the habit of
questioning what the root cause may be. Spoiler: It's emotions. Your task feels
unsurmountably huge. Your task feels useless. Your task feels too difficult.
Your task feels like it's not getting you anywhere. Your task feels too easy
(yes. that's possible too. Too easy == useless. For example when preparing for
an exam. You have a few tasks to redo some exercises you deem easy -> that may
result in procrastination, since you feel that redoing them again is not really
getting you anywhere.)
These are the two rules driving me since more than a year. guzey says
procrastination tricks stop working from time to time. Let's see when these two
rules stop working for me.
Note: They fail for me too, sometimes. Some days are just simply garbage. Keep
in mind though - it's fine, it's a human thing. It is what it is.
I may update this post at some point with further experiences and tricks.
Update: Now that I think of it, I actually use more tricks from time to time.
But the two rules above are the main ones. Mostly even just rule 1. Other
tricks I sometimes use include: Win the morning (Aurelius) and some kind of lax
Pomodoro technique based on tasks - finish a task (however long it takes, be it
15 min or 1 hour), take a small break. The Pomodoro one I only use for
studying, not for work. For work or development I usually do not need any
productivity tricks except for the two rules I mentioned. I don't actually like
Pomodoro because sometimes the break leads to a 2 hour odyssey. After that,
productivity is very impaired due to guilt.
Sometimes cross validation may fail. In the following, a bit more intuition
into an exercise that shows a pitfall of leave-one-out cross validation is
presented.
For the reference of the task at hand, refer to this book, exercise 11.5.1.
Failure of k-fold cross validation:
Consider a case in that the label is
chosen at random according to P[y = 1] = P[y = 0] = 1/2. Consider a
learning algorithm that outputs the constant predictor h(x) = 1 if the parity
of the labels on the training set is 1 and otherwise the algorithm outputs the
constant predictor h(x) = 0. Prove that the difference between the leave-oneout
estimate and the true error in such a case is always 1/2.
The gist of the solution is as follows. Be advised, this is just for intuition. The rigorous proof can be found here.
Firstly, parity is defined like on binary numbers.
A list of labels $\{x_0, .., x_n\}$ has the parity $(\sum_i^n x_i) \bmod 2$.
Imagine the sample consists of 3 items with the following labels: $\{1, 0, 0\}$.
This set has a parity of 1.
Now, when taking one of the items out (leave-one-out) for cross validation, there are two cases:
Labels are now $\{1, 0\}$. The parity of this three-tuple is still 1. The predictor is trained on this and will always return 1. However, the validation set consists of $\{0\}$, which has a parity of 0. Thus, the estimated error by CV is 1.
Labels are now $\{0, 0\}$. The parity of this three-tuple is 0. The predictor will always return 0. The validation set consists of $\{1\}$. Again, estimated error of 1.
In essence, the estimated error will always be 1 (keep in mind, we take the average of all estimated errors).
Now, imagine the same for a sample of a much higher size, such as 1001, i.e. with 501 1's and 500 0's.
The true error is then roughly 0.5: The predictor always predicts 1 due to the parity 1, but close to half of the samples are 0.
You may keep going ad infinitum.
Finally, as the estimated error is 1.0 (as shown above), we thus get the difference of 0.5, as required in the exercise.
If you're here, you probably checked out a few blog posts or Wikipedia to
understand Kernels. So this is just a few additional thoughts. Firstly, why
even do this 'kernel trick'?
The application we will be looking at in this post is clustering a set of
points. For this, it is of course important to have some kind of distance
measure between the points, in fact that is the main function needed.
Consider a dataset of 4-dimensional points. Pick a point (x1,x2,x3,x4). For
the sake of this example you find that to separate the points in multiple
clusters you will have to move them into the 10-dimensional space using the
following formula:
$(\sum_i^4 x_iy_i)^2$
Fully expanded, each of the 4-dimensional points is now the following
10-dimensional point:
You could just convert all points into this and then apply a clustering
algorithm, which will use some kind of distance measure on those points such as
euclidean distance, but observe all these calculations.. For each point you
have to do $10$ calculations. Now imagine if you did not have to convert
all points. You use the polynomial kernel:
$(\sum_i^4 x_iy_i)^2 = (x^Ty)^2$
The crucial point here is that the kernel gives us a kind of similarity measure
(due to the dot product), which is what we want. Since this is basically the
only thing we need for a successful clustering, using a kernel works here. Of
course, if you needed the higher dimension for something more complicated, a
kernel would not suffice any longer. The second crucial point here is that
calculating the kernel is much faster. You do not have to first convert all the
points to 10 dimensions and then apply some kind of distance measure, no, you
do that in just one dot operation, i.e. one sum loop. To be precise, you iterate $N = 4$ times with a
kernel, but you do it $N*2.5=10$ times with the naive way of converting points to
higher dimension.
This is a collection of notes regarding the C++ frontend of PyTorch. I'm a proponent of reading code to understand concepts in computer science, so this 'tutorial' is most likely not for everyone. Still, I recommend to just go ahead and read the goddamn code. In this post I'll just highlight a few specific topics to get one started.
Documentation is in my opinion lacking, so anyone developing a project using the C++ frontend should basically download PyTorch from Github to read the code as needed. Let's get started.
Modules always need to be named XXImpl. For instance, your module could look like this in your header:
It is possible to train multiple different objectives at once. For instance, let's assume you want to both learn to predict the policy and the value of a given state, you can do just that by simply returning a std::pair in the forward method of your custom module:
std::pair<torch::Tensor, torch::Tensor>
A2CNetImpl::forward(torch::Tensor input) {
auto x = input.view({input.size(0), -1});
x = seq->forward(x);
auto policy = F::softmax(action_head(x), F::SoftmaxFuncOptions(-1));
auto value = value_head(x);
return std::make_pair(policy, value);
}
It is possible to create a network architecture dynamically from a configuration file. Here I pass a std::vector net_architecture (e.g. {64, 64}) and then iteratively create a linear layer of size 64 (as passed) and a ReLU activation layer. At the end I create two heads, a policy and a value head.
seq = register_module("seq", torch::nn::Sequential());
int n_features_before = n_input_features;
int i = 0;
for (int layer_features : net_architecture) {
auto linear = register_module(
"l" + std::to_string(i),
torch::nn::Linear(n_features_before, layer_features)
);
linear->reset_parameters();
auto relu = register_module("r" + std::to_string(i + 1), torch::nn::ReLU());
seq->push_back(linear);
seq->push_back(relu);
n_features_before = layer_features;
i += 2;
}
action_head = register_module("a", torch::nn::Linear(n_features_before, 3));
value_head = register_module("v", torch::nn::Linear(n_features_before, 1));
action_head->reset_parameters();
The policy optimizers of course support polymorphism, so creating different ones based on configuration is possible too:
You should know when you see a local minimum.
See the following example that I've been battling with for a while. In my opinion it's a symptom of not enough exploration.
This is a separated list of 3-item tuples that represent the likelihoods of 3 actions, rotating left, rotating right and going forward. The fat tuple is interesting, as it's not confident at all between right and forward. This is the cause of the sample we're learning with: 2, 2, 2, 2, 2, 1, 0, 2, 2, ..
That sample rotates to the right, then to the left and then goes forward. Basically the two rotations are useless. Even more, we do two different actions in the same state, rotating right (1) and going forward (2). See where I'm going with this? Those are exactly the two actions that the fat tuple is not confident with.
The result of a greedy evaluation of the above policy is: 22222101010101010101010101010101010101010...
So you end up rotating until the end, not reaching the goal.
Below is an every so slightly different policy (check out the unconfident tuple - this time we go forward instead of rotating right when greedily evaluating).
So, this training sample sucks and I need more exploration based on it.
Let's see if my hypothesis is correct. I might update this post at some point.
Update: My theory was right, but the reason was wrong. See the issue here and the resolving commit here. What I saw above was this off by one mistake, as mentioned in the issue. Simply a logical mistake by myself. What finally fixed it however, was 'fuzzing' the gradient bandit in GRAB0 (bear in mind, the above topic came from that project) and noticing that when given a policy, the gradient bandit needs at least 2000 simulations to find a better policy (and even then, only most of the time, but that's good enough).
PreDeCon^{1} is a projected subspace clustering method based on DBSCAN.
This will be a short post, mostly to illustrate the concept with an example, since I believe code is worth a thousand words.
But the gist is to weigh the distances based on the variance of their neighbors.
I.e., if your neighbors are all distributed in a line, you assign a very small weight to distances along the line and very huge weight to the perpendicular dimension. So, if a point is slightly outside of the line, it's already not part of your cluster (since the distance is weighted heavily). Often, those weights can be up to 1 to 100! So going outside of the line penalizes (=weighs more) the distance by a factor of 100.
See for yourself in the following picture of an example data set. PreDeCon will find two clusters, both the lines in the picture.
The way this works, is that the red point is NOT a core point due to our weighted clustering, even though it would be with naive DBSCAN. Thus, DBSCAN stops here, instead of continueing and adding all points to one single cluster. And why is that not a core point? Well, the high variance of the right neighbor of our red point lies actually on the vertical line, and not the horizontal line as with the other neighbor! Thus, the weight given to the horizontal distance between the red point and the right neighbor will be multiplied by 100. That's because the red point looks to the neighbor like an outlier, since it's obviously not on the vertical line. The weight given to the vertical distance would've been simply 1. But the red point and the right neighbor have a vertical distance of 0 and a horizontal distance of 1.. Thus, the weighted distance here will be 0 * 1 + 1 * 100. There's more to it, but that gets the method across.
Finally, here's Python code explaining the above concept, including all details (such as correct euclidean distance calculation and not the simplification we used).
# PreDeCon - Projected Clustering
# Paper: Density Connected Clustering with Local Subspace Preferences, Böhm et al., 2004
# This code is not optimized.
import numpy as np
def get_neighbor(X, candidate):
"""Return the eps-neighbors of candidate.
"""
neighbors = []
for pt in X:
if ((pt - candidate) ** 2).sum() ** .5 <= eps:
neighbors.append(pt)
return neighbors
def get_weights(X, candidate):
dists_x = []
dists_y = []
for neighbor in get_neighbor(X, candidate):
dists_x.append((neighbor[0] - candidate[0]) ** 2)
dists_y.append((neighbor[1] - candidate[1]) ** 2)
var_x = sum(dists_x) / len(dists_x)
var_y = sum(dists_y) / len(dists_y)
weight_x = 1 if var_x > delta else K
weight_y = 1 if var_y > delta else K
return weight_x, weight_y
def pref_weighted_dist(X, neighbor, candidate):
weights = get_weights(X, neighbor)
dist = 0
for i in range(2):
dist += weights[i] * (neighbor[i] - candidate[i]) ** 2
return dist ** .5
def is_core(X, candidate):
good_ones = []
for neighbor in get_neighbor(X, candidate):
dist = max(
pref_weighted_dist(X, neighbor, candidate),
pref_weighted_dist(X, candidate, neighbor)
)
if dist <= eps:
good_ones.append(dist)
return len(good_ones) >= minpts
X = np.array([
[1, 5],
[2, 5],
[3, 5], # p3
[4, 5],
[5, 5],
[6, 5], # p6, red point
[7, 5],
[7, 6],
[7, 7],
[7, 4],
[7, 3],
[7, 2]
])
minpts = 3
eps = 1
delta = 0.25
K = 100
print('p3', is_core(X, X[2]))
print('p6', is_core(X, X[5]))
SUBCLU^{1} is a subspace clustering algorithm based on DBSCAN^{2} and A-priori^{3}.
Assume the following small exemplary data set with three dimensions.
DBSCAN is parameterized with eps=3 and minpts=2 (including the core point).
2 10 2
1 20 1
9 30 9
8 40 9
The algorithm is bottom-up, so let's start with singular dimensions 1 to 3.
D(x,y) is the euclidean distance between two points.
Dimension 1 (points: 2, 1, 9, 8): D(2,1) = 1 < eps. Since we have two points, 2 is a core point. Same can be said about 1. So, the first cluster consists of 2 and 1. Same can be applied to 9 and 8. So dimension 1 contains two clusters.
Dimension 2 (points: 10, 20, 30, 40): All points are too far away from each other (distance is at least sqrt(10) between each pair), so no cluster can be found here.
Dimension 3 (points: 2, 1, 9, 9): Analogous to dimension 1. Two clusters.
Next, we generate the k+1 subspace candidates (where k=1, since we just did singular dimensions).
Keep A-priori in mind: The k+1 candidates need to have k-1 attributes in common. Not relevant for 2 dimensions, but will be relevant for 3 dimensions later.
So here we have the following combinations of dimensions: 1,2; 1,3; 2,3. We may now prune some of those candidates using the A-priori principle, i.e. if k dimensions of a given candidate do not contain a cluster, that candidate will also not contain a cluster in k+1 dimensions. For proofs, refer to the original paper^{1}. In this case, with k=1, dimension 2 has no clusters and thus both the candidates 1,2 and 2,3 can be pruned.
Next, we apply DBSCAN on subspace 1,3 (points: (2,2), (1,1), (9,9), (8,9)). D((2,2), (1,1)) = D((1,1), (2,2)) = sqrt(2) < eps. That's a cluster with points (2,2) and (1,1) (since we have 2 points and minpts=2). D((9,9), (8,9)) = D((8,9), (9,9)) = 1 < eps. Again, another cluster.
To generate three dimensions, we would need more candidates. For example, with 1,2 and 1,3 one could generate 1,2,3 (since both share the first dimension: 1). However, currently the only 2-dimensional candidate with clusters is 1,3. Thus, we stop here.
Result:
The two dimensions 1 and 3 contain two subspace clusters
2 2
1 1
and
9 9
8 9
References:
Density-Connected Subspace Clustering for High-Dimensional Data, Kailing et al., 2004 ↩↩
A Density-Based Algorithm for Discovering Clusters, Ester et al., 1996 ↩
Fast Algorithms for Mining Association Rules, Agrawal et al., 1994 ↩
I've compiled a list of ELKI invocations for subspace/correlation clustering with nicely fit parameters (except for PROCLUS).
This post is actually more of a reminder for myself.
A note on PROCLUS: Due to the randomness of k-medoids it seems I only get good clusters half of the time. It's something to consider.
Real-time dynamic programming^{1} (RTDP) uses Markov Decision Processes (MDPs) as a model of the environment. It samples paths through the state space based on the current greedy policy and updates the values along its way. It's an efficient way of real-time planning, since not necessarily the whole state space is visited, and works well for stochastic environments.
As such, with some additional work, it is possible to apply RTDP on stochastic gym environments such as frozen lake. In my case, I tried just that. Given a 20x20 map and full observability, RTDP successfully gets a decent reward after a few minutes of training on one core.
The source code can be found at the end of the post.
For the MDP we need four components: States, actions, transition matrix and costs.
Fortunately, since we're using a gym environment, most of this is already given. There are four actions, the states are known and the gym environment exports the transition matrix (including the probabilities). That is, only the costs are missing. Out of the box, frozen lake only gives one positive reward at the very goal. In all other cases, e.g. falling into a hole, zero.
One idea for a custom cost mapping could be: 0.1 for each action, 1 for a hole, -1 for the goal (negative cost).
A keyword related to negative cost is self-absorbing goals, i.e. goals that have just a single action going back to the goal itself with 0 or even negative cost. A self-absorbing goal is a requirement of stochastic shortest path MDPs (SSP MDPs)^{4}, which is pretty much what we have here. We want the path with the lowest cost (which is the same as a shortest path with lowest cost=time). By the way, gamma needs to be set to 1.
As a result of the above mapping, RTDP achieves an average reward of 0.48 (over 10000 evaluations).
This is great, since this is not an easy stochastic environment - you only have a chance of 33% of actually ending up where you intend to be!
But this is also where RTDP excels. A faster alternative to RTDP would be discretizing the transition probabilities and applying D* Lite.
There are also improvements to RTDP, such as BRTDP^{2}.
For more information, refer to Mausam and Kobolov^{3}.
Implementations (including RTDP on frozen lake) are to be found here:
You see that visits is getting normalized with base. If base is the default 19625 from the AlphaZero pseudo code, but your MCTS only does 50 simulations (like in my case), you at most get 50 visits and thus this never becomes very significant:
Now, pb_c_init on the other hand handles mean_q. The default here is 1.25., i.e. with above results we have a factor of 2.25 before multiplying with visits and the action probability.
Basically, one can adjust the significance of the prior score in relation to the mean Q value. If pb_c_init is set lower, the MCTS will base its exploration more on the mean Q value. If it is set higher, the Q value is less significant.
This is for all the people who work most of the day on their computer.
If you use your Desktop as a temporary dumping ground or for ideas or your current projects, consider screenshotting it from time to time.
If it's in some other dedicated folder, screenshot that one.
The Desktop or the folder is a current snapshot of most things you're up to.
Well, maybe your open tabs in your favorite browser could also be added.
Anyways, doing this every few weeks will give you some excellent material to review at the end of the year.
It is very low effort and doesn't even have to happen at some given interval. Whenever you feel like it - do it.
It's like taking pictures while being in Italy on vacation.
At some point later in life you might take that album full of pictures and reminisce on it.
I assure you, later you will definitely appreciate those screenshots.
You will remember some parts of the year and possibly devulge in some nostalgia.
You might laugh about the weird things you considered.
Failed projects or maybe that one success - you will see its beginnings.
Makes you think though. Life has become insanely digital, at least for me.
The post surveys the origins of resolution and a selected set of variants:
Binary resolution with theory unification, hyper-resolution and parallel unit
resulting resolution.
Performance improvements enabled by indexing data structures such as
substitution trees or abstraction trees are explored.
Finally, related work and further topics are listed.
Introduction
As early as in the 18th century Leibniz formulated the idea of automatic
resolution with his `calculus ratiocinator', in which he described a vision of
machines being able to automatically resolve any logical problem^{51},
i.e. for example the satisfiability of a propositional formula or a theorem of
logic. Now, centuries later, resolution, an essential part of solving these
kind of problems, enjoys a broad range of applications, though Leibniz' vision
has not come into reality so far: Theorem provers do not have the ability to
solve any such given problem. An excellent case suggesting that this may
never become reality is Hilbert's 10th problem, in which an algorithm is to be
found that can decide whether a diophantine equation is solvable: Matiyasevich
showed in 1970 that it is indeed impossible to devise such an
algorithm^{31}^{30}.
Still, since its first formal definition by Robinson in 1965, many areas have
been found to profit off of the resolution principle^{41}. Examples
include automatic theorem provers^{36} such as Vampi.e.cite{RV99}, SAT
solvers such as BerkMin^{17} or logic programming. For the latter,
PROLOG is a prominent example: A PROLOG program is defined as a logic formula
and refutation using SLD resolution^{15}^{21} is the `computation' of the
program^{25}.
As for SAT solvers, resolution helps in averting combinatorial explosion and is
used as a preprocessing step^{35}. By creating new clauses that may show
the unsatisfiability earlier the total state space that is visited is lessened.
Specifically hyper-resolution has been shown to improve SAT solvers
successfully in terms of efficiency^{5}^{20}. Notable examples
include HyPre^{5}, HyperBinFast^{16}, PrecoSAT^{7} or
CryptoMiniSAT^{46}.
The remainder of this post is structured as follows. First, binary resolution
is presented. On top of that, three variants are explored: Equational-theory
resolution, hyper-resolution and unit resulting resolution. In each chapter a
note on possible performance improvements through indexing is made. Lastly,
concluding remarks and references to related topics are listed.
Binary Resolution
To derive a proof for a hypothesis given certain axioms, resolution can be of
particular use. The hypothesis can be encoded as a proof of contradiction and
resolution repeatedly applied. If the result is an empty clause, the theorem
cannot be fulfilled if the hypothesis is negated. This contradiction is the
proof that the hypothesis is actually correct. The opposite is true if the
result is not an empty clause.
Another classic application is the satisfiability of a certain problem that
consists of multiple premises (consisting of multiple boolean variables) that
have to be fulfilled. The resolution principle defines an elementary rule to
confirm the (un-) satisfiability of that problem, i.e. whether there is a
certain configuration of variables that can fulfill the problem. In the
following, the rule is presented in detail.
Binary resolution^{41} is especially of interest for its support of
predicate logic. With predicate logic it is possible to define relations
between variables, which makes it more expressive than (finite) propositional
logic^{6}.
While propositional resolution works analogously to binary resolution, using a
similar propositional resolution principle^{11}, there is a major
difference to binary resolution — it makes use of unification.
Now, in this context propositional resolution is not covered. Thus, in what
follows it is reasoned in predicate logic exclusively.
To dissect binary resolution, a few definitions are needed.
A term is a constant or a variable or an n-ary function consisting of n terms.
An n-ary predicate symbol on n terms is a literal. The negation of a literal is also a literal.
A finite set of literals is a clause. The set can also be seen as a disjunction of the literals.
Binary resolution introduces the following rule (see
Example 1): If, given two clauses, in this case $\{A(x), C(x)\}$
and $\{\neg A(x), D(x)\}$
, one contains a positive and the other a
corresponding negative literal, all else being equal, the two clauses resolve
into a new clause, also called the resolvent. In this case the resolvent is
$\{C(x), D(x)\}$
. So far this is similar to propositional resolution.
A positive literal and its negative version is also called a complement. Now,
binary resolution additionally relaxes the notion of a complement by allowing a
negative and its positive literal to be resolved if they can merely be unified.
See Example 2 below.
It can be seen that $A(x)$
and $\neg A(y)$
are unifiable by the substitution
$\{y \mapsto x\}$
. This substitution is then applied to the resolvent, yielding
$\{C(y), D(y)\}$
. It is important to note that this is always the substitution
of a most general unifier^{19}.
There is still a limitation. For example, the two clauses $\{A(x, a)\}$
and
$\{\neg A(b, x)\}$
cannot be satisfied, since this is akin to $\forall x A(x, a)$
and $\forall x \neg A(b, x)$
(where the two $x$
variables are different)
and assuming $x$
in the first clause becomes $b$
and $x$
in the second clause
becomes $a$
, $\{A(b, a)\}$
and $\{\neg A(b, a)\}$
become unsatisfiable. Yet,
binary resolution in its current form is not able to resolve to an empty
clause.
To fix this, a renaming step in the first clause is added^{32}. Then, $x$
may become $y$
(resulting in $\{A(y, a)\}$
) and thus unification and resolution
is successful.
There is still a case that cannot be resolved correctly: The two clauses
$\{A(x), A(y)\}$
and $\{\neg A(x), \neg A(y)\}$
are unsatisfiable,
yet an empty clause cannot be derived, as all literals of the first clause are
positive and all literals of the second clause are negative.
For this purpose, factoring is introduced. The main point of factoring is
deduplication. If some of the literals of a clause are unifiable, the clause
after substitution of the most general unifier is a factor. A resolution can
then commence using factors. In the above case, the substitution $\{x \mapsto y\}$
is a most general unifier. Thus, after applying it, the result is
the factors $\{A(y)\}$
and $\{\neg A(y)\}$
, since a unification removes
duplicates. Now, in a simple resolution step an empty clause is derived.
It is possible to use indexing to speed up resolution by a small factor. With
indexing, redundant terms are shared, decreasing the total amount of memory
used, and lookup of unifiable terms is faster. To realize indexing, structures
such as discrimination trees^{19} or substitution trees^{18} are
used.
As Graf mentions^{19}, an index could be built in advance, for instance,
by adding all literals into a discrimination tree and thus enabling fast
retrieval of complementary literals for unification for a given literal in a
resolution step. An exemplary tree of a set of clauses can be seen in
Figure 1. Traversing the tree, to for example look
for a complementary literal for $B(x)$
, is faster than testing all possible
literals each time a complementary candidate is required.
Binary Resolution with Theory Unification
Standard binary resolution does not have the power to resolve clauses for which
certain axioms or theorems are available, such as associativity or
commutativity in an equation. Consider the two clauses $\{A(a, b)\}$
and
$\{\neg A(b, a)\}$
. If commutativity holds, i.e. $A(a, b) = A(b, a)$
, binary
resolution cannot resolve the clauses, since syntactic unification^{3},
the unification without theories that binary resolution is based on, cannot
change constants and does not know about the rules of commutativity.
Now, a naive approach would be to generate all possible clauses that arise from
the theory and then try them all until one succeeds. However, for the clause
$\{A(x, y, z)\}$
that would already result in six total clauses, considering
that all possible combinations that could arise from commutativity need to be
covered. It is immediately clear that this does not scale.
To this end, binary resolution can be coupled with equational theory
unification^{37}, also called E-unification, i.e. if a clause including
an equation is given together with a certain theory that holds, a resolution
can be found efficiently.
Next, E-unification is explored in more detail.
The notion of most general unifier is similar to normal unification, though to
a limited extent. This means that there is an ordering of substitutions —
some are more general than others: For two given substitutions $\sigma$
and
$\tau$
, $\sigma$
is more general if there is a $\upsilon$
where $x\tau$
=
$x\sigma\upsilon$
and the given theory still holds. The substitution is the
most general E-unifier, if it is also the most general substitution.
Syntactic unification can have multiple most general unifiers, however this is
taking into account variants or renaming variables. Ignoring this, a
syntactically unifiable set of clauses has just one most general
unifier^{13}. This is comparable to unitary problems that also only have
one most general E-unifier. However, and this is the main difference between
syntactic unification and E-unification, there are cases in which E-unifiable
clauses do not have a single most general E-unifier^{3}. For example,
the equational clause $A(x, y) = A(a, b)$
(with commutativity) has two most
general E-unifiers $\{x \mapsto a, y \mapsto b\}$
and $\{x \mapsto b, y \mapsto a\}$
. Strictly speaking, none of them is then the most general E-unifier, since
both are not related (e.g. if one was the instance of the other), both have
the same number of substitutions and for both a more general substitution does
not exist.
Thus, E-unification reasons with sets of E-unifiers.
A set of ordered E-unifiers is complete, if it contains all E-unifiers that are
applicable to the problem or at least all most general ones, i.e. for each
applicable E-unifier there is at least one more general unifier in the set.
Now, a minimal set of E-unifiers only contains unifiers that cannot compared,
i.e. for a given E-unifier there is no more or less general one in the set.
For a given minimum set there are three types of problems^{3}:
Unitary, if the theory is empty (so no additional theory holds, just
equality) and there is just one most general E-unifier (which is a
syntactic one, as no theory holds).
Finitary, if the set is finite and has more than one E-unifier.
For example, commutativity is finite: There are only so many combinations
for a given problem^{43}.
Infinitary, if the set is infinite.
Zero, if the set is empty.
The type of problem is interesting since an algorithm might have to deal with
having infinitely possible E-unifiers.
Now, a list of theories is given in Table 1. It is interesting to
note that there are some theories that are not decidable. Also, not every
theory has published algorithms for E-unification. These are potentially open
problems.
Theory
Type
Decidability
Algorithm
Associativity (A)
Infinitary
Decidable
^{37}
Commutativity (C)
Finitary
Decidable
^{43}
Distributivity (D)
Infinitary$*$
(Un) Decidable
^{42}
(A) and (D)
Unitary/Finitary
Decidable
^{47}
(A), (D) and idempotency (I)
Unitary/Finitary
Decidable
^{2}
Abelian groups
Unitary/Finitary
Decidable
^{22}
Commutative rings
Zero
Undecidable
None
Boolean rings
Unitary/Finitary
Decidable
^{8}
Endomorphisms
Unitary/Finitary
(Un) Decidable
None
Table 1: List of theories with respective facts^{3}
Types denoted by $*$
only apply to some problems such as ones containing
constants or arbitrary function symbols.
Now, equational theory is restricted in that clauses need to contain some kind
of equation. However, it has been found to be possible to create specific
unification algorithms, similar to the ones referenced in Table 1:
Z-resolution^{12} compiles an axiom by generating a Lisp program that can
then automatically deduce that axiom. An axiom that can be Z-resolved has to be
in the form of a two literal clause (also called Z-clause), which together with
another clause resolves to some other clause. It can be seen immediately that
this gives considerable freedom in what the clauses contain and thus the axiom
they represent.
In fact, they can represent theories from E-unification, such as commutativity.
Compiling an axiom can be compared to a human remembering that if $A=B$
, then
$B=A$
, and thus only one rule needs to be saved in memory. The other one is
deduced when needed.
The idea of such specific unification algorithms is not applicable to
recursive clauses. For instance, one such clause is $\{A(x), \neg A(f(x))\}$
.
It resolves with itself to the new clause $\{A(x), \neg A(f(f(x)))\}$
, also
called a self-resolvent. Now, it is immediately clear that there are infinite
such resolvents. Generating a unification algorithm for them all is infeasable,
however Ohlbach showed that abstraction trees can be used to encode those
resolvents efficiently^{34}. Furthermore, standard operations such as
merge do not have to be adapted for infinite clauses.
Self-resolvents can be compared to instances of the original clause. With this,
a deterministic procedure for generating new ones is possible. The clauses can
be added into an abstraction tree with continuations, i.e. when a node with a
continuation is reached while traversing, the next level can be generated with
the procedure and by appending a new tree of same structure. Thus, it is
possible to represent infinite sets of unifiers.
Refer to Figure 2 for an example. The clauses $\{A(x, a)\}$
and $\{A(a, x)\}$
with associativity over A can be resolved with an
infinite set of E-unifiers, e.g. $\{\{x \mapsto a\}$
, $\{x \mapsto A(a, a)\}$
,
$\{x \mapsto A(a, A(a, a))\}$
, ...$\}$
. Now, when traversing the tree the
continuation is generated whenever a continuation identifier, in this case
denoted by a $C$
, is encountered.
Hyper-resolution
In the context of theorem proving, especially for huge problems, the number of
clauses generated by resolution may become substantially huge. In fact, the
number grows exponentially in terms of when unsatisfiability\footnotemark{} can
be confirmed^{4}.
\footnotetext{\
The attentive reader might question why it only grows in terms of
unsatisfiability. Well, predicate logic is undecidable^{10}^{49} and thus
it is not clear which complexity satisfiability to denote with^{4}.
}
Additionally, part of the clauses will mostly be tangential to the
proof^{9}. Storing them wastes memory and considering them as proof
candidates wastes CPU cycles, causing the theorem prover to slow down. In the
following, variants of binary resolution that deal with this are explored.
A prominent one is hyper-resolution. It offers improvements in terms of speed:
There are less resolvents generated and thus less steps needed to get to a
result^{23}. Published in 1965 by Robinson^{40}, it has since then
found wide use as part of SAT solvers, going as far as to make problems
solvable that were previously unsolvable^{5}. Next, a short overview is
given.
First, a few definitions are needed. A positive clause is a clause containing
only positive literals (i.e. no literal with a negation sign). A negative
clause then only contains negative ones. Since the resolvent is positive, this
is also called positive hyper-resolution. The corresponding negative
hyper-resolution would result in a negative resolvent.
The hyper-resolution rule is now as follows. Given a list of strictly positive
clauses, also called electrons, and a clause that is not positive (i.e. either
completely or partly negative), also called the nucleus, the clauses (all of
them) can be resolved in one step: The nucleus and the electrons are unified
simultaneously. Figure 3 illustrates this.
Given are the four clauses $\{\neg A(x), B(x)\}$
, $\{C(y)\}$
, $\{A(x), B(x)\}$
and $\{\neg B(z), \neg C(z)\}$
. Bold clauses are the respective nuclei and
there are two steps until the empty clause is reached.
Multiple electrons (or clauses) are resolved at each step. After each step the
new clause is added to all clauses. It can immediately be seen that binary
resolution would require three steps in this case.
After unification, the negative literals of the nucleus are required to have
counterparts in all of the electrons, causing all these complements to be
removed. Since all negative literals are gone, a positive clause (or an empty
clause) remains.
At this point, however, finding the unifier is the main computational issue,
as the possible combinations grow exponentially with the amount of electrons
for each negative literal in the nucleus^{19}.
For this, indexing comes into play, specifically substitution trees. In fact,
many theorem provers such as LEO-III^{50}^{24} rely on substitution
trees among others for speed-up.
A substitution tree allows for efficient retrieval of terms unifiable with a
given term. Given two substitution trees, the merge operation then returns the
substitutions that are compatible, i.e. where the codomain of the same
variables can be unified. This can be extended to multiple substitution trees
and is called a multi-merge, which is especially relevant for hyper-resolution,
since a set of electrons and a nucleus need a simultaneous unifier. Now,
Graf^{19} proposes to keep a substitution tree for each literal of the
nucleus, i.e. all required substitutions for that literal can be looked up.
When the simultaneous unifier is needed, a multi-merge on the substitutions
trees of all negative literals of the nucleus together with the respective
electrons is executed.
Before diving deep into Unit Resulting Resolution, a tangential improvement
should be noted. Often, there are multiple clauses that can be used as nucleus
at the current step. However, some result in a faster resolution than others.
An efficient selection can be achieved using an ordering of the literals (after
unification). The idea of an ordering can be traced back to Slagle and Reynolds
in 1967^{44}: For instance, clauses containing the conclusion (if looking
for a certain theorem to be proven) can be prioritized that way, increasing the
speed of resolution^{27}.
Unit Resulting Resolution
Unit resulting resolution (UR resolution) works according to the following
rule. Given is a set of $n$
unit clauses (i.e. clauses containing only one
literal) and another clause with $n+1$
literals, also called the nucleus, in
which each unit clause has a complementary literal (with unification). After
resolution with a simultaneous unifier, only one unit clause remains. For
example, given the clauses $\{\neg A(z), B(x), C(y)\}$
, $\{A(x)\}$
and $\{\neg C(z)\}$
, UR resolution resolves to $\{B(x)\}$
.
In contrast to hyper-resolution, the resulting unit clause may be positive or
negative, since the multi-literal clause may have positive or negative
literals.
Now, UR resolution is a good candidate for parallelization, as Aßmann described
in his PhD thesis in 1993^{1}.
As a matter of fact, using multiple processes is very relevant to current
developments: Latest CPUs are not improving in terms of performance due to
their clock speed, but due to their core count^{48}. Thus, focusing on
this area yields favorable runtimes. Additionally, it is clear that redundancy
of calculations is reduced, if for example there are multiple big clauses
overlapping in terms of their literals.
Aßmann's idea consists of 3 steps. First, a clause graph is generated. It
connects the literals of the nucleus to the complementary unit clauses, while
also keeping track of the unifier between both. More specifically, each edge in
the graph is enriched by the unifier split into two components: The positive
component contains the substitutions that are applied to the positive literal
of a complementary pair, the negative component the ones that are applied to
the negative literal. Splitting the unifier up is useful for the actual search
algorithm. A simple example can be seen in Figure 4. The
dashed box represents the nucleus.
Now, for each clause (or node in the graph) a process is started that executes
Algorithm 1. It should be noted that the function `smgu' returns the
simultaneous most general unifier. Additionally, the part of the unifier
between the nucleus and a unit clause that belongs to the nucleus is called
test substitution, while the one belonging to the unit clause is the send
substitution.
Finally, the core of the algorithm is a two-step update from the nucleus
towards the unit clauses. After the nucleus receives the unifier from a given
unit clause, all other unit clauses are sent their send substitution modified
with the smgu of the currently received substitutions and the test substitution
(see line 11). The intuition of this is as follows. The clauses
need to be kept updated with currently known substitutions. To do so, the
substitution that operates on their respective literal is updated.
Lastly, this loop is repeated until a simultaneous most general unifier between
all substitutions is found (see line 5).
Parallel UR Resolution (PURR) by Graf and Meyer^{33} improves upon
Aßmann's clause process by increasing the degree of parallelization even
further. Now, each edge between a nucleus literal and a unit clause (instead
of just a clause) in the clause graph is assigned a process — the resolution
process. Its task can be compared to the inner most loop in Aßmann's clause
process. Additionally, substitution trees instead of single substitutions are
shared across processes. This enables the use of efficient operations such as
the multi-merge operation. Lastly, the terminator process, which runs on a
single clause, confirms whether a simultaneous most general unifier has been
found.
In detail, the resolution process now keeps substitution trees for each literal
of the nucleus (except for the one its edge is connected to) cached. Whenever a
substitution tree is received, it subsumes^{19} the cached version of
it, i.e. all instances of the substitutions in the cached substitution tree
are removed in the new version. The same is done the other way around: The
cached version is updated in that instances of a given new substitution are
removed. The union of both (so no substitutions are lost) is then merged with
all other cached substitution trees, resulting in a substitution tree that
contains most general unifiers between the cached substitutions and the new
substitution. The unit clause literal is finally updated by applying the
lightest substitutions (i.e. substitutions with the smallest amount of
symbols) from the new substitution tree to its part of the substitution (recall
that substitutions are kept split in two on each edge of the graph).
Finally, terminator processes check for the desired simultaneous most general
unifier of all clauses (similar to line 5 in
Algorithm 1): If all collected substitutions for a clause can be
merged, such a unifier is found. Subsumption is also applied here.
It is clear at this point that PURR heavily depends on substitutions trees.
However, this also enables efficient operations such as subsumption that aid
the merge operation in that unnecessary instances of substitutions are filtered
beforehand and improves the performance of PURR.
Conclusion
The introduction of the resolution principle has had a major impact — it is
now a widely used technique in many areas of computer science and related
fields. Performance is of particular concern, and work towards improving it has
been plentiful in the past: Much attention to it was devoted by Graf in his
book Term Indexing. This includes data structure such as discrimination trees,
abstraction trees or substitution trees. Additionally, many algorithmic
adaptions or extensions based on binary resolution have since then been
proposed, such as hyper-resolution, resolution with equational theory, or unit
resulting resolution.
This post covered much of that. First, the resolution principle was presented.
An extension to it, binary resolution with equational theory, was explored:
Now, knowledge about axioms in equational clauses can be used to create
specific fast algorithms for fast resolution. Hyper-resolution, which reduces
the amount of clauses generated, was demonstrated. And finally, PURR was
reviewed: Unit resulting resolution combined with parallelization leads to good
resource utilization and thus efficient resolution of unit clauses.
Still, more areas can be surveyed.
General improvements to resolution are possible, such as faster unification
algorithms^{29}^{14}. As for further adaptions to classic binary
resolution, there are multiple: Examples include paramodulation^{39}, in
which a clause containing an equation can be used as a substitution in another
clause, linear resolution^{26}^{28} or demodulation^{52}. Lastly,
the reader interested in more related work on theorem provers could explore
alternative proof procedures for (un)satisfiability, like analytic
tableaux^{45} and its variants.
It cannot be denied that something weird is going on.
A whole lot of things currently don't add up in the financial system.
The FED having to maintain a standing repo facility would be one thing. Why are banks not injecting liquidity? Why are they holding off?
Covid-19, it should honestly be mostly a non-event. Sure, a correction is warranted. But not at this volatility.
Retail (which includes me) is of course as always dumbfounded.
I think next week will show if this is the real deal.
See the picture below (10 day variance of QQQ). Will we reach 2008 levels?
LOLA is the shorthand for Learning with Opponent-Learning Awareness.
Let's get some intuition.
First of all, policy gradient, what is that.
Well, you repeatedly update the policy in the direction the total expected reward is going when following the policy in a given state s (resulting in some action a).
Wew what a sentence !!!
So, let's assume our policy is represented by something that can be parameterized, such as a neural net with softmax output (so you have a probability for each action).
Well, obviously by watching the total reward go up and down when taking different actions you can adjust the weights of that neural net.
In LOLA when you update your policy gradient you don't just take into account the world as is (statically).
So in a static scenario your value function returns the expected value given two parameters (one for each agent).
And you simply update your own parameter. But the other parameter is static.. The other agent is inactive.
But in a dynamic scenario you take into account how you would change the update step of the other agent. You suddenly notice what the other agent is learning based on what you're doing.
So the total reward now changes based on your own actions, but also on what effect that has on the other agent's learning.
Check out the official video at roughly 12:05.
LOLA learning rule is defined as:
(our agents parameter update) + (How does our agent change the learning step of the other agent) * (How much our agents return depends on the other agents learning step)
This post is a WIP. As I get more intuition I will update this post.
If you are in a bad state, the agent should still try and do the best action given the circumstances.
If you don't use baseline, you will get a pretty bad reward just because you're in a bad state.
However, we want to reward good actions in bad states, still.
Example:
Lets take $V(s) = \mathbb{E}_{\pi}[G_t | s]$
for state $s$
as our baseline $b.$
Basically you take the mean returns of all possible actions at $s.$
You would expect the return of your action to be slightly better or worse than $b.$
So if $V(s)$
= 5 and reward of our action = 4: 4-5=-1.
If $V(s)$
= -5 and reward of our action = -6: -6-(-5)=-1.
So it's two actions that give wildly different returns as is (4 vs -6) but in the context of their situation they are only a bit bad (-1).
Without baseline the second action would be extremely bad (-6) even though given the context it is only slightly bad (-1).
Essentially this is like centering our data.
By the way, $V(s)$
can come from a neural network that has learnt it.
MiFID II gets my pulse going.
Still, this does not mean that you can only trade UCITS ETFs from now on.
Recently I've successfully enabled a non-UCITS ETF for trading for myself, specifically QQQ3, by contacting my broker and telling them that the KID for QQQ3 is available online (since September 2019 apparently).
After a few emails back and forth it seems they are now in the process of enabling access to it.
A note on MiFID II. Why the fuck are ETFs not allowed, but derivates and futures are all good? Political bullshit. The people benefitting from this are most likely a minority.
A note on QQQ3. I am starting an experiment with this. I know this could be FOMO, but still. I will start building a position in it with a minor percentage of my account.
The thing I always need, and this might be something psychological, is a position in something. Else I won't track it. If money is on the line, it gets my attention.
And even with a drawdown of like 90% (on QQQ3), I will still have the motivation to fix the position. And this is important for the next recession.
The fast Fourier transform of a vertical line is a horizontal line. WTF?
One interpretation of the FFT is summing up many sine curves, i.e. for a given function there exists some infinite amount of sine curves that, when added up, can represent an approximation of the function. The function that we want to represent in the picture below^{1} is a popular example - the square wave.
Another point to understand: The power of the frequency tells us the weighting of one sine curve.
High power? That sine curve is important.
Now, back to the vertical line.
The FFT is telling me: Sum up the sinues curves of ALL frequences there exist. And that shall result in a vertical line.
It needs ALL frequences to represent that line, and at the same power.
But why?!
And wait, those are sine curves.. Should we not get some kind of residuals around the vertical line, resulting in something similar to the shape of Mt. Fuji?
I can immediately tell you - this is completely wrong. You need to think in an infinite sense.
Assume we want to start creating such a vertical line.
The brown curve on top is the sum of all sine curves. As you can see (on the right side I simply shifted the curves a bit on the $y$
axis for easier comprehension), a line appears! The trick here is that different sine curves cancel each other out. But coincidentally, there is a point where all curves do the same thing: They are at their local maximum. This is where our vertical line forms.
By the way, don't mind the $y$
axis, the wave can be normalized to $0$
back from $20$
easily.
This is the basic idea and can be followed into infinity.
Now, to actually get a straight vertical line, apparently all frequencies have exactly the same power, going as far as to even depending on the height of the line. If its height is $1$
, the line in frequency domain lies at $y = 1$
. For $2$
, it is $2$
. And so on.
I have not been able to understand intuitively yet why this is the case. However, in the following, the mathematical proof shows that this is indeed always the case for any given vertical line.
Assume the time domain data is as follows: $[1, 0, ..., 0]$
.
It is immediately clear that only the first summand is non-zero, since the rest is multiplied with $x_n = 0$
.
Additionally, for the first summand, since $n = 0$
and $x_0 = 1$
, it holds that $1 * e^0 = 1$
and thus $X_k$
is $1$
. This holds for all $k$
.
This works analogously for all multiples, such as $[9, 0, ..., 0]: 9 * e^0 = 9$
.
I find that the product rule is always forgotten in popular blog posts (see ^{1} and ^{2}) discussing RNNs and backpropagation through time (BPTT). It is clear what is happening in those posts, but WHY exactly, in a mathematical sense, does the last output depend on all previous states? For this, let us look at the product rule^{3}.
Consider the following unrolled RNN.
Assume the following:
$h_t = \sigma(W * h_{t-1} + Ux_t)$
$y_t = \mathrm{softmax}(V * h_t)$
Using a mix of Leibniz' and Langrange's notation, I now derive:
$\sigma' * (h_2 + W * \sigma' * (h_1 + W * \sigma' * (h_0 + W * h_0'))) =$
$\sigma_{h_3}' * h_2 \mathbf{+}$$\sigma_{h_3}' * W * \sigma_{h_2}' * h_1 \mathbf{+}$$\sigma_{h_3}' * W * \sigma_{h_2}' * W * \sigma_{h_1}' * h_0 \mathbf{+}$$\sigma_{h_3}' * W * \sigma_{h_2}' * W * \sigma_{h_1}' * W * h_{0}'$
Chain rule happens in line 1 to 2, product rule in line 4 to 5. Line 3 is simply explained by Ux not containing W (which we're deriving for). Now, it can be immediately seen that each summand of the last result keeps referencing further and further into the past.
Lastly, since this assumes the reader is familiar with the topic, a really nice further explanation of BPTT for the interested reader can be found here.
A more recent and editorialized version of this post can be found here.
The wave impairment is one of many impairments that can be detected on a FBC downstream spectrum. In the following, we take a cursory look at an implementation of the wave impairment detection algorithm.
To start trying to fit a wave onto a spectrum, the data has to be stitched together. Only channels actually send data (these are the long, upside down U-shapes in the spectrum). The black line in the following picture shows how the algorithm stitches those channels together.
Based on a smoothed version of the resulting data (using rolling averages) an inverse FFT (IFFT) is calculated. This is the meat of the algorithm.
Now, those huge impulses (denoted by a)) in the chart in 160 steps are the U-shapes. The algorithm guesses that harmonic based on the channel width and the spectrum width. They are supposed to decrease the higher the frequency. If they don't, if one of the impulses is bigger than the previous one, a wave impulse is detected. This is for waves that by coincidence fall exactly in the same frequency as the U shapes. If this was the case (unfortunately I do not have an example spectrum), the U shape would continue where there are no channels (just noise), even though with less power (height of the shape would be much smaller).
These are called harmonic impulses. Non-harmonic impulses, see b), are the ones that are on a different frequency than the guessed one. Most of the time, from what I've seen, they are before 160. A significant impulse is one that has at least 45% the height of the last harmonic impulse. Now, all non-harmonic impulses that are higher than the last significant harmonic impulse are considered wave impulses.
A few additional constraints were added. First, a definition: The main harmonic is the first (and highest/most powerful) harmonic impulse.
Now, the following three conditionals deal with false positives:
If more than 75% of all channels are flat, the canditate wave is ignored
Low frequency waves are ignored (below index 5, so a few hundred kHz)
Low power waves are ignored (if the power is less than 70% of the main harmonic)
Future work:
Describe how to calculate the location (in metres) of the issue in the cable.
Yes, that's possible!
You have some data, you need to find a pattern in it.
The pattern is easy, your trained eye immediatly finds it - so it seems.
Getting to that 0% error rate in reality however is very tough, since there are always edge cases.
So when you see the following pattern, what kind of conditionals do you think of first?
You're using Python with its great scientific eco system. You have many tools at your disposal; numpy and scipy give you a simple derivative in one line.
What do?
For reference, this is a Resonant Peak impairment in a FBC downstream spectrum. A normal scan would not have that peak in the middle-right, but would simply be flat.
Taking a look at the official DOCSIS PNM best practices guide, causes for a Resonant Peak on a cable line could be: 'defective components, cold solder joints, loose modules (or module covers), loose or missing screws.'
Anyways, here's my methodology.
Before we dig deep, what is meant by a conditional? In above example, one such conditional could be the flatness of the channel in which the resonant peak has its peak. Basically the range around the peak.
There are cases of very flat 'resonant peaks' which are not resonant peaks. When the algorithm was still much too general, the following was getting classified as a resonant peak.
Let's fit a small linear regression in that range and check that the resulting line is flat - the slope should be something very low such as -0.1 or 0.2. The fitting error of the linear regression should be small too - we ought to be sure that the line represents the range. If it does not and the error is very huge, maybe the channel is not flat at all. Maybe the linear regression got distorted by outliers that cannot be fixed by L2 regularization.
Lastly, whenever I talk about separating, I mean separating a correct detection of a pattern (true positive) from a wrong detection in any way (false positive, false negative). A good separation should correctly return whether a sample contains a pattern (or some part of it) or not.
Now, this out of the way, onto the methods.
Create unittests. Find 'case studies' of interesting samples in your data that show off normal cases and edge cases. Do not forget to include important true negatives - scans that do not contain a pattern, but might (if your conditionals are not specific enough). Those true negatives come up while testing, when you find that your algorithm is too general and returns false positives. After fixing the false positives they turn into true negatives.
Your conditionals need to be as specific as possible, but also as general as possible. This is a contradiction, but bear with me. There are many possibilities, you can use derivatives, averages, percentiles, fitting to curves, anything your heart desires. But which one is the best. Use the one that separates the best. Often I see that an idea, like making sure the range around the resonant peak in the above example is not flat, just does not separate good enough. There are true positives with flatness 1 and error 1, and then there are false positives with flatness 1.1 and error 1 up to flatness 8. See what I mean? It is too close. Since I only see a small subset of the data, I deem this too risky.
The conditionals do not need to be perfect. It is good if you can simply reduce false positives. For example, the flatness above can be made very strict. This gets rid of half of the false positives, which is good enough. Maybe this particular conditional simply does not separate perfectly between true positive and true negative? It is fine. Of course, if you come up with a better conditional, you can get rid of the earlier, worse one. Do not forget that. With that, on to the last point.
Keep it as simple as possible. Get rid of conditionals that once worked but got replaced by better ones. There is more than just the error rate, such as efficiency - an algorithm that is too slow can only see limited use.
Here's a few additional notes on each point.
Regarding point 1. This can also be compared to labelling data and machine learning, but on an extremely small scale. Each unittest is a label - I looked at the sample manually and detected whether there is my pattern. Using initial unittests I compose a few conditionals with hardcoded constants. Afterwards, when I see a wrong detection, I adjust my conditionals.
Regarding point 2. A conditional that separates very well was the flatness one with wave. On the left side we see a correct wave. On the right side - a false positive. The algorithm sees the black line (since channel blocks are stitched together) and in the FFT this is apparently good enough to count as a wave. But, the majority of channels are just flat. Thus, checking for flatness across the spectrum gives us another good conditional.
Another good separator was the new filter algorithm. When fitting a tanh, the parameters of the tanh that represented a filer were wildly different from the ones without a filter. Perfect candidate for a conditional that separates very well.
Lastly, using percentiles rather than averages with noisy data is probably a better idea. It separates better!
Update: Checking the channels at the edges of the resonant peak too and making the flatness check much more strict (flatness < .6 and err < .3) makes this a much better separator on the given unittests. There is a lot of difference between true negatives (usually at least one channel at flatness < .4) and true positives (usually all channels above 1, often above 2). This is what we want: A lot of room for error.
A more recent and editorialized version of this post can be found here.
In the course of my work I had to detect filters in FBC (Full Band Capture) downstream spectrum scans. These spectrum scans may contain impairments which need to be detected. Usually, the operator implements a filter on purpose, to simply disable part of the spectrum. Example: Maybe there is a known impairment in the neighborhood that occurs whenever the cable is used close to its full capacity. Filtering part of the spectrum decreases the maximum possible bandwidth, but fixes the issue. Now, a filter is not exactly an impairment. Still, detecting it is useful. Since the shape is very pronounced, it is also not too difficult to achieve. Thus, let's take a look.
In the following the algorithm for detecting filters is presented. Then, it is shown that using a tanh is appropriate for representing a filter. Finally, an idea for future work is noted.
The filter algorithm works as follows.
First, a definition of tanh is needed. Since tanh is one of the trigonometric functions, the same transformation rules apply.
def tanh(self, x, a, b, c, d):
"""A tanh. Parameters:
a: Amplitude. Height.
b: Phase/horizontal shift.
c: Period. How flat/straight up is the tanh?
d: Vertical shift.
"""
return a * (1 + np.tanh((x - b) / c)) + d
Now, using scipy's curve_fit and supplying a few carefully crafted initial parameters for a, b, c and d (simply done by fitting a curve on a few scans with known filter impairments and using the average as the initial guess), a tanh can be fitted on a given sample. Then, the resulting tanh (its parameters) and the fitting error can be compared to a hard coded constant. In this case, this would be the initial guess extended by a range; e.g. the parameter c could then be in the range 80 to 150 to be a tanh that represents a filter impairment.
In this example, the following parameters and ranges make sense:
# The parameters a, b, c, d for changing the shape of a tanh function.
# See the tanh function above for detailed information on each parameter.
# These magic numbers were found empirically. Refer to the unit tests.
GUESSED_PARAMS = [23., 367., 128., -61.]
# Allowed deviation in the parameters of the tanh function and the guessed
# initial one, in both directions. For example, parameter a is in the range
# [17,29].
MAX_DEVIATION_PER_PARAM = np.array([6, 80, 70, 20])
To validate this algorithm, a clustering on 200000 scans was done using sklearn's MiniBatchKmeans and partially fitting 10000 samples at once. This results in roughly 2GB RAM usage compared to over 64GB (unknown how much was needed as it crashed) when using vanilla Kmeans.
Assuming 12 clusters (more ideas were tried – 12 seems roughly the best) the following cluster center for filters can be observed:
A small dataset of 208 samples was drawn from scans that the given MiniBatchKmeans model predicts to be in the above cluster. This dataset was used for adapting the algorithm (to support steeper filters) and added as a unittest.
Drilling down, it can be seen that most filters have this form, which can be represented by a tanh.
Unfortunately it seems the above cluster still contains a few false positives – a few samples are not filters at all. To make this exact, the 12 clusters would need adjusting. This could be future work. However, for the purpose of creating a small dataset of filters, this is irrelevant.
This is a short note on the parameters of DBSCAN for 2-dimensional data, e.g. FBC spectrum scans.
The data is a vector of amplitudes (in dB) of length 8704. DBSCAN needs to find neighbors for each point and thus needs a distance metric between two vectors.
The default metric used by scipy's DBSCAN is euclidean, i.e. each point qi of vector1 is compared to point pi of vector2 in the following manner:
$\sqrt{\sum_{i = 1}^{n}(q_i - p_i)^2}$
The eps parameter of DBSCAN is exactly that value above (when using the euclidean metric). Now, to find a sensible eps, a simple rough calculation shows, given that 5 dB is the maximum distance allowed between two points to be neighbors: (8704 * 5**2) ** .5 ~= 466. This is a first candidate for a good eps value.
Empirically it can be seen that any value between 100 and 2000 works fine for vectors of length 8704.
As for the second important parameter, min_samples, the following can be said. First, a core point needs to be defined. This is a point which has more than min_samples neighbors in its range (restricted by eps). The important thing to remember here is that only the neighbors of core points are considered in each iteration as candidates for expansion of a cluster. So if min_samples is huge, the clusters will be smaller and more numerous and vice versa. In conclusion, this parameter is the one to tune in terms of the resulting number of clusters. It can be compared to the parameter k of kmeans.
I just want to be contrarian.
Fearmongering all across media^{1}, yield curve and foreign investment, trade war, germany close to a recession^{2}, corporate debt^{3}^{7}.
Both sides have valid points. But mostly I see caution and restraint all around. Where is the irrational exuberance^{4}?
The good thing is, I am not that heavily invested right now.
So it is really exciting to be part of history and follow the markets, but with not that much risk.
I am extremely interested in what the black swan event will be this time.
Since yields are down everywhere^{5}, which is unprecedented, pension funds need to get their yield from somewhere. Maybe this is where irrational exuberance may come from? But this would probably be years away..
It's funny, I've started doing some kind of 'meta' tasks:
If I recognize that I'm procrastinating on a task, I write down:
'Find subtask for $bigtask'
And then some time later I come up with a subtask and write it down:
'Do $subtask'
And that's it, after a while I do the subtask (since it's usually small) and effectively get started on the bigger task.
And when you've started, you've already done the grunt work of defeating procrastination.
It's important to put in thought while creating todo points. If you invest 5 minutes in thinking about what subtasks you write down, it'll potentially save you hours in the long run.
This is a small summary of part 2 of the book "The Elephant in the Brain".
Body language: We are blind to our body language because it betrays our selfish, competitive nature
Laughter: Signal "I feel safe" while playing games (also comedians play: they violate norms and talk about taboos but do this in playful manner)
Conversation: Signal how knowledgeable one is, for mating and alliance purposes
Consumption: Signalling to others, mating purposes
Art: Signalling that resources can be wasted on art, mating and alliance purposes
Charity: Signalling, alliance purposes (as friend one is more likely to help if very charitable)
Education: Signalling, just get the certificate
Medicine: Often just to show loyalty to the patient (eg neighbor brings selfmade cake, not store-bought one to show care), signals others how much political support one has (in evolutionary context, eg 1 mio years ago), supporters hope for return support if they ever get sick
Religion: Community, costly rituals and giving up resources to show that one's committed so one becomes trustworthy and part of community, in community best chance of survival
Politics: Voters vote for their group, not for oneself, they signal loyalty to their group (to various degree)
The topic is PCA - principal component analysis.
We'll look at an example and have a few exercies + solutions at the end of the post.
Intuition
If we want to reduce the dimension of data (e.g. from 2D to 1D) due to space constraints, easier calculations or visualization, we want to lose as little information about our data as possible.
Take a look at the charts below. We will project (the cyan lines) our data (the cyan points) onto the blue line, i.e. we will reduce the 2D coordinate system to the 1-dimensional line.
Now which projection explains our data better?
After the projection, on chart A we retain the information that the two pairs are far away from each other. On chart B we do not.
Another way of reasoning about this is to say: The variance of the distances (black arrows) is higher on chart A. So the higher the variance, the better explained our data by our projection.
Let's define a few things.
A principal component (PC) is an axis. In our example: The blue line is our axis.
The blue line in chart A is PC1, the one in chart B is PC2. PC1 is the best principal component in terms of data explanation.
PCA maximizes the sum of squared distances of the projected points to the origin to find the PC with highest variance.
That sum is the eigenvalue of the PC. Variance of a PC: Eigenvalue / (n-1). Thus the highest eigenvalue equals the best PC.
The normalized vector on the PC line is the eigenvector, representing the direction our PC is going.
So in conclusion:
The eigenvector with the largest eigenvalue is the direction along which our data has maximum variance, i.e. the data is maximally explained.
If we have two eigenvalues 6 and 2, then the PC for eigenvalue 6 explains 6/(6+2) = 75%. (because eigenvalues are connected to variance)
In our example: PC2 is perpendicular to PC1.
In a 3D example: PC3 would be perpendicular to the plane of PC1 and PC2.
Steps of PCA
Mean normalization
Compute Covariance matrix
Compute eigenvectors/values with said matrix
Select top k eigenvalues and their eigenvectors
Create orthogonal base with the eigenvectors
Transform data by multiplying with said base
Recipe
1. Calculate covariance.
Formula:
a = A - 11*A / n
Cov(A) = a'a / n
2. Solve det(M-Iλ) = 0.
3. For both λ₁ and λ₂ solve:
M * [ x = λ₁ * [ x
y ] y ]
4. Put x=1 and convert to unit vector: (x**2 + y**2)**.5 = 1.
5. Orthogonal base consists of both eigenvectors side by side:
[ x₁ x₂
y₁ y₂ ]
6. Take first eigenvector and apply to a (from step 1).
a * [ x₁
y₁ ]
This is the transformed data.
1.
a)
Consider we conduct a PCA on a two-dimensional data set and we get the eigenvalue 6 and 2. Draw a distribution of sample points that may give rise to this result. Also, draw the two eigenvectors.
b)
Consider 3 data points in a 2-dimensional space R**2: (-1, 1), (0, 0), (1, 1).
What's the first principal component of the given dataset?
If we project the original data points onto the 1-dimensional subspace spanned by the principal component you choose, what are their coordinates in this subspace? What is the variance of the projected data?
For the projected data you just obtained above, now if we represent them in the original 2-dimensional space and consider them as the reconstruction of the original data points, what is the reconstruction error (squared)? Compute the reconstruction of the points.
2.
a)
Name four steps for performing a PCA.
b)
Suppose we perform PCA on a two-dimensional dataset and it yields two eigenvalues which are equal.
What does it mean regarding the importance of the dimension? Would pursuing a dimensionality reduction be a good choice? Please explain.
Sketch a dataset where its two eigenvalues would have the same size.
c)
Given the data points below, would PCA be capable of identifying the two lines? Sketch the principle axes. (please view with monospace font)
i)
How? I divided the eigenvalue 6 into 6 points (because 4 points would've been weird to calculate).
And set their distance to 1 from the proposed principal component line, ie they're at x = -1 and x = 1.
This way their SS is 1+1+1+1+1+1 = 6.
Now to get the eigenvalue 2, we need to set y values to sqrt(.5).
Why? Because two points are ON the second PC, we only care for four points.
So to get eigenvalue 2: sqrt(.5)**2 = 0.5. And 0.5 * 4 = 2.
b)
It's a line f(x) = .5
Their coordinates: [-1, 0], [0, 0], [1, 0]
Variance: 1
Reconstruction error: 2
2.
a)
Normalize input
Calculate covariance matrix
Calculate eigenvectors of said matrix
Eigenvector with greatest eigenvalue is our first principal component.
b)
Both dimensions are equally important. Pursuing dimensionality reduction would be bad because we lose a lot of data/the data becomes very skewed.
We lose 50% explanation of the data.
Dataset: [0, 0], [1, 0], [0, 1], [1, 1]
c)
If we fit the PCs to the apparent 'lines', one line will contribute ~0 variance to the total variance of the PC.
Now if we set the PCs as horizontal/vertical axes, both lines will contribute, maximizing the total variance.
So PCA will not find the apparent 'lines'.
Also, both principal components will have the same importance.
Following scenario:
We have 4 workers and they want to generate unique ids.
Usually you'd need a central server that they get the ids from.
That central server would, for example, just use the current time in ms plus a counter if we get multiple messages in the same ms.
With snowflake format, we additionally add a worker id.
So let's say each worker has same time in ms. Due to the additional worker id, we already have 4 unique ids!
If time in ms is different, we already have unique ids no matter what.
And if we get multiple messages at the same time, we have our additional counter.
If that counter runs out (it's usually rather small, like 4096), we simply wait 1 ms.
But think about it, 4000 messages per millisecond?
So there you have it, distributed unique ids!
For a technical breakdown on which bits are used for what in the snowflake format, see the sources below.