Algorithmic Design for pattern matching in 2-dimensional data

# Algorithmic Design for pattern matching in 2-dimensional data

You have some data, you need to find a pattern in it. The pattern is easy, your trained eye immediatly finds it - so it seems. Getting to that 0% error rate in reality however is very tough, since there are always edge cases. So when you see the following pattern, what kind of conditionals do you think of first?

You're using Python with its great scientific eco system. You have many tools at your disposal; numpy and scipy give you a simple derivative in one line. What do?

For reference, this is a Resonant Peak impairment in a FBC downstream spectrum. A normal scan would not have that peak in the middle-right, but would simply be flat. Taking a look at the official DOCSIS PNM best practices guide, causes for a Resonant Peak on a cable line could be: 'defective components, cold solder joints, loose modules (or module covers), loose or missing screws.'

Anyways, here's my methodology. Before we dig deep, what is meant by a conditional? In above example, one such conditional could be the flatness of the channel in which the resonant peak has its peak. Basically the range around the peak. There are cases of very flat 'resonant peaks' which are not resonant peaks. When the algorithm was still much too general, the following was getting classified as a resonant peak.

Let's fit a small linear regression in that range and check that the resulting line is flat - the slope should be something very low such as -0.1 or 0.2. The fitting error of the linear regression should be small too - we ought to be sure that the line represents the range. If it does not and the error is very huge, maybe the channel is not flat at all. Maybe the linear regression got distorted by outliers that cannot be fixed by L2 regularization. Lastly, whenever I talk about separating, I mean separating a correct detection of a pattern (true positive) from a wrong detection in any way (false positive, false negative). A good separation should correctly return whether a sample contains a pattern (or some part of it) or not. Now, this out of the way, onto the methods.

Create unittests. Find 'case studies' of interesting samples in your data that show off normal cases and edge cases. Do not forget to include important true negatives - scans that do not contain a pattern, but might (if your conditionals are not specific enough). Those true negatives come up while testing, when you find that your algorithm is too general and returns false positives. After fixing the false positives they turn into true negatives.
Your conditionals need to be as specific as possible, but also as general as possible. This is a contradiction, but bear with me. There are many possibilities, you can use derivatives, averages, percentiles, fitting to curves, anything your heart desires. But which one is the best. Use the one that separates the best. Often I see that an idea, like making sure the range around the resonant peak in the above example is not flat, just does not separate good enough. There are true positives with flatness 1 and error 1, and then there are false positives with flatness 1.1 and error 1 up to flatness 8. See what I mean? It is too close. Since I only see a small subset of the data, I deem this too risky.
The conditionals do not need to be perfect. It is good if you can simply reduce false positives. For example, the flatness above can be made very strict. This gets rid of half of the false positives, which is good enough. Maybe this particular conditional simply does not separate perfectly between true positive and true negative? It is fine. Of course, if you come up with a better conditional, you can get rid of the earlier, worse one. Do not forget that. With that, on to the last point.
Keep it as simple as possible. Get rid of conditionals that once worked but got replaced by better ones. There is more than just the error rate, such as efficiency - an algorithm that is too slow can only see limited use.

Here's a few additional notes on each point. Regarding point 1. This can also be compared to labelling data and machine learning, but on an extremely small scale. Each unittest is a label - I looked at the sample manually and detected whether there is my pattern. Using initial unittests I compose a few conditionals with hardcoded constants. Afterwards, when I see a wrong detection, I adjust my conditionals. Regarding point 2. A conditional that separates very well was the flatness one with wave. On the left side we see a correct wave. On the right side - a false positive. The algorithm sees the black line (since channel blocks are stitched together) and in the FFT this is apparently good enough to count as a wave. But, the majority of channels are just flat. Thus, checking for flatness across the spectrum gives us another good conditional.

Another good separator was the new filter algorithm. When fitting a tanh, the parameters of the tanh that represented a filer were wildly different from the ones without a filter. Perfect candidate for a conditional that separates very well. Lastly, using percentiles rather than averages with noisy data is probably a better idea. It separates better!

Update: Checking the channels at the edges of the resonant peak too and making the flatness check much more strict (flatness < .6 and err < .3) makes this a much better separator on the given unittests. There is a lot of difference between true negatives (usually at least one channel at flatness < .4) and true positives (usually all channels above 1, often above 2). This is what we want: A lot of room for error.

Published on 2019-12-05