You have some data, you need to find a pattern in it. The pattern is easy, your trained eye immediatly finds it - so it seems. Getting to that 0% error rate in reality however is very tough, since there are always edge cases. So when you see the following pattern, what kind of conditionals do you think of first?
For reference, this is a Resonant Peak impairment in a FBC downstream spectrum. A normal scan would not have that peak in the middle-right, but would simply be flat. Taking a look at the official DOCSIS PNM best practices guide, causes for a Resonant Peak on a cable line could be: 'defective components, cold solder joints, loose modules (or module covers), loose or missing screws.'
Anyways, here's my methodology. Before we dig deep, what is meant by a conditional? In above example, one such conditional could be the flatness of the channel in which the resonant peak has its peak. Basically the range around the peak. There are cases of very flat 'resonant peaks' which are not resonant peaks. When the algorithm was still much too general, the following was getting classified as a resonant peak.
- Create unittests. Find 'case studies' of interesting samples in your data that show off normal cases and edge cases. Do not forget to include important true negatives - scans that do not contain a pattern, but might (if your conditionals are not specific enough). Those true negatives come up while testing, when you find that your algorithm is too general and returns false positives. After fixing the false positives they turn into true negatives.
- Your conditionals need to be as specific as possible, but also as general as possible. This is a contradiction, but bear with me. There are many possibilities, you can use derivatives, averages, percentiles, fitting to curves, anything your heart desires. But which one is the best. Use the one that separates the best. Often I see that an idea, like making sure the range around the resonant peak in the above example is not flat, just does not separate good enough. There are true positives with flatness 1 and error 1, and then there are false positives with flatness 1.1 and error 1 up to flatness 8. See what I mean? It is too close. Since I only see a small subset of the data, I deem this too risky.
- The conditionals do not need to be perfect. It is good if you can simply reduce false positives. For example, the flatness above can be made very strict. This gets rid of half of the false positives, which is good enough. Maybe this particular conditional simply does not separate perfectly between true positive and true negative? It is fine. Of course, if you come up with a better conditional, you can get rid of the earlier, worse one. Do not forget that. With that, on to the last point.
- Keep it as simple as possible. Get rid of conditionals that once worked but got replaced by better ones. There is more than just the error rate, such as efficiency - an algorithm that is too slow can only see limited use.
Here's a few additional notes on each point. Regarding point 1. This can also be compared to labelling data and machine learning, but on an extremely small scale. Each unittest is a label - I looked at the sample manually and detected whether there is my pattern. Using initial unittests I compose a few conditionals with hardcoded constants. Afterwards, when I see a wrong detection, I adjust my conditionals. Regarding point 2. A conditional that separates very well was the flatness one with wave. On the left side we see a correct wave. On the right side - a false positive. The algorithm sees the black line (since channel blocks are stitched together) and in the FFT this is apparently good enough to count as a wave. But, the majority of channels are just flat. Thus, checking for flatness across the spectrum gives us another good conditional.
Update: Checking the channels at the edges of the resonant peak too and making the flatness check much more strict (flatness < .6 and err < .3) makes this a much better separator on the given unittests. There is a lot of difference between true negatives (usually at least one channel at flatness < .4) and true positives (usually all channels above 1, often above 2). This is what we want: A lot of room for error.