Data Preparation is Everything - Parsing The German Commercial Register (Handelsregister)

# Data Preparation is Everything - Parsing The German Commercial Register (Handelsregister)

There are three major issues with the commercial register (Handelsregister) data. First of all, every company has their own rules, e.g. in terms of appointments. Who is allowed to be appointed, when and why and with whose permission.. Thousands of combinations. Secondly, our clerks do not seem to follow the format 100%. And they make spelling mistakes sometimes. But the most glaring one is: Each register entry is ONE line of TEXT. And there is no official delimiter it seems. The clerks all kind of go with a few loose rules but yea - it is definitely not defined by a RFC or something lol. So what does this leave us with. LOTS of Regex and special rules, and it is pretty painful. But this should be known to any serious ML practitioner, that most of the work lies in data preparation.

For instance, I want you to look at this regex (which is just a small part of the data preparation pipeline):

(Die |Ist).*(ein|zwei)?.*(Gesellschaft|Vereinigung|Genossenschaft|vertritt|Vorstand|Geschäftsführer).*(ein|zwei)?.*
(vertreten|Prokuristen|vertreten.*gemeinsam|vertreten.*gemeinschaftlich|einzelvertretungsberechtigt|Vertretungsbefugnis von der Generalversammlung bestimmt)

It might not be perfect, but it gets the job done. So anyways, enough of the data preparation pipeline, how about a first result. While doing some descriptive statistics (data exploration), the first thing that interested me was the number of registrations per year over time. Check it out:

2015 we had a near recession with oil dropping and the Chinese market crashing, so it makes sense that in bad times less businesses were registered. But what is extremely surprising - in 2020 we had slightly more registrations than in 2019! What the hell? Seems like the 2020 global recession was kind of a fake recession for a certain chunk of the population.

Another thing that is interesting, but not plotted here (since it would look boring) - month over month the number of registrations are roughly the same. So there is no seasonality here. Quite interesting!

Just a note regarding the 2007 data (which is roughly 130 registrations) - the available data starts at the very end of 2007. Thus it looks like there are nearly no registrations in 2007, but in reality it's not a full year.

I will be working with this data some more and I will update this post (or create a new one). Specifically I am looking at doing Latent Semantic Analysis (LSA) and using the gensim library. Stay tuned!

Update: I have done some more analysis, specifically LSA. Note: The next few paragraphs will contain a lot of German words.

An important task is getting rid of small and common words such as 'der' (engl.: the), else one would throw off (dis)similarity measures. So for me this currently looks like that:

und oder der die das von zur nur so einer einem ein allein darum dazu für dafür
zum zu dort da grundkapital stammkapital eur dem usd gbp im mit an dem bei art
rahmen vor allem aller sowie aus in den als

These are often called 'stopwords', and we simply filter them out, which improves the LSA model considerably - we get more confident topic assignments.

Notice words like 'Stammkapital' (base capital) - they may look like they are important to the text, but if literally every text has Stammkapital included, it does not really give us any more information.

A seemingly important hyperparameter (similar to the number of clusters in clustering) is the number of topics. I am currently experimenting a bit here. Having too small of a number of topics seems to make most topics include stuff like 'Halten' (holding) which is a very important word since there are a lot of holding companies. Too many topics has duplicates as follows:

66 ['it', 'consulting', 'management', 'marketing', 'insbesondere', 'bereichen', 'it-dienstleistungen', 'gegenstand', 'vermarktung', 'zubehör']
67 ['marketing', 'insbesondere', 'dienstleistungen', 'it', 'in-', 'gegenstand', 'ausland', 'consulting', 'it-beratung', 'deren']

And then there are weird combinations such as the following. IT services and construction?!

81 ['an-', 'it-dienstleistungen', 'verbundenen', 'allen', 'trockenbau', 'alle', 'geschäften', 'nebst', 'damit', 'kauf']

Anyways, I will keep updating this post with new findings.

Published on 2021-02-07