At Data8 we occasionally get some very complex requests which present a challenge to our developers requiring them to conjure up new technology from the ether. One recent example was the ‘ApplyRules’ code, which we’ve used on a few data cleansing jobs where a cleanup using complex matching routines just wouldn’t cut it. And cutting is exactly what ApplyRules does.
Let’s say you haven’t used one of our data validation tools, and have a text field where your users can throw anything into the pot and end up with a CompanyName field holding records like:
‘ABC Company Ltd. UK – 0151111222 (talk to Dave)’.
Somewhere else in the same dataset, you’ve got ‘ABC COMPANY’ and despite the fact you as a human can tell they’re the same company, software which could export or display the data as a pivot table (Excel, we’re looking at you here!) would have no idea they’re the same company. This will result in the companies being split and false information being reported – not great for a board report!
How would you handle this situation? Our first step was to look into the data and build up some rules which we could apply across the data set in a bid to make them a little more consistent and help undo those years of data neglect.
In the case of ‘ABC Company Ltd. UK – 0151111222 (talk to Dave)’ the good news is your users employed brackets consistently for note-taking. We decided that anything in brackets can be ditched, creating a RemoveBrackets rule that strips anything in brackets (including the brackets themselves). We’re down to:
‘ABC Company Ltd. UK – 0151111222’
Your dataset doesn’t have many company names that are just numbers, so we create a RemoveNumbers rule that strips out anything, say, with five or more digits. This would leave ‘2001 A Cake Odyssey’ alone, but would delete the phone number from our example.
‘ABC Company Ltd. UK – ‘
The RemovePunctuation rule removes any punctuation with spaces either side within the string, as we decide that within overall dataset there is no viable reason for this in company names. This would affect our example by removing the hyphen at the end of the string but would not affect ‘Data-8’ as there are no spaces between the hyphen and the preceding or following text.
‘ABC Company Ltd UK’
Brevity is the soul of good data, so we again agree that any ‘UK’ at the end of a company name can be trimmed, so TrailingCountry removes any country name or abbreviation at the end of a string. ‘UK United Exports’ would be left alone, but ‘United Exports UK’ becomes ‘United Exports’.
‘ABC Company Ltd’
And finally we decide that trailing business suffixes – the Ltd., the Co. and the Bulgarian O.O.D. – just don’t make a difference to that board report. Our BusinessIdentifiers rule drops anything at the end of a string that matches a very long global list, from AB (Sweden) to ZOO (Bulgaria).
Great! ‘ABC Company’ matches any other ABC Company, whether that string started as ‘ABC Company Ltd. UK – 0151111222 (talk to Dave)’ or something much, much worse. And your board can have faith in that report.