How to clean your data before the algorithms catch a cold

Two views with a common message for trust in good data

The blue data elephant in your database room

The blue data elephant in your database room

Here we take look into the world and benefits and risks of clean data. On one side we will have the tips and advice and on the other, we have an eyewitness account from Sandra - she is our virtual voice and byte of data sitting in your database.

I sometimes feel as though business leaders act with the protectionism of Gollum in the Lord of the Rings, as they love their “precious” data so much. This blind love and protectionism can be the start of a decline for a business. So, just like Gollum, if you love it too much then you will suffer personally to your ultimate doom. On a more optimistic note, we have a cure. This will require a new set of data vigilance and discipline previously not seen before. On a positive note, let us look forward to clear metrics, better decisions and a path for successful algorithm deployment.

In fact, consider you are 30 kilos overweight and now need to go on a life-changing health regime. Most will not make it, yet for those that do; well the results will become stunning. So, how do you know if you need to go through such a data purge and cleaning exercise to facilitate the optimal machine learning and advanced algorithms capable of predicting the best time to grab a coffee and meet your boss in a good mood for a promotion, pay rise, new holiday entitlement and a free Danish pastry? As some people say, one person’s mess is another person’s gold. How can you classify and define if you have dirty data to spruce up?

Meet Sandra - The virtual voice of our data byte.

Meet Sandra - The virtual voice of our data byte.

pssst, shhh Hey you, yes you over there. I work in the data cellar and I have some terrible news for you. All your graphs that you use, it’s all fake. Yes, like fake news, we have fake data, fake projections, and fake analytics. It’s awful down there, we see half records walking around with no friends, and there are many clones (you call them dupes or duplicates I think). And all this is starting to have an impact on the good guys. Yes, those clean records are now just getting a bad name, a bad reputation. Everyone is calling it fake news for everything. But hey, I have a solution, an idea that we can turn the bad guys around. Sure I know it will mean eliminating one or two, but the whole community will benefit. Are you in? Will you join me? I’m not saying this is going to be easy, I’m not saying once this is done that all your problems are solved forever. I’m just saying we have a way to put things right and head down a better path.

If you don’t know what your data will be used for, your business will have a much harder time knowing where to even start a robust clean-up process.
— James Doyle - JAMSO
Discover the usual suspects

Discover the usual suspects

Dirty data and the usual suspects

In the lineup of guilty culprits for dirty data, we have the usual suspects. We know who they are, we have seen them, maybe helped them a few times grow and prosper but the impact on the business can become costly.

  • Partially filled in records (missing house number, postcode)

  • Duplicate entry of the same record or event

  • Variations of the same status code (sale=sold=won=customer or Glos, GL, Gloucestershire, South West)

  • Formatting issues (4foot, 1.2m, 122cm, 12192mm or 1.9.2017 vs 9.1.2017 depending if you are USA or Europe based)

These are just the start, yet already you can see the potential fun and laughter that your server microprocessors will have when they attain consciousness and review such a mess. It will be like a parent not going into a teenager’s bedroom for a month, thinking everything is fine and in order, only to discover a mess beyond normal comprehension.

Understanding the completeness of full data records.

Metrics, KPI’s and results are generated and offered based on the data and information made available. If you do not record fully and in absolute completeness for each record the needed information then the quality of the insights and correct decisions is at risk of being impacted. So, how will your business define completeness? Here we need to take a serious periodic review of wish, wants and needs for each record. Some needs are driven due to regulation, process or for the basic ease of management. The wants and wishes become plentiful when asking different departments and stakeholders. A data list can become swiftly out of control and this will impact the design structure and potential integrity of your data.

Completed data sets

Completed data sets

Defining a clear scope of data completeness can be best managed with time limitations. I like to suggest that a growing data set has specific tolerance bands of to renders its status as “complete”. So, at first entry point there will be some specific minimal fields to be filled in and then as the data and record migrate through its use within the business, other data becomes more required. This generates staggered levels of data quality as each source may offer different levels of insights at the time of recording or entry. Simply set the rules and be aware of your classification.

Wow did you get that? That means all those empty fields across all the data we record is important. Deep down in the data dungeon is a pool of data zeros wanting to become more positive and turn into ones. Do you think you can help? Great, so let’s walk further into this exploration of data.

Be consistent - always do the same things and keep on doing them

OK, you get the message. Consistency counts and matters, but why? To make judgments and decisions correctly, the basis of those decisions and insights is best performed with specific attention to planned actions. The actions of gathering data, calculating data, rounding data, the timing of data collection, size of data sets etc., all should be as consistent as possible. That is not to say that these can or should not vary at all. The message is to ensure tolerance bands and methods are in place to understand the potential levers, triggers, and causes of data variation.

For example: Taking images from a door entry feed at peak access times in the mornings does not reflect the fact that there are also low access hours for most of the day (think opening hours, school, work and even pub closing times can impact these data). So, being aware of these variations means the structure and consistency of method are critical to providing the correct information for informed decisions.

OMG is this guy serious. I mean one hand he says to be consistent but on the other hand, says variation is also OK? Let me think a little…. ahh hah OK I get it. Consistency on what matters the most to get the right picture. Being aware of areas of none consistency is as important as being consistent in the first place. Yes OK, you are right, fake news and false assumptions can be created otherwise.

Accuracy and repeatability of input sources

There is much spoken about the accuracy of a measurement, yet little about the repeatability of that accurate measure. So what is the difference? Well, imagine a sniper. They “dial in” their weapon to consider the distance, wide resistance and other factors that can impact the shot. They take their shot and strike a mark with a +- value from bull's eye. The repeatability of this exercise is then what the variation of that absolute accuracy become over 50 shots in identical circumstances. Here is how we discover the tolerance level of the actual measurement equipment and none planned variations (in the case of the sniper, does the actual barrel temperature impact the accuracy after 10 shots? So the actual firing itself heats the barrel.) The same applies to your data sources. Are your sensors accurate and repeatable with their measures?

Reports that say that something hasn’t happened are always interesting to me because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say, we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tends to be the difficult ones
— Donald Rumsfeld


Yes, we see the confusion between noise and signal all the time down here amongst the data. The challenge is to understand which the right and correct information sets are and which are not. It’s almost like going to a party about classic cars, we all have a common theme but our opinions will vary considerably based on preferences, experiences, and learning.

If you don’t use it then lose it also applies to data collection.

This is the one part of data science that can start a bun fight cause hot tempers and bring normally mild introverts into a storming rage. On one side of the argument is the old position and belief that if you do not use your data then do not bother to spend time, resources and administration burdens by. JAMSO falls on the other side of this argument. If after review, we only agree to such a purge is no significant medium, long term benefit might be generated from value. The reason for our shift in position over the past few years is simple and can be summarized in 5 key points.

1)    Data storage costs becoming lower

2)    Administration costs have reached a point where some redundant information may add no new burden.

3)    Artificial intelligence and machine learning continue to advance at a rapid pace.

4)    Big data and analytics software is becoming ever cheaper for companies to deploy

5)    The direction of the world continues to be more data records, interconnectivity (think IIOT, IOT) for many various data points, the software will evolve to cope and help reap more value from it.

No matter how much care and the right intent, times change with new opportunities

No matter how much care and the right intent, times change with new opportunities

Traditionalist will cite many positive merits for getting rid of old data that is no longer used. I prefer to take a reflective position and ask for the many reasons why the data was initially thought to be worthy of collection. Such a data architectural review is worth doing on a periodic basis. There may be better or more relevant data and metrics to priorities instead of old ones.


Yes, I understand what you are saying here. We have in the depths of the data depot a growing crew of ever more insightful and influential data sets. When they first arrived, we ignored them, but then the market changed and we have realized their voice is actually vital to the success of our core business process now.

Generating the wrong business process workflow

Nothing can be more potentially embarrassing for a manager as when someone points out that a process workflow, in reality, is very different to the workflow understood. These base assumptions are raised faster within multisite operations, different cultural regions, after a departmental change, upgrade of equipment or software and of course natural churn rate of staff.

A great tool in today’s modern technological world is the ability to record by video the movements, clicks, and inputs on a device. The use of this helps speed up the pace of change to workflows and thus the business understanding of “how stuff gets done”.

That’s right, just the other nanosecond (along time for me as a byte of information), we had two pieces of data highlighting a risk in the accounting department. Now, we all had heart attacks until three nanoseconds later one new piece of data stopped the finance dashboard from flashing by updating the delivery date field.

Access rights and security making decisions

Part of the reason for dirty data is the lack of control on whom and when and why data can and should be input. Allowing a new person to input none validated and qualified data might be done with the best of intent, yet the impact can become catastrophic to a business decision tree. These can be controlled with the solid governance of the data through access rights and a strict security policy.

People forget that data theft is not like robbing a bank. It can be simply like photocopying all the money in that bank and running away with it never to be detected
— James Doyle

The dilemma management has regarding access is, in our opinion, a slightly overblown issue. Needs rise to the top of the agenda when required. The initial access controls are important and equally important is the removal of access when roles and information requirements change. Sales teams are familiar for many years of the risks of client and market information being brought from one business to another, but the same applies today in many other roles across the business function.

Security of data should be given a high priority. Not least due to the ever increasing surge of hackers, ransomware attacks and other methods to penetrate your data systems. The establishment of silo systems offers more security but risks lack of cross data flow and out of date (out of synchronized) information. These disparities need to be considered when timing data transfers for maximum benefit. Personal level security is also vital in an age of public Wi-Fi systems and more on time, online demands from business to business. Beware of the password process and potential capture devices that can take your information in part or in whole and become corrupted, sold or lost later.

I am so glad you guy’s raised this. You know the other day I was seeing double. That’s right; I actually saw a copy and paste of our sales records. I was not bothered at first as it was just George from sales, but now I am becoming concerned, as he has also just handed in his notice of employment.

Building personal accountability into data design.

Data can become misinformation based on several key influential triggers. Not least amongst this list is the weighting values given to specific data points or data summaries. This alone, but not exclusively can therefore create a bias in the readings and output of the information provided. To avoid these scenarios we suggest strict competency roles are defined and personal level accountability and ownership is given to each data field. One of the reasons for this personal level of accountability is the threat and reality of committee level decisions which often arrives at the least contentious data as opposed to the most effective yet controversial data sources.  

Now don’t make me laugh. I see you guys and gals all the time feeding us all sorts of data from sensors to external benchmark data and even the occasional personal data finds its way hidden here from the wife’s birthday to prices on eBay. But yes I do agree, if we had more accountability then the quality of this data would improve. Indeed I can see a future data world inside here with a clean mean data machine attitude. Sign me up, I’m all in dude.

Keeping eyes on the eyes of those that manage the data

Retaining a solid monitoring process and a system for consistency that includes the development of data that helps create your metrics. Indeed, it is as vital as the data itself. Without careful monitoring, errors and risks become over looked. Examples of best practice and process control can be forgotten and eventually new less robust standards rapidly become the norm. If not done correctly, this renders your data misleading, untrustworthy and a cost burden to the business as opposed to a vital decision value based tool.

Oh, man you got to hear this story. The other week I was a pixel on a PowerPoint graph. We were all laughing our heads off that the manager presenting us had clean overlooked the fact that the process of gathering sales leads had changed. I swear when he presented me as a lead, he was not aware I had already in fact become a quality sales prospect and should have shown up on his sales forecast. That alone would have saved him the roasting he then got from the sales director.

Check that what you are doing is clear and sufficient for a clean data success

OK, so you have the right design of monitoring programs and audits, but something still is not working. Here we discover that the rules set out may not be simple, clear, and concise and understood fully. This then leads to frustrations with users of a data system which in turn creates poor data and a lack of desire to enforce the system integrity.

The truth is more important than the facts

Ha, I see what you did there, clear and concise and then only write a short paragraph. Yes even data has a sense of humor.

The 3 key elements for cleaner data control

Keeping data clean is critical, its not just maintenance

Keeping data clean is critical, its not just maintenance

You have a sales team for sales, a marketing team for marketing, a safety team, a quality team, a project team, a development team, but where is your business data team? Data is the most prevalent commodity across most organizations today, yet is managed all too often within silo structures. This impacts the overall business potential and misses out on opportunities of best practice and higher value returns.

1)    Create a data quality team. The purpose of the team is to address data quality issues and help provide long term solutions.

2)    Define the rules and requirements of your business data. At this point, I would expect many people to stop reading and click onto a movie about cats. Well, this more boring level of detail just might enable you to save the time and money in your business to allow you to invest in a whole new cat movie production genre, so please take heed!

3)    Keep it simple, clear, concise. Leave each person that interacts with your data a clear method, strategy, rules, expectations but most of all, check for their level of understanding.

Hey, dude I like this stuff. If you do all that then I’m on a diet and can start my sprint runs. I am feeling energized by prospects. Bring it on.

Tips for success when migrating to cleaner data

Let us be reminded that your data can and should reflect your business culture. If you are straight down the line transactional as a business, then your data is best to be so also. If however you have a fun and open culture (think Disney meets TGIF) then maybe you want to include 5 “random facts” about your suppliers, customers etc.).

Make the most within the wild west of data management

Make the most within the wild west of data management

Just because your data is dirty, it does not mean it is useless. Understanding the quality of your data gives better forecasts and tolerance band information to systems and leaders to reach the best business decisions. In addition, there may be no need to tackle all the errors at once. Here the law of Pareto applies, where often 20% of dirty data, once cleaned will enhance your business decisions by 80%. If however, you have a very good system already you might find it will take you 80% of further data improvement to boost business decisions by 20%. So understand the costs, time, and priorities of your organization.

  1. Clearly, the top need is often to remove duplicates after their identification. I suggest spending a little bit more time at this point to ensure the maximum data is retained and not a straight harsh purge of data cleansing.

  2. Weights, measures, dates, and scales to be consistent. There will often be significant differences to be managed when dealing with international companies that trade over various measurement systems such as imperial and metric systems.

  3. Consider the implications of different spelling (colour - UK and color – USA) for different markets. Where no market brand impact is lost then seek to standardize, otherwise set clear rules and manage around such realities. The same applies to multi-language databases where Spanish, English, German is used and thus their own grammar rules will apply for case sensitivity

  4. Remove text string inputs from standardized fields. So, you have a time and a date or sales value. Using the correct format and standard input fields is important to capture information, not as text but as the actual value it is supposed to be. Some standard software changes at administration level can eliminate this risk, but check it first!

  5. Create unique data fields. This sounds as obvious yet double check the language and use of each data field. Have you got the right language for consistency and understanding? If the accounts department uses the words “days’ payable” for “payment terms” then change the filed accordingly. This becomes ever more vital when gathering data from various sources. Ensure these unique identifiers are used so you can clearly link between tables correctly.

  6. Dealing with dodgy characters and TRIM those spaces. Dodgy characters are the data entry’s that often occur from an error or typo mistake. Search and hunt out these #``!; signs. In addition, where applicable, use remove unnecessary spaces and standardize the spaces required (think telephone numbers as a prime example)

  7. Gamify the process of good data performance. This can be from league tables to storyline specific data cleansing campaigns. It is a great start to draw attention to the needs of this business improvement and helps reinforce the necessary positive behaviors sought.

Wow, if that all gets done around here then we are ready for blast off. I’m putting on my data Nike running shoes and pumping data iron for this challenge. Bring it on, bring it on and let’s get down.

Fine-tuned algorithms ready for gold medals

Data banks

Data banks

By now we understand that that dirty data and the process of cleaning data are important for decisions and measures of the whole business. It is always worth raising the questions of data quality and tolerance bands of information presented at work. This point starts the discussion for cleaner data and wets the appetite within leadership for competitive advantages through information quality.

6 key advantages for a business after embracing a cleaner data policy

  1. Improved trust in metrics

  2. Key performance indicators become true indicators of actual performance

  3. The power of data can become unleashed through accurate insights

  4. Data mining, analytics, machine learning, and algorithms become highly testable for reliance and effectiveness.

  5. Big data and entry towards the internet of things is a natural and manageable process.

  6. Data touches all the organization and can be used as a great introduction for change management and gamification.

I wish you well in your journey and adventure to stop your algorithms from catching a cold in future. You see, with poor data overtime, comes poor decisions and the slow painful spiral of decay and death of a business. And on that not so happy note, I will leave you here to take these key points raised to help boost the quality of your data, metrics and business future. Drop me a comment below with any questions of new considerations based on your own experiences or situation.

Hey- I appreciate the ride we just went on, please help us out here. We mean to do well and help your business so throw us some tender loving care. I look forward to seeing you face to face on a transformative graph some time. Until then, keep polishing things up and we will help you as much as we can.