consulting

Keep Your Data Clean, Practice Some Data Hygiene

Lasana Murray

Nov 22, 2019 • 4 min read

What's the difference between data and information?

Information is a collection of useful facts that we can get value from, directly or indirectly. Data on the other hand (for the purposes of this post), is just raw machine readable stuff. It may or may not prove to be useful, factual or posses any value to us.

We often use the terms interchangeably but if we can make a distinction by saying
information comes from clean data.

No one wants to be given the wrong or invalid data to make a decision, the results can be catastrophic!

Imagine the consequences of being told traffic around the Queen's Park Savannah is bi-directional or that you need at least 5 liters of water to board a plane or that attending J'ouvert is mandatory for all non-nationals of Trinidad and Tobago.

These are obvious, easily proven lies, noise even, but imagine if this was the only "information" you received each time you googled "Trinidad and Tobago".

Annoying!

To make good decisions we need good information. To get good information, we
need to hold the data we capture and retain to a high standard.

Most of the time, the opportunity to do that, is at the point of capture.
This is your first line of defense against amassing a treasure trove of errors and mistakes. Once a database has accumulated invalid or inaccurate data long enough, it can become expensive and too risky to fix it.

It then becomes a question of whether the consequences of making decisions based on lies is higher than the cost of fixing them. Not a desirable decision to have to make.

By practicing good hygiene with our data we can avoid situations like this.
Here are some pointers to keep in mind to put you on that path:

Only Collect What You Can Explain

In the age of of social media and surveillance capitalism, it can be tempting to suck up every piece of information you can. After all, it seems to have worked
wonders for the tech giants like Amazon, YouTube and of course Facebook.

However, do keep in mind, you are not Amazon, Google or Facebook.

Also, keep in mind that each piece of data you collect has subtle costs attached to it that I'll explain later. Those costs add up as your database grows and are going to contribute to the time it takes to re-design your applications or implement new features.

As a rule of thumb, if you can't explain what a piece of data is and why you want it, it's probably superfluous and you don't need it.

Establish the correct format for your data.

Data such as phone numbers or postal addresses tend to have multiple ways of being recorded. For example, a valid phone number in Trinidad and Tobago may be 888-8888, where the country code 868 is implied.

Outside of Trinidad and Tobago this may not be enough digits to be a valid phone number.

Knowing the format you want your data collected and stored makes it easier to identify what's information and what's garbage. It should be noted that data does not have to be stored in the same format it is collected. In fact, data can be collected in multiple appropriate formats for the sake of user convenience but the actual storage format must be consistent.

Always Validate

This may seem obvious but I have seen first hand where applications in production fail to validate data collected. Easily validated fields such as date of birth are simply saved without first ensuring they are indeed valid dates.

Validation refers to the process of making sure an item of data is acceptable for its required format.

It's the first step in preventing invalid data from being treated as useful information.
Validation can be tricky as not all data is as easily validated as the rest so it is important to know what degree of accuracy is needed.

For example, date of birth is easy as we have strict formats for dates but a
person's name can be a bit more complex especially when dealing with multiple nationalities.
For the latter, we know that a person's name should not be a single letter but allowing only the English alphabet to be used may alienate persons with non-English names.

Different applications will desire different levels of accuracy which is why it is
important to always know why you want each piece of information you collect.

Understand How Your Information Ages

Would you purchase a horse today because of how important they were in the 1600s?

You probably would not, that information is outdated when thinking about your transportation needs, it has little relevance today. Yet this kind of decision making happens when organisations fail to recognize their data has passed its prime.

Understanding how the information you gather looses usefulness over time is important and steps should be taken to remove data that is no longer useful or relevant.

Understand The Costs Attached To Keeping Data

Statements such as "storage is cheap" can miss-lead us into thinking that the cost of data retention is just what we pay for hard disk space.

Provided you are interested in validating your data, the more types of data you want to collect, the more validation rules you are going to have to come up with and test. Keeping the data up to date is another cost to consider which can become quite expensive.

The sheer size of a database can also make moving it and implementing new features a nightmare. Picture having to wait 14 hours for a transfer of a 20+ year transaction log to complete so that a new feature could be tested.

Each time a change is made to the codebase!

It sounds ridiculous, but it happens.

By understanding these not so obvious but still very important costs information retention yields, one may feel less compelled to suck up everything under the sun and instead take only what is needed.

Bonus: Don't Overload Data Fields!

Let's say you have a form field in your shopping app that asks a user whether they would like their luxury Gucci eggs delivered to them as soon as the chicken lays them or come pick them up in person.

Options are no, yes or the user enters an address.

This is an example of overloading a field. One field of a form should always
record one item of data. In the example above, you are trying to capture 2 things,
the user's preference and an address.

What will happen if you wanted a report on how many users said yes?
A naive query will probably omit users that specified an address!

These are just a few things to keep in mind when thinking about the hygiene of your data.

Remember, data that can't be turned into information is noise, and sadly we already have an exponential amount of that these days.