Data Inconsistency in Real Life.

Motivation...

Many data analysts create functions or scripts to automate the cleaning of files that they believe come in the same format. However, human mistakes are pretty common when performing data entry. On my daily work I deal with "supposedly" identical datasets, however, the script would run for 2 to 3 files before throwing an error showing that a file is not in the same shape as the others.

Who should read it

Any data practitioner, in general. Or anyone looking to enter the field sooner or later.

Data Inconsistency

In general

The biggest frustration for data analysts and scientists is data inconsistency. It adds an extra layer of suffering to our daily work. Why? Because with inconsistent data, not only we have to transform the data into the shape we actually need, but we also have to clean all the mistakes done by others. In short, a clean data that would usually take 5-10mins to put into good shape for modelling or visualization (pivot, stack unstack...), now requires an extra 10 minutes (up to hours) to correct some minor issues (i.e mapping USA, US, usa, us, America ... to a unique key). Now one would say, this shouldn't be a big deal, after all this is our job... Indeed it isn't in many cases! Until you receive a column with supposedly 3 unique values, but in reality they are written in 30 different ways. And now you're contemplating your screen and thinking of the smartest way to map those 30 ways... But at the end of the day, after wasting your precious hours at work, you do it the naïve way.

But why?

Personally, I always ask myself, why do these people do those mistakes; Even when we notify them about it. I came to the conclusion that data literacy (or should I say Illiteracy) is the only reason why those people do the mistakes very frequently.

We (DA/DS) care about data consistency from one dataset to another because we are trying to automate the tasks. However, on the opposite side, the people filling the data are just trying to fill it in for a reader (a human). They do not really care about your python script or function that is supposed to consume the dataset. Therefore, sorry to say, but enjoy your forever suffering. UNLESS you start working with datasets coming from Google analytics, clean and tidy! (A DREAM).

Can we improve at least?

In my opinion, yes! Preparing template with built in values that can autofill is one scenario, but it is far from perfect and too limited. However, a general template to ensure that the columns are always going to be in the same place and with the same names is a HUGE milestone. At least, now, I can automate the reading part of the file, and create a function that will detect all the unmapped values. A small correction of the mapping dictionary from time to time won't be hectic.

So in general, there is hope. But this hope comes with some negotiations and the understanding of the problems from both sides. A small product owner, cannot convince a huge retailer to fix his datasets, however, if the small product owner convinces the huge retailer that a better dataset means better and faster analysis to improve sales, that would be a WIN-WIN for both sides.

My Stories... So far

For F**** Sake!!!

Spaces, apostrophes, commas...

Spacing when writing a number, 97000 vs 97 000, or even worse commas (97,000,000) and apostrophes (97'000'000).

WHY DO PEOPLE DO THAT!

One of my biggest frustrations was the spacing between the numbers. For some reason, it was not a normal space " " which is easy to deal with in python. It was some weird "u\0ax" thing... I had no idea what the hell was happening, even StackOverflow was confused. I ended up fixing it after a few hours of going crazy. Dealing with one problem in one column is a thing. Dealing with 3-4 problems is another.

What about mixing stuff altogether!

The best prize goes to the dataset that should contain numbers only, but for some reason it contains letters, characters and many others... Now I am not gonna lie, this isn't a big deal, it is fixable, however, it gives a skeptical feeling. Are my values correct now? Did I remove a comma by mistake due to my regex? Did I convert some values to NaN...

We ain't done yet...

If you think I shared with you my worst experiences... Not yet reader, not yet...

"Unnamed_level_0", "Unnamed_level_1", "Unnamed_level_1", "Unnamed_level_2"... Those are the nightmares. Multicolumn indexes are the worst by far. Especially when they are mostly empty or they are unnecessary. Now thankfully, StackOverFlow has an easy solution to reduce all these 'Unnamed' into empty cells or even combining them into one title.

Additional weirdness are the column names containing unseen spaces. Those can cause mind F*s especially if you're trying to select a column using 'loc' in pandas.

End of the rant!

I am still at the beginning of my career. What I have seen so far is frustrating but manageable. Most of those things were fixed in less than 6 hours (LOL). I keep telling myself that the best is yet to come. I would love to receive from the readers, their own experiences and how they dealt with it.

sad peppe, sad face, data science, machine learning

Data Analysis with Omar

Visitors

Search This Blog

How to become a Data Analyst in 2023