What are the idea to clean a messy data?

BittenApple
BronzeLounger
Posts: 1498
Joined: 01 Mar 2015, 02:03

What are the idea to clean a messy data?

Post by BittenApple »

Hello team,

I have a data set and I need to clean it up. What are the approaches other than removing duplicated rows?

Thanks,
BittenApple

User avatar
HansV
Administrator
Posts: 78236
Joined: 16 Jan 2010, 00:14
Status: Microsoft MVP
Location: Wageningen, The Netherlands

Re: What are the idea to clean a messy data?

Post by HansV »

It depends on the contents of the data and it can be a very labor-intensive job - much of it has to be done by hand.

Examples:
  • Addresses, if not already split: 123 Walnut Street vs 123 Walnut St vs 123 Walnut
  • Names, if not already split: James N. Brown vs James Brown vs Jim Brown vs J. Brown vs Brown, J.
  • Birth dates: switched month and day: is 2/5/1979 the 5th of February or the 2nd of May?
  • Gender, if stored as text: female vs woman vs f.
  • Spelling errors.
  • Replace multiple spaces with single spaces.
  • Etc.
After correcting errors and removing duplicates:
  • Split data into multiple fields. For example, instead of full name, create fields FirstName, MiddleName, LastName, Honorific. And instead of full address, create fields for street address, city, zip code, state (plus country if necessary).
  • Encode values if relevant, e.g. gender: 1=male, 2=female, 3=other (or more options if desired).
Best wishes,
Hans

BittenApple
BronzeLounger
Posts: 1498
Joined: 01 Mar 2015, 02:03

Re: What are the idea to clean a messy data?

Post by BittenApple »

Great response!
BittenApple