Hello team,
I have a data set and I need to clean it up. What are the approaches other than removing duplicated rows?
Thanks,
BittenApple
What are the idea to clean a messy data?
-
- BronzeLounger
- Posts: 1498
- Joined: 01 Mar 2015, 02:03
-
- Administrator
- Posts: 78236
- Joined: 16 Jan 2010, 00:14
- Status: Microsoft MVP
- Location: Wageningen, The Netherlands
Re: What are the idea to clean a messy data?
It depends on the contents of the data and it can be a very labor-intensive job - much of it has to be done by hand.
Examples:
Examples:
- Addresses, if not already split: 123 Walnut Street vs 123 Walnut St vs 123 Walnut
- Names, if not already split: James N. Brown vs James Brown vs Jim Brown vs J. Brown vs Brown, J.
- Birth dates: switched month and day: is 2/5/1979 the 5th of February or the 2nd of May?
- Gender, if stored as text: female vs woman vs f.
- Spelling errors.
- Replace multiple spaces with single spaces.
- Etc.
- Split data into multiple fields. For example, instead of full name, create fields FirstName, MiddleName, LastName, Honorific. And instead of full address, create fields for street address, city, zip code, state (plus country if necessary).
- Encode values if relevant, e.g. gender: 1=male, 2=female, 3=other (or more options if desired).
Best wishes,
Hans
Hans
-
- BronzeLounger
- Posts: 1498
- Joined: 01 Mar 2015, 02:03
Re: What are the idea to clean a messy data?
Great response!
BittenApple
BittenApple