Whether a Data Analyst is using SQL, Python, or Excel (among many tools) to analyze their dataset, an important question to ask is “Does the data belong?” as this step will be to clean the data. This ensures the attributes, columns, and values are correctly inputted. Human errors exist so it is imperative to check and confirm the information is accurate.
NOTE: Please create a backup copy of the original data in a separate file
Criteria to consider when cleaning data:
- Editing data
- Do you notice any inconsistencies in the case usage, extra spacing, formulas, non-printed characters, or spelling in the values?
2. Removing data
- What specific criteria are to be reviewed? (I.e. certain age, ethnicity, location, etc.). Remove irrelevant rows.
3. Formatting data
- Are the cells aligned correctly according to their number format?
- Do cells need to be separated? (i.e. names, location, etc.)
- Are the times and dates showing properly? Should they be separated?
4. Duplicated data
- Is there more than 1 entry showing the same information?
5. Mislabeled columns
- Have you examined the column labels to ensure they are correctly titled?
6. Missing values
- Are there values showing as NaN or NaT? Replace the values with the Median/Mean if the dataset is small and do not apply this method to categorical columns.
- Alternatively, for categorical data types, replace the data with the most frequent values within each column.
- Outliers should only be removed if the data entry was added or measured incorrectly. A data analyst must consider the validity of the outlier and how removing the outlier impacts the outcome.
These are some methods for cleaning the dataset. What can you recommend to clean the data to prepare the next step of exploratory data analysis?