Examining Datasets: Assessing the Data Quality

2 min readMar 18, 2022

legos sorted in specific colored piles — Photo by Mourizal Zativa on Unsplash

Whether a Data Analyst is using SQL, Python, or Excel (among many tools) to analyze their dataset, an important question to ask is “Does the data belong?” as this step will be to clean the data. This ensures the attributes, columns, and values are correctly inputted. Human errors exist so it is imperative to check and confirm the information is accurate.

NOTE: Please create a backup copy of the original data in a separate file

Criteria to consider when cleaning data:

Editing data

Do you notice any inconsistencies in the case usage, extra spacing, formulas, non-printed characters, or spelling in the values?

2. Removing data

What specific criteria are to be reviewed? (I.e. certain age, ethnicity, location, etc.). Remove irrelevant rows.

3. Formatting data

Are the cells aligned correctly according to their number format?
Do cells need to be separated? (i.e. names, location, etc.)
Are the times and dates showing properly? Should they be separated?

4. Duplicated data

Is there more than 1 entry showing the same information?

5. Mislabeled columns

Have you examined the column labels to ensure they are correctly titled?

6. Missing values

Are there values showing as NaN or NaT? Replace the values with the Median/Mean if the dataset is small and do not apply this method to categorical columns.
Alternatively, for categorical data types, replace the data with the most frequent values within each column.

7. Outliers

Outliers should only be removed if the data entry was added or measured incorrectly. A data analyst must consider the validity of the outlier and how removing the outlier impacts the outcome.

These are some methods for cleaning the dataset. What can you recommend to clean the data to prepare the next step of exploratory data analysis?

Examining Datasets: Assessing the Data Quality

Written by Julissa Marin, MHA