In this code snippet, we use the a library to perform some basic data cleaning tasks:
Remove Invalid Rows: We filter out rows where the 'Age' is 'Unknown' or where the 'Gender' is missing (None).
Standardize Country Names: We convert all country names to uppercase for consistency.
PYTHON
1# Original DataFrame
2data = {
3 'Name': ['Alice', 'Bob', 'Cindy', 'David'],
4 'Age': [25, 30, 35, 'Unknown'],
5 'Country': ['US', 'UK', 'Canada', 'US'],
6 'Gender': ['F', 'M', 'F', None]
7}
8df = pd.DataFrame(data)
9print("Original DataFrame:")
10print(df)
11
12# Data Cleaning Steps
13# Step 1: Remove rows with missing or 'Unknown' values
14df_cleaned = df[df['Age'] != 'Unknown']
15df_cleaned = df_cleaned.dropna(subset=['Gender'])
16
17# Step 2: Standardize the Country names to uppercase
18df_cleaned['Country'] = df_cleaned['Country'].str.upper()
19
20# Cleaned DataFrame
21print("\nCleaned DataFrame:")
22print(df_cleaned)
The result is a cleaned DataFrame ready for further analysis.