Data Cleaning and Exploration: Pandas in Practice

Why Data Cleaning Matters

There is a famous saying in the industry:

"In a machine learning project, 80% of the time is spent on data processing, and only 20% on training models."

This is absolutely true. A model's accuracy depends 90% on data quality, not on how fancy the algorithm is. Feed garbage into a model, and you get garbage predictions out. This is call ed Garbage In, Garbage Out (GIGO).

Why this matters for your career:

Data cleaning is the most underrated skill in ML — every hiring manager looks for it
Real-world data is always messy: missing values, wrong types, outliers, duplicates
The ability to clean data efficiently is what separates a useful ML engineer from someone who only runs Jupyter notebooks
80% of your time as an ML engineer will be spent on data, not models — embrace it

Loading a Dataset

We will use the California Housing dataset, a classic ML benchmark containing 20,640 records of housing districts in California:

import pandas as pd
import numpy as np

# Load from Scikit-Learn's built-in datasets
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target

# Display first 5 rows
df.head()

import pandas as pd
import numpy as np

# Load from Scikit-Learn's built-in dataset
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target

# Display first 5 rows
df.head()

Exploratory Data Analysis (EDA)

After loading the data, do not rush into training a model. First, you must understand your data. This is the most important step in any ML project.

1. Basic Dataset Information

# Dataset shape (rows, columns)
print(f"Dataset size: {df.shape}")
print(f"Total: {df.shape[0]} rows, {df.shape[1]} columns")

# Data types for each column
df.info()

# Statistical summary
df.describe()

df.describe() is your best friend. It shows: | Statistic | What It Tells You | |:----------|:-----------------| | count | How many non-null values exist | | mean | The average value | | std | Standard deviation — how spread out the data is | | min / max | The range of values | | 25% / 50% / 75% | Quartiles — where the data clusters |

2. Check for Missing Values

# Count null values per column
df.isnull().sum()

# Fill missing values with the median
df['column_name'].fillna(df['column_name'].median(), inplace=True)

3. Visual Analysis

import matplotlib.pyplot as plt
import seaborn as sns

# Price distribution histogram
plt.figure(figsize=(10, 6))
sns.histplot(df['MedHouseVal'], bins=50, kde=True)
plt.title('House Price Distribution')
plt.xlabel('Median House Value ($100,000s)')
plt.ylabel('Frequency')
plt.show()

What to look for in the distribution:

Is the data normall y distributed, or skewed?
Are there extreme values (outliers) that don't fit the pattern?

# Correlation heatmap
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()

# Find the top 3 features most correlated with price
corr_with_price = correlation_matrix['MedHouseVal'].sort_values(ascending=False)
print("Top features correlated with price:")
print(corr_with_price.head(4))

Feature Engineering

Feature engineering transforms raw data into formats that ML models can understand more effectively. This is where domain expertise meets data science — and where the biggest performance gains come from:

1. Create Composite Features

Sometimes, combining two features has more predictive power than using them separately:

# Example: rooms per area = space efficiency
df['rooms_per_area'] = df['AveRooms'] / df['AveBedrms']

# Example: older houses may need more maintenance
df['house_age_squared'] = df['HouseAge'] ** 2

2. Encoding Categorical Features

ML models only process numbers. If you have categorical data (city names, colors), you must convert it:

# One-Hot Encoding: convert categories into 0/1 columns
df_encoded = pd.get_dummies(df, columns=['categorical_column'], drop_first=True)

3. Feature Scaling (Standardization)

Different features have different units (area 10-100, age 1-50). Standardization makes all features have the same scale, helping models converge faster:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['MedInc', 'HouseAge', 'AveRooms']])

Train/Test Split

Before training any model, split your data into training and testing sets:

from sklearn.model_selection import train_test_split

# Features (X) and target (y)
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

# Split: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

random_state=42 ensures the split is identical every time — essential for reproducible debugging and fair model comparison.

How to Use Vibe Coding for Faster Data Cleaning

Don't want to write all this code manuall y? Let AI handle it:

🔥 Vibe Coding Prompt for Data Cleaning "I have a CSV file call ed house_data.csv. Please help me:

Load it with Pandas and show the dataset size and column info.

Check missing value ratio in each column; drop columns with >50% missing.

Fill missing numeric values with the median.

Use the IQR method to detect and remove outliers.

Plot histograms for all numeric columns (a single figure with multiple subplots).

Save the cleaned dataset as clean_house_data.csv."

Today's Summary

In this chapter, you learned:

✅ Exploratory Data Analysis (EDA): Using .info(), .describe() to quickly understand data.
✅ Visualization: Using Matplotlib and Seaborn to plot distributions and heatmaps.
✅ Missing value handling: Checking for and filling null values.
✅ Feature engineering: Creating composite features, encoding categories, standardizing numeric values.
✅ Data splitting: Dividing data into training and test sets.

Key EDA Techniques

| Technique | Code | Purpose | |-----------|------|---------| | Overview | df.info(), df.describe() | Structure and statistics | | Missing values | df.isnull().sum() | Find gaps in data | | Distributions | df.hist(), sns.kdeplot() | Understand value ranges | | Correlations | df.corr(), sns.heatmap() | Find relationships | | Outliers | IQR method, Z-score | Detect anomalies | | Group stats | df.groupby().mean() | Compare categories |

Common Data Issues

| Issue | Detection | Fix | |-------|-----------|-----| | Missing values | df.isnull().sum() | Fill with mean/median/mode or drop | | Outliers | Box plots, IQR > 1.5 | Cap, transform, or remove | | Skewed distributions | Histogram | Log transform, Box-Cox | | High cardinality | df.nunique() | Group rare categories | | Multicollinearity | Correlation matrix > 0.9 | Drop one of the correlated features | | Incorrect data types | df.dtypes | Cast with astype() |

Summary

Exploratory Data Analysis is the critical first step in any ML project. Use .info(), .describe(), and visualizations to understand your data. Handle missing values, outliers, and skewed distributions before training models. Always split into train and test sets to evaluate performance honestly.

Key takeaways:

EDA: understand structure, statistics, distributions, relationships
Visualization: histograms, box plots, heatmaps, scatter plots
Missing values: mean/median fill or drop rows/columns
Outliers: IQR or Z-score detection, cap or remove
Skewed: log transform or Box-Cox for normalization
Feature engineering: create new features from existing ones
Encoding: one-hot for categories, label encoding for ordinal
Standardization: scale numeric features to mean=0, std=1
Split: 80/20 train/test, keep test data untouched

What's Next: Linear Regression

The next chapter trains your first machine learning model — linear regression for predicting house prices.