資料清洗與探索：Pandas 實戰

業界有一個經典的說法：

「一個機器學習專案，80% 的時間都在處理資料，只有 20% 的時間在訓練模型。」

這句話完全正確。模型的準確率高不高，90% 取決於資料的品質，而不是用了多高級的演算法。把垃圾資料餵給模型，產出的只會是垃圾預測 (Garbage In, Garbage Out)。

載入資料集

我們將使用一個經典的機器學習入門資料集：波士頓房價資料集（Boston Housing Dataset），它包含了波士頓地區 506 筆房屋資料。

import pandas as pd
import numpy as np

# 從 Scikit-Learn 內建資料集載入
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target

# 顯示前 5 筆資料
df.head()

探索性資料分析 (EDA)

載入資料後，第一步不是直接訓練模型，而是先「認識你的資料」：

1. 資料集基本資訊

# 資料集的形狀（幾行幾列）
print(f"資料集大小: {df.shape}")
print(f"共有 {df.shape[0]} 筆資料，{df.shape[1]} 個欄位")

# 每個欄位的資料類型
df.info()

# 基本統計摘要
df.describe()

df.describe() 是你最好的朋友！它會顯示每個數值欄位的：

count：有多少筆非空資料
mean：平均值
std：標準差（資料的分散程度）
min / max：最小值與最大值
25% / 50% / 75%：四分位數

2. 檢查缺失值

# 檢查每個欄位有多少空值
df.isnull().sum()

# 如果某欄位有缺失值，可以用中位數填補
df['column_name'].fillna(df['column_name'].median(), inplace=True)

3. 視覺化分析

import matplotlib.pyplot as plt
import seaborn as sns

# 設定中文字型（如果你的系統有）
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']

# 畫出房價分布直方圖
plt.figure(figsize=(10, 6))
sns.histplot(df['MedHouseVal'], bins=50, kde=True)
plt.title('房價分布圖')
plt.xlabel('房價（單位：十萬美元）')
plt.ylabel('筆數')
plt.show()

你可以從分布圖觀察到：

房價是常態分布嗎？還是集中在某個區間？
有沒有極端值（例如特別貴或特別便宜的房子）？

# 畫出相關性熱力圖
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('各特徵與房價的相關性矩陣')
plt.show()

# 找出與房價最相關的 3 個特徵
corr_with_price = correlation_matrix['MedHouseVal'].sort_values(ascending=False)
print("與房價相關性最高的特徵：")
print(corr_with_price.head(4))  # 包含房價自己

特徵工程 (Feature Engineering)

特徵工程是把原始資料轉換成模型更容易理解的形式。以下是幾種常用的技巧：

1. 建立複合特徵

有時候，兩個特徵的組合比單獨使用更有預測力：

# 例如：房間數 / 房屋面積 = 空間效率
df['rooms_per_area'] = df['AveRooms'] / df['AveBedrms']

# 例如：房子越老可能需要越多維修費
df['house_age_squared'] = df['HouseAge'] ** 2

2. 類別特徵編碼

機器學習模型只能處理數字，如果你有類別資料（例如城市名、顏色），需要轉換：

# One-Hot Encoding：將類別轉換為多個 0/1 欄位
df_encoded = pd.get_dummies(df, columns=['categorical_column'], drop_first=True)

3. 數值標準化 (Standardization)

不同的特經常有不同的單位（例如坪數是 10-100，屋齡是 1-50）。標準化可以讓所有特徵的尺度一致，幫助模型更快收斂：

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['MedInc', 'HouseAge', 'AveRooms']])

分割訓練集與測試集

在訓練模型之前，必須先將資料分為「訓練用」和「測試用」：

from sklearn.model_selection import train_test_split

# 特徵 (X) 與目標 (y)
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

# 分割：80% 訓練，20% 測試
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"訓練集大小: {X_train.shape}")
print(f"測試集大小: {X_test.shape}")

random_state=42 確保每次分割的結果都一樣，方便除錯與比較。

使用 Vibe Coding 加速資料清洗

不想手寫這些程式碼？讓 AI 幫你處理：

🔥 【資料清洗詠唱範例】 「我有一個 CSV 檔案 house_data.csv，請幫我： 1. 用 Pandas 讀取並顯示資料集大小與欄位資訊。 2. 檢查每個欄位的缺失值比例，如果某欄位缺失超過 50% 就刪除該欄位。 3. 對數值欄位的缺失值用中位數填補。 4. 使用 IQR 方法找出離群值並移除它們。 5. 畫出所有數值欄位的分布直方圖（一張大圖包含多個子圖）。 6. 輸出清洗後的資料集為 clean_house_data.csv。」

本日總結

在本章中，你學到了：

✅ 探索性資料分析 (EDA)：使用 .info()、.describe() 快速了解資料
✅ 視覺化分析：用 Matplotlib 與 Seaborn 畫出分布圖與熱力圖
✅ 缺失值處理：檢查並填補空值
✅ 特徵工程：建立複合特徵、編碼類別資料、標準化數值
✅ 資料分割：將資料分為訓練集與測試集

下一章，我們終於要訓練第一個機器學習模型了！