Feature Engineering with Pandas

Chawala Pancharoen

5 min readAug 19, 2020

Feature Engineering เป็นกระบวนการหนึ่งในขั้นตอน Data Collection เพื่อสร้าง Model อ้างอิงจาก

[ML-by-newbies] Feature Engineering คือ อะไร ?

ผู้ใช้จะคลิกโฆษณาของเราไหม

medium.com

เป้าหมายในการทำ feature engineering มีอยู่ 2 ข้อหลักๆ คือ

1. เพื่อเตรียมชุดข้อมูลอินพุต (input dataset) ที่เหมาะสมกับอัลกอริทึมของ ML ที่ต้องการจะใช้

2. เพื่อปรับปรุงประสิทธิภาพของ Machine Learning

เพื่อเพิ่มประสิทธิภาพ Machine learning Model โดยเป็นการเล่นกับ Feature หรือ column ของข้อมูล เพื่อให้ Data มีคุณภาพมากขึ้น เพราะ Data ของเราอาจมี Missing values ในบาง Column หรือ Row หรือบาง column มีค่ามากกว่า column อื่นๆ หรือข้อมูลใน Column บาง Row มีค่าโดดๆมา ก็อาจมีปัญหาได้ เพื่อพิจารณาคุณภาพของมันในแง่ต่างๆ แล้วจึงเลือกวิธีในการทำ Feature Engineering ที่เหมาะสม

โดยใช้วิธี 7 วิธี ดังนี้

Imputation
Handling Outliers
Drop Outlier with Standard Deviation
Drop with Percentiles
Binning
Log Transform
One-hot encoding

ก่อนที่จะ workshop บน Jupyter Notebook ต้องติดตั้ง Pandas Profiling Library ด้วยคำสั่งดังนี้

pip install pandas-profiling[notebook]

ติดตั้งเพื่อช่วยให้เข้าใจคุณภาพของข้อมูลได้อย่างรวดเร็ว โดยมันจะคำนวนค่าทางสถิติต่างๆ ด้วย ProfileReport Function

import Library ที่จำเป็นด้วยคำสั่งดังนี้

import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

ดาวน์โหลดไฟล์ข้อมูล titanic.csv จาก Gitlab Project ความหมายในแต่ละ Column ประกอบด้วย

Passengerid คือ ลำดับผู้โดยสาร
Survived คือ เป็นผู้รอดชีวิตหรือไม่
Pclass คือ ประเภทตั๋ว แบ่งเป็น 1 = Upper , 2 = Middle , 3 = Lower
Name คือ ชื่อ
Sex คือ เพศ
SibSp คือ จำนวนพี่น้องกับจำนวนคู่สมรสที่อยู่บนเรือ
Parch คือ จำนวนพ่อแม่กับจำนวนลูกที่อยู่บนเรือ
Ticket คือหมายเลขตั๋ว
Fare คือ ค่าโดยสาร
Cabin คือ รหัสของลูกเรือ
Embarked คือ ท่าเรือปลายทาง C = Cherbourg , Q = Queenstown และ S = Southampton

จากนั้นอ่านไฟล์ด้วยคำสั่งดังนี้

df = pd.read_csv('E:/AI/winemag-data_first150k.csv', sep = ';')

ดูข้อมูลทั้งหมดแบบ Overview ด้วยคำสั่งดังนี้

profile = ProfileReport(df, title="Pandas Profiling Report")
profile

Workshop

1. Imputation การจัดการข้อมูลที่สูญหายไปในตาราง

การแทนค่าที่หายไป ด้วยค่าค่าหนึ่ง ซึ่งเราจะแทนค่าที่ว่าง ด้วยค่าเฉลี่ยจากการคำนวนข้อมูลทั้งหมด หรือแทนด้วย 0 ทั้งหมดก็ได้

จากคำสั่งด้านล่าง จะเป็นการหา cell ที่มีค่าเป็น null ในทุก Column

print(df.isnull().sum())

จาก output มี Missing Values ใน Column Age Cabin และ Embarked

การแทนที่ Missing Value มีหลายวิธีดังนี้

ค่า mean ของอายุทั้งหมดใน Column Age ด้วย สร้างตัวแปรมาอีกหนึ่งตัว ชื่อ new_df คำสั่งดังนี้

new_df = df.copy()
new_df['Age'].fillna(df['Age'].mean(), inplace = True)
print(df.isnull().sum())

จาก output Missing Values ใน Column Age หายไปแล้ว แต่การทำวิธีนี้จะได้ข้อมูลที่ไม่ตรงกับที่ต้องการ

2. การลบ Row หรือ Column จากการกำหนดค่า Threshold โดยดูค่า mean ในแต่ละ Column ซึ่ง Column ที่มีร้อยละของ Missing Value มากกว่า ค่า Threshold จะถูกลบ

df.isnull().mean()

threshold = 0.5
new_df = df[df.columns[df.isnull().mean() < threshold]]

กำหนดค่า Threshold = 0.5 คิดเป็น 50% หมายความว่าใน Column ที่มีค่า Missing Value > 0.5 จะถูกลบ

3. ค่า Median (มัธยฐาน) ของแต่ละ Column

บางครั้งค่า mean ของแต่ละ Column มีความ Sensitive ต่อค่าที่ผิดปกติ (Outlier Value) ซึ่งจะมีผลกระทบต่อการคิดค่าเฉลี่ย และค่ามัธยฐานมีความทนทานต่อ Outlier Value ได้มากกว่า

print(df.median())
new_df = df.fillna(df.median())

4. ค่าศูนย์ (0)

new_df = df.fillna(0)

2. Handling Outliers การจัดการค่าที่ผิดปกติ

วิธีที่ใช้จัดการกับ ค่าที่มีค่าออกมาโดดๆ เป็นส่วนที่เรียกว่า Outliner

import Library ที่จำเป็นด้วยคำสั่งดังนี้

import seaborn as sns
from matplotlib import pyplot as plt

plot กราฟ Box เพื่อดู Outlier ของข้อมูล column Age ด้วยคำสั่งดังนี้

fig = plt.figure(figsize=(12,8))
sns.boxplot(x=df['Age'], color='lime')
plt.xlabel('Price Featured', fontsize=14)
plt.show()

ดูค่าสถิติของข้อมูล

3. Drop Outlier with Standard Deviation การลบ Outliers ใน Column

print(df.shape)factor = 3
upper_lim = df['Age'].mean () + df['Age'].std () * factor
lower_lim = df['Age'].mean () - df['Age'].std () * factordrop_outlier1 = df[(df['Age'] < upper_lim) & (df['Age'] > lower_lim)]print(drop_outlier1.shape)

จาก output จำนวนของ row หายไป ลอง plot กราฟใหม่อีกครั้ง

fig = plt.figure(figsize=(12,8))
sns.boxplot(x=drop_outlier1['Age'], color='lime')
plt.xlabel('Price Featured', fontsize=14)
plt.show()

จากค่าสถิติ ถ้าเทียบกับตอนที่ยังไม่ได้ลบ Outlier ตัวอย่างเช่น เดิมค่า max = 80 เหลือ 71

4. Drop with Percentiles เป็นการจัดสเกลให้กับข้อมูล ให้ไม่เบ้ไปทางใดทางหนึ่ง

จะลบแถวที่พบ Outlier ใน Column Age ที่น้อยกว่าหรือเท่ากับ Quantile 0.5 และมากกว่าหรือเท่ากับ Quantile 0.95

print(df.shape)upper_lim = df['Age'].quantile(.95)
lower_lim = df['Age'].quantile(.05)drop_outlier2 = df[(df['Age'] < upper_lim) & (df['Age'] > lower_lim)]print(drop_outlier2.shape)

fig = plt.figure(figsize=(12,8))
sns.boxplot(x=drop_outlier2['Age'], color='skyblue')
plt.xlabel('Price Featured', fontsize=14)
plt.show()

5. Binning การแบ่งช่วงข้อมูล

จะทำให้สามารถป้องกันการเกิด Overfitting เมื่อมีการ Train Model ได้ในระดับหนึ่งการแบ่ง Age เป็น Low, Mid, High

labels = ['low', 'mid', 'high']
bins = [0., 20., 40., 100.]drop_outlier2['Age_cat'] = pd.cut(drop_outlier2['Age'], labels=labels, bins=bins, include_lowest=False)

6. Log Transform การแปลงข้อมูลโดยใช้ค่า Log

ช่วยจัดการกับข้อมูลมีการแจกแจงไม่ปกติที่มีลักษณะเบ้ขวาหรือเบ้ไปทางบวก (Positive skew) ให้สอดคล้องกับการแจกแจงปกติมากขึ้น
นอกจากนี้ยังลดผลกระทบของค่าผิดปกติ (outliers) และทำให้โมเดลมีความเสถียรมากขึ้น เมื่อนำข้อมูลชุดใหม่มาทดสอบกับโมเดล ความแม่นยำหรือความผิดพลาดจะคงที่

drop_outlier2['log'] = (drop_outlier2['Age']).transform(np.log)

7. One-hot encoding การเข้ารหัสข้อมูลให้เป็น binary 0 กับ 1

One Hot Encoding คือ การ Encode ข้อมูล Categorical Data ที่ปกติเก็บเป็น Nomimal Number, Ordinal Number ให้แตกเป็น Column ย่อย ๆ แบบ Binary 0/1 ตาม Value ของข้อมูล ซึ่งแบ่งตามกลุ่มที่แบ่งไว้ในวิธี Binning

encoded_columns = pd.get_dummies(drop_outlier2['Age_cat'])
drop_outlier2 = drop_outlier2.join(encoded_columns)

Reference

Feature Engineering for AI and Machine Learning (การทำ Feature Engineering ด้วย Pandas)

ภาพจาก…

blog.pjjop.org

Feature Engineering with Pandas

[ML-by-newbies] Feature Engineering คือ อะไร ?

ผู้ใช้จะคลิกโฆษณาของเราไหม

Workshop

การแทนที่ Missing Value มีหลายวิธีดังนี้

Reference

Feature Engineering for AI and Machine Learning (การทำ Feature Engineering ด้วย Pandas)

ภาพจาก…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Chawala Pancharoen

No responses yet

More from Chawala Pancharoen

การวิเคราะห์ประสิทธิภาพ Machine Learning Model ด้วย Learning Curve

Image Classification ด้วย Convolutional Neural Networks (CNN)

Image Classification

การเลือกใช้ Loss Function ในการพัฒนา Deep Learning Model

Visualizing Kernels and Feature Maps in Deep Learning Model (CNN)

Convolutional Neural Network

Recommended from Medium

20 Advanced Statistical Approaches Every Data Scientist Should Know 🐱‍🚀

Data science is a multidisciplinary field that combines mathematics, statistics, computer science, and domain expertise to extract…

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Tools I use that are cheaper than Netflix

Lists

Staff picks

Stories to Help You Level-Up at Work

Self-Improvement 101

Productivity 101

Data Science All Algorithm Cheatsheet 2025

Stories, strategies, and secrets to choosing the perfect algorithm.

Surrogate Modeling: The Secret to Faster, Smarter Engineering

Its fundamentals, capabilities, and engineering applications

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

Just Stop Writing Python Functions Like This!!!

I just reviewed someone else’s code and I was just shocked.