Machine Learning - Data Processing

I’m Xin, a passionate AI enthusiast and tech-savvy explorer on a mission to demystify the world of artificial intelligence. My journey into AI began with a fascination for how machines can learn and adapt, and it has since grown into a deep dive into the cutting-edge technologies that are shaping our future. On this blog, I aim to share my discoveries, insights, and experiences with fellow AI aficionados and curious minds. Whether you’re a seasoned developer, a tech student, or just someone intrigued by the possibilities of AI, I hope you’ll find something valuable here. From the latest breakthroughs in machine learning to practical applications in everyday life, I strive to make complex concepts accessible and engaging. Join me as we explore the fascinating intersection of AI, technology, and human ingenuity. Feel free to reach out if you have any questions or just want to chat about all things AI. Let’s embark on this exciting journey together!
We’ve learned the basic related concepts of AI and AI maths, now let’s continue our journey on Machine Learning.
What is Machine Learning?
Machine Learning is a subset of AI that enables systems to learn patterns from data and make predictions or decisions without being explicitly programmed. The key power of Machine learning is instead of writing rules, ML lets the algorithm learn from examples.
Types of Machine Learning
Supervised Learning:
The algorithm learns from labelled data (input-output pairs).
Examples:
Regression: Predicting house prices.
Classification: Classifying emails as spam or not spam.
Unsupervised Learning:
The algorithm learns patterns from unlabelled data.
Examples:
Clustering: Grouping customers based on purchasing behaviour.
Dimensionality Reduction: Reducing the number of features in a dataset.
Reinforcement Learning:
The algorithm learns by interacting with an environment and receiving rewards or penalties.
Examples:
Training a robot to walk.
Teaching an AI to play games like Chess or Go.
Data Processing
As we know, Data is one pillar stone of the three major elements of AI. Real-world data is often messy. They often need to be cleaned and transformed before feeding into an ML model.
Steps of Data Processing
Handling Missing Data: Filling or removing missing values.
Normalization/Scaling: Scales data to a standard range, typically [0,1], [-1, 1].
Feature Engineering: Creating new features or transforming existing ones.
Example: Preprocessing Iris Dataset with Panas and Scikit-learn
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Load the Iris dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = iris.target
# Display the first few rows
print("Original Data:\n", data.head())
# Handle missing data (if any)
# Example: Fill missing values with the mean
data.fillna(data.mean(), inplace=True)
# Normalize/Scale the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.drop('species', axis=1))
scaled_data = pd.DataFrame(scaled_features, columns=iris.feature_names)
scaled_data['species'] = data['species']
# Display the scaled data
print("\nScaled Data:\n", scaled_data.head())
# Split data into training and testing sets
X = scaled_data.drop('species', axis=1)
y = scaled_data['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)
In the above code example, I load the Iris dataset and handle missing values (if any) by filling them with the mean.
The features are normalized by using
StandardScalerto bring them to a similar scale.The dataset is split into training and testing sets for model evaluation (Testing set is 20%, training set is 80%)
Filling the missing values with the mean, e.g.
data.fillna(data.mean(), inplace=True)
The printing outputs will show the Training data and testing data shape.
Key takeaways:
Machine Learning enables systems to learn from data and make predictions or decisions.
The three main types of ML are Supervised Learning, Unsupervised Learning, and Reinforcement Learning.
Data preprocessing is a crucial step in ML to ensure the data is clean, normalized, and ready for modelling.
In this blog, I introduced the fundamentals of Machine Learning, explored its types, and demonstrated how to implement a simple Linear Regression model using Scikit-Learn. I also walked through essential data preprocessing steps to prepare data for ML. In the next blog, I will dive deeper into Supervised Learning and build some cool models. It will be more and more fun :) Let’s keep exploring…



