Machine Learning - Data Processing

Machine Learning - Data Processing

·

3 min read

We’ve learned the basic related concepts of AI and AI maths, now let’s continue our journey on Machine Learning.

What is Machine Learning?

Machine Learning is a subset of AI that enables systems to learn patterns from data and make predictions or decisions without being explicitly programmed. The key power of Machine learning is instead of writing rules, ML lets the algorithm learn from examples.

Types of Machine Learning
  1. Supervised Learning:

    • The algorithm learns from labelled data (input-output pairs).

    • Examples:

      • Regression: Predicting house prices.

      • Classification: Classifying emails as spam or not spam.

  2. Unsupervised Learning:

    • The algorithm learns patterns from unlabelled data.

    • Examples:

      • Clustering: Grouping customers based on purchasing behaviour.

      • Dimensionality Reduction: Reducing the number of features in a dataset.

  3. Reinforcement Learning:

    • The algorithm learns by interacting with an environment and receiving rewards or penalties.

    • Examples:

      • Training a robot to walk.

      • Teaching an AI to play games like Chess or Go.

Data Processing

As we know, Data is one pillar stone of the three major elements of AI. Real-world data is often messy. They often need to be cleaned and transformed before feeding into an ML model.

Steps of Data Processing

  • Handling Missing Data: Filling or removing missing values.

  • Normalization/Scaling: Scales data to a standard range, typically [0,1], [-1, 1].

  • Feature Engineering: Creating new features or transforming existing ones.

Example: Preprocessing Iris Dataset with Panas and Scikit-learn
# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = iris.target

# Display the first few rows
print("Original Data:\n", data.head())

# Handle missing data (if any)
# Example: Fill missing values with the mean
data.fillna(data.mean(), inplace=True)

# Normalize/Scale the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.drop('species', axis=1))
scaled_data = pd.DataFrame(scaled_features, columns=iris.feature_names)
scaled_data['species'] = data['species']

# Display the scaled data
print("\nScaled Data:\n", scaled_data.head())

# Split data into training and testing sets
X = scaled_data.drop('species', axis=1)
y = scaled_data['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nTraining Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

In the above code example, I load the Iris dataset and handle missing values (if any) by filling them with the mean.

  • The features are normalized by using StandardScaler to bring them to a similar scale.

  • The dataset is split into training and testing sets for model evaluation (Testing set is 20%, training set is 80%)

  • Filling the missing values with the mean, e.g. data.fillna(data.mean(), inplace=True)

The printing outputs will show the Training data and testing data shape.

Key takeaways:

  • Machine Learning enables systems to learn from data and make predictions or decisions.

  • The three main types of ML are Supervised Learning, Unsupervised Learning, and Reinforcement Learning.

  • Data preprocessing is a crucial step in ML to ensure the data is clean, normalized, and ready for modelling.

In this blog, I introduced the fundamentals of Machine Learning, explored its types, and demonstrated how to implement a simple Linear Regression model using Scikit-Learn. I also walked through essential data preprocessing steps to prepare data for ML. In the next blog, I will dive deeper into Supervised Learning and build some cool models. It will be more and more fun :) Let’s keep exploring…