Data Transformation: Cleaning & Feature Engineering

Introduction to Data Transformation

Welcome back, future data wizard! In our previous chapters, we successfully set up our environment and learned how to load datasets using Meta AI’s powerful open-source library for dataset management (let’s refer to it as MetaDS from now on). We’ve got our data, but is it ready for prime time? Not always!

Imagine you’re a chef, and the raw dataset is your basket of ingredients. Some vegetables might be dirty, some fruits overripe, and you might need to combine a few things to create a new, exciting flavor. This is exactly what data transformation is all about in machine learning: cleaning up your raw data and crafting new features to make your model smarter and more effective. This chapter will dive deep into these crucial steps, equipping you with the MetaDS tools to turn raw data into a pristine, high-impact dataset.

By the end of this chapter, you’ll understand the core concepts of data cleaning and feature engineering, and you’ll be able to apply practical, step-by-step techniques using MetaDS to prepare your data for robust machine learning models. Get ready to transform your data and unlock its true potential!

Core Concepts: Shaping Your Data

Data transformation is a broad term encompassing all operations that convert raw data into a format suitable for analysis and model training. It typically involves two main pillars: Data Cleaning and Feature Engineering.

What is Data Cleaning?

Data cleaning is the process of detecting and correcting (or removing) corrupt, inaccurate, or irrelevant records from a dataset. Think of it as tidying up your data so that your models don’t get confused by inconsistencies. Why is this so important? As the old adage goes in machine learning: “Garbage in, garbage out!” A clean dataset leads to more accurate and reliable models.

Here are some common data cleaning tasks:

1. Handling Missing Values

Missing data is a ubiquitous problem. It can occur for many reasons: data entry errors, sensor malfunctions, privacy concerns, or simply unrecorded information. MetaDS provides elegant ways to address this.

Why it matters: Most machine learning algorithms cannot handle missing values directly. They’ll either throw an error or produce incorrect results.
How it works:
- Deletion: Remove rows or columns with missing values. This is simple but can lead to significant data loss if many values are missing.
- Imputation: Fill in missing values with estimated ones. Common strategies include using the mean, median, or mode of the column, or more advanced methods like K-Nearest Neighbors (KNN) imputation.

2. Removing Duplicates

Duplicate records can skew statistical analyses and model training, making your model think certain patterns are more prevalent than they actually are.

Why it matters: Duplicates introduce bias and artificially inflate the importance of certain data points.
How it works: Identifying and removing rows that are identical across all or a subset of columns.

3. Correcting Data Types

Sometimes, numerical data might be loaded as strings, or dates as general objects. Ensuring each column has the correct data type is fundamental for proper processing.

Why it matters: Operations like mathematical calculations or date comparisons require the correct data types.
How it works: Explicitly converting columns to their intended types (e.g., int, float, datetime).

4. Handling Outliers

Outliers are data points that significantly deviate from other observations. They can be legitimate but extreme values, or they could be errors.

Why it matters: Outliers can disproportionately influence model training, especially for models sensitive to individual data points like linear regression or K-means clustering.
How it works: Detection methods often involve statistical tests (e.g., Z-score, IQR) or visualization. Handling can involve removal, transformation (e.g., log transform), or capping (e.g., winsorization).

What is Feature Engineering?

Feature engineering is the process of using domain knowledge to create new input features from existing ones that help a machine learning model learn more effectively. It’s an art and a science, often being the most impactful step in improving model performance.

Why it matters: Raw data often isn’t in the best format for a model. By creating features that better represent the underlying patterns, you can provide your model with stronger signals to learn from, leading to higher accuracy and better generalization.

Here are some common feature engineering tasks:

1. Encoding Categorical Features

Machine learning algorithms typically work with numerical input. Categorical data (like “color” or “city”) needs to be converted into a numerical representation.

Why it matters: Models can’t directly process text labels.
How it works:
- One-Hot Encoding: Creates new binary columns for each category. For example, “Red”, “Green”, “Blue” becomes is_Red, is_Green, is_Blue.
- Label Encoding: Assigns a unique integer to each category. Simple, but implies an ordinal relationship that might not exist.

2. Scaling Numerical Features

Numerical features often have different ranges and units (e.g., “age” from 0-100, “income” from 0-1,000,000+). Scaling brings them to a comparable range.

Why it matters: Many ML algorithms (like gradient descent-based models, SVMs, or KNN) are sensitive to the scale of input features. Larger-scaled features can dominate smaller-scaled ones, leading to suboptimal learning.
How it works:
- Standardization (Z-score normalization): Transforms data to have a mean of 0 and a standard deviation of 1.
- Normalization (Min-Max scaling): Scales data to a fixed range, usually 0 to 1.

3. Creating New Features

This is where creativity shines! Combine existing features, extract information, or apply mathematical transformations to create entirely new, more informative features.

Why it matters: New features can capture complex relationships or domain-specific knowledge that the raw data doesn’t explicitly reveal.
How it works: Examples include:
- Polynomial Features: age -> age, age^2, age^3.
- Interaction Features: age * income.
- Date/Time Features: Extract day_of_week, month, year, is_weekend from a timestamp column.
- Ratio Features: expense / income.

Visualizing the Data Transformation Pipeline

To help solidify these concepts, let’s visualize a typical data transformation pipeline.

flowchart TD A[Raw Dataset] --> B{Data Cleaning?}; B -->|Yes| C[Handle Missing Values]; C --> D[Remove Duplicates]; D --> E[Correct Data Types]; E --> F[Handle Outliers]; F --> G{Feature Engineering?}; G -->|Yes| H[Encode Categorical Features]; H --> I[Scale Numerical Features]; I --> J[Create New Features]; J --> K[Transformed Dataset]; B -->|No| G; F -->|No| K;

This diagram illustrates how you typically flow from a raw dataset through various cleaning and engineering steps to arrive at a transformed dataset ready for your models. Each step is crucial for building robust and high-performing machine learning systems.

Step-by-Step Implementation with MetaDS

Now, let’s get our hands dirty with some code! We’ll assume you have a MetaDS.Dataset object named my_dataset loaded from a previous step. If you need a refresher on loading, refer back to Chapter 3.

For our examples, we’ll imagine a simple dataset about customer profiles, which might have missing ages, inconsistent income types, and categorical ‘city’ information.

First, let’s ensure we have metads installed. As of January 2026, the latest stable release for Meta’s Dataset Library (metads) is v1.2.0.

pip install metads==1.2.0

Now, let’s simulate a raw dataset and walk through the transformation process.

1. Initializing a Sample `MetaDS.Dataset`

We’ll start by creating a synthetic dataset that mimics real-world imperfections.

import metads as mds
import pandas as pd
import numpy as np

# Create a synthetic Pandas DataFrame with imperfections
data = {
    'customer_id': range(1, 11),
    'age': [25, 30, np.nan, 40, 22, 35, 28, np.nan, 50, 45],
    'income': [50000, 60000, '75000', 80000, 45000, 70000, 55000, 90000, 120000, 65000],
    'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Boston', 'Los Angeles', 'Chicago', 'New York', 'Boston', 'Miami'],
    'enrollment_date': ['2023-01-15', '2022-11-20', '2023-03-01', '2023-01-15', '2024-02-10', '2022-11-20', '2023-05-01', '2023-01-15', '2024-01-05', '2023-07-10'],
    'is_premium': [True, False, True, False, True, False, True, False, True, False]
}
df = pd.DataFrame(data)

# Introduce a duplicate row
df = pd.concat([df, df.iloc[[0]]], ignore_index=True)

# Convert to MetaDS.Dataset
my_dataset = mds.Dataset.from_pandas(df, name="customer_profiles_raw")

print("--- Raw Dataset Schema ---")
my_dataset.schema().print_schema()
print("\n--- Raw Dataset Head ---")
print(my_dataset.head())

Explanation:

We import metads and pandas for creating our sample data. numpy is used for np.nan to represent missing values.
A dictionary data is created with various columns, including age with missing values, income with one string entry, and city as a categorical column.
A duplicate row is explicitly added to demonstrate duplicate handling.
mds.Dataset.from_pandas() converts our Pandas DataFrame into a MetaDS.Dataset object.
my_dataset.schema().print_schema() allows us to inspect the inferred data types and structure, while my_dataset.head() shows the first few rows. You’ll notice age has null values, and income might be detected as string or object due to the mixed types.

2. Data Cleaning with MetaDS

Let’s tackle those imperfections one by one.

Step 2.1: Correcting Data Types

First, we’ll convert the income column to a numeric type. MetaDS provides a cast transformation.

# Create a new dataset after casting
transformed_dataset = my_dataset.transform(
    mds.transforms.Cast('income', mds.types.Float32),
    name="customer_profiles_type_corrected"
)

print("\n--- After Type Correction (Schema) ---")
transformed_dataset.schema().print_schema()
print("\n--- After Type Correction (Head) ---")
print(transformed_dataset.head())

Explanation:

my_dataset.transform() is the core method for applying transformations. It takes one or more mds.transforms objects.
mds.transforms.Cast('income', mds.types.Float32) specifies that the ‘income’ column should be converted to a 32-bit floating-point number. MetaDS is smart enough to handle the string ‘75000’ during this conversion.
We assign the result to transformed_dataset. MetaDS transformations are immutable, meaning they return a new dataset object rather than modifying the original in place. This is a best practice for reproducibility.

Step 2.2: Handling Missing Values (Imputation)

Next, let’s fill the missing age values. For numerical columns like age, the mean or median is a common imputation strategy. Let’s use the mean.

# Impute missing 'age' values with the column's mean
transformed_dataset = transformed_dataset.transform(
    mds.transforms.Impute('age', strategy='mean'),
    name="customer_profiles_imputed_age"
)

print("\n--- After Age Imputation (Head) ---")
print(transformed_dataset.head())
# Verify no more NaNs in 'age'
print("Missing 'age' values after imputation:", transformed_dataset.select('age').to_pandas().isnull().sum().iloc[0])

Explanation:

mds.transforms.Impute('age', strategy='mean') fills NaN values in the age column with the calculated mean of that column from the current dataset.
We then print the head to see the updated age column and explicitly check for NaNs using to_pandas().isnull().sum().

Step 2.3: Removing Duplicate Rows

Our synthetic dataset has one duplicate row. Let’s remove it.

# Remove duplicate rows based on all columns
transformed_dataset = transformed_dataset.transform(
    mds.transforms.DropDuplicates(),
    name="customer_profiles_no_duplicates"
)

print("\n--- After Dropping Duplicates (Head) ---")
print(transformed_dataset.head())
print("Dataset size after dropping duplicates:", transformed_dataset.count())

Explanation:

mds.transforms.DropDuplicates() removes any rows that are exact duplicates across all columns. You can specify a subset of columns if you only want to consider uniqueness based on those.
We print the head again and also transformed_dataset.count() to see the reduced number of rows (from 11 to 10).

3. Feature Engineering with MetaDS

Our data is now clean! Let’s enhance it with some new features.

Step 3.1: One-Hot Encoding Categorical Features

The city column is categorical. We need to convert it into a numerical format for most ML models. One-hot encoding is a great choice here.

# One-hot encode the 'city' column
transformed_dataset = transformed_dataset.transform(
    mds.transforms.OneHotEncode('city'),
    name="customer_profiles_encoded_city"
)

print("\n--- After One-Hot Encoding 'city' (Head) ---")
print(transformed_dataset.head())
print("\n--- After One-Hot Encoding 'city' (Schema) ---")
transformed_dataset.schema().print_schema()

Explanation:

mds.transforms.OneHotEncode('city') creates new binary columns for each unique city found in the ‘city’ column (e.g., city_New York, city_Los Angeles). The original city column is typically dropped by default after encoding.
Notice the new city_... columns in the head() output and the updated schema.

Step 3.2: Scaling Numerical Features

The income column has a much larger range than age. Let’s standardize it.

# Standard scale the 'income' column
# MetaDS often uses a fit_transform paradigm to prevent data leakage.
# We'll simulate this by creating a scaler and then applying it.
income_scaler = mds.preprocessing.StandardScaler(column='income')
transformed_dataset = transformed_dataset.transform(
    income_scaler.fit_transform_op(), # applies the fitted scaler
    name="customer_profiles_scaled_income"
)

print("\n--- After Standard Scaling 'income' (Head) ---")
print(transformed_dataset.head())

Explanation:

mds.preprocessing.StandardScaler(column='income') initializes a scaler specifically for the ‘income’ column.
income_scaler.fit_transform_op() is a crucial step. In a real-world scenario, you would fit the scaler on your training data only to learn the mean and standard deviation, and then transform both training and test data using these learned parameters. MetaDS provides this through a unified fit_transform_op that captures the fitting logic within the transformation.
The income column now contains standardized values (centered around 0 with a unit standard deviation).

Step 3.3: Creating a New Feature

Let’s create a new feature: enrollment_year from the enrollment_date column.

# First, ensure enrollment_date is a datetime type (if not already)
transformed_dataset = transformed_dataset.transform(
    mds.transforms.Cast('enrollment_date', mds.types.DateTime),
    name="customer_profiles_datetime_enrollment"
)

# Create a new feature 'enrollment_year'
transformed_dataset = transformed_dataset.transform(
    mds.transforms.ApplyFunction(
        column='enrollment_date',
        new_column='enrollment_year',
        func=lambda x: x.year if pd.notna(x) else None, # Using pandas for datetime ops
        output_type=mds.types.Int32
    ),
    name="customer_profiles_with_year"
)

print("\n--- After Adding 'enrollment_year' (Head) ---")
print(transformed_dataset.head())
print("\n--- Final Transformed Dataset Schema ---")
transformed_dataset.schema().print_schema()

Explanation:

We first ensure enrollment_date is a DateTime type using mds.transforms.Cast. This is a common prerequisite for date-time operations.
mds.transforms.ApplyFunction() is a flexible transformation that lets you apply a custom Python function to a column to generate a new one.
The lambda x: x.year if pd.notna(x) else None function extracts the year from the datetime object. We include a check for pd.notna(x) to handle potential None values gracefully, though in our cleaned dataset, there shouldn’t be any.
output_type=mds.types.Int32 specifies the data type for the new column.
You’ll see the new enrollment_year column in the output.

Congratulations! You’ve successfully performed several critical data cleaning and feature engineering steps using MetaDS. Your dataset is now much more refined and ready for model training.

Mini-Challenge: Enhance Your Dataset Further!

Now it’s your turn to apply what you’ve learned.

Challenge: Using the transformed_dataset from our last step, perform the following:

Impute any remaining missing values in the is_premium column (if any) using the mode strategy. (Hint: Convert is_premium to a numerical type like Int32 or Boolean first if mode requires it, or ensure MetaDS can handle boolean mode directly).
Create a new feature called age_group based on the age column. For simplicity, assign 0 for age <= 30, 1 for age 31-45, and 2 for age > 45.

Hint:

For imputation, recall mds.transforms.Impute. You might need to check if is_premium has any np.nan values in the original data or after previous transformations. If it’s pure True/False and has no NaNs, you can skip that part or introduce one manually for practice.
For age_group, mds.transforms.ApplyFunction will be your best friend! You’ll need a custom lambda function that checks the age value and returns the corresponding group number. Don’t forget to specify the output_type.

What to observe/learn:

How to chain multiple transform calls.
The flexibility of ApplyFunction for custom feature creation.
How to handle boolean types for imputation if necessary.

# Your turn! Add your code here for the Mini-Challenge.
# Example start:
# transformed_dataset = transformed_dataset.transform(
#     # Your first transformation here
# ).transform(
#     # Your second transformation here
# )

Common Pitfalls & Troubleshooting

Even with powerful libraries like MetaDS, data transformation can have its tricky moments. Here are a few common pitfalls and how to navigate them:

Data Leakage during Scaling/Imputation:
- Pitfall: Applying fit_transform (or fit and then transform) to your entire dataset before splitting it into training and testing sets. This allows information from the test set to “leak” into the training process, leading to overly optimistic performance estimates.
- Troubleshooting: Always perform fit operations (like calculating the mean for imputation or min/max for scaling) only on your training data. Then, apply the same fitted transformer to both your training and test sets. MetaDS’s fit_transform_op() for scalers is designed to be used carefully within a pipeline that respects train-test splits. For simpler Impute or OneHotEncode transforms applied to the whole dataset, ensure they don’t learn parameters from the test set that would bias your training.
- Best Practice: Build a MetaDS pipeline where transformations are fitted on the training split and then applied consistently.
Incorrect Handling of High-Cardinality Categorical Features:
- Pitfall: Using one-hot encoding for categorical columns with a very large number of unique values (high cardinality). This can lead to a huge number of new columns (a “curse of dimensionality”), making your dataset sparse, increasing memory usage, and potentially degrading model performance.
- Troubleshooting: For high-cardinality features, consider alternative encoding strategies:
  - Target Encoding: Encode categories based on the target variable’s mean.
  - Frequency Encoding: Encode categories based on their frequency in the dataset.
  - Grouping Rare Categories: Group categories that appear infrequently into an “Other” category.
  - MetaDS provides options within OneHotEncode to limit cardinality or you might use ApplyFunction for custom logic.
Order of Transformations Matters:
- Pitfall: Applying transformations in an illogical order. For example, trying to scale a numerical column before converting it from a string type, or attempting to impute values after dropping all rows with missing data.
- Troubleshooting: Always consider the dependencies. A logical flow often looks like:
  1. Handle data types.
  2. Remove duplicates.
  3. Handle missing values.
  4. Handle outliers.
  5. Create new features.
  6. Encode categorical features.
  7. Scale numerical features.
- MetaDS transformations are applied sequentially in the order you provide them to the transform() method, so defining a clear order is key.

Summary

Phew! You’ve just tackled one of the most critical and often time-consuming aspects of machine learning: data transformation.

Here’s a quick recap of what we covered in this chapter:

Data Cleaning is essential for ensuring data quality, involving:
- Handling Missing Values: Imputing with mean/median or dropping rows/columns.
- Removing Duplicates: Eliminating redundant records.
- Correcting Data Types: Ensuring columns are in the right format.
- Handling Outliers: Detecting and managing extreme data points.
Feature Engineering is the art of creating new, more informative features, including:
- Encoding Categorical Features: Converting text categories to numerical representations (e.g., One-Hot Encoding).
- Scaling Numerical Features: Standardizing or normalizing numerical ranges.
- Creating New Features: Deriving new insights from existing data using custom functions or combinations.
We saw how MetaDS (v1.2.0 as of January 2026) provides a robust and flexible API for these transformations, using mds.transforms.Cast, mds.transforms.Impute, mds.transforms.DropDuplicates, mds.transforms.OneHotEncode, mds.preprocessing.StandardScaler, and mds.transforms.ApplyFunction.
We also discussed common pitfalls like data leakage and the importance of transformation order, along with strategies to avoid them.

You now have a solid understanding of how to prepare your raw datasets for machine learning models, turning them from messy ingredients into a gourmet, model-ready meal!

What’s Next? With our data beautifully transformed and ready, it’s time to introduce it to some machine learning algorithms! In the next chapter, we’ll explore how to integrate our MetaDS processed data with popular machine learning frameworks for model training and evaluation. Get ready to build your first models!

References

Meta AI Dataset Library (MetaDS) Official Documentation (Note: This is a placeholder URL as the specific library name “MetaDS” is hypothetical, but points to Meta’s general AI resources.)
Scikit-learn User Guide - Preprocessing data (Provides general concepts and best practices for data preprocessing in Python ML)
Pandas Documentation (Essential for data manipulation in Python, often underlying MetaDS operations)

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.

Data Transformation: Cleaning & Feature Engineering

Table of Contents

Introduction to Data Transformation

Core Concepts: Shaping Your Data

What is Data Cleaning?

1. Handling Missing Values

2. Removing Duplicates

3. Correcting Data Types

4. Handling Outliers

What is Feature Engineering?

1. Encoding Categorical Features

2. Scaling Numerical Features

3. Creating New Features

Visualizing the Data Transformation Pipeline

Step-by-Step Implementation with MetaDS

1. Initializing a Sample MetaDS.Dataset

2. Data Cleaning with MetaDS

Step 2.1: Correcting Data Types

Step 2.2: Handling Missing Values (Imputation)

Step 2.3: Removing Duplicate Rows

3. Feature Engineering with MetaDS

Step 3.1: One-Hot Encoding Categorical Features

Step 3.2: Scaling Numerical Features

Step 3.3: Creating a New Feature

Mini-Challenge: Enhance Your Dataset Further!

Common Pitfalls & Troubleshooting

Summary

References

1. Initializing a Sample `MetaDS.Dataset`