How to Prepare Data Before Training ML Models Step by Step
Blog Post

How to Prepare Data Before Training ML Models Step by Step

Jake McCluskey
Back to blog

Data preparation accounts for roughly 70% of the time spent on successful machine learning projects, yet it's the phase most beginners rush through or skip entirely. Before you train any ML model, you need to clean your data, engineer features that reveal patterns, and scale numerical values so your algorithm can learn efficiently. Get these steps right and even simple models will outperform complex ones trained on messy data. Skip them? You'll waste weeks chasing accuracy gains that were never possible with your raw input.

What Is Data Preparation for Machine Learning

Data preparation is the systematic process of transforming raw data into a format that machine learning algorithms can actually use. It includes identifying and fixing errors, creating new features from existing ones, and standardizing values so different variables play nicely together during training.

Think of it like meal prep. You don't throw raw ingredients directly into a pan and expect a great dish. You wash vegetables, cut them to consistent sizes, and measure seasonings. ML models need the same treatment with their input data.

The process typically happens in three distinct phases: data cleaning (fixing what's broken), feature engineering (creating what's useful), and scaling (making everything consistent). Each phase builds on the previous one, and honestly, rushing through any of them creates problems that compound during training.

Why Data Quality Matters More Than Model Accuracy

Here's an uncomfortable truth: a linear regression model trained on well-prepared data will beat a neural network trained on garbage. Data quality determines your model's performance ceiling, while algorithm choice only helps you approach that ceiling.

Research from Google's machine learning team found that data quality issues cause approximately 60% of model failures in production environments. Missing values, inconsistent formats, duplicate records. They create noise that drowns out the actual patterns you're trying to learn. Your model can't distinguish between signal and garbage, so it learns both equally.

Poor data preparation also leads to data leakage, where information from your test set accidentally influences training. This makes your model look great during development but fail spectacularly on real-world data. I've seen teams celebrate 95% accuracy only to discover their model was useless because of a single preprocessing mistake.

Data Cleaning Steps for Machine Learning Projects

Data cleaning fixes the fundamental problems that prevent models from learning anything useful. Start here before you think about fancy algorithms or feature engineering tricks.

Handle Missing Values Strategically

Missing data appears in almost every real-world dataset. You have three main options: delete rows with missing values, impute (fill in) the missing values, or create an indicator variable that flags missingness as its own signal.

Deletion works when you're missing less than 5% of your data randomly. Beyond that threshold, you're throwing away too much information. Use pandas to identify missing values quickly:

import pandas as pd

df = pd.read_csv('your_data.csv')
missing_summary = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100

print(missing_summary[missing_summary > 0])

For numerical columns, impute with the median rather than the mean. Medians resist outlier influence. For categorical columns, create a new "Unknown" category rather than guessing. Scikit-learn's SimpleImputer handles both cases:

from sklearn.impute import SimpleImputer

# For numerical columns
num_imputer = SimpleImputer(strategy='median')
df[numerical_cols] = num_imputer.fit_transform(df[numerical_cols])

# For categorical columns
cat_imputer = SimpleImputer(strategy='constant', fill_value='Unknown')
df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])

Remove Duplicates and Fix Inconsistencies

Duplicate rows artificially inflate certain patterns, making your model overweight specific examples. Check for exact duplicates first, then look for semantic duplicates where values differ slightly but represent the same thing.

# Remove exact duplicates
df = df.drop_duplicates()

# Find near-duplicates in text columns
df['email_clean'] = df['email'].str.lower().str.strip()
df = df.drop_duplicates(subset=['email_clean'])

Inconsistent formats plague categorical data. "New York", "new york", "NY", and "N.Y." should all map to the same category. Standardize these before training:

# Standardize text formatting
df['state'] = df['state'].str.upper().str.strip()

# Map variations to standard values
state_mapping = {
    'NY': 'NEW_YORK',
    'N.Y.': 'NEW_YORK',
    'NEW YORK': 'NEW_YORK'
}
df['state'] = df['state'].replace(state_mapping)

Identify and Address Outliers

Outliers aren't always errors, but they can dominate your model's learning process. Use the interquartile range (IQR) method to flag potential outliers, then investigate whether they're legitimate extreme values or data entry mistakes.

Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['price'] < lower_bound) | (df['price'] > upper_bound)]
print(f"Found {len(outliers)} potential outliers")

Don't automatically delete outliers. Cap them at reasonable thresholds or transform them using log scaling instead. Deletion should be your last resort after confirming they're actually errors.

Feature Engineering Techniques Before Model Training

Feature engineering creates new variables that make patterns obvious to your model. It's where domain knowledge meets data science, and it often delivers bigger performance gains than any algorithm swap ever could.

Extract Information from Existing Features

Raw features often hide useful information in their structure. Datetime columns contain day of week, month, quarter, time since epoch. Text columns contain length, word counts, specific keyword presence. Pull these signals out explicitly:

import pandas as pd

# Extract datetime components
df['date'] = pd.to_datetime(df['date'])
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Extract text features
df['text_length'] = df['description'].str.len()
df['word_count'] = df['description'].str.split().str.len()
df['has_keyword'] = df['description'].str.contains('urgent', case=False).astype(int)

Create Interaction Features

Sometimes the relationship between two features matters more than either feature alone. Price per square foot matters more for real estate than price and square footage separately. Create these interactions manually:

# Mathematical interactions
df['price_per_sqft'] = df['price'] / df['square_feet']
df['bedroom_bathroom_ratio'] = df['bedrooms'] / df['bathrooms']

# Categorical interactions
df['city_property_type'] = df['city'] + '_' + df['property_type']

Encode Categorical Variables

Machine learning algorithms need numbers, not categories. One-hot encoding creates binary columns for each category value, while label encoding assigns integers. Use one-hot encoding for nominal categories (no inherent order) and label encoding for ordinal ones (clear ranking).

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# One-hot encoding for nominal categories
df_encoded = pd.get_dummies(df, columns=['color', 'city'], drop_first=True)

# Label encoding for ordinal categories
le = LabelEncoder()
df['size_encoded'] = le.fit_transform(df['size'])  # Small=0, Medium=1, Large=2

The drop_first=True parameter prevents multicollinearity by dropping one category column. Your model can infer that dropped category when all others are zero.

Feature Scaling and Normalization Explained for ML

Feature scaling ensures that variables with different ranges don't dominate your model's learning process. Age ranges from 0 to 100, while income ranges from 0 to 500,000. Without scaling, the model treats a $1 income change as 5,000 times more important than a 1-year age change.

Distance-based algorithms (k-nearest neighbors, support vector machines, neural networks) require scaling. Tree-based algorithms (random forests, gradient boosting) don't, since they split on thresholds rather than distances.

Standardization vs Normalization

Standardization (also called z-score normalization) transforms features to have mean=0 and standard deviation=1. It preserves outliers and works well when your data follows a normal distribution. Use StandardScaler from scikit-learn:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

Min-max normalization squashes all values into a fixed range, typically 0 to 1. It's sensitive to outliers but works better when you need bounded outputs. Use MinMaxScaler:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

Apply standardization to most datasets by default. Switch to min-max normalization only when you specifically need bounded ranges or when working with image data that's already bounded.

Avoid Data Leakage During Scaling

Here's a critical mistake that invalidates countless ML projects: fitting your scaler on the entire dataset before splitting into train and test sets. This leaks information from your test set into training, making your validation metrics meaningless.

Always split first, then fit your scaler only on training data. Apply that fitted scaler to both train and test sets:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit scaler only on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply fitted scaler to test data
X_test_scaled = scaler.transform(X_test)  # Note: transform, not fit_transform

This same principle applies to imputation, encoding, and any other transformation that learns parameters from data. Fit on training, apply to everything.

Data Preprocessing Checklist for ML Beginners

Follow this checklist in order for every ML project. Skipping steps or changing the sequence creates problems that are hard to debug later.

Before splitting your data:

  • Load data and perform initial exploration with df.info(), df.describe(), and df.head()
  • Check for and remove exact duplicate rows
  • Identify missing values and decide on handling strategy per column
  • Standardize categorical value formats (case, spacing, abbreviations)
  • Investigate outliers but don't remove them yet
  • Create engineered features from existing columns
  • Encode categorical variables appropriately

After splitting into train and test sets:

  • Fit imputers on training data only, then apply to both sets
  • Fit scalers on training data only, then apply to both sets
  • Verify no data leakage by checking test set never influenced any fitted parameters
  • Confirm train and test sets have same number of columns and data types

This sequence prevents data leakage while ensuring your test set accurately represents how your model will perform on new data. Most beginners get the order wrong and wonder why their deployed models fail.

Common Data Preparation Mistakes That Sabotage ML Projects

Certain mistakes appear in roughly 40% of failed ML projects, according to surveys of data science teams. Knowing these patterns helps you avoid them.

Fitting transformations on the full dataset: We covered this above, but it's worth repeating because it's so common. Any operation that learns from data (scaling, imputation, encoding with target statistics) must fit only on training data.

Deleting rows with any missing value: This throws away too much information. A row missing one column out of 50 still provides 49 useful values. Impute strategically instead of deleting reflexively.

Using mean imputation for skewed distributions: Mean imputation pulls toward outliers in skewed data. Use median imputation instead, which resists extreme values.

One-hot encoding high-cardinality categories: Encoding a zip code column with 10,000 unique values creates 10,000 new columns. This explodes your feature space and slows training. Use target encoding or frequency encoding for high-cardinality categories instead.

Ignoring class imbalance: If 95% of your data is class A and 5% is class B, your model will learn to predict class A every time and achieve 95% accuracy while being completely useless. Address imbalance through stratified sampling, SMOTE oversampling, or class weights.

Forgetting to save preprocessing objects: You need the exact same scaler, imputer, and encoder objects for production predictions that you used during training. Save them with joblib or pickle:

import joblib

# Save preprocessing objects
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(imputer, 'imputer.pkl')

# Load for production use
scaler = joblib.load('scaler.pkl')
new_data_scaled = scaler.transform(new_data)

Practical Tools and Python Libraries for Data Preparation

You don't need to build everything from scratch. These libraries handle the heavy lifting for most data preparation tasks.

Pandas: Your foundation for data manipulation. Load CSVs, handle missing values, create new columns, perform basic cleaning. Every data scientist uses pandas daily.

Scikit-learn: Provides StandardScaler, MinMaxScaler, SimpleImputer, OneHotEncoder, and train_test_split. Its consistent API (fit, transform, fit_transform) makes pipelines easy to build. When you're ready to test AI models before deploying to production, scikit-learn's preprocessing tools ensure consistency between development and deployment.

NumPy: Powers pandas and scikit-learn under the hood. Use it directly for mathematical transformations and array operations that pandas doesn't handle elegantly.

Feature-engine: Extends scikit-learn with more sophisticated feature engineering transformers. Handles rare category encoding, outlier capping, datetime feature extraction, all in a pipeline-friendly format.

Great Expectations: Validates data quality with explicit assertions about your data's properties. Write tests that confirm your data matches expectations, then run them automatically as part of your pipeline. This catches data drift before it ruins your model.

Ready to stop reading and start shipping?

Get a free AI-powered SEO audit of your site

We'll crawl your site, benchmark your local pack, and hand you a prioritized fix list in minutes. No call required.

Run my free audit
WANT THE SHORTCUT

Need help applying this to your business?

The post above is the framework. Spend 30 minutes with me and we'll map it to your specific stack, budget, and timeline. No pitch, just a real scoping conversation.