blog-cover-image

Feature Engineering For Tabular Machine Learning

Feature engineering is a cornerstone of successful machine learning, especially when working with tabular data. In many real-world machine learning projects, raw data is rarely ready for modeling. Instead, analysts and engineers must transform, clean, and enrich the data through a variety of feature engineering techniques. This article provides a guide to feature engineering for tabular machine learning, covering essential topics such as handling missing values, encoding categorical variables, scaling, creating interaction and lag features, selecting relevant features, and implementing these processes efficiently with Python.

Feature Engineering For Tabular Machine Learning

What is Feature Engineering?

Feature engineering is the process of transforming raw data into meaningful features that better represent the underlying problem to predictive models. For tabular data, this often involves data cleaning, transformation, and the creation of new variables, all with the goal of improving model performance. Effective feature engineering can significantly elevate the predictive power of machine learning algorithms, making it an essential skill for data scientists and machine learning engineers.

Why Is Feature Engineering Crucial For Tabular Data?

Tabular data—data stored in rows and columns, like spreadsheets or database tables—is ubiquitous in business and research. Typical tabular datasets contain a mix of numeric and categorical variables, missing values, outliers, and other complexities. Machine learning models, especially those like Linear Regression, Logistic Regression, and Neural Networks, are highly sensitive to the quality and representation of features. Good feature engineering can:

Improve model accuracy and generalization
Reduce overfitting
Shorten model training times
Help models interpret and learn from data more effectively

Handling Missing Values

Missing data is a common challenge in tabular datasets. Before building models, it is important to handle missing values properly to avoid introducing bias or reducing model performance. There are several strategies for handling missing values:

1. Removing Missing Values

If the amount of missing data is very small, or if missing values are concentrated in non-essential columns, you can simply remove those rows or columns.


import pandas as pd

# Remove rows with missing values
df_cleaned = df.dropna()

# Remove columns with missing values
df_cleaned = df.dropna(axis=1)

2. Imputation

Imputation replaces missing values with estimated ones. Common imputation methods include:

Mean/Median Imputation: Replace missing values with the mean or median of the column (best for numeric features).
Mode Imputation: Replace missing values with the most frequent value (best for categorical features).
Constant Imputation: Set missing values to a constant (e.g., 0 or 'Unknown').
Advanced Imputation: Use model-based imputation methods such as KNN, MICE, or IterativeImputer.


from sklearn.impute import SimpleImputer

# Mean imputation for numeric features
imputer = SimpleImputer(strategy='mean')
df['age'] = imputer.fit_transform(df[['age']])

# Mode imputation for categorical features
imputer = SimpleImputer(strategy='most_frequent')
df['gender'] = imputer.fit_transform(df[['gender']])

3. Indicator Variables

Sometimes, the fact that a value is missing is itself informative. You can create a new binary column indicating whether a value was missing:


df['is_age_missing'] = df['age'].isnull().astype(int)

Best Practices

Always analyze the pattern of missingness—are values missing at random or systematically?
Document your imputation choices for reproducibility and interpretability.
Consider using advanced imputation for large or complex datasets.

Encoding Categorical Variables

Many machine learning algorithms require input features to be numerical. Thus, categorical variables (e.g., 'male', 'female', 'yes', 'no') must be encoded numerically. Here are the most common encoding strategies:

1. Label Encoding

Assigns each category a unique integer. Suitable for ordinal variables, where the categories have an inherent order.


from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['education_encoded'] = le.fit_transform(df['education'])

2. One-Hot Encoding

Creates a new binary column for each category. Best for nominal (unordered) variables. This can expand your feature space, so use with caution for high-cardinality features.


df_encoded = pd.get_dummies(df, columns=['color'])

3. Target Encoding (Mean Encoding)

Replaces each category with the mean of the target variable for that category. Good for high-cardinality categorical features, but can lead to overfitting.


import category_encoders as ce

encoder = ce.TargetEncoder(cols=['city'])
df['city_encoded'] = encoder.fit_transform(df['city'], df['target'])

4. Frequency Encoding

Replaces each category with its frequency in the data.


freq = df['zipcode'].value_counts()
df['zipcode_freq'] = df['zipcode'].map(freq)

Best Practices

Use one-hot encoding for low-cardinality nominal features.
Use label encoding for ordinal features.
Experiment with target or frequency encoding for high-cardinality features, but beware of leakage.
Always keep track of encoding mappings for applying the same transformation to new data.

Feature Scaling

Feature scaling standardizes the range and distribution of features. Many machine learning algorithms, such as k-nearest neighbors and those using gradient descent, perform better on scaled data. The two most common scaling methods are:

1. Min-Max Scaling

Rescales features to a specific range, usually [0, 1]:

\( x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}} \)


from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])

2. Standardization (Z-score Scaling)

Centers features around the mean with unit variance:

\( x_{scaled} = \frac{x - \mu}{\sigma} \)


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])

3. Robust Scaling

Uses the median and interquartile range, making it less sensitive to outliers:

\( x_{scaled} = \frac{x - Q_2}{Q_3 - Q_1} \)


from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
df[['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])

When to Scale?

Always scale features for algorithms sensitive to feature magnitude (e.g., SVM, KNN, Neural Networks).
Tree-based models (Random Forest, XGBoost) are generally insensitive to scaling.

Creating Interaction Features

Interaction features combine two or more existing features to capture relationships that may be non-linear or not immediately obvious. Common types of interaction features include:

Product/Ratio Features: Multiply or divide one feature by another (e.g., income_per_person = income / household_size).
Polynomial Features: Add powers or cross-products of features to capture non-linear effects.

1. Manual Interaction Creation


# Product feature
df['income_per_person'] = df['income'] / (df['household_size'] + 1)

# Polynomial feature
df['age_squared'] = df['age'] ** 2

2. Automatic Interaction Creation

Scikit-learn's PolynomialFeatures can generate interaction and polynomial features automatically.


from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
interaction_features = poly.fit_transform(df[['age', 'salary']])

Best Practices

Use domain knowledge to guide which interactions make sense.
Beware of adding too many interaction features, as they can lead to overfitting and increased computational cost.

Lag Features for Time Series Data

When working with tabular time series data, lag features are crucial for capturing temporal dependencies. Lag features use values from previous time steps as predictors for the current value.

1. Creating Lag Features


# Create lag-1 and lag-2 features for a time series column
df['sales_lag_1'] = df['sales'].shift(1)
df['sales_lag_2'] = df['sales'].shift(2)

2. Rolling Statistics

Rolling features capture trends and seasonality by aggregating values over past windows.


# 7-day rolling mean
df['sales_rolling_mean_7'] = df['sales'].rolling(window=7).mean()

# 30-day rolling std
df['sales_rolling_std_30'] = df['sales'].rolling(window=30).std()

Best Practices

Carefully manage the order of data to prevent data leakage.
Use lag and rolling features to capture seasonality, trends, and autocorrelation in time series.

Feature Selection

Feature selection involves identifying the most relevant features for your model. This helps to reduce dimensionality, improve interpretability, and prevent overfitting. Feature selection methods can be divided into three main categories:

1. Filter Methods

Filter methods assess the relevance of features by their statistical properties, independently of the model.

Correlation Matrix: Drop features that are highly correlated with each other.
Statistical Tests: Use ANOVA, chi-square, or mutual information to rank features.


from sklearn.feature_selection import SelectKBest, chi2

# Select top 5 features based on chi2 test
selector = SelectKBest(chi2, k=5)
X_new = selector.fit_transform(X, y)

2. Wrapper Methods

Wrapper methods use a predictive model to score feature subsets by model performance.

Recursive Feature Elimination (RFE): Iteratively removes least important features based on model coefficients.


from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
rfe = RFE(model, n_features_to_select=5)
X_rfe = rfe.fit_transform(X, y)

3. Embedded Methods

Embedded methods perform feature selection as part of the model training process.

Lasso (L1 regularization): Shrinks less important feature coefficients to zero.
Tree-based Feature Importance: Decision trees and ensembles provide feature importance scores.


from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X, y)
importances = model.feature_importances_

Best Practices

Start with filter methods for quick dimensionality reduction.
Use wrapper or embedded methods for fine-tuning the feature set.
Validate selected features with cross-validation to avoid overfitting.

Practical Python Workflows for Feature Engineering

A robust machine learning workflow should include reproducible and modular feature engineering steps. Below is a practical workflow using pandas and scikit-learn:

1. Data Loading and Initial Exploration


import pandas as pd

df = pd.read_csv('data.csv')
print(df.head())
print(df.info())

2. Handling Missing Values


from sklearn.impute import SimpleImputer

num_cols = df.select_dtypes(include='number').columns
cat_cols = df.select_dtypes(include='object').columns

# Impute numeric columns with median
num_imputer = SimpleImputer(strategy='median')
df[num_cols] = num_imputer.fit_transform(df[num_cols])

# Impute categorical columns with most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

3. Encoding Categorical Variables


df = pd.get_dummies(df, columns=cat_cols)

4. Feature Scaling


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

5. Creating New Features


# Example: interaction feature
df['income_per_person'] = df['income'] / (df['household_size'] + 1)

# Example: lag feature (for time series)
df['sales_lag_1'] = df['sales'].shift(1)

6. Feature Selection


from sklearn.feature_selection import SelectKBest, f_classif

X = df.drop('target', axis=1)
y = df['target']

selector = SelectKBest(score_func=f_classif, k=10)
X_new = selector.fit_transform(X, y)

7. Model Training


from

from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Split the data X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=42) # Train the model model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # Make predictions y_pred = model.predict(X_test) # Evaluate performance print("Accuracy:", accuracy_score(y_test, y_pred))

`8. Pipeline Automation`

For production-ready machine learning workflows, it is best to automate feature engineering steps using scikit-learn's Pipeline and ColumnTransformer. This ensures reproducibility and helps prevent data leakage.


from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Define feature types
num_cols = ['age', 'salary']
cat_cols = ['gender', 'education']

# Numeric pipeline
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine pipelines
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, num_cols),
    ('cat', categorical_pipeline, cat_cols)
])

# Create full pipeline
full_pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Fit the pipeline
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
print("Pipeline Accuracy:", accuracy_score(y_test, y_pred))

`Feature Engineering: Real-World Examples`

`Example 1: Predicting House Prices`

Suppose you have a dataset with the following columns: ['LotArea', 'YearBuilt', 'Neighborhood', 'GarageCars', 'SalePrice']. Let's apply several feature engineering steps:


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor

# Load dataset
df = pd.read_csv('houses.csv')

# Handle missing values
df['GarageCars'].fillna(0, inplace=True)

# Encode categorical features
df = pd.get_dummies(df, columns=['Neighborhood'])

# Create age of house feature
df['HouseAge'] = 2024 - df['YearBuilt']

# Feature scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['LotArea', 'HouseAge']] = scaler.fit_transform(df[['LotArea', 'HouseAge']])

# Prepare data for model
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = GradientBoostingRegressor()
model.fit(X_train, y_train)

# Evaluate
print("Test R^2:", model.score(X_test, y_test))

`Example 2: Time Series Forecasting with Lag Features`

For a sales dataset with ['date', 'store', 'sales'], create lag features and rolling averages:


df = pd.read_csv('sales.csv')
df = df.sort_values(['store', 'date'])

# Lag features
df['sales_lag_1'] = df.groupby('store')['sales'].shift(1)
df['sales_lag_7'] = df.groupby('store')['sales'].shift(7)

# Rolling mean
df['sales_rolling_7'] = df.groupby('store')['sales'].rolling(window=7).mean().reset_index(level=0, drop=True)

`Evaluating the Impact of Feature Engineering`

After applying feature engineering, it's crucial to assess its effect on model performance. Typical evaluation strategies include:

Hold-out validation: Split data into training and test sets to evaluate generalization.
Cross-validation: Use k-fold cross-validation for robust performance estimates.
Feature importance analysis: Inspect model-derived feature importances to see which features are most useful.
Permutation importance: Randomly shuffle each feature to assess its impact on the model's performance.


from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_test, y_test, n_repeats=10)
importances = result.importances_mean

`Common Pitfalls in Feature Engineering`

Data Leakage: Accidentally using future information or test data in feature creation or preprocessing can inflate metrics and harm real-world performance.
Overfitting: Creating too many features, especially interaction or polynomial features, can cause the model to fit noise rather than signal.
Ignoring Domain Knowledge: Neglecting insights from the business or application domain can result in suboptimal feature choices.
Improper Handling of Categorical Variables: Using label encoding on nominal variables can introduce unintended ordinal relationships.

`Summary Table: Feature Engineering Techniques`

`Technique`	`Purpose`	`Common Methods`
`Missing Value Handling`	`Deal with incomplete data`	`Imputation (mean, median, mode), removal, indicator variables`
`Encoding Categorical Variables`	`Convert categories to numeric`	`One-hot, label, target, frequency encoding`
`Scaling`	`Normalize feature ranges`	`StandardScaler, MinMaxScaler, RobustScaler`
`Interaction Features`	`Capture feature relationships`	`Product, ratio, polynomial features`
`Lag Features`	`Capture temporal patterns`	`Lag, rolling mean/median/std`
`Feature Selection`	`Reduce dimensionality`	`Filter, wrapper, embedded methods`

`Conclusion: Best Practices for Tabular Feature Engineering`

Feature engineering is both an art and a science. The choices you make can have a dramatic impact on the performance of your machine learning models. Here are some final best practices:

Understand your data through thorough exploration and visualization.
Handle missing values thoughtfully—don’t just drop data unless necessary.
Choose encoding and scaling methods appropriate for your model and data types.
Leverage domain expertise to create meaningful new features and interactions.
Automate feature engineering with pipelines for reproducibility and to avoid data leakage.
Iteratively evaluate and refine features using cross-validation and feature importance analysis.
Document your transformations and decisions for future reference.

By mastering feature engineering for tabular machine learning, you will unlock the full predictive potential of your data and ensure your models are robust, interpretable, and ready for deployment.