The Art & Science of Feature Engineering for Advanced Analytics

A Brief Introduction to Feature Engineering for Machine Learning, AI and Predictive Analytics

Add bookmark

What is Feature Engineering 

First and foremost, in predictive analytics, a “feature” is the predictor, the values that affect the outcome variable. For example, let’s say you wanted to forecast the value of a used car. The features would be anything that affects its value (the outcome variable in this example) such as model, vehicle type and age.

Feature engineering is the general term for creating and manipulating predictors so that a good predictive model can be created. In other words, it refers to the process of using domain knowledge and data mining techniques to create and extract new features from a given dataset. 

Feature engineering typically involves organizing available data (which may come from multiple sources or be disheveled) into an orderly matrix of rows (records to be predicted) and columns (predictor variables or features). This helps machine learning (ML) algorithms to understand data and determine patterns or trends. 

An important first step in preparing data for modeling and analysis, the benefits of feature engineering include:



Feature Engineering Tasks 

 

Feature selection  

Refers to the decision about which predictor variables should be included in a model. This helps reduce the dimensionality of the training problem.

 

Feature generation

Refers to the process of creating new features from one or multiple existing features. 

 

Feature Transformation 

The real meat of feature engineering, feature transformation involves mapping a set of values for the feature to a new set of values to make the representation of the data more suitable or easier to process for the downstream analysis. In other words, it’s the process of modifying data so that it is compatible and can be fed into ML algorithms. 

Common feature transformation strategies include:

  • Data scaling (Standardization and Normalization)
  • Data imputation, binning and encoding
  • Outlier detection and elimination

 

Feature analysis and evaluation

Refers to the process of evaluating the effectiveness of different feature selection algorithms for a specific dataset.

 

Feature Engineering Automation

Though an essential component of ML and statistical modeling, feature engineering can be incredibly monotonous, requiring painstaking levels of precision, creativity and time. Automated feature engineering, on the other hand, can generate hundreds or thousands of candidate features from a dataset in just a fraction of the time a human can. 

Automated feature engineering works by automatically extracting useful and meaningful features from a set of related data tables with a framework that can be applied to any problem. Not only is this process significantly more efficient than manual methods, it is also repeatable, explainable and more secure. 

Some examples of feature engineering automation solutions include:

  • H2O Driverless AI - a tool that employs a library of algorithms and feature transformations to automatically engineer new, high-value features for a given dataset.
  • Featuretools - An open source python framework for automated feature engineering. Using a method known as “deep feature synthesis,” it transforms temporal and relational datasets into feature matrices for machine learning.
  • DataRobot Automated Feature Discovery -  accelerates feature engineering by generating hundreds of valuable new features using the relationships between primary and multiple secondary datasets. For more advanced users, Feature Discovery and relationship definitions are available via the DataRobot API to support further automation and complex workflows.
  • Dataiku EventsAggregator - The EventsAggregator plugin generates aggregated features on a SQL dataset that contains events (i.e. with a date column and some additional features). The generated features can be used in order to train machine learning algorithms.
  • AutoFeat - a python library that provides automated feature engineering and feature selection along with models such as AutoFeatRegressor and AutoFeatClassifier. It is not meant for relational data, but was built with scientific use cases in mind.

Become a Member of the AI, Data & Analytics Network TODAY!


RECOMMENDED