Scaling Feature Engineering Pipelines with Feast and Ray

towardsdatascience.com

project involving the build of propensity models to predict customers’ prospective purchases, I encountered feature engineering issues that I had seen numerous times before.

These challenges can be broadly classified into two categories:

1) Inadequate Feature Management

Definitions, lineage, and versions of features generated by the team were not systematically tracked, thereby limiting feature reuse and reproducibility of model runs.
Feature logic was manually maintained across separate training and inference scripts, leading to a risk of inconsistent features for training and inference (i.e., training-serving skew)
Features were stored as flat files (e.g., CSV), which lack schema enforcement and support for low-latency or scalable access.

2) High Feature Engineering Latency

Heavy feature engineering workloads often arise when dealing with time-series data, where multiple window-based transformations must be computed.
When these computations are executed sequentially rather than optimized for parallel execution, the latency of feature engineering can increase significantly.

In this article, I clearly explain the concepts and implementation of feature stores (Feast) and distributed compute frameworks (Ray) for feature engineering in production machine learning (ML) pipelines.

(1) Example Use Case(2) Understanding Feast and Ray(3) Roles of Feast and Ray in Feature Engineering
(4) Code Walkthrough

You can find the accompanying GitHub repo here.

(i) Objective

To illustrate the capabilities of Feast and Ray, our example scenario involves building an ML pipeline to train and serve a 30-day customer purchase propensity model.

(ii) Dataset

We will use the UCI Online Retail dataset (CC BY 4.0), which comprises purchase transactions for a UK online retailer between December 2010 and December 2011.

Fig. 1 — Sample rows of UCI Online Retail dataset | Image by author

(iii) Feature Engineering Approach

We shall keep the feature engineering scope simple by limiting it to the following features (based on a 90-day lookback window unless otherwise stated):

Recency, Frequency, Monetary Value (RFM) features

recency_days: Days since last purchase
frequency: Number of distinct orders
monetary: Total monetary spend
tenure_days: Days since first-ever purchase (all-time)

Customer behavioral features

avg_order_value: Mean spend per order
avg_basket_size: Mean number of items per order
n_unique_products: Product diversity
return_rate: Share of cancelled orders
avg_days_between_purchases: Mean days between purchases

(iv) Rolling Window Design

The features are computed from a 90-day window before each cutoff date, and purchase labels (1 = at least one purchase, 0 = no purchase) are computed from a 30-day window after each cutoff.

Given that the cutoff dates are spaced 30 days apart, it produces nine snapshots from the dataset:

Fig. 2 — Rolling window timeline for features generation and prediction labels | Image by author

(i) About Feast

Firstly, let’s understand what a feature store is.

A feature store is a centralized data repository that manages, stores, and serves machine learning features, acting as a single source of truth for both training and serving.

Feature stores offer key benefits in managing feature pipelines:

Enforce consistency between training and serving data
Prevent data leakage by ensuring features use only data available at the time of prediction (i.e., point-in-time correct data)
Allow cross-team reuse of features and feature pipelines
Track feature versions, lineage, and metadata for governance

Feast (short for Feature Store) is an open-source feature store that delivers feature data at scale during training and inference.

It integrates with multiple database backends and ML frameworks that can work across or off cloud platforms.

Fig 3. — Feast architecture. Note that data transformation for feature engineering typically sits outside of the Feast framework | Image used under Apache License 2.0

Feast supports both online (for real-time inference) and offline (for batch predictions), though our focus is on offline features, as batch prediction is more relevant for our purchase propensity use case.

(ii) About Ray

Ray is an open-source general-purpose distributed computing framework designed to scale ML applications from a single machine to large clusters. It can run on any machine, cluster, cloud provider, or Kubernetes.

Ray offers a range of capabilities, and the one we will use is the core distributed runtime called Ray Core.

Fig. 4 — Overview of the Ray framework | Image used under Apache License 2.0

Ray Core provides low-level primitives for the parallel execution of Python functions as distributed tasks and for managing tasks across available compute resources.

Let’s look at the areas where Feast and Ray help address feature engineering challenges.

(i) Feature Store Setup with Feast

For our case, we will set up an offline feature store using Feast. Our RFM and customer behavior features will be registered in the feature store for centralized access.

In Feast terminology, offline features are also termed as ‘historical’ features

(ii) Feature Retrieval with Feast and Ray

With our Feast feature store ready, we shall enable the retrieval of relevant features from it during both stages of model training and inference.

We must first be clear about these three concepts: Entity, Feature, and Feature View.

An entity is the primary key used to retrieve features. It basically refers to the identifier “object” for each feature row (e.g., user_id, account_id, etc)
A feature is a single typed attribute associated with each entity (e.g., avg_basket_size)
A feature view defines a group of related features for an entity, sourced from a dataset. Think of it as a table with a primary key (e.g., user_id) being coupled with relevant feature columns.

Fig. 5 — Example illustration of entity, feature, and feature view | Image by author

Event timestamps are an essential component of feature views as it enables usto generate point-in-time correct feature data for training and inference.

Say we now want to obtain these offline features for training or inference. Here’s how it is done:

An entity DataFrame is first created, containing the entity keys and an event timestamp for each row. It corresponds to the two left-most columns in Fig. 5 above.
A point-in-time correct join occurs between the entity DataFrame and the feature tables defined by the different Feature Views

The output is a combined dataset containing all the requested features for the specified set of entities and timestamps.

So where does Ray come in here?

The Ray Offline Store is a distributed compute engine that enables faster, more scalable feature retrieval, especially for large datasets. It does so by parallelizing data access and join operations:

Data (I/O) Access: Distributed data reads by splitting Parquet files across multiple workers, where each worker reads a different partition in parallel
Join Operations: Splits the entity DataFrame so that each partition independently performs temporal joins to retrieve the feature values per entity before a given timestamp. With multiple feature views, Ray parallelizes the computationally intensive joins to scale efficiently.

(iii) Feature Engineering with Ray

The feature engineering function for generating RFM and customer behavior features must be applied to each 90-day window (i.e., nine independent cutoff dates, each requiring the same computation).

Ray Core turns each function call into a remote task, enabling the feature engineering to run in parallel across available cores (or machines in a cluster).

(4.1) Initial Setup

We install the following Python dependencies:

feast[ray]==0.60.0
openpyxl==3.1.5
psycopg2-binary==2.9.11
ray==2.54.0
scikit-learn==1.8.0
xgboost==3.2.0

As we will use PostgreSQL for the feature registry, make sure that Docker is installed and running before running docker compose up -d to start the PostgreSQL container.

(4.2) Prepare Data

Besides data ingestion and cleaning, there are two preparation steps to execute:

Rolling Cutoff Generation: Creates nine snapshots spaced 30 days apart. Each cutoff date defines a training/prediction point at which features are computed from the 90 days preceding it, and target labels are computed from the 30 days after it.
Label Creation: For each cutoff, create a binary target label indicating whether a customer made at least one purchase within the 30-day window after the cutoff.

(4.3) Run Ray-Based Feature Engineering

After defining the code to generate RFM and customer behavior features, let’s parallelize the execution using Ray for each rolling window.

We start by creating a function (compute_features_for_cutoff) to wrap all the relevant feature engineering steps for every cutoff:

The @ray.remote decorator registers the function as a remote task to be run asynchronously in separate workers.

The data preparation and feature engineering pipeline is then run as follows:

Here’s how Ray is involved in the pipeline:

ray.init() initiates a Ray cluster and enables distributed execution across all local cores by default.
ray.put(df) stores the cleaned DataFrame in Ray’s shared memory (aka distributed object store) and returns a reference (ObjectRef) so that all parallel tasks can access the DataFrame without copying it. This helps to improve memory efficiency and task launch performance
compute_features_for_cutoff.remote(...) sends our feature computation tasks to Ray’s scheduler, where Ray assigns each task to a worker for parallel execution and returns a reference to each task’s output.
futures = [...] stores all references returned by each .remote() call. They represent all the in-flight parallel tasks that have been launched
ray.get(futures) retrieves all the actual return values from the parallel task executions at one go
The script then extracts and concatenates per-cutoff RFM and behavior features into two DataFrames, saves them as Parquet files locally
ray.shutdown() releases the resources allocated by stopping the Ray runtime

While our features are stored locally in this case, do note that offline feature data is typically stored in data warehouses or data lakes (e.g., S3, BigQuery, etc) in production settings.

(4.4) Set up Feast Feature Registry

So far, we have covered the transformation and storage aspects of feature engineering. Let us move on to the Feast feature registry.

A feature registry is the centralized catalog of feature definitions and metadata that serves as a single source of truth for feature information.

There are two key components in the registry setup: Definitions and Configuration.

Definitions

We first define the Python objects to represent the features engineered so far. For example, one of the first objects to determine is the Entity (i.e., the primary key that links the feature rows):

Next, we define the data sources in which our feature data are stored:

Note that the timestamp_field is critical as it enables correct point-in-time data views and joins when features are retrieved for training or inference.

After defining entities and data sources, we can define the feature views. Given that we have two sets of features (RFM and customer behavior), we expect to have two feature views:

The schema (field names, dtypes) is important for ensuring that feature data is properly validated and registered.

Configuration

The feature registry configuration is defined in a YAML file called feature_store.yaml:

The configuration tells Feast what infrastructure to use and where its metadata and feature data live, and it generally comprises the following:

Project name: Namespace for project
Provider: Execution environment (e.g., local, Kubernetes, cloud)
Registry location: Location of feature metadata storage (file or databases like PostgreSQL)
Offline store: Location from which historical features data is read
Online store: Location from which low-latency features are served (not relevant in our case)

In our case, we use PostgreSQL (running in a Docker container) for the feature registry and the Ray offline store for optimized feature retrieval.

We use PostgreSQL instead of local SQLite to simulate production-grade infrastructure for the feature registry setup, where multiple services can access the registry concurrently

Feast Apply

Once definitions and configuration are set up, we run feast apply to register and synchronize the definitions with the registry and provision the required infrastructure.

The command can be found in the Makefile:

# Step 2: Register Feast feature definitions in PostgreSQL registry
apply:
 cd feature_store && feast apply

(4.5) Retrieve Features for Model Training

Once our feature store is ready, we proceed with training the ML model.

We start by creating the entity spine for retrieval (i.e., the two columns of customer_id and event_timestamp), which Feast uses to retrieve the correct feature snapshot.

We then execute the retrieval of features for model training at runtime:

FeatureStore is the Feast object that is used to define, create, and retrieve features at runtime
get_historical_features() is designed for offline feature retrieval (as opposed to get_online_features()), and it expects the entity DataFrame and the list of features to retrieve. The distributed reads and point-in-time joins of feature data take place here.

(4.7) Retrieve Features for Inference

We end off by generating predictions from our trained model.

The feature retrieval codes for inference are largely similar to those for training, since we are reaping the benefits of a consistent feature store.

The main difference comes from the different cutoff dates used.

Wrapping It Up

Feature engineering is a vital component of building ML models, but it also introduces data management challenges if not properly handled.

In this article, we clearly demonstrated how to use Feast and Ray to improve the management, reusability, and efficiency of feature engineering.

Understanding and applying these concepts will enable teams to build efficient ML pipelines with scalable feature engineering capabilities.

Feeds