project involving the build of propensity models to predict customers’ prospective purchases, I encountered feature engineering issues that I had seen numerous times before.
These challenges can be broadly classified into two categories:
In this article, I clearly explain the concepts and implementation of feature stores (Feast) and distributed compute frameworks (Ray) for feature engineering in production machine learning (ML) pipelines.
(1) Example Use Case(2) Understanding Feast and Ray(3) Roles of Feast and Ray in Feature Engineering
(4) Code Walkthrough
You can find the accompanying GitHub repo here.
To illustrate the capabilities of Feast and Ray, our example scenario involves building an ML pipeline to train and serve a 30-day customer purchase propensity model.
We will use the UCI Online Retail dataset (CC BY 4.0), which comprises purchase transactions for a UK online retailer between December 2010 and December 2011.

Fig. 1 — Sample rows of UCI Online Retail dataset | Image by author
We shall keep the feature engineering scope simple by limiting it to the following features (based on a 90-day lookback window unless otherwise stated):
Recency, Frequency, Monetary Value (RFM) features
recency_days: Days since last purchasefrequency: Number of distinct ordersmonetary: Total monetary spendtenure_days: Days since first-ever purchase (all-time)Customer behavioral features
avg_order_value: Mean spend per orderavg_basket_size: Mean number of items per ordern_unique_products: Product diversityreturn_rate: Share of cancelled ordersavg_days_between_purchases: Mean days between purchasesThe features are computed from a 90-day window before each cutoff date, and purchase labels (1 = at least one purchase, 0 = no purchase) are computed from a 30-day window after each cutoff.
Given that the cutoff dates are spaced 30 days apart, it produces nine snapshots from the dataset:

Fig. 2 — Rolling window timeline for features generation and prediction labels | Image by author
Firstly, let’s understand what a feature store is.
A feature store is a centralized data repository that manages, stores, and serves machine learning features, acting as a single source of truth for both training and serving.
Feature stores offer key benefits in managing feature pipelines:
Feast (short for Feature Store) is an open-source feature store that delivers feature data at scale during training and inference.
It integrates with multiple database backends and ML frameworks that can work across or off cloud platforms.

Fig 3. — Feast architecture. Note that data transformation for feature engineering typically sits outside of the Feast framework | Image used under Apache License 2.0
Feast supports both online (for real-time inference) and offline (for batch predictions), though our focus is on offline features, as batch prediction is more relevant for our purchase propensity use case.
Ray is an open-source general-purpose distributed computing framework designed to scale ML applications from a single machine to large clusters. It can run on any machine, cluster, cloud provider, or Kubernetes.
Ray offers a range of capabilities, and the one we will use is the core distributed runtime called Ray Core.

Fig. 4 — Overview of the Ray framework | Image used under Apache License 2.0
Ray Core provides low-level primitives for the parallel execution of Python functions as distributed tasks and for managing tasks across available compute resources.
Let’s look at the areas where Feast and Ray help address feature engineering challenges.
For our case, we will set up an offline feature store using Feast. Our RFM and customer behavior features will be registered in the feature store for centralized access.
In Feast terminology, offline features are also termed as ‘historical’ features
With our Feast feature store ready, we shall enable the retrieval of relevant features from it during both stages of model training and inference.
We must first be clear about these three concepts: Entity, Feature, and Feature View.
user_id, account_id, etc)avg_basket_size)user_id) being coupled with relevant feature columns.
Fig. 5 — Example illustration of entity, feature, and feature view | Image by author
Event timestamps are an essential component of feature views as it enables usto generate point-in-time correct feature data for training and inference.
Say we now want to obtain these offline features for training or inference. Here’s how it is done:
The output is a combined dataset containing all the requested features for the specified set of entities and timestamps.
So where does Ray come in here?
The Ray Offline Store is a distributed compute engine that enables faster, more scalable feature retrieval, especially for large datasets. It does so by parallelizing data access and join operations:
The feature engineering function for generating RFM and customer behavior features must be applied to each 90-day window (i.e., nine independent cutoff dates, each requiring the same computation).
Ray Core turns each function call into a remote task, enabling the feature engineering to run in parallel across available cores (or machines in a cluster).
We install the following Python dependencies:
feast[ray]==0.60.0
openpyxl==3.1.5
psycopg2-binary==2.9.11
ray==2.54.0
scikit-learn==1.8.0
xgboost==3.2.0
As we will use PostgreSQL for the feature registry, make sure that Docker is installed and running before running docker compose up -d to start the PostgreSQL container.
Besides data ingestion and cleaning, there are two preparation steps to execute:
After defining the code to generate RFM and customer behavior features, let’s parallelize the execution using Ray for each rolling window.
We start by creating a function (compute_features_for_cutoff) to wrap all the relevant feature engineering steps for every cutoff:
The @ray.remote decorator registers the function as a remote task to be run asynchronously in separate workers.
The data preparation and feature engineering pipeline is then run as follows:
Here’s how Ray is involved in the pipeline:
ray.init() initiates a Ray cluster and enables distributed execution across all local cores by default. ray.put(df) stores the cleaned DataFrame in Ray’s shared memory (aka distributed object store) and returns a reference (ObjectRef) so that all parallel tasks can access the DataFrame without copying it. This helps to improve memory efficiency and task launch performancecompute_features_for_cutoff.remote(...) sends our feature computation tasks to Ray’s scheduler, where Ray assigns each task to a worker for parallel execution and returns a reference to each task’s output.futures = [...] stores all references returned by each .remote() call. They represent all the in-flight parallel tasks that have been launchedray.get(futures) retrieves all the actual return values from the parallel task executions at one goray.shutdown() releases the resources allocated by stopping the Ray runtimeWhile our features are stored locally in this case, do note that offline feature data is typically stored in data warehouses or data lakes (e.g., S3, BigQuery, etc) in production settings.
So far, we have covered the transformation and storage aspects of feature engineering. Let us move on to the Feast feature registry.
A feature registry is the centralized catalog of feature definitions and metadata that serves as a single source of truth for feature information.
There are two key components in the registry setup: Definitions and Configuration.
We first define the Python objects to represent the features engineered so far. For example, one of the first objects to determine is the Entity (i.e., the primary key that links the feature rows):
Next, we define the data sources in which our feature data are stored:
Note that the timestamp_field is critical as it enables correct point-in-time data views and joins when features are retrieved for training or inference.
After defining entities and data sources, we can define the feature views. Given that we have two sets of features (RFM and customer behavior), we expect to have two feature views:
The schema (field names, dtypes) is important for ensuring that feature data is properly validated and registered.
The feature registry configuration is defined in a YAML file called feature_store.yaml:
The configuration tells Feast what infrastructure to use and where its metadata and feature data live, and it generally comprises the following:
In our case, we use PostgreSQL (running in a Docker container) for the feature registry and the Ray offline store for optimized feature retrieval.
We use PostgreSQL instead of local SQLite to simulate production-grade infrastructure for the feature registry setup, where multiple services can access the registry concurrently
Once definitions and configuration are set up, we run feast apply to register and synchronize the definitions with the registry and provision the required infrastructure.
The command can be found in the Makefile:
# Step 2: Register Feast feature definitions in PostgreSQL registry
apply:
cd feature_store && feast apply
Once our feature store is ready, we proceed with training the ML model.
We start by creating the entity spine for retrieval (i.e., the two columns of customer_id and event_timestamp), which Feast uses to retrieve the correct feature snapshot.
We then execute the retrieval of features for model training at runtime:
FeatureStore is the Feast object that is used to define, create, and retrieve features at runtimeget_historical_features() is designed for offline feature retrieval (as opposed to get_online_features()), and it expects the entity DataFrame and the list of features to retrieve. The distributed reads and point-in-time joins of feature data take place here.We end off by generating predictions from our trained model.
The feature retrieval codes for inference are largely similar to those for training, since we are reaping the benefits of a consistent feature store.
The main difference comes from the different cutoff dates used.
Feature engineering is a vital component of building ML models, but it also introduces data management challenges if not properly handled.
In this article, we clearly demonstrated how to use Feast and Ray to improve the management, reusability, and efficiency of feature engineering.
Understanding and applying these concepts will enable teams to build efficient ML pipelines with scalable feature engineering capabilities.