Optimizing Vector Search: Why You Should Flatten Structured Data

towardsdatascience.com

structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor performance. Modern embeddings are based on the BERT architecture, which is essentially the encoder part of a Transformer, and are trained on a huge text dataset with the main goal of capturing semantic meaning. Modern embedding models can provide incredible retrieval performance, but they are trained on a large set of unstructured text with a focus on semantic meaning. As a result, even though embedding JSON may look like an intuitively simple and elegant solution, using a generic embedding model for JSON objects would demonstrate results far from peak performance.

Deep dive

Tokenization

The first step is tokenization, which takes the text and splits it into tokens, which are generally a generic part of the word. The modern embedding models utilize Byte-Pair Encoding (BPE) or WordPiece tokenization algorithms. These algorithms are optimized for natural language, breaking words into common sub-components. When a tokenizer encounters raw JSON, it struggles with the high frequency of non-alphanumeric characters. For example, "usd": 10, is not viewed as a key-value pair; instead, it’s fragmented:

The quotes ("), colon (:), and comma (,)
Tokens usd and 10

This creates a low signal-to-noise ratio. In natural language, almost all words contribute to the semantic “signal”. While in JSON (and other structured formats), a significant percentage of tokens are “wasted” on structural syntax that contains zero semantic value.

Attention calculation

The core power of Transformers lies in the attention mechanism. This allows the model to weight the importance of tokens relative to each other.

In the sentence The price is 10 US dollars or 9 euros, attention can easily link the value 10 to the concept price because these relationships are well-represented in the model’s pre-training data and the model has seen this linguistic pattern millions of times. On the other hand, in the raw JSON:

"price": {
  "usd": 10,
  "eur": 9,
 }

the model encounters structural syntax it was not primarily optimized to “read”. Without the linguistic connector, the resulting vector will fail to capture the true intent of the data, as the relationships between the key and the value are obscured by the format itself.

Mean Pooling

The final step in generating a single embedding representation of the document is Mean Pooling. Mathematically, the final embedding (E) is the centroid of all token vectors (e1, e2, e3) in the document:

Mean Pooling calculation: Converting a sequence of n token embeddings into a single vector representation by averaging their values. Image by author.

This is where the JSON tokens become a mathematical liability. If 25% of the tokens in the document are structural markers (braces, quotes, colons), the final vector is heavily influenced by the “meaning” of punctuation. As a result, the vector is effectively “pulled” away from its true semantic center in the vector space by these noise tokens. When a user submits a natural language query, the distance between the “clean” query vector and “noisy” JSON vector increases, directly hurting the retrieval metrics.

Flatten it

So now that we know about the JSON limitations, we need to figure out how to resolve them. The general and most straightforward approach is to flatten the JSON and convert it into natural language.

Let’s consider the typical product object:

{
 "skuId": "123",
 "description": "This is a test product used for demonstration purposes",
 "quantity": 5,
 "price": {
  "usd": 10,
  "eur": 9,
 },
 "availableDiscounts": ["1", "2", "3"],
 "giftCardAvailable": "true", 
 "category": "demo product"
 ...
}

This is a simple object with some attributes like description, etc. Let’s apply the tokenization to it and see how it looks:

Tokenization of raw JSON. Notice the high density of distinct tokens for syntax (braces, quotes, colons) that contribute to noise rather than meaning. Screenshot by author using OpenAI Tokenizer

Now, let’s convert it into text to make the embeddings’ work easier. In order to do that, we can define a template and substitute the JSON values into it. For example, this template could be used to describe the product:

Product with SKU {skuId} belongs to the category "{category}"
Description: {description}
It has a quantity of {quantity} available 
The price is {price.usd} US dollars or {price.eur} euros  
Available discount ids include {availableDiscounts as comma-separated list}  
Gift cards are {giftCardAvailable ? "available" : "not available"} for this product

So the final result will look like:

Product with SKU 123 belongs to the category "demo product"
Description: This is a test product used for demonstration purposes
It has a quantity of 5 available
The price is 10 US dollars or 9 euros
Available discount ids include 1, 2, and 3
Gift cards are available for this product

And apply tokenizer to it:

Tokenization of the flattened text. The resulting sequence is shorter (14% fewer tokens) and composed primarily of semantically meaningful words. Screenshot by author using OpenAI Tokenizer

Not only does it have 14% fewer tokens now, but it also is a much clearer form with the semantic meaning and required context.

Let’s measure the results

Note: Complete, reproducible code for this experiment is available in the Google Colab notebook

Now let’s try to measure retrieval performance for both options. We are going to focus on the standard retrieval metrics like Recall@k, Precision@k, and MRR to keep it simple, and will utilize a generic embedding model (all-MiniLM-L6-v2) and the Amazon ESCI dataset with random 5,000 queries and 3,809 associated products.

The all-MiniLM-L6-v2 is a popular choice, which is small (22.7m params) but provides fast and accurate results, making it a good choice for this experiment.

For the dataset, the version of Amazon ESCI is used, specifically milistu/amazon-esci-data (), which is available on Hugging Face and contains a collection of Amazon products and search queries data.

The flattening function used for text conversion is:

def flatten_product(product):
  return (
    f"Product {product['product_title']} from brand {product['product_brand']}" 
    f" and product id {product['product_id']}" 
    f" and description {product['product_description']}"
)

A sample of the raw JSON data is:

{
  "product_id": "B07NKPWJMG",
  "title": "RoWood 3D Puzzles for Adults, Wooden Mechanical Gear Kits for Teens Kids Age 14+",
  "description": "<p> <strong>Specifications</strong><br /> Model Number: Rowood Treasure box LK502<br /> Average build time: 5 hours<br /> Total Pieces: 123<br /> Model weight: 0.69 kg<br /> Box weight: 0.74 KG<br /> Assembled size: 100*124*85 mm<br /> Box size: 320*235*39 mm<br /> Certificates: EN71,-1,-2,-3,ASTMF963<br /> Recommended Age Range: 14+<br /> <br /> <strong>Contents</strong><br /> Plywood sheets<br /> Metal Spring<br /> Illustrated instructions<br /> Accessories<br /> <br /> <strong>MADE FOR ASSEMBLY</strong><br /> -Follow the instructions provided in the booklet and assembly 3d puzzle with some exciting and engaging fun. Fell the pride of self creation getting this exquisite wooden work like a pro.<br /> <strong>GLORIFY YOUR LIVING SPACE</strong><br /> -Revive the enigmatic charm and cheer your parties and get-togethers with an experience that is unique and interesting .<br /> <br />",
  "brand": "RoWood",
  "color": "Treasure Box"
}

For the vector search, two FAISS indexes are created: one for the flattened text and one for the JSON-formatted text. Both indexes are flat, which means that they will compare distances for each of the existing entries instead of utilizing an Approximate Nearest Neighbour (ANN) index. This is important to ensure that retrieval metrics are not affected by the ANN.

D = 384
index_json = faiss.IndexFlatIP(D)
index_flatten = faiss.IndexFlatIP(D)

To reduce the dataset a random number of 5,000 queries has been selected and all corresponding products have been embedded and added to the indexes. As a result, the collected metrics are as follows:

Comparing the two indexing methods using the all-MiniLM-L6-v2 embedding model on the Amazon ESCI dataset. The flattened approach consistently yields higher scores across all key retrieval metrics (Precision@10, Recall@10, and MRR). Image by author

And the performance change of the flattened version is:

Converting the structured JSON to natural language text resulted in significant gains, including a 19.1% boost in Recall@10 and a 27.2% boost in MRR (Mean Reciprocal Rank), confirming the superior semantic representation of the flattened data. Image by author.

The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.

References

[1] Full experiment code https://colab.research.google.com/drive/1dTgt6xwmA6CeIKE38lf2cZVahaJNbQB1?usp=sharing[2] Model https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
[3] Amazon ESCI dataset. Specific version used: https://huggingface.co/datasets/milistu/amazon-esci-data
The original dataset available at https://www.amazon.science/code-and-datasets/shopping-queries-dataset-a-large-scale-esci-benchmark-for-improving-product-search
[4] FAISS https://ai.meta.com/tools/faiss/

Feeds