RAG and Vector Engine: The Definitive Guide for OpenSearch

RAG, Vector Search, Embeddings: Not Trends Anymore

These aren't emerging technologies anymore. They're foundations. Every enterprise application that truly wants to leverage Generative AI starts here.

The point isn't whether to implement them. It's how to do it without blowing your budget or turning your infrastructure into an operational nightmare.

And if your stack is on AWS, there's another question to ask: how much does it cost to leave the ecosystem for a feature you already have?

Amazon OpenSearch Service doesn't start as a pure vector database. It doesn't need to. It's the orchestrator that merges textual search, vector search, and native integration with Bedrock, SageMaker, and Lambda. Qdrant flies. Pinecone simplifies. But adding an external service comes with a cost—operational, financial, or complexity—that's rarely fully accounted for.

This article isn't theory. It's an architectural deep-dive for those who want production-ready RAG systems. Provisioned vs Serverless. k-Nearest Neighbors index configuration. Chunking strategy. The decisions that separate a prototype from a system that scales.

OpenSearch as a Vector Database: Much More Than Text Search

et's start with the basics. OpenSearch isn't a pure vector database. It's an open-source fork of Elasticsearch (2021), built for search and analytics. Vector support came later.

So why use it?

Concrete Reasons to Choose OpenSearch

1. Native hybrid search.

Real use cases rarely need just vector similarity search. You need to combine semantic search (vectors), keyword search (BM25), and metadata filters. OpenSearch does it all in a single query. Zero external orchestration.

2. Mature ecosystem.

Already using OpenSearch or Elasticsearch for logging, monitoring, search? Adding vector search means extending what you have, not building something new. Less operational complexity. Fewer hidden costs.

3. Native AWS integration.

Bedrock Knowledge Bases, SageMaker, Lambda, Kinesis. AWS-centric stack? Integration overhead is minimal.

4. Storage tiering.

With UltraWarm and Cold Storage you keep historical vectors at reduced cost. Hot tier only for the most accessed data. Try doing that with Pinecone or Weaviate without architectural gymnastics.

When OpenSearch Isn't the Right Answer

Let's be honest. OpenSearch isn't always the right choice.

Here are scenarios where looking elsewhere makes sense:

You want pure vector search only?

No hybrid search, no integration with existing stack? Purpose-built databases like Pinecone or Qdrant offer lower latencies, simpler setup, developer experience optimized for that specific case. If you're starting from zero with only similarity search as your goal, they're worth considering first.

You want to stay AWS-native but OpenSearch is too much?

Amazon S3 Vectors (GA since late 2025) is the AWS answer for simple cases. Save vectors directly to S3, query with ANN, pay for what you consume. Zero infrastructure.

It's useful when:

  • You have small to medium datasets without needing hybrid search
  • Your RAG is simple: semantic queries on a static or infrequently updated corpus
  • Your team wants to prototype fast, without cluster provisioning
  • Budget is tight: cost per vector is significantly lower than OpenSearch

It's not useful when:

  • You need BM25 or hybrid search — S3 Vectors only supports similarity search
  • You have strict latency requirements or high-throughput workloads
  • You want control over indexing algorithm, shards, or replicas

Multi-cloud portability requirement?

If the architectural constraint is avoiding AWS lock-in at all costs, solutions like Weaviate or Milvus offer more deployment flexibility. But if you're already working in AWS—and most enterprise teams are—this scenario rarely justifies the added complexity.

Team without OpenSearch experience?

Managing shard allocation, replicas, heap memory, and index configuration requires operational skills you can't improvise. If your team is small and nobody has operated an OpenSearch cluster, the time-to-value of a managed solution like Pinecone can be much lower, at least initially.

Provisioned vs Serverless: Which Deployment Model Fits You?

Amazon OpenSearch Service offers two deployment models. The choice impacts costs, performance, and how much you'll manage manually.

Provisioned Domains: Total Control, Total Responsibility

Traditional clusters with EC2 nodes. You choose instance types, storage, shard count, replicas. Maximum flexibility. Maximum responsibility.

Use it when:

  • Traffic is predictable and you need fine-tuning on performance (thread pool, cache, circuit breakers)
  • Volumes are high and Reserved Instances justify savings (30-50% vs on-demand)
  • You have strict latency requirements (<50ms p99)

Instance sizing: don't get it wrong.

For vector-intensive workloads, choose memory-optimized instances (r6g, r7g). k-NN indices consume RAM like no tomorrow. An r6g.xlarge.search with 32GB RAM handles vector queries better than a c6g.2xlarge with 16GB, even with fewer vCPU.

Serverless: Simple, But Not Free

OpenSearch Serverless eliminates infrastructure management. You create collections, index data, AWS scales automatically. You pay for OCU (OpenSearch Compute Units) consumed.

Use it when:

  • Traffic is unpredictable or you're still experimenting
  • Your team is small and lacks advanced OpenSearch expertise
  • You want to go live fast, without tuning

Attention: Serverless ≠ economical.

OCU-based pricing can get expensive for high and constant volumes. A Provisioned cluster with Reserved Instances costs 40-60% less. Do the math before deciding.

Configuring k-Nearest Neighbors (k-NN) Indices: Choices That Matter

The heart of vector search on OpenSearch is the k-NN plugin. Configure it right and you make architectural decisions that impact performance, recall accuracy, and costs.

There are some technical details to choose before moving forward with the setup.

HNSW or IVF? It Depends on Your Workload

OpenSearch supports two ANN (Approximate Nearest Neighbors) algorithms. They're not equivalent.

HNSW (Hierarchical Navigable Small World)

  • Graph-based. Fast queries, slow indexing
  • High recall accuracy: 95-99%
  • Consumes more memory — the graph structure lives in RAM
  • Ideal for query-intensive workloads with low-latency requirements

IVF (Inverted File Index)

  • Clustering-based. Faster indexing
  • Slightly lower recall: 90-95% with tuning
  • Uses less memory
  • Ideal for write-intensive workloads and very large datasets

NOTE: IVF requires a mandatory training step. Before indexing, you must train a model using the Train API with the IVF method definition. Training needs at least nlist data points (more is better). More complexity than HNSW, which doesn't require training.

Space Type: Choose the Right Metric

The distance metric (space_type) depends on your embedding model:

cosinesimil: Measures the angle between vectors. Useful if vectors aren't normalized and you only care about orientation, not magnitude.

innerproduct: Dot product. The ideal choice for performance if you're using already-normalized vectors (like OpenAI or Cohere). Computing the dot product on unit-length vectors is mathematically identical to cosine similarity, but much faster because OpenSearch saves the magnitude calculation at runtime.

l2 (Euclidean Distance): Measures the straight-line distance between points. Use when magnitude (vector length) has specific meaning in your domain—for example, in certain recommendation systems where vector length reflects frequency, intensity, or confidence of data, not just thematic similarity.

Practical rule: Using OpenAI or similar models? Normalize your vectors (or let OpenSearch handle it from version 2.18+) and use innerproduct to push query performance to the max.

Tuning Parameters: Recall vs Latency

k-NN tuning is a continuous tradeoff. More recall = more latency. Know the current default values before touching anything.

ef_construction (HNSW, index-time)

  • Range: 100-512
  • Higher = better recall, slower indexing
  • Current default: 128 (note: was 512 in versions ≤ 2.11)

ef_search (HNSW, query-time)

  • Range: 50-500
  • Higher = better recall, slower queries
  • Current default: 100

m (HNSW)

  • Range: 8-64
  • Higher = better recall, more memory
  • Current default: 16

End-to-End RAG Pipeline: From Theory to Production

Infrastructure ready? Let's build the pipeline.

Step 1: Document Ingestion and Chunking

Chunking is the most underestimated variable. Chunks too small lose context. Too large increase noise and costs. There's no universal answer: it depends on your data.

Here are the three fundamental blocks.

1. Chunking with semantic awareness

self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " ", ""]
)

2. Embedding generation with Amazon Bedrock Titan
response = self.bedrock.invoke_model(
modelId=self.embedding_model,
body=json.dumps({"inputText": text})
)
embedding = json.loads(response['body'].read())['embedding']

3. Bulk indexing on OpenSearch
success, failed = helpers.bulk(
client,
actions,
chunk_size=batch_size,
raise_on_error=False
)

From Prototype to Production: The Choices Are Yours

RAG and Vector Search aren't experiments anymore. They're production.

OpenSearch gives you the tools. But tools aren't enough: you need to know how to choose.

Provisioned or Serverless? HNSW or IVF? Chunking strategy? There's no universal answer. There's the right one for your use case. Hopefully this article gave you the resources to ask the right questions.

Want to talk about implementing all this in your AWS stack? You know where to find us.

Sources and References

This article is based on official documentation, AWS best practices, and real implementations in production environments.

AWS OpenSearch Service Documentation:

CloudFormation and IaC:

Fabio Gabas
DevOps at beSharp. I love designing ML and GenAI solutions in the Cloud. After spending some years as a theoretical chemist I decided to switch to AI expert aiming to make computers do the work for me! In my free time I like listening to lesser-known music and enjoy playing collectible card games, and so Magic (...are there really other collectible card games?)

Leave a comment

You could also like