Enhanced Semantic Cache

An improved semantic cache implementation with persistent storage, adjustable thresholds, and flexible embedders.

As the use of Large Language Models (LLMs) becomes more integrated into real-world applications—from customer support bots to intelligent retrieval pipelines—efficiency, cost control, and response quality have become critical concerns. One of the most effective ways to address these challenges is through semantic caching.

Following the initial release of my basic semantic cache, which served as an educational foundation and prototyping tool, I decided to introduce a more robust and production-aware version: the Enhanced Semantic Cache.

Why Enhance the Cache?

The basic version of the semantic cache was intentionally lightweight and minimalist—ideal for getting started with the concept of using embeddings to short-circuit repeated LLM calls. However, real-world systems demand more:

Persistent storage across sessions
Dynamic similarity thresholds
Multiple embedding providers
Rich metadata for insight and debugging
Compatibility with vector databases

This enhanced version directly addresses those needs, offering greater scalability, flexibility, and observability while maintaining an intuitive and modular architecture.

Key Features

Persistent Storage with LanceDB or FAISS

Move beyond memory-only caches. Store embeddings and responses in a fast, queryable format across sessions, ideal for long-lived apps or team environments.

Configurable Similarity Thresholds

Set a default or dynamically tune thresholds to match how precise you want cache matches to be. Looser for creative tasks, stricter for factual Q&A.

Pluggable Embedder System

Supports FastEmbed out-of-the-box, with a simple interface to plug in OpenAI, HuggingFace, or even local sentence transformers.

Metadata Tracking

Each cache entry stores creation time, usage count, similarity score, and more—paving the way for intelligent cache management and analysis.

Multiple Storage Backends

Run entirely in memory for development, or choose a persistent backend (e.g., LanceDB, FAISS). Ideal for flexibility between local prototyping and production deployment.

Async and Sync Operation Modes

Support for both synchronous and asynchronous workflows—adapt to your app’s architecture and performance model.

Logging & Metrics

Get insight into cache performance with logs and stats, including hit rates, fallback frequency, and LLM invocation counts.

What Makes This Cache “Enhanced”?

The modular design allows developers to use only the parts they need while leaving room for growth. For example:

Use it as a plug-in module in your FastAPI app
Deploy it behind an API gateway as a microservice
Integrate it directly into agent pipelines and RAG systems

It’s designed for developers, ML engineers, and architects looking for semantic recall without unnecessary compute cost.

Designed for Experimentation & Evolution

This enhanced cache is not a closed, black-box tool—it’s an evolving open-source project with extensibility at its core. It’s deliberately transparent, allowing others to inspect, modify, and build upon the caching logic.

I invite developers and AI practitioners to:

Clone it
Modify it
Extend it
Contribute ideas and improvements

👉 Explore the Code on GitHub