Enhanced Data Stores
What Enhanced Data Stores are, how they differ from online data sources, and when to use them to unlock vector search, OCR, audio transcription, and other advanced capabilities.
Enhanced Data Stores
AnEnhanced Data Storeis a private, indexed document store that you build and own. Unlike online data sources — which are live connections to external services — an Enhanced Data Store pulls content in, processes it, and keeps it in an internal index purpose-built for advanced search and analysis.
Navigate toNews → Data Sources → Enhanced Data Storesto manage your organization's Enhanced Data Stores.
What Is an Enhanced Data Store?
When you create an Enhanced Data Store, you define:
- A content source— where documents come from (uploaded files, or an existing data source)
- A transform— how each document is processed before indexing
- A splitter— how each document is divided into searchable chunks
- An index technology— the search engine that powers queries
The result is a fully searchable document collection with capabilities that far exceed what a live online connection can offer — including semantic (vector) search, OCR, and audio transcription.
Content Sources
User Files
You upload documents directly from your computer. Any file type that the platform can process — PDFs, Word documents, plain text files, audio files, video files — can be added. The store grows as you upload more files.
Best for:Private research libraries, internal reports, archived documents, and recordings you want to make searchable.
Latest News from a Data Source
The store automatically fetches the most recent articles from an existing online data source (one that supports theLatestcapability) and indexes them. You can set a cutoff date to control how far back to go.
Best for:Creating a deep, searchable archive of recent coverage from a specific news outlet or social channel.
Search Results from a Data Source
The store runs a specific keyword query against an existing data source and indexes everything it finds. You configure the search term when creating the store.
Best for:Building a focused collection of documents on a specific topic drawn from a larger data source.
Processing Pipeline
Every document that enters an Enhanced Data Store passes through a configurable processing pipeline.
Transforms
ATransformpre-processes each document before it is split and indexed. Choose the transform that matches your content type:
| Transform | What it does |
|---|---|
| Default | Automatic format detection and text extraction — the right choice for most situations |
| None | Uses the document in its original form without any processing |
| HTML Open Graph metadata | Extracts Open Graph metadata (title, description, image) from HTML pages |
| HTML meta metadata | Extracts standard<meta>tag data from HTML pages |
| HTML header metadata | Extracts headings and structure from HTML pages |
| Markdown header metadata | Extracts headings and structure from Markdown documents |
| VTT metadata | Extracts metadata from WebVTT subtitle files |
| Extract audio from video | Uses FFmpeg to extract the audio track from a video file (use before audio transcription) |
| Audio transcription | Converts spoken audio into searchable text |
| Azure Document Intelligence | Uses Microsoft Azure AI to extract text and structure from complex PDFs and scanned documents (OCR) |
| Remove whitespace | Strips excess spaces and blank lines to reduce token usage |
| HTML minification | Compresses HTML content |
| Data source transform | Applies the transform defined by the underlying data source |
Splitters
ASplitterdivides each processed document into chunks. Smaller, well-defined chunks improve search precision because the search engine can pinpoint the exact section of a long document that is relevant to a query.
| Splitter | How it works |
|---|---|
| Paragraphs | Splits on paragraph boundaries |
| Paragraph stacking(default) | Groups paragraphs together with overlap so context is preserved across chunk boundaries |
| Hierarchical | Splits by the document's heading structure, keeping each section together |
| Halves | Recursively splits the document in half until chunks are small enough |
| Whole document | Treats the entire document as a single chunk — best for short documents |
Index Technology
The indexed chunks are stored in a vector database that enables semantic search.
| Technology | Description |
|---|---|
| Flash | TopHack's proprietary vector search engine — fast, cost-effective, and the recommended choice |
| Azure AI Search | Microsoft Azure AI Search(coming soon) |
| OpenAI Vector Store | OpenAI's native vector store(coming soon) |
You also choose anembedding modelduring setup. The embedding model converts text into vectors; the choice of model affects the quality and language coverage of semantic search results.
Why Use an Enhanced Data Store?
| Need | How an Enhanced Data Store helps |
|---|---|
| Search your own files semantically | Upload files, apply transcription or OCR, and query by meaning |
| Make podcasts and videos searchable | Transcribe audio content and index it for keyword and semantic search |
| Extract text from scanned documents | Apply Azure Document Intelligence to index PDFs and images |
| Build a curated research archive | Pull content from a live data source into a private, reusable index |
| Search with higher precision | Fine-tune chunking and transforms to match your content type |
Relationship to Online Data Sources
An Enhanced Data Store appears as a regular data source in thesearch module— you can select it and query it just like any built-in or custom online source. The difference is that results come from your private index rather than a live external query.
When you set up an Enhanced Data Store that pulls from a live source, that source needs to support theLatestorSearchcapability. The Enhanced Data Store adds theVector SearchandFull-Text Searchcapabilities on top.
SeeAbout Data Sourcesfor an overview of how online sources and Enhanced Data Stores fit together.