Enhanced Data Stores

What Enhanced Data Stores are, how they differ from online data sources, and when to use them to unlock vector search, OCR, audio transcription, and other advanced capabilities.

View as Markdown

Enhanced Data Stores

AnEnhanced Data Storeis a private, indexed document store that you build and own. Unlike online data sources — which are live connections to external services — an Enhanced Data Store pulls content in, processes it, and keeps it in an internal index purpose-built for advanced search and analysis.

Navigate toNews → Data Sources → Enhanced Data Storesto manage your organization's Enhanced Data Stores.

What Is an Enhanced Data Store?

When you create an Enhanced Data Store, you define:

A content source— where documents come from (uploaded files, or an existing data source)
A transform— how each document is processed before indexing
A splitter— how each document is divided into searchable chunks
An index technology— the search engine that powers queries

The result is a fully searchable document collection with capabilities that far exceed what a live online connection can offer — including semantic (vector) search, OCR, and audio transcription.

Content Sources

User Files

You upload documents directly from your computer. Any file type that the platform can process — PDFs, Word documents, plain text files, audio files, video files — can be added. The store grows as you upload more files.

Best for:Private research libraries, internal reports, archived documents, and recordings you want to make searchable.

Latest News from a Data Source

The store automatically fetches the most recent articles from an existing online data source (one that supports theLatestcapability) and indexes them. You can set a cutoff date to control how far back to go.

Best for:Creating a deep, searchable archive of recent coverage from a specific news outlet or social channel.

Search Results from a Data Source

The store runs a specific keyword query against an existing data source and indexes everything it finds. You configure the search term when creating the store.

Best for:Building a focused collection of documents on a specific topic drawn from a larger data source.

Processing Pipeline

Every document that enters an Enhanced Data Store passes through a configurable processing pipeline.

Transforms

ATransformpre-processes each document before it is split and indexed. Choose the transform that matches your content type:

Transform	What it does
Default	Automatic format detection and text extraction — the right choice for most situations
None	Uses the document in its original form without any processing
HTML Open Graph metadata	Extracts Open Graph metadata (title, description, image) from HTML pages
HTML meta metadata	Extracts standard`<meta>`tag data from HTML pages
HTML header metadata	Extracts headings and structure from HTML pages
Markdown header metadata	Extracts headings and structure from Markdown documents
VTT metadata	Extracts metadata from WebVTT subtitle files
Extract audio from video	Uses FFmpeg to extract the audio track from a video file (use before audio transcription)
Audio transcription	Converts spoken audio into searchable text
Azure Document Intelligence	Uses Microsoft Azure AI to extract text and structure from complex PDFs and scanned documents (OCR)
Remove whitespace	Strips excess spaces and blank lines to reduce token usage
HTML minification	Compresses HTML content
Data source transform	Applies the transform defined by the underlying data source

Splitters

ASplitterdivides each processed document into chunks. Smaller, well-defined chunks improve search precision because the search engine can pinpoint the exact section of a long document that is relevant to a query.

Splitter	How it works
Paragraphs	Splits on paragraph boundaries
Paragraph stacking(default)	Groups paragraphs together with overlap so context is preserved across chunk boundaries
Hierarchical	Splits by the document's heading structure, keeping each section together
Halves	Recursively splits the document in half until chunks are small enough
Whole document	Treats the entire document as a single chunk — best for short documents

Index Technology

The indexed chunks are stored in a vector database that enables semantic search.

Technology	Description
Flash	TopHack's proprietary vector search engine — fast, cost-effective, and the recommended choice
Azure AI Search	Microsoft Azure AI Search(coming soon)
OpenAI Vector Store	OpenAI's native vector store(coming soon)

You also choose anembedding modelduring setup. The embedding model converts text into vectors; the choice of model affects the quality and language coverage of semantic search results.

Why Use an Enhanced Data Store?

Need	How an Enhanced Data Store helps
Search your own files semantically	Upload files, apply transcription or OCR, and query by meaning
Make podcasts and videos searchable	Transcribe audio content and index it for keyword and semantic search
Extract text from scanned documents	Apply Azure Document Intelligence to index PDFs and images
Build a curated research archive	Pull content from a live data source into a private, reusable index
Search with higher precision	Fine-tune chunking and transforms to match your content type

Relationship to Online Data Sources

An Enhanced Data Store appears as a regular data source in thesearch module— you can select it and query it just like any built-in or custom online source. The difference is that results come from your private index rather than a live external query.

When you set up an Enhanced Data Store that pulls from a live source, that source needs to support theLatestorSearchcapability. The Enhanced Data Store adds theVector SearchandFull-Text Searchcapabilities on top.

SeeAbout Data Sourcesfor an overview of how online sources and Enhanced Data Stores fit together.