---
title: Enhanced Data Stores
description: What Enhanced Data Stores are, how they differ from online data sources, and when to use them to unlock vector search, OCR, audio transcription, and other advanced capabilities.
---

# Enhanced Data Stores

An **Enhanced Data Store** is a private, indexed document store that you build and own. Unlike online data sources — which are live connections to external services — an Enhanced Data Store pulls content in, processes it, and keeps it in an internal index purpose-built for advanced search and analysis.

Navigate to [News → Data Sources → Enhanced Data Stores]({APP_HOST}/documents/pipelines) to manage your organization's Enhanced Data Stores.

## What Is an Enhanced Data Store?

When you create an Enhanced Data Store, you define:

1. **A content source** — where documents come from (uploaded files, or an existing data source)
2. **A transform** — how each document is processed before indexing
3. **A splitter** — how each document is divided into searchable chunks
4. **An index technology** — the search engine that powers queries

The result is a fully searchable document collection with capabilities that far exceed what a live online connection can offer — including semantic (vector) search, OCR, and audio transcription.

## Content Sources

### User Files

You upload documents directly from your computer. Any file type that the platform can process — PDFs, Word documents, plain text files, audio files, video files — can be added. The store grows as you upload more files.

**Best for:** Private research libraries, internal reports, archived documents, and recordings you want to make searchable.

### Latest News from a Data Source

The store automatically fetches the most recent articles from an existing online data source (one that supports the **Latest** capability) and indexes them. You can set a cutoff date to control how far back to go.

**Best for:** Creating a deep, searchable archive of recent coverage from a specific news outlet or social channel.

### Search Results from a Data Source

The store runs a specific keyword query against an existing data source and indexes everything it finds. You configure the search term when creating the store.

**Best for:** Building a focused collection of documents on a specific topic drawn from a larger data source.

## Processing Pipeline

Every document that enters an Enhanced Data Store passes through a configurable processing pipeline.

### Transforms

A **Transform** pre-processes each document before it is split and indexed. Choose the transform that matches your content type:

| Transform | What it does |
|---|---|
| **Default** | Automatic format detection and text extraction — the right choice for most situations |
| **None** | Uses the document in its original form without any processing |
| **HTML Open Graph metadata** | Extracts Open Graph metadata (title, description, image) from HTML pages |
| **HTML meta metadata** | Extracts standard `<meta>` tag data from HTML pages |
| **HTML header metadata** | Extracts headings and structure from HTML pages |
| **Markdown header metadata** | Extracts headings and structure from Markdown documents |
| **VTT metadata** | Extracts metadata from WebVTT subtitle files |
| **Extract audio from video** | Uses FFmpeg to extract the audio track from a video file (use before audio transcription) |
| **Audio transcription** | Converts spoken audio into searchable text |
| **Azure Document Intelligence** | Uses Microsoft Azure AI to extract text and structure from complex PDFs and scanned documents (OCR) |
| **Remove whitespace** | Strips excess spaces and blank lines to reduce token usage |
| **HTML minification** | Compresses HTML content |
| **Data source transform** | Applies the transform defined by the underlying data source |

### Splitters

A **Splitter** divides each processed document into chunks. Smaller, well-defined chunks improve search precision because the search engine can pinpoint the exact section of a long document that is relevant to a query.

| Splitter | How it works |
|---|---|
| **Paragraphs** | Splits on paragraph boundaries |
| **Paragraph stacking** *(default)* | Groups paragraphs together with overlap so context is preserved across chunk boundaries |
| **Hierarchical** | Splits by the document's heading structure, keeping each section together |
| **Halves** | Recursively splits the document in half until chunks are small enough |
| **Whole document** | Treats the entire document as a single chunk — best for short documents |

### Index Technology

The indexed chunks are stored in a vector database that enables semantic search.

| Technology | Description |
|---|---|
| **Flash** | TopHack's proprietary vector search engine — fast, cost-effective, and the recommended choice |
| **Azure AI Search** | Microsoft Azure AI Search *(coming soon)* |
| **OpenAI Vector Store** | OpenAI's native vector store *(coming soon)* |

You also choose an **embedding model** during setup. The embedding model converts text into vectors; the choice of model affects the quality and language coverage of semantic search results.

## Why Use an Enhanced Data Store?

| Need | How an Enhanced Data Store helps |
|---|---|
| Search your own files semantically | Upload files, apply transcription or OCR, and query by meaning |
| Make podcasts and videos searchable | Transcribe audio content and index it for keyword and semantic search |
| Extract text from scanned documents | Apply Azure Document Intelligence to index PDFs and images |
| Build a curated research archive | Pull content from a live data source into a private, reusable index |
| Search with higher precision | Fine-tune chunking and transforms to match your content type |

## Relationship to Online Data Sources

An Enhanced Data Store appears as a regular data source in the [search module]({APP_HOST}/documents/search) — you can select it and query it just like any built-in or custom online source. The difference is that results come from your private index rather than a live external query.

When you set up an Enhanced Data Store that pulls from a live source, that source needs to support the **Latest** or **Search** capability. The Enhanced Data Store adds the **Vector Search** and **Full-Text Search** capabilities on top.

See [About Data Sources](about-data-sources) for an overview of how online sources and Enhanced Data Stores fit together.
