IngestionPipeline orchestrates the complete flow of data from raw documents to indexed, searchable nodes. It handles parsing, transformation, embedding, and storage with built-in caching.
Overview
An ingestion pipeline:- Transforms documents through a series of steps
- Caches intermediate results for efficiency
- Handles embedding generation
- Stores nodes in vector stores
- Manages document updates and deduplication
Basic Usage
Pipeline Components
Transformations
Transformations are the processing steps applied to your data:Vector Store Integration
Automatically store nodes in a vector store:Multiple Vector Stores
Use different vector stores for different modalities:Caching
Pipelines cache transformation results to avoid reprocessing:Custom Cache
How Caching Works
- A hash is computed from the input nodes and transformation configuration
- If cached results exist for this hash, they’re returned immediately
- Otherwise, the transformation runs and results are cached
- Cache is stored in-memory by default
Document Store Strategies
Control how documents are managed and deduplicated:Available Strategies
DocStoreStrategy.UPSERTS(default): Update existing docs, insert new onesDocStoreStrategy.DUPLICATES_ONLY: Skip duplicate documentsDocStoreStrategy.UPSERTS_AND_DELETE: Handle updates and deletionsDocStoreStrategy.NONE: No document store management
Input Sources
Pipelines accept multiple input sources:Direct Documents
Existing Nodes
Reader Integration
Pipeline Documents
Complete Example
Advanced Usage
Custom Transformations
Create custom transformation components:Running Transformations Independently
Batch Processing
Performance Tips
- Enable caching for repeated runs with same documents
- Use appropriate chunk sizes to balance quality and quantity
- Batch documents when processing multiple files
- Monitor token usage with embedding models
- Use document stores to avoid reprocessing unchanged documents
Error Handling
Best Practices
-
Choose appropriate transformations
- Text splitting for long documents
- Embeddings for semantic search
- Custom extractors for metadata
-
Configure caching wisely
- Enable for development and repeated runs
- Disable for production one-time ingestion
-
Use document stores
- Track which documents have been processed
- Avoid reprocessing unchanged content
- Enable incremental updates
-
Monitor pipeline performance
- Log processing times
- Track node counts
- Watch for errors in transformations
-
Handle large datasets
- Process in batches
- Use streaming when possible
- Monitor memory usage
Next Steps
Node Parsers
Configure text splitting strategies
Embeddings
Choose and configure embedding models
Vector Stores
Set up vector storage backends
Storage
Manage document and index stores