What is a Document?
ADocument is a specialized TextNode that serves as the primary input for building indices and processing data. Each document contains:
- Text content: The actual text data
- Metadata: Key-value pairs for additional information
- Unique ID: Automatically generated or custom identifier
- Embeddings: Optional vector representations
Creating Documents
From Text
The simplest way to create a document:With Metadata
Add metadata to provide context:From Files
Use readers to create documents from files:Document Properties
Core Properties
text: The document’s text contentid_: Unique identifier (UUID by default)metadata: Object containing key-value metadataembedding: Optional vector embedding arrayhash: Auto-generated content hash
Metadata Control
Control which metadata is included in embeddings or LLM context:Working with Documents
Getting Content
Retrieve content with different metadata modes:Metadata Modes
MetadataMode.ALL: Include all metadataMetadataMode.LLM: Include metadata for LLM context (respectsexcludedLlmMetadataKeys)MetadataMode.EMBED: Include metadata for embeddings (respectsexcludedEmbedMetadataKeys)MetadataMode.NONE: No metadata, text only
Updating Content
Modify document text:Document Transformations
Converting to Nodes
Documents are typically split into smaller chunks (nodes) for processing:Complete Example
Document Relationships
Documents can maintain relationships with other nodes through the relationships property:Best Practices
- Use meaningful IDs: Provide custom IDs when you need to track or update specific documents
- Add rich metadata: Include relevant context that helps with retrieval and filtering
- Exclude sensitive metadata: Use
excludedLlmMetadataKeysto keep internal IDs or system fields out of LLM context - Keep documents focused: One topic or section per document for better retrieval
- Include source information: Track where documents came from for debugging and citations
Related Types
TextNode: Base class for text-based nodes (Document extends this)ImageDocument: For documents with image contentBaseNode: Abstract base class for all node types
Next Steps
Node Parsers
Learn how to split documents into chunks
Readers
Load documents from various file formats
Ingestion
Build complete data processing pipelines
Storage
Persist and manage document stores