Documents - LlamaIndex.TS

Documents are the fundamental data containers in LlamaIndex.TS. They represent individual pieces of text content along with associated metadata.

What is a Document?

A Document is a specialized TextNode that serves as the primary input for building indices and processing data. Each document contains:

Text content: The actual text data
Metadata: Key-value pairs for additional information
Unique ID: Automatically generated or custom identifier
Embeddings: Optional vector representations

Creating Documents

From Text

The simplest way to create a document:

import { Document } from "llamaindex";

const document = new Document({
  text: "This is my document text",
  id_: "doc_1"
});

With Metadata

Add metadata to provide context:

const document = new Document({
  text: "LlamaIndex is a data framework for LLM applications.",
  metadata: {
    source: "documentation",
    author: "LlamaIndex Team",
    date: "2024-01-01",
    category: "introduction"
  }
});

From Files

Use readers to create documents from files:

import { SimpleDirectoryReader } from "@llamaindex/readers/directory";

const reader = new SimpleDirectoryReader();
const documents = await reader.loadData("./data");

Document Properties

Core Properties

text: The document’s text content
id_: Unique identifier (UUID by default)
metadata: Object containing key-value metadata
embedding: Optional vector embedding array
hash: Auto-generated content hash

Metadata Control

Control which metadata is included in embeddings or LLM context:

const document = new Document({
  text: "My content",
  metadata: {
    title: "Important Doc",
    internal_id: "12345",
    description: "A sample document"
  },
  excludedLlmMetadataKeys: ["internal_id"],
  excludedEmbedMetadataKeys: ["internal_id"]
});

Working with Documents

Getting Content

Retrieve content with different metadata modes:

import { MetadataMode } from "llamaindex";

// Get all content including all metadata
const fullContent = document.getContent(MetadataMode.ALL);

// Get content with LLM-specific metadata
const llmContent = document.getContent(MetadataMode.LLM);

// Get content with embedding metadata
const embedContent = document.getContent(MetadataMode.EMBED);

// Get just the text without metadata
const textOnly = document.getContent(MetadataMode.NONE);

Metadata Modes

MetadataMode.ALL: Include all metadata
MetadataMode.LLM: Include metadata for LLM context (respects excludedLlmMetadataKeys)
MetadataMode.EMBED: Include metadata for embeddings (respects excludedEmbedMetadataKeys)
MetadataMode.NONE: No metadata, text only

Updating Content

Modify document text:

document.setContent("Updated text content");

Document Transformations

Converting to Nodes

Documents are typically split into smaller chunks (nodes) for processing:

import { SentenceSplitter } from "llamaindex";

const splitter = new SentenceSplitter({
  chunkSize: 1024,
  chunkOverlap: 20
});

const nodes = await splitter.transform([document]);

Complete Example

import { 
  Document, 
  VectorStoreIndex,
  SentenceSplitter,
  MetadataMode 
} from "llamaindex";
import { OpenAIEmbedding } from "@llamaindex/openai";

// Create a document
const document = new Document({
  text: `
    LlamaIndex is a data framework for LLM applications.
    It provides tools for ingesting, structuring, and accessing data.
    You can build powerful RAG applications with LlamaIndex.
  `,
  metadata: {
    title: "LlamaIndex Introduction",
    category: "documentation",
    version: "1.0"
  }
});

console.log("Document ID:", document.id_);
console.log("Full content:", document.getContent(MetadataMode.ALL));
console.log("Text only:", document.getContent(MetadataMode.NONE));

// Process the document
const index = await VectorStoreIndex.fromDocuments([document], {
  nodeParser: new SentenceSplitter({ chunkSize: 512 })
});

// Query the index
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
  query: "What is LlamaIndex?"
});

console.log(response.toString());

Document Relationships

Documents can maintain relationships with other nodes through the relationships property:

import { NodeRelationship } from "llamaindex";

// Documents track their derived nodes
const nodes = splitter.transform([document]);

// Each node maintains a SOURCE relationship to the original document
for (const node of nodes) {
  console.log("Source document ID:", node.sourceNode?.nodeId);
}

Best Practices

Use meaningful IDs: Provide custom IDs when you need to track or update specific documents
Add rich metadata: Include relevant context that helps with retrieval and filtering
Exclude sensitive metadata: Use excludedLlmMetadataKeys to keep internal IDs or system fields out of LLM context
Keep documents focused: One topic or section per document for better retrieval
Include source information: Track where documents came from for debugging and citations

TextNode: Base class for text-based nodes (Document extends this)
ImageDocument: For documents with image content
BaseNode: Abstract base class for all node types

Next Steps

Node Parsers

Learn how to split documents into chunks

Readers

Load documents from various file formats

Ingestion

Build complete data processing pipelines

Storage

Persist and manage document stores

Documentation Index

​What is a Document?

​Creating Documents

​From Text

​With Metadata

​From Files

​Document Properties

​Core Properties

​Metadata Control

​Working with Documents

​Getting Content

​Metadata Modes

​Updating Content

​Document Transformations

​Converting to Nodes

​Complete Example

​Document Relationships

​Best Practices

​Related Types

​Next Steps