Skip to main content

Overview

Node parsers split documents into smaller chunks (nodes) for processing. They handle text segmentation, maintain relationships between chunks, and preserve metadata.

NodeParser

Abstract base class for all node parsers.
import { NodeParser } from "@llamaindex/core/node-parser";

Properties

includeMetadata
boolean
default:true
Whether to include document metadata in parsed nodes
includePrevNextRel
boolean
default:true
Whether to include previous/next relationships between consecutive chunks

Methods

getNodesFromDocuments
method
Parse documents into nodes
getNodesFromDocuments(documents: TextNode[]): TextNode[] | Promise<TextNode[]>

TextSplitter

Abstract base class for text splitting strategies.
import { TextSplitter } from "@llamaindex/core/node-parser";

Methods

splitText
method
Split a single text into chunks
abstract splitText(text: string): string[]
splitTexts
method
Split multiple texts into chunks
splitTexts(texts: string[]): string[]

SentenceSplitter

Splits text by sentences with configurable chunk size and overlap.
import { SentenceSplitter } from "@llamaindex/core/node-parser";

Constructor Options

chunkSize
number
default:1024
Maximum number of characters per chunk
chunkOverlap
number
default:200
Number of characters to overlap between chunks
separator
string
default:" "
Separator to use when splitting
paragraphSeparator
string
default:"\n\n\n"
Separator for paragraph boundaries
secondarySeparator
string
default:"\n\n"
Secondary separator (e.g., line breaks)

Example

import { SentenceSplitter } from "@llamaindex/core/node-parser";
import { Document } from "@llamaindex/core/schema";

const parser = new SentenceSplitter({
  chunkSize: 512,
  chunkOverlap: 50
});

const document = new Document({
  text: "Long document text..."
});

const nodes = parser.getNodesFromDocuments([document]);
console.log(nodes.length); // Number of chunks created

MarkdownNodeParser

Splits markdown documents while preserving structure.
import { MarkdownNodeParser } from "@llamaindex/core/node-parser";

Constructor Options

chunkSize
number
default:1024
Maximum characters per chunk
chunkOverlap
number
default:200
Overlap between chunks

Example

const parser = new MarkdownNodeParser({
  chunkSize: 1024,
  chunkOverlap: 100
});

const document = new Document({
  text: "# Heading\n\nParagraph text...",
  metadata: { format: "markdown" }
});

const nodes = parser.getNodesFromDocuments([document]);

MetadataAwareTextSplitter

Abstract base for splitters that consider metadata when chunking.
abstract class MetadataAwareTextSplitter extends TextSplitter {
  abstract splitTextMetadataAware(
    text: string,
    metadata: string
  ): string[];
}
Useful when metadata should be included in chunk size calculations.

Node Relationships

Parsed nodes automatically include relationships:
const nodes = parser.getNodesFromDocuments([document]);

// First node
console.log(nodes[0].relationships);
// {
//   [NodeRelationship.SOURCE]: { nodeId: "doc-id", ... },
//   [NodeRelationship.NEXT]: { nodeId: "node-1-id", ... }
// }

// Middle node
console.log(nodes[1].relationships);
// {
//   [NodeRelationship.SOURCE]: { nodeId: "doc-id", ... },
//   [NodeRelationship.PREVIOUS]: { nodeId: "node-0-id", ... },
//   [NodeRelationship.NEXT]: { nodeId: "node-2-id", ... }
// }

Metadata Inheritance

Nodes inherit metadata from parent documents:
const document = new Document({
  text: "Document text...",
  metadata: {
    title: "My Document",
    author: "John Doe"
  }
});

const nodes = parser.getNodesFromDocuments([document]);

// All nodes inherit parent metadata
console.log(nodes[0].metadata);
// { title: "My Document", author: "John Doe" }

Character Positions

Parsers track character positions in the original document:
const nodes = parser.getNodesFromDocuments([document]);

console.log(nodes[0].startCharIdx); // 0
console.log(nodes[0].endCharIdx);   // 512
console.log(nodes[1].startCharIdx); // 462 (with overlap)
console.log(nodes[1].endCharIdx);   // 1024

Custom Node Parser

Create custom parsers by extending NodeParser:
import { NodeParser } from "@llamaindex/core/node-parser";
import { TextNode } from "@llamaindex/core/schema";

class CustomParser extends NodeParser {
  protected parseNodes(documents: TextNode[]): TextNode[] {
    return documents.flatMap(doc => {
      // Custom splitting logic
      const chunks = this.customSplit(doc.text);
      
      return chunks.map(chunk => new TextNode({
        text: chunk,
        metadata: { ...doc.metadata }
      }));
    });
  }
  
  private customSplit(text: string): string[] {
    // Your custom splitting logic
    return text.split(/\n---\n/);
  }
}

Best Practices

  1. Choose appropriate chunk size: Smaller chunks (256-512) for precise retrieval, larger chunks (1024-2048) for more context
  2. Use overlap: 10-20% overlap helps maintain context across chunk boundaries
  3. Preserve structure: Use MarkdownNodeParser for markdown to maintain headings and formatting
  4. Consider token limits: Account for model context windows when setting chunk sizes