Groq - LlamaIndex.TS

Overview

Groq provides ultra-fast inference for open-source LLMs like Llama, Mixtral, and Gemma with speeds up to 500+ tokens/second.

Installation

npm install @llamaindex/groq

Basic Usage

import { Groq } from "@llamaindex/groq";

const llm = new Groq({
  model: "llama-3.1-70b-versatile",
  apiKey: process.env.GROQ_API_KEY
});

const response = await llm.chat({
  messages: [
    { role: "user", content: "Explain quantum computing" }
  ]
});

console.log(response.message.content);

Constructor Options

model

string

required

Groq model name

apiKey

string

Groq API key (defaults to GROQ_API_KEY env variable)

temperature

number

Sampling temperature

maxTokens

number

Maximum tokens in response

topP

number

default:1

Nucleus sampling parameter

Supported Models

Llama 3.1

llama-3.1-405b-reasoning: Most capable
llama-3.1-70b-versatile: Balanced performance
llama-3.1-8b-instant: Fastest

Llama 3

llama3-70b-8192: 70B parameter model
llama3-8b-8192: 8B parameter model

Mixtral

mixtral-8x7b-32768: Mixtral MoE model

Gemma

gemma-7b-it: Google Gemma 7B
gemma2-9b-it: Gemma 2 9B

Streaming

const stream = await llm.chat({
  messages: [{ role: "user", content: "Write a story" }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.delta);
}

Function Calling

import { tool } from "@llamaindex/core/tools";
import { z } from "zod";

const weatherTool = tool({
  name: "get_weather",
  description: "Get weather for a location",
  parameters: z.object({
    location: z.string()
  }),
  execute: async ({ location }) => {
    return `Weather in ${location}: 72°F`;
  }
});

const response = await llm.chat({
  messages: [{ role: "user", content: "Weather in NYC?" }],
  tools: [weatherTool]
});

Structured Output

import { z } from "zod";

const schema = z.object({
  summary: z.string(),
  sentiment: z.enum(["positive", "negative", "neutral"]),
  topics: z.array(z.string())
});

const result = await llm.exec({
  messages: [{ role: "user", content: "Analyze: Great product, fast shipping!" }],
  responseFormat: schema
});

Configuration

Environment Variables

GROQ_API_KEY=gsk_...

Global Settings

import { Settings } from "llamaindex";
import { Groq } from "@llamaindex/groq";

Settings.llm = new Groq({
  model: "llama-3.1-70b-versatile"
});

Performance

Groq’s LPU (Language Processing Unit) delivers exceptional speed:

const startTime = Date.now();

const response = await llm.chat({
  messages: [{ role: "user", content: "Explain AI" }]
});

const duration = Date.now() - startTime;
console.log(`Response time: ${duration}ms`);
console.log(`Tokens/sec: ${response.raw.usage.completion_tokens / (duration / 1000)}`);

Typical speeds: 300-500 tokens/second

With LlamaIndex

import { Settings, VectorStoreIndex } from "llamaindex";
import { Groq } from "@llamaindex/groq";

Settings.llm = new Groq({ model: "llama-3.1-70b-versatile" });

const index = await VectorStoreIndex.fromDocuments(documents);
const queryEngine = index.asQueryEngine();

const response = await queryEngine.query({
  query: "What is the main topic?"
});

Model Selection Guide

Use Case	Recommended Model	Why
Complex reasoning	llama-3.1-405b-reasoning	Best quality
General purpose	llama-3.1-70b-versatile	Balanced
Speed critical	llama-3.1-8b-instant	Fastest
Long context	mixtral-8x7b-32768	32K context

Rate Limits

Groq has generous free tier limits:

Free: 30 requests/minute
Paid: Higher limits based on plan

Handle rate limits:

try {
  const response = await llm.chat({ messages });
} catch (error) {
  if (error.status === 429) {
    console.log("Rate limit hit, waiting...");
    await new Promise(resolve => setTimeout(resolve, 2000));
    // Retry
  }
}

Best Practices

Use for production: Groq’s speed excellent for real-time applications
Choose right model: Balance speed vs capability
Monitor usage: Track API calls and costs
Stream responses: Even better UX with Groq’s speed
Handle rate limits: Implement retry logic

Documentation Index

​Overview

​Installation

​Basic Usage

​Constructor Options

​Supported Models

​Llama 3.1

​Llama 3

​Mixtral

​Gemma

​Streaming

​Function Calling

​Structured Output

​Configuration

​Environment Variables

​Global Settings

​Performance

​With LlamaIndex

​Model Selection Guide

​Rate Limits

​Best Practices

​See Also

Overview

Installation

Basic Usage

Constructor Options

Supported Models

Llama 3.1

Llama 3

Mixtral

Gemma

Streaming

Function Calling

Structured Output

Configuration

Environment Variables

Global Settings

Performance

With LlamaIndex

Model Selection Guide

Rate Limits

Best Practices

See Also