Overview
Groq provides ultra-fast inference for open-source LLMs like Llama, Mixtral, and Gemma with speeds up to 500+ tokens/second.
Installation
npm install @llamaindex/groq
Basic Usage
import { Groq } from "@llamaindex/groq";
const llm = new Groq({
model: "llama-3.1-70b-versatile",
apiKey: process.env.GROQ_API_KEY
});
const response = await llm.chat({
messages: [
{ role: "user", content: "Explain quantum computing" }
]
});
console.log(response.message.content);
Constructor Options
Groq API key (defaults to GROQ_API_KEY env variable)
Maximum tokens in response
Nucleus sampling parameter
Supported Models
Llama 3.1
llama-3.1-405b-reasoning: Most capable
llama-3.1-70b-versatile: Balanced performance
llama-3.1-8b-instant: Fastest
Llama 3
llama3-70b-8192: 70B parameter model
llama3-8b-8192: 8B parameter model
Mixtral
mixtral-8x7b-32768: Mixtral MoE model
Gemma
gemma-7b-it: Google Gemma 7B
gemma2-9b-it: Gemma 2 9B
Streaming
const stream = await llm.chat({
messages: [{ role: "user", content: "Write a story" }],
stream: true
});
for await (const chunk of stream) {
process.stdout.write(chunk.delta);
}
Function Calling
import { tool } from "@llamaindex/core/tools";
import { z } from "zod";
const weatherTool = tool({
name: "get_weather",
description: "Get weather for a location",
parameters: z.object({
location: z.string()
}),
execute: async ({ location }) => {
return `Weather in ${location}: 72°F`;
}
});
const response = await llm.chat({
messages: [{ role: "user", content: "Weather in NYC?" }],
tools: [weatherTool]
});
Structured Output
import { z } from "zod";
const schema = z.object({
summary: z.string(),
sentiment: z.enum(["positive", "negative", "neutral"]),
topics: z.array(z.string())
});
const result = await llm.exec({
messages: [{ role: "user", content: "Analyze: Great product, fast shipping!" }],
responseFormat: schema
});
Configuration
Environment Variables
Global Settings
import { Settings } from "llamaindex";
import { Groq } from "@llamaindex/groq";
Settings.llm = new Groq({
model: "llama-3.1-70b-versatile"
});
Groq’s LPU (Language Processing Unit) delivers exceptional speed:
const startTime = Date.now();
const response = await llm.chat({
messages: [{ role: "user", content: "Explain AI" }]
});
const duration = Date.now() - startTime;
console.log(`Response time: ${duration}ms`);
console.log(`Tokens/sec: ${response.raw.usage.completion_tokens / (duration / 1000)}`);
Typical speeds: 300-500 tokens/second
With LlamaIndex
import { Settings, VectorStoreIndex } from "llamaindex";
import { Groq } from "@llamaindex/groq";
Settings.llm = new Groq({ model: "llama-3.1-70b-versatile" });
const index = await VectorStoreIndex.fromDocuments(documents);
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
query: "What is the main topic?"
});
Model Selection Guide
| Use Case | Recommended Model | Why |
|---|
| Complex reasoning | llama-3.1-405b-reasoning | Best quality |
| General purpose | llama-3.1-70b-versatile | Balanced |
| Speed critical | llama-3.1-8b-instant | Fastest |
| Long context | mixtral-8x7b-32768 | 32K context |
Rate Limits
Groq has generous free tier limits:
- Free: 30 requests/minute
- Paid: Higher limits based on plan
Handle rate limits:
try {
const response = await llm.chat({ messages });
} catch (error) {
if (error.status === 429) {
console.log("Rate limit hit, waiting...");
await new Promise(resolve => setTimeout(resolve, 2000));
// Retry
}
}
Best Practices
- Use for production: Groq’s speed excellent for real-time applications
- Choose right model: Balance speed vs capability
- Monitor usage: Track API calls and costs
- Stream responses: Even better UX with Groq’s speed
- Handle rate limits: Implement retry logic
See Also