LLM streaming FASTAPI
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from fastapi.staticfiles import StaticFiles
from langchain.callbacks.streaming_aiter import AsyncIteratorCallbackHandler
from langchain.callbacks.manager import AsyncCallbackManager
from langchain.chat_models import init_chat_model
import asyncio
import os
GROQ_API_KEY = "key"
os.environ["GROQ_API_KEY"] = GROQ_API_KEY
app = FastAPI()
app.mount("/static", StaticFiles(directory=".", html=True), name="static")
@app.get("/stream")
async def stream_response(query: str):
callback = AsyncIteratorCallbackHandler()
manager = AsyncCallbackManager([callback])
llm = init_chat_model(
model="groq:llama-3.1-8b-instant",
streaming=True,
callback_manager=manager,
temperature=0
)
async def sse_generator():
task = asyncio.create_task(llm.apredict(query))
async for token in callback.aiter():
yield f"data: {token}\n\n"
yield "data: [DONE]\n\n"
await task
return StreamingResponse(sse_generator(), media_type="text/event-stream")
SSE Format: What the Browser Expects
Streaming Gotchas in GenAI Apps
1. Text-only protocols (SSE)
2. Partial message boundaries
3. Buffering delays
4. Browser limits and behavior
5. Error handling during generation
6. Token join glitches
7. LLM context truncation or restart
8. Streaming across agent steps / tool calls
9. Async / concurrency pitfalls
10. Frontend rendering latency
Bonus: Security + Compliance
Risk
Fix
TL;DR Best Practices
Area
Practice
Last updated