Overview
The CTGT API offers advanced features to customize and optimize your AI interactions:
Streaming Responses
Real-time token-by-token output
Token Limits
Control response length and costs
Streaming Responses
Get responses in real-time as they’re generated, similar to ChatGPT’s typing effect.
Enable Streaming
Set stream: true in your request:
curl -X POST https://api.ctgt.ai/v1/chat/completions \
-H "Authorization: Bearer sk-ctgt-YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-2.5-flash",
"messages": [
{"role": "user", "content": "Write a short story"}
],
"stream": true
}'
Server-Sent Events (SSE):
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1702847123,"model":"gemini-2.5-flash","choices":[{"index":0,"delta":{"content":"Once"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1702847123,"model":"gemini-2.5-flash","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1702847123,"model":"gemini-2.5-flash","choices":[{"index":0,"delta":{"content":" a"},"finish_reason":null}]}
...
data: [DONE]
When to Use Streaming
Interactive Applications
Chat interfaces, chatbots, conversational UIs
Long-Form Content
Articles, reports, stories, documentation
Better UX
Show progress, reduce perceived latency
Real-Time Feedback
Users see responses immediately
Streaming is ideal for user-facing applications where showing incremental progress improves the experience.
Token Limits
Control response length and manage costs with max_tokens.
Setting Token Limits
curl -X POST https://api.ctgt.ai/v1/chat/completions \
-H "Authorization: Bearer sk-ctgt-YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-2.5-flash",
"messages": [
{"role": "user", "content": "Summarize quantum physics"}
],
"max_tokens": 100
}'
Token Guidelines
| Tokens | Approximate Length | Best For |
|---|
| 50-100 | 1-2 short paragraphs | Quick answers, summaries |
| 200-500 | 1-2 medium paragraphs | Explanations, descriptions |
| 500-1000 | 1-2 pages | Detailed responses, articles |
| 1000-2000 | 2-4 pages | Long-form content, essays |
| 2000+ | Multiple pages | Reports, documentation |
1 token ≈ 4 characters or ¾ of a word in English.
Cost Impact
# Example: 1000-token response
# Gemini 2.5 Flash: $0.0027
# Claude Sonnet 4.5: $0.017
# Claude Opus 4.5: $0.030
# Setting max_tokens=200 reduces costs by 80%
Combining Features
Example: Production Chat Application
import requests
import json
def chat_completion(
api_key: str,
user_message: str,
conversation_history: list = None,
model: str = "gemini-2.5-flash",
streaming: bool = True,
temperature: float = 0.7,
max_tokens: int = 500
):
"""
Complete chat request with all advanced features
"""
url = "https://api.ctgt.ai/v1/chat/completions"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
# Build messages with history
messages = []
if conversation_history:
messages.extend(conversation_history)
messages.append({
"role": "user",
"content": user_message
})
# Request payload
data = {
"model": model,
"messages": messages,
"stream": streaming,
"max_tokens": max_tokens
}
response = requests.post(url, headers=headers, json=data, stream=streaming)
if streaming:
# Handle streaming response
full_response = ""
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data_str = line[6:]
if data_str == '[DONE]':
break
try:
chunk = json.loads(data_str)
content = chunk['choices'][0]['delta'].get('content', '')
full_response += content
print(content, end='', flush=True)
except json.JSONDecodeError:
pass
print()
return full_response
else:
# Handle non-streaming response
result = response.json()
return result['choices'][0]['message']['content']
# Usage
conversation = []
response1 = chat_completion(
api_key="sk-ctgt-YOUR_API_KEY",
user_message="What is machine learning?",
conversation_history=conversation,
streaming=True
)
conversation.append({"role": "user", "content": "What is machine learning?"})
conversation.append({"role": "assistant", "content": response1})
response2 = chat_completion(
api_key="sk-ctgt-YOUR_API_KEY",
user_message="Give me a code example",
conversation_history=conversation,
streaming=True
)
Best Practices
Set Token Limits
- Prevent excessive costs
- Control response length
- Match your UI constraints
Enable Streaming
- Better user experience
- Show progress in real-time
- Ideal for chat interfaces
Monitor Usage
- Track token consumption
- Set budget alerts
- Optimize costs regularly
Handle Errors
- Implement retry logic
- Use exponential backoff
- Log all failures
Error Handling
def safe_api_call(api_key, messages, max_retries=3):
for attempt in range(max_retries):
try:
response = requests.post(
"https://api.ctgt.ai/v1/chat/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "gemini-2.5-flash",
"messages": messages,
"max_tokens": 500
},
timeout=30
)
if response.status_code == 200:
return response.json()
elif response.status_code == 429:
# Rate limit - wait and retry
time.sleep(2 ** attempt)
continue
else:
print(f"Error: {response.status_code}")
return None
except requests.exceptions.Timeout:
print(f"Timeout on attempt {attempt + 1}")
if attempt < max_retries - 1:
continue
except Exception as e:
print(f"Error: {e}")
return None
return None
Next Steps