Skip to main content

Overview

The CTGT API offers advanced features to customize and optimize your AI interactions:

Streaming Responses

Real-time token-by-token output

Token Limits

Control response length and costs

Streaming Responses

Get responses in real-time as they’re generated, similar to ChatGPT’s typing effect.

Enable Streaming

Set stream: true in your request:
curl -X POST https://api.ctgt.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-ctgt-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash",
    "messages": [
      {"role": "user", "content": "Write a short story"}
    ],
    "stream": true
  }'

Streaming Response Format

Server-Sent Events (SSE):
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1702847123,"model":"gemini-2.5-flash","choices":[{"index":0,"delta":{"content":"Once"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1702847123,"model":"gemini-2.5-flash","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1702847123,"model":"gemini-2.5-flash","choices":[{"index":0,"delta":{"content":" a"},"finish_reason":null}]}

...

data: [DONE]

When to Use Streaming

Interactive Applications

Chat interfaces, chatbots, conversational UIs

Long-Form Content

Articles, reports, stories, documentation

Better UX

Show progress, reduce perceived latency

Real-Time Feedback

Users see responses immediately
Streaming is ideal for user-facing applications where showing incremental progress improves the experience.

Token Limits

Control response length and manage costs with max_tokens.

Setting Token Limits

curl -X POST https://api.ctgt.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-ctgt-YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash",
    "messages": [
      {"role": "user", "content": "Summarize quantum physics"}
    ],
    "max_tokens": 100
  }'

Token Guidelines

TokensApproximate LengthBest For
50-1001-2 short paragraphsQuick answers, summaries
200-5001-2 medium paragraphsExplanations, descriptions
500-10001-2 pagesDetailed responses, articles
1000-20002-4 pagesLong-form content, essays
2000+Multiple pagesReports, documentation
1 token ≈ 4 characters or ¾ of a word in English.

Cost Impact

# Example: 1000-token response
# Gemini 2.5 Flash: $0.0027
# Claude Sonnet 4.5: $0.017
# Claude Opus 4.5: $0.030

# Setting max_tokens=200 reduces costs by 80%

Combining Features

Example: Production Chat Application

import requests
import json

def chat_completion(
    api_key: str,
    user_message: str,
    conversation_history: list = None,
    model: str = "gemini-2.5-flash",
    streaming: bool = True,
    temperature: float = 0.7,
    max_tokens: int = 500
):
    """
    Complete chat request with all advanced features
    """
    url = "https://api.ctgt.ai/v1/chat/completions"
    
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    
    # Build messages with history
    messages = []
    
    if conversation_history:
        messages.extend(conversation_history)
    
    messages.append({
        "role": "user",
        "content": user_message
    })
    
    # Request payload
    data = {
        "model": model,
        "messages": messages,
        "stream": streaming,
        "max_tokens": max_tokens
    }
    
    response = requests.post(url, headers=headers, json=data, stream=streaming)
    
    if streaming:
        # Handle streaming response
        full_response = ""
        for line in response.iter_lines():
            if line:
                line = line.decode('utf-8')
                if line.startswith('data: '):
                    data_str = line[6:]
                    if data_str == '[DONE]':
                        break
                    try:
                        chunk = json.loads(data_str)
                        content = chunk['choices'][0]['delta'].get('content', '')
                        full_response += content
                        print(content, end='', flush=True)
                    except json.JSONDecodeError:
                        pass
        print()
        return full_response
    else:
        # Handle non-streaming response
        result = response.json()
        return result['choices'][0]['message']['content']

# Usage
conversation = []

response1 = chat_completion(
    api_key="sk-ctgt-YOUR_API_KEY",
    user_message="What is machine learning?",
    conversation_history=conversation,
    streaming=True
)

conversation.append({"role": "user", "content": "What is machine learning?"})
conversation.append({"role": "assistant", "content": response1})

response2 = chat_completion(
    api_key="sk-ctgt-YOUR_API_KEY",
    user_message="Give me a code example",
    conversation_history=conversation,
    streaming=True
)

Best Practices

Set Token Limits

  • Prevent excessive costs
  • Control response length
  • Match your UI constraints

Enable Streaming

  • Better user experience
  • Show progress in real-time
  • Ideal for chat interfaces

Monitor Usage

  • Track token consumption
  • Set budget alerts
  • Optimize costs regularly

Handle Errors

  • Implement retry logic
  • Use exponential backoff
  • Log all failures

Error Handling

def safe_api_call(api_key, messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.post(
                "https://api.ctgt.ai/v1/chat/completions",
                headers={"Authorization": f"Bearer {api_key}"},
                json={
                    "model": "gemini-2.5-flash",
                    "messages": messages,
                    "max_tokens": 500
                },
                timeout=30
            )
            
            if response.status_code == 200:
                return response.json()
            elif response.status_code == 429:
                # Rate limit - wait and retry
                time.sleep(2 ** attempt)
                continue
            else:
                print(f"Error: {response.status_code}")
                return None
                
        except requests.exceptions.Timeout:
            print(f"Timeout on attempt {attempt + 1}")
            if attempt < max_retries - 1:
                continue
        except Exception as e:
            print(f"Error: {e}")
            return None
    
    return None

Next Steps