Design an AI Chat App (ChatGPT-style)

Published:

๐ŸŽฏ Problem Statement

Design an AI Chat Application for iOS similar to ChatGPT that supports:

  • Text streaming responses (typewriter effect)
  • Multiple conversations with context switching
  • Multimodal inputs (text, images, videos)
  • AI-generated images (DALL-E style)
  • Conversation history and offline access
  • Real-time streaming while handling interruptions

Constraints:

  • OpenAI API rate limits (3,500 RPM for GPT-4)
  • Streaming must feel instant (<200ms first token)
  • Handle 10+ concurrent conversations
  • Support images up to 20MB
  • Work on slow 3G networks

๐Ÿ“ STEP 1: High-Level Architecture Diagram

In the interview, start by drawing this:

graph TB
    subgraph "iOS Client Layer"
        UI[ChatViewController<br/>- MessageList<br/>- InputBar<br/>- TypewriterEffect]
        VM[ChatViewModel<br/>- @Published messages<br/>- isStreaming state]
        CM[ConversationManager<br/>- Active chat<br/>- Queue requests<br/>- Context window]
        SS[StreamingService<br/>- SSE/WebSocket<br/>- Parsing<br/>- Buffering]
        MS[MediaUploadService<br/>- Image processing<br/>- Compression<br/>- Upload queue]
        DB[DatabaseManager<br/>- CoreData<br/>- History<br/>- Offline]
        
        UI --> VM
        VM --> CM
        CM --> SS
        CM --> MS
        CM --> DB
    end
    
    subgraph "Backend / Proxy Server"
        PROXY[API Gateway<br/>- Rate limiting<br/>- API key mgmt<br/>- Request queue<br/>- Error handling]
    end
    
    subgraph "AI Service Providers"
        OPENAI[OpenAI API<br/>- GPT-4<br/>- DALL-E<br/>- Vision]
        CLAUDE[Anthropic<br/>- Claude API]
        GEMINI[Google AI<br/>- Gemini API]
    end
    
    SS -->|HTTPS/SSE| PROXY
    MS -->|Upload| PROXY
    PROXY --> OPENAI
    PROXY --> CLAUDE
    PROXY --> GEMINI

Key Components:

  • ChatView: UIKit/SwiftUI interface with message bubbles
  • ChatViewModel: Manages UI state, streaming buffers
  • ConversationManager: Handles multiple chats, context switching
  • StreamingService: SSE/WebSocket connection management
  • MediaUploadService: Compresses and uploads images/videos
  • Backend Proxy: Rate limiting, authentication, cost control

๐Ÿ’ฌ What to say while drawing:

โ€œIโ€™ll use MVVM with a conversation manager to handle multiple chats. The streaming service manages real-time connections. A backend proxy protects API keys and handles rate limiting. For multimodal, we have a separate upload service that processes media before sending.โ€


๐Ÿ‘ค STEP 2: User Flow Diagram

Draw the main interaction flows:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ USER SENDS MSG  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Check Type?     โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ TEXT ONLY    โ”‚
โ”‚ Text/Image/Both โ”‚        โ”‚ Go to Flow A โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚ Image/Video
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ FLOW B:         โ”‚
โ”‚ Multimodal      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 1. Show upload  โ”‚
โ”‚    progress     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 2. Compress     โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ WebP/HEVC    โ”‚
โ”‚    media        โ”‚        โ”‚ reduction    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 3. Upload to    โ”‚
โ”‚    backend      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 4. Get media URLโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ

FLOW A: TEXT STREAMING
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 1. Open SSE     โ”‚
โ”‚    connection   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 2. Send request โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ POST /chat   โ”‚
โ”‚    with context โ”‚        โ”‚ + media URL  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 3. Show typing  โ”‚
โ”‚    indicator    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 4. Receive SSE  โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ data: chunk  โ”‚
โ”‚    chunks       โ”‚        โ”‚ data: chunk  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 5. Append to    โ”‚
โ”‚    message with โ”‚
โ”‚    animation    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 6. Stream done  โ”‚
โ”‚    (data:[DONE])โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 7. Save to DB   โ”‚
โ”‚    Close SSE    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

EDGE CASE: USER SWITCHES CHAT WHILE STREAMING
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Streaming       โ”‚
โ”‚ message 50%     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚ User taps different conversation
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Cancel current  โ”‚
โ”‚ SSE connection  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Save partial    โ”‚
โ”‚ response to DB  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Load new chat   โ”‚
โ”‚ context         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Ready for new   โ”‚
โ”‚ message         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ’ฌ What to say while drawing:

โ€œFor text-only messages, we open an SSE connection and stream chunks. For images, we compress first, upload to backend, get a URL, then include it in the API request. If user switches conversations mid-stream, we cancel the connection, save partial response, and load new context. This prevents mixing responses from different chats.โ€


๐Ÿ”„ STEP 3: Streaming Architecture - SSE vs WebSocket

Critical decision: How to handle real-time streaming?

sequenceDiagram
    participant App as iOS App
    participant Proxy as Backend Proxy
    participant OpenAI as OpenAI API
    
    App->>Proxy: GET /stream?message=...<br/>Accept: text/event-stream
    Proxy->>OpenAI: POST /chat/completions<br/>stream: true
    
    loop Streaming chunks
        OpenAI-->>Proxy: data: {"choices":[{"delta":{"content":"Hello"}}]}
        Proxy-->>App: Forward chunk
        
        OpenAI-->>Proxy: data: {"choices":[{"delta":{"content":" world"}}]}
        Proxy-->>App: Forward chunk
    end
    
    OpenAI-->>Proxy: data: [DONE]
    Proxy-->>App: data: [DONE]
    
    Note over App,Proxy: Connection closes

Implementation:

class StreamingService {
    private var eventSource: EventSource?
    
    func streamMessage(
        message: String,
        conversationId: String,
        onChunk: @escaping (String) -> Void,
        onComplete: @escaping () -> Void,
        onError: @escaping (Error) -> Void
    ) {
        let url = URL(string: "\(baseURL)/stream")!
        var request = URLRequest(url: url)
        request.httpMethod = "POST"
        request.setValue("application/json", forHTTPHeaderField: "Content-Type")
        
        let body: [String: Any] = [
            "message": message,
            "conversation_id": conversationId
        ]
        request.httpBody = try? JSONSerialization.data(withJSONObject: body)
        
        // Using EventSource library for SSE
        eventSource = EventSource(request: request)
        
        eventSource?.onMessage { event in
            if event.data == "[DONE]" {
                onComplete()
                return
            }
            
            if let data = event.data?.data(using: .utf8),
               let json = try? JSONDecoder().decode(StreamChunk.self, from: data) {
                onChunk(json.choices.first?.delta.content ?? "")
            }
        }
        
        eventSource?.onError { error in
            onError(error)
        }
        
        eventSource?.connect()
    }
    
    func cancelStream() {
        eventSource?.disconnect()
        eventSource = nil
    }
}

struct StreamChunk: Codable {
    struct Choice: Codable {
        struct Delta: Codable {
            let content: String?
        }
        let delta: Delta
    }
    let choices: [Choice]
}

Pros:

  • โœ… Simpler than WebSocket (HTTP-based)
  • โœ… Automatic reconnection
  • โœ… Unidirectional (perfect for AI responses)
  • โœ… Works with existing HTTP infrastructure

Cons:

  • โŒ One-way communication only
  • โŒ No built-in iOS support (need library)

Option 2: WebSocket

iOS App                     Backend                   OpenAI API
   โ”‚                              โ”‚                         โ”‚
   โ”‚ WS Connect                   โ”‚                         โ”‚
   โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚                         โ”‚
   โ”‚                              โ”‚                         โ”‚
   โ”‚ {"type":"message","text":"..."}                       โ”‚
   โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚                         โ”‚
   โ”‚                              โ”‚ POST /chat/completions  โ”‚
   โ”‚                              โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚
   โ”‚                              โ”‚                         โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚ {"type":"chunk","content":"Hello"}                    โ”‚
   โ”‚                              โ”‚                         โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚ {"type":"chunk","content":" world"}                   โ”‚
   โ”‚                              โ”‚                         โ”‚
   โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚
   โ”‚ {"type":"done"}              โ”‚                         โ”‚
   โ”‚                              โ”‚                         โ”‚
   โ”‚ WS stays open for next msg   โ”‚                         โ”‚

Implementation:

class WebSocketService {
    private var webSocket: URLSessionWebSocketTask?
    
    func connect() {
        let session = URLSession(configuration: .default)
        webSocket = session.webSocketTask(with: URL(string: "\(wsURL)/chat")!)
        webSocket?.resume()
        receiveMessage()
    }
    
    func sendMessage(_ text: String, conversationId: String) {
        let message: [String: Any] = [
            "type": "message",
            "text": text,
            "conversation_id": conversationId
        ]
        
        if let data = try? JSONSerialization.data(withJSONObject: message),
           let string = String(data: data, encoding: .utf8) {
            webSocket?.send(.string(string)) { error in
                if let error = error {
                    print("Send error: \(error)")
                }
            }
        }
    }
    
    private func receiveMessage() {
        webSocket?.receive { [weak self] result in
            switch result {
            case .success(let message):
                switch message {
                case .string(let text):
                    self?.handleMessage(text)
                case .data(let data):
                    // Handle binary data
                    break
                @unknown default:
                    break
                }
                
                // Continue receiving
                self?.receiveMessage()
                
            case .failure(let error):
                print("Receive error: \(error)")
            }
        }
    }
    
    private func handleMessage(_ text: String) {
        guard let data = text.data(using: .utf8),
              let json = try? JSONDecoder().decode(WSMessage.self, from: data) else {
            return
        }
        
        switch json.type {
        case "chunk":
            onChunkReceived?(json.content ?? "")
        case "done":
            onStreamComplete?()
        case "error":
            onError?(NSError(domain: "WS", code: -1, userInfo: [NSLocalizedDescriptionKey: json.content ?? ""]))
        default:
            break
        }
    }
    
    var onChunkReceived: ((String) -> Void)?
    var onStreamComplete: (() -> Void)?
    var onError: ((Error) -> Void)?
}

struct WSMessage: Codable {
    let type: String
    let content: String?
}

Pros:

  • โœ… Bi-directional communication
  • โœ… Native iOS support (URLSessionWebSocketTask)
  • โœ… Persistent connection (better for multiple messages)
  • โœ… Can send interruptions/cancellations

Cons:

  • โŒ More complex to implement
  • โŒ Requires backend WebSocket support
  • โŒ Connection management overhead

๐Ÿ“Š SSE vs WebSocket Comparison

FeatureSSEWebSocket
ComplexitySimple (HTTP)Complex (custom protocol)
iOS SupportLibrary neededNative URLSession
DirectionServerโ†’Client onlyBi-directional
ReconnectionAutomaticManual
Use CaseStreaming responsesReal-time chat, interruptions
OpenAI APIDirectly supportedNeed proxy conversion
Best ForChatGPT-style appsCollaborative editing

๐Ÿ’ก Recommendation: Use SSE for ChatGPT-style apps. Itโ€™s simpler, matches OpenAIโ€™s API, and handles 99% of use cases. Only use WebSocket if you need:

  • User to interrupt AI mid-response
  • AI to ask follow-up questions
  • Real-time collaboration features

๐ŸŽจ STEP 4: Handling Multiple Concurrent Conversations

Problem: User has 5 conversations open, switches between them rapidly. How do you prevent:

  • Response mixing (conversation A getting conversation Bโ€™s answer)
  • Memory leaks from abandoned streams
  • Race conditions

Solution: Conversation Queue Manager

class ConversationManager {
    // Current active conversation
    @Published private(set) var activeConversationId: String?
    
    // Streaming state per conversation
    private var streamingStates: [String: StreamingState] = [:]
    
    // Queue for pending requests
    private let requestQueue = DispatchQueue(label: "com.app.conversation", qos: .userInitiated)
    
    struct StreamingState {
        var eventSource: EventSource?
        var partialMessage: String = ""
        var isActive: Bool = false
    }
    
    func sendMessage(
        _ text: String,
        in conversationId: String,
        onChunk: @escaping (String) -> Void,
        onComplete: @escaping (String) -> Void
    ) {
        requestQueue.async { [weak self] in
            guard let self = self else { return }
            
            // Cancel any existing stream for this conversation
            self.cancelStream(for: conversationId)
            
            // Create new streaming state
            var state = StreamingState()
            state.isActive = true
            self.streamingStates[conversationId] = state
            
            // Start streaming
            let eventSource = self.streamingService.streamMessage(
                message: text,
                conversationId: conversationId,
                onChunk: { [weak self] chunk in
                    guard let self = self else { return }
                    
                    // Only process if this conversation is still active
                    guard var state = self.streamingStates[conversationId],
                          state.isActive else {
                        return
                    }
                    
                    // Append chunk
                    state.partialMessage += chunk
                    self.streamingStates[conversationId] = state
                    
                    // Update UI
                    DispatchQueue.main.async {
                        onChunk(chunk)
                    }
                },
                onComplete: { [weak self] in
                    guard let self = self else { return }
                    
                    if let state = self.streamingStates[conversationId] {
                        // Save complete message
                        self.saveMessage(state.partialMessage, in: conversationId)
                        
                        DispatchQueue.main.async {
                            onComplete(state.partialMessage)
                        }
                    }
                    
                    // Clean up
                    self.streamingStates[conversationId] = nil
                },
                onError: { [weak self] error in
                    // Handle error, clean up
                    self?.streamingStates[conversationId] = nil
                }
            )
            
            // Store event source
            self.streamingStates[conversationId]?.eventSource = eventSource
        }
    }
    
    func switchToConversation(_ conversationId: String) {
        requestQueue.async { [weak self] in
            guard let self = self else { return }
            
            // Cancel stream for previous conversation
            if let previousId = self.activeConversationId,
               previousId != conversationId {
                self.cancelStream(for: previousId, savingPartial: true)
            }
            
            // Set new active conversation
            self.activeConversationId = conversationId
        }
    }
    
    private func cancelStream(for conversationId: String, savingPartial: Bool = false) {
        guard var state = streamingStates[conversationId] else { return }
        
        // Close SSE connection
        state.eventSource?.disconnect()
        state.isActive = false
        
        // Optionally save partial response
        if savingPartial && !state.partialMessage.isEmpty {
            saveMessage(state.partialMessage + " [interrupted]", in: conversationId)
        }
        
        // Clean up
        streamingStates[conversationId] = nil
    }
    
    private func saveMessage(_ text: String, in conversationId: String) {
        // Save to CoreData/Realm
    }
}

Key techniques:

  • โœ… Conversation-scoped state: Each conversation has its own StreamingState
  • โœ… Active flag: Only process chunks if conversation is still active
  • โœ… Queue serialization: All operations on serial queue prevent race conditions
  • โœ… Graceful cancellation: Save partial responses when user switches away
  • โœ… Memory cleanup: Nil out references when done

๐Ÿ“ธ STEP 5: Multimodal Support (Images/Videos)

Challenge: User uploads 10MB image. How do you handle it efficiently?

Image Upload Flow:

class MediaUploadService {
    func uploadImage(
        _ image: UIImage,
        quality: CompressionQuality = .medium,
        onProgress: @escaping (Double) -> Void,
        onComplete: @escaping (String) -> Void  // Returns URL
    ) async throws {
        // 1. Compress image
        let compressedData = try await compressImage(image, quality: quality)
        
        // 2. Generate presigned URL
        let presignedURL = try await getPresignedUploadURL()
        
        // 3. Upload to S3/Cloud Storage
        try await uploadData(compressedData, to: presignedURL, onProgress: onProgress)
        
        // 4. Return permanent URL
        let mediaURL = try await confirmUpload(presignedURL)
        onComplete(mediaURL)
    }
    
    private func compressImage(_ image: UIImage, quality: CompressionQuality) async throws -> Data {
        return try await withCheckedThrowingContinuation { continuation in
            DispatchQueue.global(qos: .userInitiated).async {
                // Resize to max dimension
                let maxDimension: CGFloat = quality == .high ? 2048 : 1024
                let resized = image.resized(maxDimension: maxDimension)
                
                // Convert to WebP or HEIC (smaller than JPEG)
                guard let data = resized.heicData(compressionQuality: quality.value) else {
                    continuation.resume(throwing: MediaError.compressionFailed)
                    return
                }
                
                continuation.resume(returning: data)
            }
        }
    }
    
    private func uploadData(_ data: Data, to url: URL, onProgress: @escaping (Double) -> Void) async throws {
        var request = URLRequest(url: url)
        request.httpMethod = "PUT"
        request.setValue("image/heic", forHTTPHeaderField: "Content-Type")
        
        // Use upload task with progress tracking
        let (_, response) = try await URLSession.shared.upload(for: request, from: data, delegate: ProgressDelegate(onProgress))
        
        guard let httpResponse = response as? HTTPURLResponse,
              (200...299).contains(httpResponse.statusCode) else {
            throw MediaError.uploadFailed
        }
    }
}

enum CompressionQuality {
    case low, medium, high
    
    var value: CGFloat {
        switch self {
        case .low: return 0.3
        case .medium: return 0.6
        case .high: return 0.8
        }
    }
}

Sending Multimodal Request to OpenAI:

// OpenAI Vision API Request
struct MultimodalRequest: Codable {
    let model = "gpt-4-vision-preview"
    let messages: [Message]
    let maxTokens: Int = 4096
    let stream: Bool = true
    
    enum CodingKeys: String, CodingKey {
        case model, messages
        case maxTokens = "max_tokens"
        case stream
    }
}

struct Message: Codable {
    let role: String
    let content: [Content]
}

enum Content: Codable {
    case text(String)
    case imageURL(String)
    
    func encode(to encoder: Encoder) throws {
        var container = encoder.singleValueContainer()
        switch self {
        case .text(let text):
            try container.encode(["type": "text", "text": text])
        case .imageURL(let url):
            try container.encode([
                "type": "image_url",
                "image_url": ["url": url]
            ])
        }
    }
}

// Usage
let request = MultimodalRequest(
    messages: [
        Message(role: "user", content: [
            .text("What's in this image?"),
            .imageURL("https://cdn.example.com/image.jpg")
        ])
    ]
)

๐Ÿš€ STEP 6: Real-World APIs & Integration

struct OpenAIConfig {
    static let baseURL = "https://api.openai.com/v1"
    static let models = [
        "gpt-4-turbo": (inputCost: 0.01, outputCost: 0.03),  // per 1K tokens
        "gpt-4": (inputCost: 0.03, outputCost: 0.06),
        "gpt-3.5-turbo": (inputCost: 0.0005, outputCost: 0.0015)
    ]
    
    // Rate Limits (as of Oct 2025)
    static let rateLimit = 3500  // requests per minute (GPT-4)
    static let tokenLimit = 150_000  // tokens per minute
}

class OpenAIService {
    func chat(
        messages: [Message],
        model: String = "gpt-4-turbo",
        stream: Bool = true
    ) async throws -> AsyncThrowingStream<String, Error> {
        var request = URLRequest(url: URL(string: "\(OpenAIConfig.baseURL)/chat/completions")!)
        request.httpMethod = "POST"
        request.setValue("application/json", forHTTPHeaderField: "Content-Type")
        request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
        
        let body: [String: Any] = [
            "model": model,
            "messages": messages.map { $0.toDictionary() },
            "stream": stream,
            "max_tokens": 4096
        ]
        request.httpBody = try JSONSerialization.data(withJSONObject: body)
        
        return AsyncThrowingStream { continuation in
            let eventSource = EventSource(request: request)
            
            eventSource.onMessage { event in
                if event.data == "[DONE]" {
                    continuation.finish()
                    return
                }
                
                // Parse and yield chunk
                if let chunk = self.parseChunk(event.data) {
                    continuation.yield(chunk)
                }
            }
            
            eventSource.onError { error in
                continuation.finish(throwing: error)
            }
            
            eventSource.connect()
        }
    }
}

Pros:

  • Most comprehensive API
  • Best model quality
  • Vision, DALL-E, TTS all in one

Cons:

  • Expensive ($0.03/1K tokens for GPT-4)
  • Rate limits can be restrictive
  • No guaranteed uptime SLA

Option 2: Anthropic Claude API

struct ClaudeConfig {
    static let baseURL = "https://api.anthropic.com/v1"
    static let models = [
        "claude-3-opus": (cost: 0.015),  // per 1K tokens
        "claude-3-sonnet": (cost: 0.003),
        "claude-3-haiku": (cost: 0.00025)
    ]
}

Pros:

  • Longer context window (200K tokens!)
  • Often more factual than GPT
  • Better at following instructions

Cons:

  • No image generation
  • Smaller ecosystem
  • Fewer integrations

Option 3: Google Gemini API

struct GeminiConfig {
    static let baseURL = "https://generativelanguage.googleapis.com/v1"
    static let models = [
        "gemini-1.5-pro": (cost: 0.0035),  // Multimodal included!
    ]
}

Pros:

  • Free tier (60 requests/minute)
  • Native multimodal (text, images, video, audio)
  • Extremely long context (1M tokens)

Cons:

  • New API, less stable
  • Limited to Google Cloud
  • Censorship more aggressive

๐Ÿ”‘ Key Implementation Patterns

1. Token Counting (Cost Management)

class TokenCounter {
    // Approximate token count (1 token โ‰ˆ 4 chars)
    func estimateTokens(_ text: String) -> Int {
        return text.count / 4
    }
    
    // More accurate using tiktoken
    func countTokens(_ text: String, model: String = "gpt-4") -> Int {
        // Use tiktoken library
        let encoder = Tiktoken.encoder(for: model)
        return encoder.encode(text).count
    }
    
    func estimateCost(inputTokens: Int, outputTokens: Int, model: String) -> Decimal {
        guard let pricing = OpenAIConfig.models[model] else { return 0 }
        
        let inputCost = Decimal(inputTokens) / 1000 * Decimal(pricing.inputCost)
        let outputCost = Decimal(outputTokens) / 1000 * Decimal(pricing.outputCost)
        
        return inputCost + outputCost
    }
}

2. Context Window Management

class ContextManager {
    let maxTokens = 8000  // For GPT-4-turbo
    let reservedForResponse = 2000
    
    func trimMessages(_ messages: [Message]) -> [Message] {
        var trimmed: [Message] = []
        var totalTokens = 0
        
        // Always keep system message
        if let systemMsg = messages.first(where: { $0.role == "system" }) {
            trimmed.append(systemMsg)
            totalTokens += tokenCounter.countTokens(systemMsg.text)
        }
        
        // Add messages from most recent, working backwards
        for message in messages.reversed() where message.role != "system" {
            let tokens = tokenCounter.countTokens(message.text)
            
            if totalTokens + tokens > maxTokens - reservedForResponse {
                break  // Stop adding messages
            }
            
            trimmed.insert(message, at: 0)  // Add to beginning
            totalTokens += tokens
        }
        
        return trimmed
    }
}

3. Error Handling & Retry Logic

class AIService {
    func chat(messages: [Message]) async throws -> String {
        var retryCount = 0
        let maxRetries = 3
        
        while retryCount < maxRetries {
            do {
                return try await performChat(messages)
            } catch let error as AIError {
                switch error {
                case .rateLimitExceeded:
                    // Exponential backoff
                    let delay = pow(2.0, Double(retryCount))
                    try await Task.sleep(nanoseconds: UInt64(delay * 1_000_000_000))
                    retryCount += 1
                    
                case .modelOverloaded:
                    // Try with different model
                    return try await performChat(messages, model: "gpt-3.5-turbo")
                    
                case .contextLengthExceeded:
                    // Trim context and retry
                    let trimmed = contextManager.trimMessages(messages)
                    return try await performChat(trimmed)
                    
                default:
                    throw error
                }
            }
        }
        
        throw AIError.maxRetriesExceeded
    }
}

enum AIError: Error {
    case rateLimitExceeded
    case modelOverloaded
    case contextLengthExceeded
    case maxRetriesExceeded
    case invalidAPIKey
    case networkError
}

๐Ÿ“Š Performance Optimization

1. Streaming Buffer Management

class StreamingTextView: UIView {
    private let label = UILabel()
    private var displayBuffer = ""
    private var incomingBuffer = ""
    private var displayTimer: Timer?
    
    func appendChunk(_ text: String) {
        incomingBuffer += text
        
        // Start display timer if not running
        if displayTimer == nil {
            displayTimer = Timer.scheduledTimer(withTimeInterval: 0.05, repeats: true) { [weak self] _ in
                self?.flushBuffer()
            }
        }
    }
    
    private func flushBuffer() {
        guard !incomingBuffer.isEmpty else {
            return
        }
        
        // Take chunk from buffer
        let chunkSize = 5  // characters per frame
        let endIndex = incomingBuffer.index(
            incomingBuffer.startIndex,
            offsetBy: min(chunkSize, incomingBuffer.count)
        )
        
        let chunk = String(incomingBuffer[..<endIndex])
        incomingBuffer.removeSubrange(..<endIndex)
        
        displayBuffer += chunk
        label.text = displayBuffer
        
        // Stop timer if buffer empty
        if incomingBuffer.isEmpty {
            displayTimer?.invalidate()
            displayTimer = nil
        }
    }
}

2. Memory Management for Long Conversations

class ConversationStorage {
    func saveConversation(_ messages: [Message], id: String) {
        // Only keep last 50 messages in memory
        let recent = Array(messages.suffix(50))
        
        // Save older messages to disk
        if messages.count > 50 {
            let archived = Array(messages.prefix(messages.count - 50))
            diskStorage.archive(archived, for: id)
        }
        
        memoryCache.store(recent, for: id)
    }
    
    func loadConversation(id: String) -> [Message] {
        // Load recent from memory
        let recent = memoryCache.load(id) ?? []
        
        // Load archived from disk if needed
        let archived = diskStorage.loadArchived(for: id)
        
        return archived + recent
    }
}

๐ŸŽฏ Interview Discussion Points

Q: Why SSE over WebSocket?

โ€œSSE is simpler for one-way streaming. OpenAIโ€™s API natively supports SSE. WebSocket adds complexity without benefit for this use case. Weโ€™d only use WebSocket if we needed bidirectional communication, like letting users interrupt mid-response.โ€

Q: How do you handle rate limits?

โ€œImplement exponential backoff, queue requests, and show user-friendly errors. For production, use a backend proxy that tracks rate limits across all users and queues requests intelligently. Consider caching common responses.โ€

Q: What if user sends message while previous is streaming?

โ€œQueue the new request. Cancel current stream, save partial response, then start new stream. Alternatively, show โ€˜AI is respondingโ€ฆโ€™ and disable input until complete.โ€

Q: How do you prevent response mixing?

โ€œEach conversation has unique ID. StreamingService checks ID before processing chunks. If user switches conversations, we cancel the EventSource, nil out references, and start fresh for new conversation.โ€

Q: Cost optimization strategies?

โ€œUse GPT-3.5 for simple queries, GPT-4 for complex. Cache responses. Trim context aggressively. Show token count before sending. Offer monthly limits. Use cheaper models (Claude Haiku, Gemini) for non-critical features.โ€


๐Ÿš€ Scaling Considerations

For 1M Users:

  • Backend proxy with rate limiting and caching
  • Conversation history in distributed database (Cassandra, DynamoDB)
  • Media uploads to S3/CloudFront
  • Redis for active conversation state
  • Message queue (SQS, RabbitMQ) for request buffering

Cost Estimates:

  • GPT-4: $0.03 input + $0.06 output per 1K tokens
  • Average conversation: ~5K tokens = $0.45
  • 1M conversations/day = $450K/day ๐Ÿ˜ฑ
  • Solution: Use cheaper models, caching, context trimming

๐ŸŽ“ Resources


Remember: In system design interviews, start with diagrams, explain trade-offs, and show you understand real-world constraints like cost and rate limits! ๐Ÿš€