Design an AI Chat App (ChatGPT-style)
Published:
๐ฏ Problem Statement
Design an AI Chat Application for iOS similar to ChatGPT that supports:
- Text streaming responses (typewriter effect)
- Multiple conversations with context switching
- Multimodal inputs (text, images, videos)
- AI-generated images (DALL-E style)
- Conversation history and offline access
- Real-time streaming while handling interruptions
Constraints:
- OpenAI API rate limits (3,500 RPM for GPT-4)
- Streaming must feel instant (<200ms first token)
- Handle 10+ concurrent conversations
- Support images up to 20MB
- Work on slow 3G networks
๐ STEP 1: High-Level Architecture Diagram
In the interview, start by drawing this:
graph TB
subgraph "iOS Client Layer"
UI[ChatViewController<br/>- MessageList<br/>- InputBar<br/>- TypewriterEffect]
VM[ChatViewModel<br/>- @Published messages<br/>- isStreaming state]
CM[ConversationManager<br/>- Active chat<br/>- Queue requests<br/>- Context window]
SS[StreamingService<br/>- SSE/WebSocket<br/>- Parsing<br/>- Buffering]
MS[MediaUploadService<br/>- Image processing<br/>- Compression<br/>- Upload queue]
DB[DatabaseManager<br/>- CoreData<br/>- History<br/>- Offline]
UI --> VM
VM --> CM
CM --> SS
CM --> MS
CM --> DB
end
subgraph "Backend / Proxy Server"
PROXY[API Gateway<br/>- Rate limiting<br/>- API key mgmt<br/>- Request queue<br/>- Error handling]
end
subgraph "AI Service Providers"
OPENAI[OpenAI API<br/>- GPT-4<br/>- DALL-E<br/>- Vision]
CLAUDE[Anthropic<br/>- Claude API]
GEMINI[Google AI<br/>- Gemini API]
end
SS -->|HTTPS/SSE| PROXY
MS -->|Upload| PROXY
PROXY --> OPENAI
PROXY --> CLAUDE
PROXY --> GEMINI
Key Components:
- ChatView: UIKit/SwiftUI interface with message bubbles
- ChatViewModel: Manages UI state, streaming buffers
- ConversationManager: Handles multiple chats, context switching
- StreamingService: SSE/WebSocket connection management
- MediaUploadService: Compresses and uploads images/videos
- Backend Proxy: Rate limiting, authentication, cost control
๐ฌ What to say while drawing:
โIโll use MVVM with a conversation manager to handle multiple chats. The streaming service manages real-time connections. A backend proxy protects API keys and handles rate limiting. For multimodal, we have a separate upload service that processes media before sending.โ
๐ค STEP 2: User Flow Diagram
Draw the main interaction flows:
โโโโโโโโโโโโโโโโโโโ
โ USER SENDS MSG โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Check Type? โโโโโโโโโถโ TEXT ONLY โ
โ Text/Image/Both โ โ Go to Flow A โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Image/Video
โผ
โโโโโโโโโโโโโโโโโโโ
โ FLOW B: โ
โ Multimodal โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ 1. Show upload โ
โ progress โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ 2. Compress โโโโโโโโโถโ WebP/HEVC โ
โ media โ โ reduction โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ 3. Upload to โ
โ backend โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ 4. Get media URLโ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
FLOW A: TEXT STREAMING
โโโโโโโโโโโโโโโโโโโ
โ 1. Open SSE โ
โ connection โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ 2. Send request โโโโโโโโโถโ POST /chat โ
โ with context โ โ + media URL โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ 3. Show typing โ
โ indicator โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ 4. Receive SSE โโโโโโโโโถโ data: chunk โ
โ chunks โ โ data: chunk โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ 5. Append to โ
โ message with โ
โ animation โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ 6. Stream done โ
โ (data:[DONE])โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ 7. Save to DB โ
โ Close SSE โ
โโโโโโโโโโโโโโโโโโโ
EDGE CASE: USER SWITCHES CHAT WHILE STREAMING
โโโโโโโโโโโโโโโโโโโ
โ Streaming โ
โ message 50% โ
โโโโโโโโโโฌโโโโโโโโโ
โ User taps different conversation
โผ
โโโโโโโโโโโโโโโโโโโ
โ Cancel current โ
โ SSE connection โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Save partial โ
โ response to DB โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Load new chat โ
โ context โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Ready for new โ
โ message โ
โโโโโโโโโโโโโโโโโโโ
๐ฌ What to say while drawing:
โFor text-only messages, we open an SSE connection and stream chunks. For images, we compress first, upload to backend, get a URL, then include it in the API request. If user switches conversations mid-stream, we cancel the connection, save partial response, and load new context. This prevents mixing responses from different chats.โ
๐ STEP 3: Streaming Architecture - SSE vs WebSocket
Critical decision: How to handle real-time streaming?
Option 1: Server-Sent Events (SSE) โ RECOMMENDED
sequenceDiagram
participant App as iOS App
participant Proxy as Backend Proxy
participant OpenAI as OpenAI API
App->>Proxy: GET /stream?message=...<br/>Accept: text/event-stream
Proxy->>OpenAI: POST /chat/completions<br/>stream: true
loop Streaming chunks
OpenAI-->>Proxy: data: {"choices":[{"delta":{"content":"Hello"}}]}
Proxy-->>App: Forward chunk
OpenAI-->>Proxy: data: {"choices":[{"delta":{"content":" world"}}]}
Proxy-->>App: Forward chunk
end
OpenAI-->>Proxy: data: [DONE]
Proxy-->>App: data: [DONE]
Note over App,Proxy: Connection closes
Implementation:
class StreamingService {
private var eventSource: EventSource?
func streamMessage(
message: String,
conversationId: String,
onChunk: @escaping (String) -> Void,
onComplete: @escaping () -> Void,
onError: @escaping (Error) -> Void
) {
let url = URL(string: "\(baseURL)/stream")!
var request = URLRequest(url: url)
request.httpMethod = "POST"
request.setValue("application/json", forHTTPHeaderField: "Content-Type")
let body: [String: Any] = [
"message": message,
"conversation_id": conversationId
]
request.httpBody = try? JSONSerialization.data(withJSONObject: body)
// Using EventSource library for SSE
eventSource = EventSource(request: request)
eventSource?.onMessage { event in
if event.data == "[DONE]" {
onComplete()
return
}
if let data = event.data?.data(using: .utf8),
let json = try? JSONDecoder().decode(StreamChunk.self, from: data) {
onChunk(json.choices.first?.delta.content ?? "")
}
}
eventSource?.onError { error in
onError(error)
}
eventSource?.connect()
}
func cancelStream() {
eventSource?.disconnect()
eventSource = nil
}
}
struct StreamChunk: Codable {
struct Choice: Codable {
struct Delta: Codable {
let content: String?
}
let delta: Delta
}
let choices: [Choice]
}
Pros:
- โ Simpler than WebSocket (HTTP-based)
- โ Automatic reconnection
- โ Unidirectional (perfect for AI responses)
- โ Works with existing HTTP infrastructure
Cons:
- โ One-way communication only
- โ No built-in iOS support (need library)
Option 2: WebSocket
iOS App Backend OpenAI API
โ โ โ
โ WS Connect โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโถโ โ
โ โ โ
โ {"type":"message","text":"..."} โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโถโ โ
โ โ POST /chat/completions โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโถโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ {"type":"chunk","content":"Hello"} โ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ {"type":"chunk","content":" world"} โ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ {"type":"done"} โ โ
โ โ โ
โ WS stays open for next msg โ โ
Implementation:
class WebSocketService {
private var webSocket: URLSessionWebSocketTask?
func connect() {
let session = URLSession(configuration: .default)
webSocket = session.webSocketTask(with: URL(string: "\(wsURL)/chat")!)
webSocket?.resume()
receiveMessage()
}
func sendMessage(_ text: String, conversationId: String) {
let message: [String: Any] = [
"type": "message",
"text": text,
"conversation_id": conversationId
]
if let data = try? JSONSerialization.data(withJSONObject: message),
let string = String(data: data, encoding: .utf8) {
webSocket?.send(.string(string)) { error in
if let error = error {
print("Send error: \(error)")
}
}
}
}
private func receiveMessage() {
webSocket?.receive { [weak self] result in
switch result {
case .success(let message):
switch message {
case .string(let text):
self?.handleMessage(text)
case .data(let data):
// Handle binary data
break
@unknown default:
break
}
// Continue receiving
self?.receiveMessage()
case .failure(let error):
print("Receive error: \(error)")
}
}
}
private func handleMessage(_ text: String) {
guard let data = text.data(using: .utf8),
let json = try? JSONDecoder().decode(WSMessage.self, from: data) else {
return
}
switch json.type {
case "chunk":
onChunkReceived?(json.content ?? "")
case "done":
onStreamComplete?()
case "error":
onError?(NSError(domain: "WS", code: -1, userInfo: [NSLocalizedDescriptionKey: json.content ?? ""]))
default:
break
}
}
var onChunkReceived: ((String) -> Void)?
var onStreamComplete: (() -> Void)?
var onError: ((Error) -> Void)?
}
struct WSMessage: Codable {
let type: String
let content: String?
}
Pros:
- โ Bi-directional communication
- โ Native iOS support (URLSessionWebSocketTask)
- โ Persistent connection (better for multiple messages)
- โ Can send interruptions/cancellations
Cons:
- โ More complex to implement
- โ Requires backend WebSocket support
- โ Connection management overhead
๐ SSE vs WebSocket Comparison
Feature | SSE | WebSocket |
---|---|---|
Complexity | Simple (HTTP) | Complex (custom protocol) |
iOS Support | Library needed | Native URLSession |
Direction | ServerโClient only | Bi-directional |
Reconnection | Automatic | Manual |
Use Case | Streaming responses | Real-time chat, interruptions |
OpenAI API | Directly supported | Need proxy conversion |
Best For | ChatGPT-style apps | Collaborative editing |
๐ก Recommendation: Use SSE for ChatGPT-style apps. Itโs simpler, matches OpenAIโs API, and handles 99% of use cases. Only use WebSocket if you need:
- User to interrupt AI mid-response
- AI to ask follow-up questions
- Real-time collaboration features
๐จ STEP 4: Handling Multiple Concurrent Conversations
Problem: User has 5 conversations open, switches between them rapidly. How do you prevent:
- Response mixing (conversation A getting conversation Bโs answer)
- Memory leaks from abandoned streams
- Race conditions
Solution: Conversation Queue Manager
class ConversationManager {
// Current active conversation
@Published private(set) var activeConversationId: String?
// Streaming state per conversation
private var streamingStates: [String: StreamingState] = [:]
// Queue for pending requests
private let requestQueue = DispatchQueue(label: "com.app.conversation", qos: .userInitiated)
struct StreamingState {
var eventSource: EventSource?
var partialMessage: String = ""
var isActive: Bool = false
}
func sendMessage(
_ text: String,
in conversationId: String,
onChunk: @escaping (String) -> Void,
onComplete: @escaping (String) -> Void
) {
requestQueue.async { [weak self] in
guard let self = self else { return }
// Cancel any existing stream for this conversation
self.cancelStream(for: conversationId)
// Create new streaming state
var state = StreamingState()
state.isActive = true
self.streamingStates[conversationId] = state
// Start streaming
let eventSource = self.streamingService.streamMessage(
message: text,
conversationId: conversationId,
onChunk: { [weak self] chunk in
guard let self = self else { return }
// Only process if this conversation is still active
guard var state = self.streamingStates[conversationId],
state.isActive else {
return
}
// Append chunk
state.partialMessage += chunk
self.streamingStates[conversationId] = state
// Update UI
DispatchQueue.main.async {
onChunk(chunk)
}
},
onComplete: { [weak self] in
guard let self = self else { return }
if let state = self.streamingStates[conversationId] {
// Save complete message
self.saveMessage(state.partialMessage, in: conversationId)
DispatchQueue.main.async {
onComplete(state.partialMessage)
}
}
// Clean up
self.streamingStates[conversationId] = nil
},
onError: { [weak self] error in
// Handle error, clean up
self?.streamingStates[conversationId] = nil
}
)
// Store event source
self.streamingStates[conversationId]?.eventSource = eventSource
}
}
func switchToConversation(_ conversationId: String) {
requestQueue.async { [weak self] in
guard let self = self else { return }
// Cancel stream for previous conversation
if let previousId = self.activeConversationId,
previousId != conversationId {
self.cancelStream(for: previousId, savingPartial: true)
}
// Set new active conversation
self.activeConversationId = conversationId
}
}
private func cancelStream(for conversationId: String, savingPartial: Bool = false) {
guard var state = streamingStates[conversationId] else { return }
// Close SSE connection
state.eventSource?.disconnect()
state.isActive = false
// Optionally save partial response
if savingPartial && !state.partialMessage.isEmpty {
saveMessage(state.partialMessage + " [interrupted]", in: conversationId)
}
// Clean up
streamingStates[conversationId] = nil
}
private func saveMessage(_ text: String, in conversationId: String) {
// Save to CoreData/Realm
}
}
Key techniques:
- โ Conversation-scoped state: Each conversation has its own StreamingState
- โ Active flag: Only process chunks if conversation is still active
- โ Queue serialization: All operations on serial queue prevent race conditions
- โ Graceful cancellation: Save partial responses when user switches away
- โ Memory cleanup: Nil out references when done
๐ธ STEP 5: Multimodal Support (Images/Videos)
Challenge: User uploads 10MB image. How do you handle it efficiently?
Image Upload Flow:
class MediaUploadService {
func uploadImage(
_ image: UIImage,
quality: CompressionQuality = .medium,
onProgress: @escaping (Double) -> Void,
onComplete: @escaping (String) -> Void // Returns URL
) async throws {
// 1. Compress image
let compressedData = try await compressImage(image, quality: quality)
// 2. Generate presigned URL
let presignedURL = try await getPresignedUploadURL()
// 3. Upload to S3/Cloud Storage
try await uploadData(compressedData, to: presignedURL, onProgress: onProgress)
// 4. Return permanent URL
let mediaURL = try await confirmUpload(presignedURL)
onComplete(mediaURL)
}
private func compressImage(_ image: UIImage, quality: CompressionQuality) async throws -> Data {
return try await withCheckedThrowingContinuation { continuation in
DispatchQueue.global(qos: .userInitiated).async {
// Resize to max dimension
let maxDimension: CGFloat = quality == .high ? 2048 : 1024
let resized = image.resized(maxDimension: maxDimension)
// Convert to WebP or HEIC (smaller than JPEG)
guard let data = resized.heicData(compressionQuality: quality.value) else {
continuation.resume(throwing: MediaError.compressionFailed)
return
}
continuation.resume(returning: data)
}
}
}
private func uploadData(_ data: Data, to url: URL, onProgress: @escaping (Double) -> Void) async throws {
var request = URLRequest(url: url)
request.httpMethod = "PUT"
request.setValue("image/heic", forHTTPHeaderField: "Content-Type")
// Use upload task with progress tracking
let (_, response) = try await URLSession.shared.upload(for: request, from: data, delegate: ProgressDelegate(onProgress))
guard let httpResponse = response as? HTTPURLResponse,
(200...299).contains(httpResponse.statusCode) else {
throw MediaError.uploadFailed
}
}
}
enum CompressionQuality {
case low, medium, high
var value: CGFloat {
switch self {
case .low: return 0.3
case .medium: return 0.6
case .high: return 0.8
}
}
}
Sending Multimodal Request to OpenAI:
// OpenAI Vision API Request
struct MultimodalRequest: Codable {
let model = "gpt-4-vision-preview"
let messages: [Message]
let maxTokens: Int = 4096
let stream: Bool = true
enum CodingKeys: String, CodingKey {
case model, messages
case maxTokens = "max_tokens"
case stream
}
}
struct Message: Codable {
let role: String
let content: [Content]
}
enum Content: Codable {
case text(String)
case imageURL(String)
func encode(to encoder: Encoder) throws {
var container = encoder.singleValueContainer()
switch self {
case .text(let text):
try container.encode(["type": "text", "text": text])
case .imageURL(let url):
try container.encode([
"type": "image_url",
"image_url": ["url": url]
])
}
}
}
// Usage
let request = MultimodalRequest(
messages: [
Message(role: "user", content: [
.text("What's in this image?"),
.imageURL("https://cdn.example.com/image.jpg")
])
]
)
๐ STEP 6: Real-World APIs & Integration
Option 1: OpenAI API (Most Popular)
struct OpenAIConfig {
static let baseURL = "https://api.openai.com/v1"
static let models = [
"gpt-4-turbo": (inputCost: 0.01, outputCost: 0.03), // per 1K tokens
"gpt-4": (inputCost: 0.03, outputCost: 0.06),
"gpt-3.5-turbo": (inputCost: 0.0005, outputCost: 0.0015)
]
// Rate Limits (as of Oct 2025)
static let rateLimit = 3500 // requests per minute (GPT-4)
static let tokenLimit = 150_000 // tokens per minute
}
class OpenAIService {
func chat(
messages: [Message],
model: String = "gpt-4-turbo",
stream: Bool = true
) async throws -> AsyncThrowingStream<String, Error> {
var request = URLRequest(url: URL(string: "\(OpenAIConfig.baseURL)/chat/completions")!)
request.httpMethod = "POST"
request.setValue("application/json", forHTTPHeaderField: "Content-Type")
request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
let body: [String: Any] = [
"model": model,
"messages": messages.map { $0.toDictionary() },
"stream": stream,
"max_tokens": 4096
]
request.httpBody = try JSONSerialization.data(withJSONObject: body)
return AsyncThrowingStream { continuation in
let eventSource = EventSource(request: request)
eventSource.onMessage { event in
if event.data == "[DONE]" {
continuation.finish()
return
}
// Parse and yield chunk
if let chunk = self.parseChunk(event.data) {
continuation.yield(chunk)
}
}
eventSource.onError { error in
continuation.finish(throwing: error)
}
eventSource.connect()
}
}
}
Pros:
- Most comprehensive API
- Best model quality
- Vision, DALL-E, TTS all in one
Cons:
- Expensive ($0.03/1K tokens for GPT-4)
- Rate limits can be restrictive
- No guaranteed uptime SLA
Option 2: Anthropic Claude API
struct ClaudeConfig {
static let baseURL = "https://api.anthropic.com/v1"
static let models = [
"claude-3-opus": (cost: 0.015), // per 1K tokens
"claude-3-sonnet": (cost: 0.003),
"claude-3-haiku": (cost: 0.00025)
]
}
Pros:
- Longer context window (200K tokens!)
- Often more factual than GPT
- Better at following instructions
Cons:
- No image generation
- Smaller ecosystem
- Fewer integrations
Option 3: Google Gemini API
struct GeminiConfig {
static let baseURL = "https://generativelanguage.googleapis.com/v1"
static let models = [
"gemini-1.5-pro": (cost: 0.0035), // Multimodal included!
]
}
Pros:
- Free tier (60 requests/minute)
- Native multimodal (text, images, video, audio)
- Extremely long context (1M tokens)
Cons:
- New API, less stable
- Limited to Google Cloud
- Censorship more aggressive
๐ Key Implementation Patterns
1. Token Counting (Cost Management)
class TokenCounter {
// Approximate token count (1 token โ 4 chars)
func estimateTokens(_ text: String) -> Int {
return text.count / 4
}
// More accurate using tiktoken
func countTokens(_ text: String, model: String = "gpt-4") -> Int {
// Use tiktoken library
let encoder = Tiktoken.encoder(for: model)
return encoder.encode(text).count
}
func estimateCost(inputTokens: Int, outputTokens: Int, model: String) -> Decimal {
guard let pricing = OpenAIConfig.models[model] else { return 0 }
let inputCost = Decimal(inputTokens) / 1000 * Decimal(pricing.inputCost)
let outputCost = Decimal(outputTokens) / 1000 * Decimal(pricing.outputCost)
return inputCost + outputCost
}
}
2. Context Window Management
class ContextManager {
let maxTokens = 8000 // For GPT-4-turbo
let reservedForResponse = 2000
func trimMessages(_ messages: [Message]) -> [Message] {
var trimmed: [Message] = []
var totalTokens = 0
// Always keep system message
if let systemMsg = messages.first(where: { $0.role == "system" }) {
trimmed.append(systemMsg)
totalTokens += tokenCounter.countTokens(systemMsg.text)
}
// Add messages from most recent, working backwards
for message in messages.reversed() where message.role != "system" {
let tokens = tokenCounter.countTokens(message.text)
if totalTokens + tokens > maxTokens - reservedForResponse {
break // Stop adding messages
}
trimmed.insert(message, at: 0) // Add to beginning
totalTokens += tokens
}
return trimmed
}
}
3. Error Handling & Retry Logic
class AIService {
func chat(messages: [Message]) async throws -> String {
var retryCount = 0
let maxRetries = 3
while retryCount < maxRetries {
do {
return try await performChat(messages)
} catch let error as AIError {
switch error {
case .rateLimitExceeded:
// Exponential backoff
let delay = pow(2.0, Double(retryCount))
try await Task.sleep(nanoseconds: UInt64(delay * 1_000_000_000))
retryCount += 1
case .modelOverloaded:
// Try with different model
return try await performChat(messages, model: "gpt-3.5-turbo")
case .contextLengthExceeded:
// Trim context and retry
let trimmed = contextManager.trimMessages(messages)
return try await performChat(trimmed)
default:
throw error
}
}
}
throw AIError.maxRetriesExceeded
}
}
enum AIError: Error {
case rateLimitExceeded
case modelOverloaded
case contextLengthExceeded
case maxRetriesExceeded
case invalidAPIKey
case networkError
}
๐ Performance Optimization
1. Streaming Buffer Management
class StreamingTextView: UIView {
private let label = UILabel()
private var displayBuffer = ""
private var incomingBuffer = ""
private var displayTimer: Timer?
func appendChunk(_ text: String) {
incomingBuffer += text
// Start display timer if not running
if displayTimer == nil {
displayTimer = Timer.scheduledTimer(withTimeInterval: 0.05, repeats: true) { [weak self] _ in
self?.flushBuffer()
}
}
}
private func flushBuffer() {
guard !incomingBuffer.isEmpty else {
return
}
// Take chunk from buffer
let chunkSize = 5 // characters per frame
let endIndex = incomingBuffer.index(
incomingBuffer.startIndex,
offsetBy: min(chunkSize, incomingBuffer.count)
)
let chunk = String(incomingBuffer[..<endIndex])
incomingBuffer.removeSubrange(..<endIndex)
displayBuffer += chunk
label.text = displayBuffer
// Stop timer if buffer empty
if incomingBuffer.isEmpty {
displayTimer?.invalidate()
displayTimer = nil
}
}
}
2. Memory Management for Long Conversations
class ConversationStorage {
func saveConversation(_ messages: [Message], id: String) {
// Only keep last 50 messages in memory
let recent = Array(messages.suffix(50))
// Save older messages to disk
if messages.count > 50 {
let archived = Array(messages.prefix(messages.count - 50))
diskStorage.archive(archived, for: id)
}
memoryCache.store(recent, for: id)
}
func loadConversation(id: String) -> [Message] {
// Load recent from memory
let recent = memoryCache.load(id) ?? []
// Load archived from disk if needed
let archived = diskStorage.loadArchived(for: id)
return archived + recent
}
}
๐ฏ Interview Discussion Points
Q: Why SSE over WebSocket?
โSSE is simpler for one-way streaming. OpenAIโs API natively supports SSE. WebSocket adds complexity without benefit for this use case. Weโd only use WebSocket if we needed bidirectional communication, like letting users interrupt mid-response.โ
Q: How do you handle rate limits?
โImplement exponential backoff, queue requests, and show user-friendly errors. For production, use a backend proxy that tracks rate limits across all users and queues requests intelligently. Consider caching common responses.โ
Q: What if user sends message while previous is streaming?
โQueue the new request. Cancel current stream, save partial response, then start new stream. Alternatively, show โAI is respondingโฆโ and disable input until complete.โ
Q: How do you prevent response mixing?
โEach conversation has unique ID. StreamingService checks ID before processing chunks. If user switches conversations, we cancel the EventSource, nil out references, and start fresh for new conversation.โ
Q: Cost optimization strategies?
โUse GPT-3.5 for simple queries, GPT-4 for complex. Cache responses. Trim context aggressively. Show token count before sending. Offer monthly limits. Use cheaper models (Claude Haiku, Gemini) for non-critical features.โ
๐ Scaling Considerations
For 1M Users:
- Backend proxy with rate limiting and caching
- Conversation history in distributed database (Cassandra, DynamoDB)
- Media uploads to S3/CloudFront
- Redis for active conversation state
- Message queue (SQS, RabbitMQ) for request buffering
Cost Estimates:
- GPT-4: $0.03 input + $0.06 output per 1K tokens
- Average conversation: ~5K tokens = $0.45
- 1M conversations/day = $450K/day ๐ฑ
- Solution: Use cheaper models, caching, context trimming
๐ Resources
Remember: In system design interviews, start with diagrams, explain trade-offs, and show you understand real-world constraints like cost and rate limits! ๐
Share on
Twitter Facebook LinkedInโ Buy me a coffee! ๐
If you found this article helpful, consider buying me a coffee to support my work! ๐