Introduction: Your Personal AI Sidekick
Welcome to Project 3! In this exciting chapter, we’re going to dive deep into building a modern, interactive AI-powered assistant app for iOS. Think of it like creating your own personalized Siri or ChatGPT experience, right on your iPhone. This isn’t just about making a simple app; it’s about integrating cutting-edge artificial intelligence capabilities directly into your user experience.
We’ll explore how to enable your app to “listen” to user commands using speech recognition, “think” by interacting with an AI model (both conceptually and with a mock service, laying the groundwork for real API integration), and “speak” back to the user with synthesized voice. A key focus will be on creating a dynamic, streaming user interface that updates in real-time as the AI generates its response, providing a fluid and engaging interaction.
This project will solidify your understanding of advanced SwiftUI, modern Swift concurrency with async/await, integrating system frameworks like Speech and AVFoundation, and designing a responsive application capable of handling complex asynchronous operations. Before we begin, make sure you’re comfortable with SwiftUI fundamentals, networking concepts, and Swift’s concurrency model covered in previous chapters. Let’s make some AI magic happen!
Core Concepts: Bringing AI to Life on iOS
Building an AI assistant involves orchestrating several powerful technologies. Let’s break down the core concepts we’ll be working with.
1. AI Integration Strategies: On-Device vs. Cloud
When bringing AI into your app, you generally have two main approaches:
On-Device AI (e.g., Core ML, Natural Language Framework):
- What it is: Running pre-trained machine learning models directly on the user’s device. Apple provides powerful frameworks like Core ML for integrating custom models and the Natural Language framework for tasks like text classification, sentiment analysis, and named entity recognition.
- Why it’s important: Offers privacy (data never leaves the device), speed (no network latency), and offline functionality.
- How it functions: You convert a trained model (from TensorFlow, PyTorch, etc.) into a Core ML model format (
.mlmodel) and bundle it with your app. Your app then uses the Core ML framework to run inferences on this model. - Modern Best Practice (2026): For tasks requiring basic natural language processing, sentiment analysis, or image recognition, on-device AI is often preferred for its privacy and responsiveness. For this project, we’ll focus on the interaction flow, but keep Core ML in mind for future enhancements.
Cloud-Based AI (API Calls):
- What it is: Sending user input to a powerful AI model hosted on a remote server (e.g., OpenAI’s GPT models, Google Gemini, Anthropic Claude) via a network API call.
- Why it’s important: Provides access to the most advanced and general-purpose AI models, capable of complex reasoning, content generation, and multi-turn conversations without bundling large models with your app.
- How it functions: Your app makes an HTTP request to the AI provider’s API endpoint, sending the user’s query. The server processes it and sends back the AI’s response.
- Modern Best Practice (2026): For conversational AI, sophisticated content generation, or tasks requiring up-to-date world knowledge, cloud-based AI is the go-to. We’ll simulate this interaction initially, setting the stage for real API integration.
For our project, we’ll primarily focus on the interaction flow that would work with either a local or cloud AI, using a mock service to simulate the AI’s responses. This allows us to build the UI and core logic without needing API keys or complex backend setup initially.
2. Speech Recognition: From Voice to Text
The Speech framework allows your iOS app to convert spoken audio into text. It’s an incredibly powerful tool for creating hands-free interfaces.
- What it is: Apple’s framework for transcribing speech.
- Why it’s important: Enables voice commands, dictation, and natural language interaction without typing.
- How it functions: You request microphone and speech recognition permissions from the user. Then, you create an
SFSpeechRecognizerandSFSpeechAudioBufferRecognitionRequestto process audio from the microphone. The recognizer continuously provides partial and final transcriptions as the user speaks.
3. Text-to-Speech: Giving Your App a Voice
The AVFoundation framework provides AVSpeechSynthesizer for converting text into spoken audio. This is how our AI assistant will “talk” back to the user.
- What it is: Apple’s framework for synthesizing speech from text.
- Why it’s important: Enhances user experience by providing auditory feedback, making the assistant feel more alive and accessible.
- How it functions: You create an
AVSpeechSynthesizerinstance and anAVSpeechUtterancecontaining the text to be spoken and desired voice settings (language, pitch, rate). The synthesizer then speaks the utterance.
4. Streaming UI for Dynamic Responses
Traditional API calls often return a complete response all at once. However, modern AI experiences (like ChatGPT) often stream responses, showing text word-by-word as it’s generated. This makes the interaction feel much faster and more engaging.
- What it is: Updating the user interface incrementally as data arrives, rather than waiting for a full response.
- Why it’s important: Improves perceived performance and user engagement, especially for AI responses that can take several seconds to generate.
- How it functions: We’ll leverage Swift’s
AsyncSequence(orCombineif preferred, thoughAsyncSequenceis the modern choice forasync/awaitflows) to process chunks of text as they arrive from our simulated AI service. The UI will then append these chunks to the displayed message.
5. Modern Concurrency with async/await
All these operations – speech recognition, AI service calls, text-to-speech – are asynchronous and can be long-running. Swift’s async/await syntax is perfect for managing these tasks cleanly and efficiently.
- What it is: Swift’s structured concurrency model introduced in Swift 5.5 and refined in Swift 6.
- Why it’s important: Prevents UI freezes, makes asynchronous code easier to read and write, and helps avoid common concurrency bugs like race conditions.
- How it functions: We’ll use
Taskto initiate concurrent operations andawaitto pause execution until an asynchronous result is available. This ensures our app remains responsive while performing intensive background work.
6. Permissions
For speech recognition, your app needs explicit user permission to access the microphone and recognize speech. You’ll declare these in your Info.plist file.
NSSpeechRecognitionUsageDescription: Explains why your app needs speech recognition.NSMicrophoneUsageDescription: Explains why your app needs microphone access.
Data Flow Diagram
Let’s visualize how these components interact within our AI assistant app:
Step-by-Step Implementation: Building Your Assistant
Let’s start building our AI assistant app. We’ll use Xcode 17.x (or later, supporting Swift 6.1.3+) and target iOS 17.0+.
Step 1: Project Setup and Permissions
Create a New Xcode Project:
- Open Xcode (version 17.x or later, which supports Swift 6.1.3 as of 2026-02-26).
- Go to
File > New > Project... - Select
iOS > Appand clickNext. - Product Name:
AIAssistant - Interface:
SwiftUI - Language:
Swift - Storage:
None - Click
Nextand choose a location to save your project.
Configure Permissions in
Info.plist: Your app needs to declare its intent to use the microphone and speech recognition.- In the Project Navigator, select your project, then your target (
AIAssistant). - Go to the
Infotab. - Add two new rows (by clicking the
+button next to any existing row):- Privacy - Speech Recognition Usage Description:
We use speech recognition to transcribe your voice commands for the AI assistant. - Privacy - Microphone Usage Description:
We need microphone access to record your voice for speech recognition.
- Privacy - Speech Recognition Usage Description:
These descriptions are crucial; without them, your app will crash when trying to access these features.
- In the Project Navigator, select your project, then your target (
Step 2: Basic Chat UI
Let’s create a simple chat interface to display messages and an input area.
Open ContentView.swift. We’ll start by defining a Message struct and a simple list to display them.
// ContentView.swift
import SwiftUI
// 1. Define a simple Message struct
struct Message: Identifiable, Equatable { // Added Equatable for potential future optimizations
let id = UUID()
let text: String
let isUser: Bool // True for user, false for AI
}
struct ContentView: View {
// 2. State variable to hold our chat messages
@State private var messages: [Message] = [
Message(text: "Hello! How can I help you today?", isUser: false)
]
// 3. State variable for the user's current input
@State private var userInput: String = ""
var body: some View {
NavigationView { // 4. Embed in NavigationView for title
VStack {
// 5. Scrollable list of messages
ScrollView {
VStack(alignment: .leading, spacing: 10) {
ForEach(messages) { message in
HStack {
if message.isUser {
Spacer()
}
Text(message.text)
.padding(10)
.background(message.isUser ? Color.blue.opacity(0.8) : Color.gray.opacity(0.2))
.foregroundColor(message.isUser ? .white : .primary)
.cornerRadius(10)
if !message.isUser {
Spacer()
}
}
}
}
.padding()
}
// 6. Input field and send button
HStack {
TextField("Type your message...", text: $userInput)
.textFieldStyle(RoundedBorderTextFieldStyle())
.padding(.horizontal)
Button("Send") {
sendMessage()
}
.padding(.trailing)
.disabled(userInput.isEmpty) // Disable if input is empty
}
.padding(.bottom)
}
.navigationTitle("AI Assistant")
}
}
// 7. Function to handle sending a message
private func sendMessage() {
guard !userInput.isEmpty else { return }
messages.append(Message(text: userInput, isUser: true))
// Here, we would usually send userInput to our AI service
// For now, let's just simulate an AI response
simulateAIResponse(for: userInput)
userInput = "" // Clear input field
}
// 8. Placeholder for simulating AI response
private func simulateAIResponse(for input: String) {
let aiResponse = "I received your message: \"\(input)\". I am still learning, but I'm here to help!"
messages.append(Message(text: aiResponse, isUser: false))
}
}
struct ContentView_Previews: PreviewProvider {
static var previews: some View {
ContentView()
}
}
Explanation:
- We define a
Messagestruct to represent each chat bubble, with anidforForEach,textcontent, and anisUserflag to differentiate sender. We also addedEquatablefor potential performance benefits with SwiftUI’s diffing algorithm. @State private var messagesholds our chat history, initialized with a welcome message from the AI.@State private var userInputstores the text the user is currently typing.- The
NavigationViewprovides a title for our app. - A
ScrollViewcontains aVStackthat lays out ourMessageviews. We useHStackandSpacerto align user messages to the right and AI messages to the left. - The
HStackat the bottom contains aTextFieldfor user input and aButtonto send the message. The button is disabled ifuserInputis empty. sendMessage()adds the user’s input to themessagesarray and clears the input field.simulateAIResponse()is a placeholder that currently just echoes the user’s message. We’ll replace this with actual AI interaction soon.
Run the app in the simulator. You should see a basic chat interface where you can type and send messages, and the “AI” will echo them back.
Step 3: Integrating Speech Recognition
Now, let’s add the ability to speak to our assistant. We’ll create a dedicated class to manage speech recognition.
Create
SpeechRecognizer.swift: Create a new Swift file namedSpeechRecognizer.swift.// SpeechRecognizer.swift import Speech import Foundation import Combine // For error handling and publishing results // 1. Define an error type for speech recognition enum SpeechRecognizerError: Error, Identifiable { var id: String { localizedDescription } case authorizationDenied case restricted case notDetermined case unknown case recognitionFailed(Error) case audioEngineFailed(Error) var localizedDescription: String { switch self { case .authorizationDenied: return "Speech recognition authorization denied." case .restricted: return "Speech recognition restricted on this device." case .notDetermined: return "Speech recognition authorization not determined." case .unknown: return "An unknown speech recognition error occurred." case .recognitionFailed(let error): return "Recognition failed: \(error.localizedDescription)" case .audioEngineFailed(let error): return "Audio engine failed: \(error.localizedDescription)" } } } // 2. SpeechRecognizer class class SpeechRecognizer: ObservableObject { // Publishers for recognized text and potential errors @Published var recognizedText: String = "" @Published var isRecording: Bool = false @Published var error: SpeechRecognizerError? // 3. Initialize with locale (e.g., "en-US" for US English) private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US")) private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest? private var recognitionTask: SFSpeechRecognitionTask? private let audioEngine = AVAudioEngine() // 4. Audio engine for recording // 5. Request authorization using modern async/await func requestAuthorization() async { return await withCheckedContinuation { continuation in SFSpeechRecognizer.requestAuthorization { authStatus in DispatchQueue.main.async { // Ensure UI updates are on main thread switch authStatus { case .authorized: print("Speech recognition authorized.") self.error = nil // Clear any previous error case .denied: self.error = .authorizationDenied print("Speech recognition authorization denied.") case .restricted: self.error = .restricted print("Speech recognition restricted on this device.") case .notDetermined: self.error = .notDetermined print("Speech recognition authorization not determined.") @unknown default: self.error = .unknown print("Unknown speech recognition authorization status.") } continuation.resume() } } } } // 6. Start recording func startRecording() { guard speechRecognizer?.isAvailable ?? false else { self.error = .recognitionFailed(NSError(domain: "Speech", code: 0, userInfo: [NSLocalizedDescriptionKey: "Speech recognizer not available."])) return } // Cancel the previous task if it's running recognitionTask?.cancel() self.recognitionTask = nil self.recognizedText = "" self.error = nil // Configure the audio session for recording let audioSession = AVAudioSession.sharedInstance() do { try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers) try audioSession.setActive(true, options: .notifyOthersOnDeactivation) } catch { self.error = .audioEngineFailed(error) return } // Create a new recognition request recognitionRequest = SFSpeechAudioBufferRecognitionRequest() guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a SFSpeechAudioBufferRecognitionRequest object") } recognitionRequest.shouldReportPartialResults = true // Get partial results as user speaks // Start the recognition task recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest) { result, error in var isFinal = false if let result = result { self.recognizedText = result.bestTranscription.formattedString // Update recognized text isFinal = result.isFinal } if error != nil || isFinal { self.stopRecording() // Stop recording on error or final result if let error = error { self.error = .recognitionFailed(error) } } } // Install the audio tap on the input node let inputNode = audioEngine.inputNode let recordingFormat = inputNode.outputFormat(forBus: 0) inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in self.recognitionRequest?.append(buffer) // Append audio buffer to recognition request } // Prepare and start the audio engine audioEngine.prepare() do { try audioEngine.start() self.isRecording = true print("Audio engine started. Recording...") } catch { self.error = .audioEngineFailed(error) print("Audio engine failed to start: \(error.localizedDescription)") } } // 7. Stop recording func stopRecording() { audioEngine.stop() audioEngine.inputNode.removeTap(onBus: 0) // Remove the tap recognitionRequest?.endAudio() recognitionTask?.cancel() // Cancel the task recognitionTask = nil recognitionRequest = nil self.isRecording = false print("Audio engine stopped. Recording ended.") // Reset audio session do { try AVAudioSession.sharedInstance().setActive(false) } catch { print("Error deactivating audio session: \(error.localizedDescription)") } } }
Explanation:
SpeechRecognizerErroris a custom enum to handle various error states, making our error handling more robust and user-friendly.SpeechRecognizeris anObservableObjectso our SwiftUI views can react to changes inrecognizedText,isRecording, anderror.SFSpeechRecognizeris initialized withen-USlocale. You can change this for other languages.AVAudioEngineis used to capture audio from the microphone.requestAuthorization()usesSFSpeechRecognizer.requestAuthorizationto ask the user for permission. It’s anasyncfunction to fit into modern Swift concurrency.startRecording()configures the audio session, sets up anSFSpeechAudioBufferRecognitionRequestto process live audio, and starts theAVAudioEngine. It also installs a tap to feed microphone audio buffers to the recognition request.stopRecording()stops the audio engine, removes the tap, ends the recognition request, and cleans up resources.
Step 4: Integrating Speech Recognition into ContentView
Now, let’s update ContentView to use our SpeechRecognizer.
// ContentView.swift (Updated)
import SwiftUI
import Speech // Don't forget to import Speech
// ... (Message struct and other ContentView code remains the same) ...
struct ContentView: View {
@State private var messages: [Message] = [
Message(text: "Hello! How can I help you today?", isUser: false)
]
@State private var userInput: String = ""
@State private var isListening: Bool = false // New state for listening status
@State private var currentTranscription: String = "" // New state for live transcription
// 1. Instantiate our SpeechRecognizer as a StateObject
@StateObject private var speechRecognizer = SpeechRecognizer()
var body: some View {
NavigationView {
VStack {
ScrollView {
VStack(alignment: .leading, spacing: 10) {
ForEach(messages) { message in
HStack {
if message.isUser {
Spacer()
}
Text(message.text)
.padding(10)
.background(message.isUser ? Color.blue.opacity(0.8) : Color.gray.opacity(0.2))
.foregroundColor(message.isUser ? .white : .primary)
.cornerRadius(10)
if !message.isUser {
Spacer()
}
}
}
// 2. Display live transcription while listening
if isListening && !currentTranscription.isEmpty {
HStack {
Text(currentTranscription)
.padding(10)
.background(Color.blue.opacity(0.1))
.foregroundColor(.blue)
.cornerRadius(10)
Spacer()
}
.padding(.horizontal)
}
}
.padding()
}
// 3. Observe changes from SpeechRecognizer
.onChange(of: speechRecognizer.recognizedText) { newText in
currentTranscription = newText
}
.onChange(of: speechRecognizer.isRecording) { isRecording in
self.isListening = isRecording
if !isRecording && !currentTranscription.isEmpty {
// When recording stops, if there's transcribed text, send it
userInput = currentTranscription
sendMessage()
currentTranscription = ""
}
}
// 4. Show error if any
.alert(item: $speechRecognizer.error) { error in
Alert(title: Text("Speech Error"), message: Text(error.localizedDescription), dismissButton: .default(Text("OK")))
}
HStack {
// 5. Text input field
TextField("Type or speak your message...", text: $userInput)
.textFieldStyle(RoundedBorderTextFieldStyle())
.padding(.horizontal)
.disabled(isListening) // Disable text input while listening
// 6. Send button
Button("Send") {
sendMessage()
}
.padding(.trailing, 5)
.disabled(userInput.isEmpty || isListening)
// 7. Microphone button
Button {
toggleRecording()
} label: {
Image(systemName: isListening ? "mic.fill" : "mic.circle")
.font(.title)
.foregroundColor(isListening ? .red : .accentColor)
}
.padding(.trailing)
}
.padding(.bottom)
}
.navigationTitle("AI Assistant")
// 8. Request authorization on appear
.task { // Use .task for async operations on view appear
await speechRecognizer.requestAuthorization()
}
}
}
// ... (sendMessage() and simulateAIResponse() remain the same for now) ...
// 9. Toggle recording function
private func toggleRecording() {
if speechRecognizer.isRecording {
speechRecognizer.stopRecording()
} else {
speechRecognizer.startRecording()
currentTranscription = "" // Clear previous transcription
}
}
}
Explanation:
@StateObject private var speechRecognizer = SpeechRecognizer()creates an instance of our speech recognizer.@StateObjectensures it lives as long as the view and its updates are observed.- We added a conditional
Textview to displaycurrentTranscriptionlive while the user is speaking. onChangemodifiers observe changes fromspeechRecognizer. WhenrecognizedTextchanges,currentTranscriptionis updated. WhenisRecordingchanges tofalse(meaning recording stopped), the transcribed text is put intouserInputandsendMessage()is called automatically.- An
.alertmodifier is added to display any errors reported by theSpeechRecognizer. - The
TextFieldis disabled while listening to prevent conflicts. - The “Send” button is also disabled while listening.
- A
Buttonwith a microphone icon is added. Its icon changes based onisListeningstatus. Tapping it callstoggleRecording(). .taskmodifier is used to callspeechRecognizer.requestAuthorization()when the view appears. This is the modern way to perform asynchronous setup for a view.toggleRecording()simply starts or stops thespeechRecognizer.
Run the app. When you tap the microphone button for the first time, it will ask for permissions. Grant them. Then, tap the mic button again, speak, and you should see your words transcribed live, and then sent as a message when you stop speaking.
Step 5: Text-to-Speech (AI Voice)
Let’s make our AI assistant speak its responses.
Create
TextToSpeechSynthesizer.swift: Create a new Swift file namedTextToSpeechSynthesizer.swift.// TextToSpeechSynthesizer.swift import AVFoundation import Foundation class TextToSpeechSynthesizer: NSObject, ObservableObject, AVSpeechSynthesizerDelegate { @Published var isSpeaking: Bool = false // Publish speaking status private let synthesizer = AVSpeechSynthesizer() override init() { super.init() synthesizer.delegate = self // Optional: Configure audio session for playback do { try AVAudioSession.sharedInstance().setCategory(.playback, mode: .default, options: .duckOthers) try AVAudioSession.sharedInstance().setActive(true) } catch { print("Error setting up audio session for playback: \(error.localizedDescription)") } } func speak(_ text: String) { // Stop any ongoing speech before starting a new one if synthesizer.isSpeaking { synthesizer.stopSpeaking(at: .word) } let utterance = AVSpeechUtterance(string: text) // Modern best practice: Specify language. "en-US" for US English. // You can explore other voices like AVSpeechSynthesisVoice(identifier: "com.apple.voice.premium.en-US.Zoe") utterance.voice = AVSpeechSynthesisVoice(language: "en-US") utterance.rate = AVSpeechUtteranceDefaultSpeechRate // Default speaking rate utterance.pitchMultiplier = 1.0 // Default pitch utterance.volume = 1.0 // Full volume synthesizer.speak(utterance) } func stopSpeaking() { if synthesizer.isSpeaking { synthesizer.stopSpeaking(at: .immediate) } } // MARK: - AVSpeechSynthesizerDelegate func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer, didStart utterance: AVSpeechUtterance) { DispatchQueue.main.async { self.isSpeaking = true } print("Started speaking: \(utterance.speechString)") } func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer, didFinish utterance: AVSpeechUtterance) { DispatchQueue.main.async { self.isSpeaking = false } print("Finished speaking: \(utterance.speechString)") } func speechSynthesizer(_ synthesizer: AVSpeechSynthesizer, didCancel utterance: AVSpeechUtterance) { DispatchQueue.main.async { self.isSpeaking = false } print("Canceled speaking: \(utterance.speechString)") } }
Explanation:
TextToSpeechSynthesizerusesAVSpeechSynthesizerto convert text to speech.- It conforms to
AVSpeechSynthesizerDelegateto optionally log speech events and, importantly, to update its@Published var isSpeakingproperty. - The
init()method configures the audio session for playback. speak(_ text:)takes a string, creates anAVSpeechUtterance, configures it with a voice and other properties, and tells the synthesizer to speak it. It also stops any currently speaking utterance to prevent overlapping.stopSpeaking()provides a way to immediately halt speech.
Step 6: Integrating Text-to-Speech into ContentView
Now, let’s make our AI’s simulated response speak aloud.
// ContentView.swift (Updated)
import SwiftUI
import Speech
import AVFoundation // Don't forget to import AVFoundation
// ... (Message struct, ContentView, SpeechRecognizer definitions) ...
struct ContentView: View {
@State private var messages: [Message] = [
Message(text: "Hello! How can I help you today?", isUser: false)
]
@State private var userInput: String = ""
@State private var isListening: Bool = false
@State private var currentTranscription: String = ""
@StateObject private var speechRecognizer = SpeechRecognizer()
// 1. Instantiate our TextToSpeechSynthesizer
@StateObject private var textToSpeech = TextToSpeechSynthesizer()
var body: some View {
NavigationView {
VStack {
ScrollView {
// ... (Message display code remains the same) ...
}
// ... (onChange modifiers for speechRecognizer remain the same) ...
// 2. Observe textToSpeech speaking status
.onChange(of: textToSpeech.isSpeaking) { newValue in
// You might use this for a UI indicator, or to disable other actions
print("AI Speaking Status: \(newValue)")
}
HStack {
// ... (TextField, Send Button, Mic Button code remains the same) ...
}
.padding(.bottom)
}
.navigationTitle("AI Assistant")
.task {
await speechRecognizer.requestAuthorization()
}
// 3. Stop speaking if the view disappears
.onDisappear {
textToSpeech.stopSpeaking()
}
}
}
private func sendMessage() {
guard !userInput.isEmpty else { return }
messages.append(Message(text: userInput, isUser: true))
// 4. Stop any ongoing speech when user sends a new message
textToSpeech.stopSpeaking()
simulateAIResponse(for: userInput)
userInput = ""
}
private func simulateAIResponse(for input: String) {
let aiResponse = "I received your message: \"\(input)\". I am still learning, but I'm here to help!"
messages.append(Message(text: aiResponse, isUser: false))
// 5. Speak the AI's response
textToSpeech.speak(aiResponse)
}
// ... (toggleRecording() remains the same) ...
}
Explanation:
@StateObject private var textToSpeech = TextToSpeechSynthesizer()creates an instance of our text-to-speech synthesizer.- An
onChangemodifier fortextToSpeech.isSpeakingis added. This is a good practice for debugging and could be used for UI elements (like a “stop speaking” button) in the future. .onDisappearensures that if the user navigates away from this view, any ongoing speech is stopped.sendMessage()now callstextToSpeech.stopSpeaking()before simulating a new AI response. This prevents the previous AI message from continuing to speak.simulateAIResponse()now callstextToSpeech.speak(aiResponse)to make the AI’s response audible.
Run the app. Now, when the AI provides its simulated response, you should hear it speak!
Step 7: Streaming AI Responses (Mock Service)
To simulate a real-world AI API that streams its response, we’ll create a mock AI service that delivers text character by character with a delay.
Create
MockAIService.swift: Create a new Swift file namedMockAIService.swift.// MockAIService.swift import Foundation // 1. Define a protocol for our AI service protocol AIService { func getStreamingResponse(for query: String) async throws -> AsyncThrowingStream<String, Error> } // 2. Mock implementation of AIService class MockAIService: AIService { func getStreamingResponse(for query: String) async throws -> AsyncThrowingStream<String, Error> { return AsyncThrowingStream { continuation in Task { let fullResponse = "That's a very interesting question about \"\(query)\". As an AI assistant, I can provide information, generate creative content, and help you with a wide range of tasks. What else would you like to know or do?" // Simulate a delay for each character for character in fullResponse { // Using Task.sleep with nanoseconds for precise, short delays try await Task.sleep(nanoseconds: 30_000_000) // 30ms delay per char continuation.yield(String(character)) } continuation.finish() // Indicate completion } } } }
Explanation:
AIServiceprotocol defines the contract for any AI service, making it easy to swap out the mock with a real API later. It returns anAsyncThrowingStreamwhich is perfect for streaming data.MockAIServiceimplements this protocol. ItsgetStreamingResponsemethod takes a query and constructs a simulatedfullResponse.- It then iterates through each character of the response, yielding it to the
AsyncThrowingStreamafter a small delay (Task.sleep). This simulates the “typing out” effect of a streaming AI. continuation.finish()is called when all characters have been yielded.
Step 8: Integrating Streaming AI into ContentView
Now, let’s update ContentView to use our MockAIService and handle streaming responses.
// ContentView.swift (Updated)
import SwiftUI
import Speech
import AVFoundation
// ... (Message struct, ContentView, SpeechRecognizer, TextToSpeechSynthesizer definitions) ...
struct ContentView: View {
@State private var messages: [Message] = [
Message(text: "Hello! How can I help you today?", isUser: false)
]
@State private var userInput: String = ""
@State private var isListening: Bool = false
@State private var currentTranscription: String = ""
@State private var isAITyping: Bool = false // New state for AI typing indicator
@StateObject private var speechRecognizer = SpeechRecognizer()
@StateObject private var textToSpeech = TextToSpeechSynthesizer()
// 1. Instantiate our MockAIService (using the protocol for flexibility)
private let aiService: AIService = MockAIService()
var body: some View {
NavigationView {
VStack {
ScrollViewReader { proxy in // 2. Use ScrollViewReader for auto-scrolling
ScrollView {
VStack(alignment: .leading, spacing: 10) {
ForEach(messages) { message in
HStack {
if message.isUser {
Spacer()
}
Text(message.text)
.padding(10)
.background(message.isUser ? Color.blue.opacity(0.8) : Color.gray.opacity(0.2))
.foregroundColor(message.isUser ? .white : .primary)
.cornerRadius(10)
if !message.isUser {
Spacer()
}
}
.id(message.id) // Assign ID for ScrollViewReader
}
// 3. AI Typing Indicator
if isAITyping {
HStack {
Text("AI is typing...")
.padding(10)
.background(Color.gray.opacity(0.1))
.foregroundColor(.gray)
.cornerRadius(10)
Spacer()
}
.id("aiTypingIndicator") // ID for auto-scrolling
}
}
.padding()
.onChange(of: messages.count) { _ in // 4. Auto-scroll when messages count changes
scrollToBottom(proxy: proxy)
}
.onChange(of: isAITyping) { _ in // 4. Auto-scroll when AI typing status changes
scrollToBottom(proxy: proxy)
}
}
}
// ... (onChange modifiers for speechRecognizer and textToSpeech remain the same) ...
HStack {
TextField("Type or speak your message...", text: $userInput)
.textFieldStyle(RoundedBorderTextFieldStyle())
.padding(.horizontal)
.disabled(isListening || isAITyping) // Disable if AI is typing
Button("Send") {
sendMessage()
}
.padding(.trailing, 5)
.disabled(userInput.isEmpty || isListening || isAITyping)
Button {
toggleRecording()
} label: {
Image(systemName: isListening ? "mic.fill" : "mic.circle")
.font(.title)
.foregroundColor(isListening ? .red : .accentColor)
}
.padding(.trailing)
.disabled(isAITyping) // Disable mic button if AI is typing
}
.padding(.bottom)
}
.navigationTitle("AI Assistant")
.task {
await speechRecognizer.requestAuthorization()
}
.onDisappear {
textToSpeech.stopSpeaking()
}
}
}
private func sendMessage() {
guard !userInput.isEmpty else { return }
let userMessage = Message(text: userInput, isUser: true)
messages.append(userMessage)
textToSpeech.stopSpeaking() // Stop any ongoing AI speech
let query = userInput
userInput = "" // Clear input field immediately
// 5. Start AI response streaming
Task { // Use a Task to run the async operation
await handleStreamingAIResponse(for: query)
}
}
// 6. New function to handle streaming AI responses
private func handleStreamingAIResponse(for query: String) async {
isAITyping = true // Show typing indicator
var aiResponseAccumulator = ""
let aiMessageID = UUID() // Generate ID for the streaming message
var currentAIMessageIndex: Int?
// Append an empty AI message to start streaming into
// Must be on main thread as it modifies @State
DispatchQueue.main.async {
self.messages.append(Message(id: aiMessageID, text: "", isUser: false))
currentAIMessageIndex = self.messages.firstIndex(where: { $0.id == aiMessageID })
}
do {
let stream = try await aiService.getStreamingResponse(for: query)
for try await chunk in stream {
aiResponseAccumulator += chunk
// Update the last AI message with the new chunk
// Must be on main thread
DispatchQueue.main.async {
if let index = currentAIMessageIndex, index < self.messages.count {
self.messages[index] = Message(id: aiMessageID, text: aiResponseAccumulator, isUser: false)
}
}
// Optional: For a more real-time voice, you could speak partial chunks.
// For simplicity and better pronunciation, we'll speak the full response at the end for now.
}
// 7. Speak the full AI response once streaming is complete
if !aiResponseAccumulator.isEmpty {
textToSpeech.speak(aiResponseAccumulator)
}
} catch {
print("Error streaming AI response: \(error.localizedDescription)")
// Display error message in chat
DispatchQueue.main.async {
self.messages.append(Message(text: "Error: \(error.localizedDescription)", isUser: false))
self.textToSpeech.speak("I encountered an error while processing your request.")
}
}
isAITyping = false // Hide typing indicator
}
private func toggleRecording() {
if speechRecognizer.isRecording {
speechRecognizer.stopRecording()
} else {
// Stop AI speaking if user starts recording
textToSpeech.stopSpeaking()
speechRecognizer.startRecording()
currentTranscription = ""
}
}
// Helper function for auto-scrolling
private func scrollToBottom(proxy: ScrollViewProxy) {
// Delay needed to allow SwiftUI to render the new message before scrolling
DispatchQueue.main.asyncAfter(deadline: .now() + 0.05) {
if let lastMessage = messages.last {
proxy.scrollTo(lastMessage.id, anchor: .bottom)
} else if isAITyping { // If no messages, but AI is typing, scroll to indicator
proxy.scrollTo("aiTypingIndicator", anchor: .bottom)
}
}
}
}
Explanation:
private let aiService: AIService = MockAIService()instantiates our mock AI service, conforming to theAIServiceprotocol.ScrollViewReaderis introduced to allow programmatic scrolling to the bottom of the chat, which is essential for a good chat experience.@State private var isAITypingis a new state variable to show/hide a “AI is typing…” indicator.onChangemodifiers onmessages.countandisAITypingtriggerscrollToBottomto keep the latest messages visible. A small delay is added toscrollToBottomto ensure SwiftUI has time to render the new content before attempting to scroll.sendMessage()now kicks off anasync Taskto handle the streaming AI response. We clearuserInputimmediately for a responsive UI.handleStreamingAIResponse()is the core of our streaming logic:- It sets
isAITypingtotrue. - It immediately adds an empty AI message to the
messagesarray. This message will be updated in place as chunks arrive. - It calls
aiService.getStreamingResponse()to get theAsyncThrowingStream. - It then iterates
for try await chunk in stream, appending eachchunktoaiResponseAccumulatorand updating thetextof the last AI message in themessagesarray on the main thread. - Once the stream finishes, the complete
aiResponseAccumulatoris spoken bytextToSpeech. - Error handling is included.
- Finally,
isAITypingis set tofalse.
- It sets
- The
TextField, “Send” button, and microphone button are disabled whileisAITypingto prevent user input conflicts during AI processing. toggleRecording()now also stops any ongoing AI speech when the user starts recording, to prioritize user input.scrollToBottomhelper ensures the view scrolls to the latest message or the typing indicator.
Run the app again. Now, when you send a message (either by typing or speaking), you’ll see “AI is typing…” and the AI’s response will appear character by character, followed by it speaking the full message! This creates a much more engaging and realistic AI interaction.
Mini-Challenge: Enhance AI Interaction
You’ve built a solid foundation for an AI assistant! Now, let’s add a small but impactful enhancement.
Challenge: Implement a “Stop Speaking” button or gesture. When the AI is speaking its response, allow the user to interrupt it immediately.
Hint:
- You already have a
textToSpeech.stopSpeaking()method. - The
TextToSpeechSynthesizernow publishes itsisSpeakingstatus. You can use this to conditionally show a UI element. - Consider adding a simple button next to the AI’s message or a floating button that only appears when
textToSpeech.isSpeaking. - Alternatively, you could add a
TapGestureorLongPressGestureto the chat view’s background that callstextToSpeech.stopSpeaking().
What to observe/learn: This challenge reinforces event handling, UI state management, and user control over ongoing asynchronous operations. It’s a critical aspect of making AI interactions feel natural and responsive.
Common Pitfalls & Troubleshooting
Building an AI assistant involves many moving parts. Here are some common issues you might encounter:
Permission Denied/App Crashes on Mic Access:
- Symptom: Your app crashes or doesn’t recognize speech, and you see errors in the console related to
NSSpeechRecognitionUsageDescriptionorNSMicrophoneUsageDescription. - Fix: Double-check that you’ve added both “Privacy - Speech Recognition Usage Description” and “Privacy - Microphone Usage Description” keys with appropriate strings to your project’s
Info.plist(under the target’sInfotab in Xcode). Ensure the app was re-installed after adding these, as permission prompts are only shown once.
- Symptom: Your app crashes or doesn’t recognize speech, and you see errors in the console related to
UI Freezing During AI Response:
- Symptom: When you send a message, the app becomes unresponsive until the AI’s full response appears (or the simulated response finishes).
- Fix: This almost always means you’re performing a long-running operation (like waiting for the AI response) directly on the main thread. Ensure all network calls or heavy computations are wrapped in
Task { ... }blocks and that any UI updates from background tasks are explicitly on the main actor usingawait MainActor.run { ... }or by using@Publishedproperties onObservableObjects, which automatically publish on the main thread. OurhandleStreamingAIResponseusesTaskandDispatchQueue.main.asyncfor safe UI updates.
Speech Recognition Not Working / Incorrectly Transcribing:
- Symptom: The app doesn’t detect speech, or the transcription is consistently wrong.
- Fix:
- Verify microphone access is granted in iOS Settings (
Settings > Privacy & Security > Microphone > Your App). - Check
SpeechRecognizer’slocaleproperty. Ensure it matches the language you’re speaking. - Ensure a quiet environment for testing.
- Check Xcode console for any
AVAudioEngineerrors.
- Verify microphone access is granted in iOS Settings (
Text-to-Speech Not Playing Audio:
- Symptom: The AI’s response appears on screen, but you don’t hear any voice.
- Fix:
- Check device volume.
- Ensure the device is not on silent mode.
- Verify
AVAudioSessioncategory is set to.playbackand activated, as done inTextToSpeechSynthesizer. - Check Xcode console for any
AVFoundationerrors. - Make sure
utterance.voiceis set to a valid language (e.g.,en-US).
Summary: Your AI Assistant is Alive!
Congratulations! You’ve successfully built an AI-powered assistant app, integrating several advanced iOS features. Let’s recap what you’ve learned and accomplished:
- AI Integration Strategies: Understood the trade-offs between on-device and cloud-based AI, and laid the groundwork for future integration with real AI APIs.
- Speech Recognition: Implemented voice input using Apple’s
Speechframework, including requesting permissions and handling live transcription. - Text-to-Speech: Gave your app a voice using
AVFoundationto synthesize spoken responses. - Streaming UI: Created a dynamic and engaging user experience by displaying AI responses character-by-character as they are generated, enhancing perceived performance.
- Modern Concurrency: Leveraged Swift’s
async/awaitfor efficient and readable handling of asynchronous tasks like speech recognition and AI communication, keeping your UI responsive. - Robust Error Handling: Incorporated custom error types and alerts to provide clear feedback to the user when issues arise.
This project pushed your skills beyond basic UI, demonstrating how to weave together complex system frameworks and advanced concurrency patterns to create a truly interactive and intelligent application.
What’s Next?
In the real world, this project would evolve significantly. Here are some immediate next steps you could consider:
- Integrate a Real AI API: Replace
MockAIServicewith an actual API client for services like OpenAI (e.g., GPT-4), Google Gemini, or Anthropic Claude. This would involve handling API keys, network requests, and potentially more complex streaming protocols (like Server-Sent Events). - Conversation History: Implement persistent storage (e.g., SwiftData or Core Data) to save chat conversations across app launches.
- On-Device AI for Specific Tasks: Explore using Core ML for local, specialized tasks like sentiment analysis of user input before sending it to a cloud AI.
- Customization: Allow users to choose different AI voices, adjust speaking rates, or personalize the assistant’s personality.
- Advanced UI: Add features like copy-to-clipboard for AI responses, markdown rendering for rich text, or visual indicators for AI thinking states.
You’ve taken a massive leap in your iOS development journey. Keep building, keep experimenting, and keep pushing the boundaries of what your apps can do!
References
- Apple Developer Documentation: Speech Framework
- Apple Developer Documentation: AVFoundation
- Apple Developer Documentation: Core ML
- Apple Developer Documentation: Concurrency (async/await)
- Apple Developer Documentation: SwiftUI
- GitHub: swiftlang/swift-package-manager Releases (Referencing Swift 6.1.3)
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.