Rise of the ChatBots (2) - They can hear and speak

By Charles LAZIOSI

Published on: December 3, 2023

Sharing

Introduction

Integrating Speech-to-Text and Text-to-Speech functionalities with OpenAI's powerful AI models can enhance applications by providing capabilities to convert spoken language into written text and vice versa. With the recent updates to OpenAI's library, developers can now easily incorporate these features into their software using Python and Rust programming languages.

Get ready to unlock new avenues of interaction within your projects by harnessing the power of speech recognition and synthesis with OpenAI's cutting-edge technology.

Integrate Speech-to-Text with ChatGPT Library

OpenAI has just updated its library, and in Python, it is now possible to transcribe an audio file with just a few lines of code.

import os
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variable
load_dotenv()

# Create the client OpenAI
client = OpenAI()

# Load your API key from an environment variable or secret management service
client.api_key = os.getenv("OPENAI_API_KEY");

# Declare the path for your file
file = "question.m4a"
# Open the file
audio_file= open(file, "rb")
# Transcription
transcript = client.audio.transcriptions.create(
	model="whisper-1",
	file=audio_file,
	# You can choose your output language!
	language="en"
)
# Print the result
print(transcript.text)

You will notice that it is possible to choose the output language... For example, you can speak in French and the result of the transcription will come out in English.

Integrate Speech-to-Text with direct API call (Rust)

The below code aims at sending an audio file to OpenAI's servers for transcription and handling responses appropriately in their respective programming languages with proper error handling mechanisms in place. You can check out the full source code from my GitHub repo : https://github.com/claziosi/RustAI

// Define structures for deserializing response JSON.
// Note: The actual structure may vary depending on OpenAI's API response format.
#[derive(serde::Deserialize)]
struct TranscriptionResponse {
   text: String,
}

const API_URL: &str = "https://api.openai.com/v1/audio/transcriptions";

pub async fn transcription(file_path: &Path) -> Result<String, Box<dyn std::error::Error>> {
    let api_key = std::env::var("OPENAI_API_KEY").expect("OPENAI_API_KEY not set");
    let model_name = "whisper-1";

    // Read the file content into a byte vector
    let file_content = tokio::fs::read(file_path).await?;

    // Create a multipart form
    let part = multipart::Part::bytes(file_content)
        .file_name("audio.m4a")
        .mime_str("audio/m4a")?;  // Make sure to set the correct MIME type for your audio file

    let form = multipart::Form::new()
        .part("file", part)
        .text("model", model_name.to_string());

    // Build client and make the request
    let client = Client::new();
    
    let response = client.post(API_URL)
        .bearer_auth(api_key)
        .multipart(form)
        .send()
        .await?;

    if response.status().is_success() {
        if let Ok(transcription_response) = response.json::<TranscriptionResponse>().await {
            return Ok(transcription_response.text.clone());
        } else {
            Err("Failed to parse JSON response".into())
        }
        
    } else {
        Err(format!("Error making request: {:?}", response.status()).into())
    }
}

Integrate Text-to-Speech with ChatGPT Library

The Audio API provides a speech endpoint based on our TTS (text-to-speech) model. It comes with 6 built-in voices and can be used to:

Narrate a written blog post
Produce spoken audio in multiple languages
Give real time audio output using streaming

source: https://platform.openai.com/docs/guides/text-to-speech

from pathlib import Path
from openai import OpenAI
client = OpenAI()

speech_file_path = Path(__file__).parent / "speech.mp3"
response = client.audio.speech.create(
  model="tts-1",
  voice="alloy",
  input="Today is a wonderful day to build something people love!"
)

response.stream_to_file(speech_file_path)

Integrate Text-To-Speech with direct API call (Rust)

This code defines an asynchronous function called text_to_speech, which sends a JSON payload to OpenAI's text-to-speech endpoint and saves the resulting audio stream as an .mp3 file.

use std::{path::{Path, PathBuf}, error::Error, fs::File, io::Write};

use reqwest::{Client, multipart};
use tokio;

// Define a structure for the request body.
#[derive(serde::Serialize)]
struct TextToSpeechRequest {
    model: String,
    input: String,
    voice: String,
}

pub async fn text_to_speech(input_text: &str, voice: &str,
) -> Result<PathBuf, Box<dyn Error>> {
    
    const API_URL: &str = "https://api.openai.com/v1/audio/speech";
    let api_key = std::env::var("OPENAI_API_KEY").expect("OPENAI_API_KEY not set");

    // Prepare the request body.
    let body = TextToSpeechRequest {
        model: "tts-1".to_string(),
        input: input_text.to_string(),
        voice: voice.to_string(),
    };

    // Create an HTTP client instance.
    let client = Client::new();

    // Perform the POST request.
    let response_bytes = client
        .post(API_URL)
        .bearer_auth(api_key)
        .json(&body)
        .send()
        .await?
        .error_for_status()? // Ensure we have a successful response code (e.g., 2xx).
        .bytes()
        .await?;

    // Write the received bytes into an MP3 file.
    let output_path = PathBuf::from("speech.mp3");
    let mut file = File::create(&output_path)?;
    
    file.write_all(&response_bytes)?;

    Ok(output_path)
}

We are now able to talk to bots and hear their responses.