June 17, 2024

LlamaIndex – Design pattern utilizing astream_chat method of OpenAI Class (Part 4 in series)

Engaging with Large Language Models (LLMs) presents various challenges and opportunities, especially when implementing asynchronous streaming operations. In this part of our series, we delve into handling asynchronous stream API calls within the "chat" functionality of models like ChatGPT. Asynchronous streaming allows the app developer to handle generated text in real-time as it becomes available, rather than waiting for the entire output to be ready. This code leverages the LlamaIndex package to interact asynchronously with OpenAI's GPT-3.5-turbo and GPT-4 models through an async stream chat design pattern. Async stream chat is a design pattern where chat messages and responses are processed in an event-driven manner, enhancing real-time interaction and responsiveness. This pattern is crucial for applications that require continuous updates or for managing large responses that are more efficiently handled in segments. The provided asynchronous code includes tasks such as checking for an API key in the environment, initializing the OpenAI clients with non-blocking calls, sending chat messages to both models, and asynchronously measuring response time. This approach optimizes resource usage and elevates the user experience by displaying responses progressively as they are processed, rather than after the complete response is computed.

Prem Urali

LLM programming intro series:

LlamaIndex is a popular open source library for easily integrating LLMs using standardized design patterns and access models for a wide range of LLMs.

Series

Focus: In a series of articles, we will demonstrate the programming models supported by LllamaIndex. This is meant for technical software architects, developers, LLMOps engineers as well as technical enthusiasts. We provide you with actual working code so you can copy it and use it as you please.

Links

Link	Purpose
https://www.llamaindex.ai/	Main website
https://docs.llamaindex.ai/en/stable/	Documentation website

Objective of this code sample

This Python sample demonstrates the use of asynchronous programming to interact with OpenAI, utilizing the llama_index package for GPT-3.5 and GPT-4 models. The script checks for the presence of an OpenAI API key in the environment, initializes the API clients, sends chat messages to both GPT models asynchronously, and prints the responses concurrently. The main goal is to illustrate how to handle asynchronous API calls and process responses in a non-blocking manner, providing insights into the performance benefits of asynchronous programming.

Learning objectives

1. Get introduced to LlamaIndex as a programming model for interacting with LLMs.
2. Try out a simple design pattern. Try out how to use the async stream api. Introduce you to the concepts behind streaming, especially the async version of it.
3. Set you up for more advanced concepts in future articles.

Demo Code

This section imports necessary standard and LlamaIndex packages. asyncio is used for asynchronous programming, time for measuring elapsed time, and os for accessing environment variables. The LlamaIndex subpackages are used to interact with the OpenAI API and structure chat messages.

				
					# Note1: make sure to pip install llama_index.core before proceeding
# Note2: make sure your openai api key is set as an env variable as well.
# Import required standard packages
import asyncio
import time
import os
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage

check_key function checks if the OpenAI API key is set in the environment. If found, it prints a confirmation message and returns True; otherwise, it returns False.

print_chunks is an asynchronous function that prints the response chunks as they are received. It iterates over the response asynchronously, ensuring non-blocking behavior.

				
					def check_key() -> bool:
    if "OPENAI_API_KEY" in os.environ:
        print(f"\nOPENAI_API_KEY detected in env")
        return True
    else:
        return False

async def print_chunks(response, label):
    async for chunk in response:
        print(f"{label}: {chunk.delta}", end='')

This function performs the core operations:

API Key Check and Client Initialization: Verifies the API key and initializes the OpenAI clients for GPT-3.5 and GPT-4.
Defining Messages: Prepares the chat messages to be sent to both models.
Timing and Async Tasks: Measures the start time, creates asynchronous tasks for sending messages, and waits for their responses.
Printing Responses: Creates tasks to print the response chunks from both models concurrently.
Elapsed Time Calculation: Calculates and prints the time taken to receive the first chunk of responses.

				
					
async def main():

    if check_key():
        openai_client_gpt_3_5_turbo = OpenAI(model="gpt-3.5-turbo")
        openai_client_gpt_3_5_turbo.api_key = os.environ["OPENAI_API_KEY"]
        openai_client_gpt_4 = OpenAI(model="gpt-4")
        openai_client_gpt_4.api_key = os.environ["OPENAI_API_KEY"]

    else:
        print("OPENAI_API_KEY not in env")
        exit(1)  # Exit if no API key is found

    # Define the GPT 3.5 chat messages to initiate the chat
    messages_3_5 = [
        ChatMessage(role="system", content="You are a helpful AI assistant."),
        ChatMessage(role="user", content="Tell me the best day to visit Paris. Then, elaborate.")
    ]

    # Define the GPT4 chat messages to initiate the chat
    messages_4 = [
        ChatMessage(role="system", content="You are a helpful AI assistant."),
        ChatMessage(role="user", content="Tell me the best day to visit Paris. Then, elaborate.")
    ]

    # Get the current time
    start_time = time.time()

    response_3_5_task = asyncio.create_task(openai_client_gpt_3_5_turbo.astream_chat(messages=messages_3_5))
    response_4_task = asyncio.create_task(openai_client_gpt_4.astream_chat(messages=messages_4))

    response_3_5 = await response_3_5_task
    response_4 = await response_4_task

    # Get the end time. This is time to get to first chunk only. So super fast!
    end_time = time.time()

    # Alternate way of doing async calling is to use create_task and then call gather
    # this will do interleaved print of gpt3.5 chunks and gpt4 chunks
    # while not a practical example it is meant to illustrate to the software architect
    # the capability available even for printing async!
    task_3_5 = asyncio.create_task(print_chunks(response_3_5, "GPT-3.5"))
    task_4 = asyncio.create_task(print_chunks(response_4, "GPT-4"))

    await asyncio.gather(task_3_5, task_4)

				
					    # Calculate the elapsed time in seconds
    elapsed_time = end_time - start_time

    # Format the elapsed time to two decimal places
    formatted_time = "{:.2f}".format(elapsed_time)

    # Print the formatted time
    print(f"\nElapsed time (to first chunk): {formatted_time} seconds")

if __name__ == "__main__":
    asyncio.run(main())

Sample Output

The sample output demonstrates the interleaved printing of response chunks from GPT-3.5 and GPT-4, showcasing the asynchronous handling of responses and the quick retrieval of the first chunks of data.

				
					PARTIAL OUTPUT SHOWN HERE FOR DEMO PURPOSES...

#here is a sample output. It has been truncated for brevity.
# GPT-4:  theGPT-4:  mostGPT-4:  pleasantGPT-3.5: InGPT-3.5:  
# theGPT-3.5:  fallGPT-3.5: ,GPT-3.5:  ParisGPT-3.5:  isGPT-3.5:  
# adornedGPT-3.5:  withGPT-3.5:  beautifulGPT-3.5:  autumnGPT-4:  
# weatherGPT-4: ,GPT-4:  andGPT-4:  crowdGPT-4:  sizesGPT-3.5:  
# colorsGPT-3.5: ,GPT-3.5:  creatingGPT-3.5:  aGPT-3.5:  
# picturesqueGPT-3.5:  backdropGPT-3.5:  forGPT-3.5:  exploringGPT-3.5:  
# theGPT-4:  areGPT-4:  smallGPT-4:  toGPT-3.5:  cityGPT-3.5: 
# 'sGPT-3.5:  landmarksGPT-4:  mediumGPT-4: .

Here is the class hierarchy behind the two key classes used.

Class hierarchy of ChatMessage Class

Class hierarchy of OpenAI Class

Subscribe to our newsletter

Join over 1,000+ other people who are mastering AI in 2024

You will be the first to know when we publish new articles