June 14, 2024

LlamaIndex – Design pattern utilizing stream_chat method of OpenAI Class (Part 3 in series)

Engaging with Large Language Models (LLMs) presents various challenges and opportunities, particularly when dealing with streaming operations. In this part of our series, we explore how to handle streaming API calls within the "chat" functionality of models like ChatGPT. Streaming model allows the app developer to start using the generated text as it is getting generated rather than waiting for the entire completion to be ready. The provided code demonstrates how to use the LlamaIndex package to interact with OpenAI's GPT-3.5-turbo and GPT-4 models through a stream chat design pattern. Stream chat is a design pattern where the chat messages and responses are processed incrementally, allowing for real-time interaction and responsiveness. This pattern is particularly useful in applications requiring continuous, real-time updates or when handling large responses that benefit from being processed in chunks rather than waiting for the complete response. The code performs the following tasks: checking for an API key in the environment, initializing the OpenAI clients, sending chat messages to both models, and measuring the response time. This approach ensures efficient resource usage and improved user experience, as responses are streamed and displayed incrementally rather than waiting for the entire response to be processed.

Prem Urali

LLM programming intro series:

LlamaIndex is a popular open source library for easily integrating LLMs using standardized design patterns and access models for a wide range of LLMs.

Series

Focus: In a series of articles, we will demonstrate the programming models supported by LllamaIndex. This is meant for technical software architects, developers, LLMOps engineers as well as technical enthusiasts. We provide you with actual working code so you can copy it and use it as you please.

Links

Link	Purpose
https://www.llamaindex.ai/	Main website
https://docs.llamaindex.ai/en/stable/	Documentation website

Objective of this code sample

In this code sample, we will create a Python script that interacts with LlamaIndex’s OpenAI module using a stream-based approach. The script begins by checking if an OpenAI API key is present in the environment. If found, it initializes two instances of the OpenAI client, one for the GPT-3.5 Turbo model and another for the GPT-4 model, with the API key set for both. Each client is used to send a series of predefined chat messages, simulating a conversation asking for travel advice to Paris. These messages include initial system instructions and user queries. After sending these messages, the script captures and processes the responses from both GPT models in a synchronous manner, leveraging the stream chat design pattern. This approach ensures that responses are received and displayed incrementally, enhancing real-time interaction. Finally, the script measures and outputs the elapsed time for these operations, alongside the models’ responses, demonstrating the performance and interaction capabilities with different versions of the GPT models within the LlamaIndex framework.

Learning objectives

1. Get introduced to LlamaIndex as a programming model for interacting with LLMs.
2. Try out a simple design pattern. Try out how to use the stream api. Introduce you to the concepts behind streaming.
3. Set you up for more advanced concepts in future articles.

Demo Code

This section imports the necessary standard Python packages (time and os) and specific subpackages from LlamaIndex. The OpenAI class is used to interact with OpenAI’s language models, and ChatMessage is used to structure the chat messages.

				
					# Note1: make sure to pip install llama_index.core before proceeding
# Note2: make sure your openai api key is set as an env variable as well.
# Import required standard packages
import time
import os

# Import required LlamaIndex Subpackages
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage

Helper function that checks if the key is present as an env variable. The check_key function checks if the OPENAI_API_KEY is set in the environment variables. If the key is found, it prints a message and returns True; otherwise, it returns False.

				
					# helper function
def check_key() -> bool:
    # Check for the OpenAI API key in the environment and set it
    # Setting in env is the best way to make llama_index not throw an exception
    if "OPENAI_API_KEY" in os.environ:
        print(f"\nOPENAI_API_KEY detected in env")
        return True
    else:
        return False

Use the helper function check_key() to get the api key. If you do not have one, you can get it here: https://platform.openai.com/api-keys

Two lists of ChatMessage objects are created, one for each model (GPT-3.5-turbo and GPT-4). These messages initiate the chat with the system role setting the context and the user role asking a specific question.

				
					def main():

    if check_key():
        openai_client_gpt_3_5_turbo = OpenAI(model="gpt-3.5-turbo")
        openai_client_gpt_3_5_turbo.api_key = os.environ["OPENAI_API_KEY"]

        openai_client_gpt_4 = OpenAI(model="gpt-4")
        openai_client_gpt_4.api_key = os.environ["OPENAI_API_KEY"]

    else:
        print("OPENAI_API_KEY not in env")
        exit(1)  # Exit if no API key is found


    # Define the GPT 3.5 chat messages to initiate the chat
    messages_3_5 = [
        ChatMessage(role="system", content="You are a helpful AI assistant."),
        ChatMessage(role="user", content="Tell me the best day to visit Paris. Then, elaborate.")
    ]

    # Define the GPT4 chat messages to initiate the chat
    messages_4 = [
        ChatMessage(role="system", content="You are a helpful AI assistant."),
        ChatMessage(role="user", content="Tell me the best day to visit Paris. Then, elaborate.")
    ]

The current time is recorded before and after sending the chat messages to both models. The stream_chat method sends the messages and starts receiving responses.
The responses from both models are printed as they are received. The stream_chat method returns chunks of the response, which are printed incrementally.

				
					   # Get the current time
    start_time = time.time()

    # Synchronously (blocking call made serially) call both GPT-3.5-turbo and GPT-4

    response_3_5 = openai_client_gpt_3_5_turbo.stream_chat(messages_3_5)
    response_4 = openai_client_gpt_4.stream_chat(messages_4)

    # Get the end time. Essentially response begins as soon as first chunk of text
    # starts to stream back to us
    end_time = time.time()

In this final step, we print out the responses from the LLM. We also print out the time it took to get a run completed.

				
					    # Print the incremental responses
    for chunk in response_3_5:
        print(chunk.delta, end='')

    # Print the incremental responses
    for chunk in response_4:
        print(chunk.delta, end='')

    # Calculate the elapsed time in seconds
    elapsed_time = end_time - start_time

    # Format the elapsed time to two decimal places
    formatted_time = "{:.2f}".format(elapsed_time)

    # Print the formatted time
    print(f"\nElapsed time (first chunk return): {formatted_time} seconds")

if __name__ == "__main__":
    main()

We are working on getting you a Jupyter notebook version of this code. Also, an easy way for you to get this code from Github. We will also be publishing a YouTube series of videos that go over this material and then some more. Please stay tuned.

				
					PARTIAL OUTPUT SHOWN HERE FOR DEMO PURPOSES...

OPENAI_API_KEY detected in env
The best day to visit Paris can vary depending on your preferences and interests. However, many people find that visiting Paris during the spring (April to June) or fall (September to November) offers pleasant weather, fewer crowds, and beautiful scenery with blooming flowers or colorful autumn foliage.

During these seasons, you can enjoy outdoor activities like picnicking along the Seine River, exploring the city's parks and gardens, and strolling through charming neighborhoods without feeling overwhelmed by tourists.

Additionally, visiting Paris on weekdays rather than weekends can help you avoid long lines at popular attractions and experience a more relaxed atmosphere in the city.

Here is the class hierarchy behind the two key classes used.

Class hierarchy of ChatMessage Class

Class hierarchy of OpenAI Class

Subscribe to our newsletter

Join over 1,000+ other people who are mastering AI in 2024

You will be the first to know when we publish new articles