Learn how to construct an OpenAI-compatible API

0
35


Create a server to copy OpenAI’s Chat Completions API, enabling any LLM to combine with instruments written for the OpenAI API

1*fi5nERyz9P8GLq62SyD 3Q
Picture generated by the creator utilizing OpenAI DALL-E

It’s early 2024, and the Gen AI market is being dominated by OpenAI. For good causes, too — they’ve the primary mover’s benefit, being the primary to supply an easy-to-use API for an LLM, they usually additionally provide arguably essentially the most succesful LLM thus far, GPT 4. Provided that that is the case, builders of all kinds of instruments (brokers, private assistants, coding extensions), have turned to OpenAI for his or her LLM wants.

Whereas there are numerous causes to gas your Gen AI creations with OpenAI’s GPT, there are many causes to go for another. Generally, it may be much less cost-efficient, and at different instances your knowledge privateness coverage might prohibit you from utilizing OpenAI, or perhaps you’re internet hosting an open-source LLM (or your personal).

OpenAI’s market dominance implies that most of the instruments you would possibly need to use solely assist the OpenAI API. Gen AI & LLM suppliers like OpenAI, Anthropic, and Google all appear to creating totally different API schemas (maybe deliberately), which provides lots of additional work for devs who need to assist all of them.

So, as a fast weekend challenge, I made a decision to implement a Python FastAPI server that’s appropriate with the OpenAI API specs, as a way to wrap just about any LLM you want (both managed like Anthropic’s Claude, or self-hosted) to imitate the OpenAI API. Fortunately, the OpenAI API specs, have a base_url parameter you’ll be able to set to successfully level the consumer to your server, as a substitute of OpenAI’s servers, and a lot of the builders of aforementioned instruments can help you set this parameter to your liking.

To do that, I’ve adopted OpenAI’s Chat API reference brazenly accessible right here, with some assist from the code of vLLM, an Apache-2.0 licensed inference server for LLMs that additionally gives OpenAI API compatibility.

Recreation Plan

We shall be constructing a mock API that mimics the way in which OpenAI’s Chat Completion API (/v1/chat/completions) works. Whereas this implementation is in Python and makes use of FastAPI, I saved it fairly easy in order that it may be simply transferable to a different fashionable coding language like TypeScript or Go. We shall be utilizing the Python official OpenAI consumer library to check it — the thought is that if we are able to get the library to suppose our server is OpenAI, we are able to get any program that makes use of it to suppose the identical.

First step — chat completions API, no streaming

We’ll begin with implementing the non-streaming bit. Let’s begin with modeling our request:

from typing import Record, Optionally available

from pydantic import BaseModel


class ChatMessage(BaseModel):
function: str
content material: str

class ChatCompletionRequest(BaseModel):
mannequin: str = "mock-gpt-model"
messages: Record[ChatMessage]
max_tokens: Optionally available[int] = 512
temperature: Optionally available[float] = 0.1
stream: Optionally available[bool] = False

The PyDantic mannequin represents the request from the consumer, aiming to copy the API reference. For the sake of brevity, this mannequin doesn’t implement your entire specs, however fairly the naked bones wanted for it to work. For those who’re lacking a parameter that is part of the API specs (like top_p), you’ll be able to merely add it to the mannequin.

The ChatCompletionRequest fashions the parameters OpenAI makes use of of their requests. The chat API specs require specifying a listing of ChatMessage (like a chat historical past, the consumer is normally in control of retaining it and feeding again in at each request). Every chat message has a job attribute (normally system, assistant , or person ) and a content material attribute containing the precise message textual content.

Subsequent, we’ll write our FastAPI chat completions endpoint:

import time

from fastapi import FastAPI

app = FastAPI(title="OpenAI-compatible API")

@app.publish("/chat/completions")
async def chat_completions(request: ChatCompletionRequest):

if request.messages and request.messages[0].function == 'person':
resp_content = "As a mock AI Assitant, I can solely echo your final message:" + request.messages[-1].content material
else:
resp_content = "As a mock AI Assitant, I can solely echo your final message, however there have been no messages!"

return {
"id": "1337",
"object": "chat.completion",
"created": time.time(),
"mannequin": request.mannequin,
"selections": [{
"message": ChatMessage(role="assistant", content=resp_content)
}]
}

That easy.

Testing our implementation

Assuming each code blocks are in a file referred to as foremost.py, we’ll set up two Python libraries in the environment of selection (all the time finest to create a brand new one): pip set up fastapi openai and launch the server from a terminal:

uvicorn foremost:app

Utilizing one other terminal (or by launching the server within the background), we are going to open a Python console and copy-paste the next code, taken straight from OpenAI’s Python Shopper Reference:

from openai import OpenAI

# init consumer and hook up with localhost server
consumer = OpenAI(
api_key="fake-api-key",
base_url="http://localhost:8000" # change the default port if wanted
)

# name API
chat_completion = consumer.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
mannequin="gpt-1337-turbo-pro-max",
)

# print the highest "selection"
print(chat_completion.selections[0].message.content material)

For those who’ve performed every thing appropriately, the response from the server needs to be appropriately printed. It’s additionally value inspecting the chat_completion object to see that every one related attributes are as despatched from our server. You must see one thing like this:

1*ULB ja65pkGyZl157oZa1Q
Code by the creator, formatted utilizing Carbon

Leveling up — supporting streaming

As LLM technology tends to be sluggish (computationally costly), it’s value streaming your generated content material again to the consumer, in order that the person can see the response because it’s being generated, with out having to attend for it to complete. For those who recall, we gave ChatCompletionRequest a boolean stream property — this lets the consumer request that the info be streamed again to it, fairly than despatched at as soon as.

This makes issues only a bit extra complicated. We’ll create a generator operate to wrap our mock response (in a real-world state of affairs, we are going to desire a generator that is attached to our LLM technology)

import asyncio
import json

async def _resp_async_generator(text_resp: str):
# let's faux each phrase is a token and return it over time
tokens = text_resp.break up(" ")

for i, token in enumerate(tokens):
chunk = {
"id": i,
"object": "chat.completion.chunk",
"created": time.time(),
"mannequin": "blah",
"selections": [{"delta": {"content": token + " "}}],
}
yield f"knowledge: {json.dumps(chunk)}nn"
await asyncio.sleep(1)
yield "knowledge: [DONE]nn"

And now, we might modify our authentic endpoint to return a StreamingResponse when stream==True

import time

from starlette.responses import StreamingResponse

app = FastAPI(title="OpenAI-compatible API")

@app.publish("/chat/completions")
async def chat_completions(request: ChatCompletionRequest):

if request.messages:
resp_content = "As a mock AI Assitant, I can solely echo your final message:" + request.messages[-1].content material
else:
resp_content = "As a mock AI Assitant, I can solely echo your final message, however there wasn't one!"
if request.stream:
return StreamingResponse(_resp_async_generator(resp_content), media_type="utility/x-ndjson")

return {
"id": "1337",
"object": "chat.completion",
"created": time.time(),
"mannequin": request.mannequin,
"selections": [{
"message": ChatMessage(role="assistant", content=resp_content) }]
}

Testing the streaming implementation

After restarting the uvicorn server, we’ll open up a Python console and put on this code (once more, taken from OpenAI’s library docs)

from openai import OpenAI

# init consumer and hook up with localhost server
consumer = OpenAI(
api_key="fake-api-key",
base_url="http://localhost:8000" # change the default port if wanted
)

stream = consumer.chat.completions.create(
mannequin="mock-gpt-model",
messages=[{"role": "user", "content": "Say this is a test"}],
stream=True,
)
for chunk in stream:
print(chunk.selections[0].delta.content material or "")

You must see every phrase within the server’s response being slowly printed, mimicking token technology. We are able to examine the final chunk object to see one thing like this:

1*smlnT9D fYUIbU6pb0fg g
Code by the creator, formatted utilizing Carbon

Placing all of it collectively

Lastly, within the gist beneath, you’ll be able to see your entire code for the server.

https://medium.com/media/91eb67425eb23e8948b39b1d44164f03/href

Ultimate Notes

  • There are various different attention-grabbing issues we are able to do right here, like supporting different request parameters, and different OpenAI abstractions like Operate Calls and the Assistant API.
  • The shortage of standardization in LLM APIs makes it troublesome to change suppliers, for each corporations and builders of LLM-wrapping packages. Within the absence of any normal, the method I’ve taken right here is to summary the LLM behind the specs of the most important and most mature API.

stat?event=post


Learn how to construct an OpenAI-compatible API was initially printed in In direction of Knowledge Science on Medium, the place individuals are persevering with the dialog by highlighting and responding to this story.



Supply hyperlink

LEAVE A REPLY

Please enter your comment!
Please enter your name here