Past RAG: Community Evaluation by way of LLMs for Data Extraction

0
40


Finish-to-end knowledge science mission utilizing Streamlit, Upstash, and OpenAI to construct higher information navigation and comprehension utilizing community evaluation

Picture by USGS on Unsplash

This text will information you thru an end-to-end knowledge science mission utilizing a number of state-of-the-art instruments within the AI area. This instrument is known as Thoughts Mapper as a result of it means that you can create conceptual maps by injecting info right into a information base and retrieving it in a sensible method.

The motivation was to transcend the “easy” RAG framework, the place a consumer queries a vector database and its response is then fed to an LLM like GPT-4 for an enriched reply.

Thoughts Mapper leverages RAG to create intermediate outcome representations helpful to carry out some type of information intelligence which is permits us in flip to raised perceive the output outcomes of RAG over lengthy and unstructured paperwork.

Merely talking, I wish to use RAG as a foundational step to construct various responses, not simply textual. A thoughts map is one in all such responses.

Listed here are a number of the instrument’s options:

  • Manages textual content in mainly all varieties: copy-paste, textual and originating from audio supply (video is contemplated too if the mission is nicely acquired)
  • Makes use of an in-project SQLite database for knowledge persistence
  • Leverages the state-of-the-art Upstash vector database to retailer vectors effectively
  • Chunks from the vector database are then used to create a information graph of the data
  • A closing LLM is known as to touch upon the information graph and extract insights

We’ll use Streamlit as library for frontend rendering of our logic. All the code will likely be written in Python.

If you’d like to check out the app you’ll be constructing, test it out right here

https://medium.com/media/3b389a87e4679685f0106e046775b611/href

I’ve uploaded a sequence of textual content paperwork copy-pasted from Wikipedia about distinguished people within the AI world like Sam Altman, Andrej Karpathy, and extra. We’ll question this data base to show how the mission works.

A thoughts map seems like this, when utilizing a immediate like

Who’s Andrej Karpathy?”

1*OIXt 4ynOiLafFm0inPoCQ
Instance of a thoughts map. Picture by creator.

Be happy to navigate the linked software, present your OpenAI API key and Upstash REST Url + Token and immediate the prevailing information base for some demo insights.

The deployed Streamlit app has the inputs part disabled to keep away from exposing the database publicly. In the event you construct the app from the bottom up or clone it from Github, you’ll have the database accessible underneath the principle department of the mission.

If this introduction stimulated your curiosity, then be part of me and let’s dive deeper into the reasons and code!

Right here’s the Github of the mission if you wish to observe alongside.

GitHub – andrea-dagostino/mind_mapper

How Does It Work?

The software program works following this algorithm

  1. consumer uploads or pastes textual content into the software program and saves the info right into a database. Person also can add an audio monitor which will get transcribed because of OpenAI’s Whisper mannequin
1*sZyZJGbfatr FVCdr0raGA
Enter part of the software program. Picture by creator.

2. when the info is saved, it’s break up into textual chunks and these chunks are then embedded utilizing OpenAI ada-002 mannequin

3. vectors are saved into Upstash vector database, with metadata connected

4. when consumer asks a query to the assistant, the question is embedded utilizing the identical mannequin and that vector is used to retrieve the highest n most comparable chunks utilizing dot product similarity metric

5. these comparable chunks of textual content, that are associated to the enter question, are fed into an AI agent accountable of extracting entities and relationships from all of the chunks

6. these entities and relationships make up a Python dictionary which is then used to construct the thoughts map

7. one other agent reads the content material of the identical dictionary and creates a remark to explain the thoughts map and spotlight related info

END.

The Instruments

Let’s briefly undergo the mission dependencies to get a greater understanding of the blocks that make up the logic.

Poetry

I take advantage of Poetry for mainly all of my tasks. It’s a handy and easy Python env and package deal supervisor. You’ll be able to obtain Poetry from this hyperlink.

In the event you cloned the repository, all it’s important to do is poetry set up contained in the mission’s folder in your terminal. Poetry will set up and care for it all.

Upstash Vector Database

Upstash was actually a current discovery and I felt I wished to check it out with an actual mission. Whereas Upstash’s been releasing state-of-the-art merchandise for a while, it was lacking a vector database. Lower than a month in the past, the corporate launch the vector database, which is absolutely on the cloud and free for experimentation and much more. I discovered myself having fun with utilizing it’s API, and the web service had 0 lag.

Upstash: Serverless Knowledge for Redis and Kafka

OpenAI

As talked about, this mission leverages Whisper for audio file transcription and GPT-4 to empower the brokers to extract and remark the thoughts map. We may additionally use open supply fashions if we wished to.

In the event you haven’t already, you’ll be able to setup an OpenAI API key at this hyperlink right here.

https://platform.openai.com

NetworkX

NetworkX empowers the thoughts map element within the software program. It takes care of making nodes of entities and edges amongst these. With Plotly, the interactive visualization lib, you’ll be able to actually visualize advanced networks. You’ll be able to learn extra in regards to the lib at this hyperlink.

NetworkX – NetworkX documentation

Streamlit

There are a bunch of core libraries like Pandas and Numpy however I gained’t even checklist them right here. Alternatively, Streamlit needs to be talked about as a result of it makes the frontend attainable. An actual boon for knowledge scientists which have little information of frontend frameworks and JavaScript.

Streamlit * A sooner technique to construct and share knowledge apps

Now that now we have an higher concept of the principle elements of our software program, let’s begin constructing it from scratch. Sit tight as a result of it’s going to be fairly a protracted learn.

The Challenge’s Construction

That is how the whole mission seems:

1*0VI8k1lGvSK6wnvP 7Emqg

Clearly the logic is contained within the src folder. It accommodates the majority of the logic, whereas there’s a devoted folder for the llm components. We’ll go step-by-step and construct all the scripts. We’ll begin with the one devoted to the info construction, i.e. schema.py.

Schema, Database and Helpers

Let’s begin by defining the data schema. It’s typically the very first thing I do when working with knowledge. We’ll use SQLModel and Pydantic to outline an Info object that can retailer the data and permit desk creation in SQLite.

# schema.py

from sqlmodel import SQLModel, Area
from typing import Non-compulsory

import datetime
from enum import Enum


class FileType(Enum):
AUDIO = "audio"
TEXT = "textual content"
VIDEO = "video"


class Info(SQLModel, desk=True):
id: Non-compulsory[int] = Area(default=None, primary_key=True)
filename: str = Area()
title: Non-compulsory[str] = Area(default="NA", distinctive=False)
hash_id: str = Area(distinctive=True)
created_at: float = Area(default=datetime.datetime.now().timestamp())
file_type: FileType
textual content: str = Area(default="")
embedded: bool = Area(default=False)

__table_args__ = {"extend_existing": True}

Every textual content we’ll enter within the database will likely be an Info. It is going to have

  • and ID, which can act as a main key and thus be autoincremental
  • a filename that can point out the title of the file uploaded in string format
  • a title that the consumer can specify optionally in string format
  • hash_id: created by encoding with MD5 hashing the textual content. We’ll use the hash ID to carry out database operations like learn, delete and replace.
  • created_at is mechanically generated through the use of as a default worth the present time indicating when the merchandise was saved in database
  • file_type signifies whether or not the enter knowledge was textual, audio or video (not applied, however may be)
  • textual content accommodates the supply knowledge used for the complete logic
  • embedded is a boolean worth that can assist us level to the gadgets which have been embedded and thus current within the cloud vector database

Notice: the piece of code __table_args__ = {"extend_existing": True} is critical have the ability to entry and manipulate knowledge within the database from Streamlit.

Now that we acquired the info schema down, let’s write our first utility perform: the logger. It’s an extremely helpful factor to have, and due to the lib Wealthy we’ll additionally get pleasure from having some cool colours within the terminal.

# logger.py

import logging
from wealthy.logging import RichHandler
from typing import Non-compulsory


def get_console_logger(title: Non-compulsory[str] = "default") -> logging.Logger:
logger = logging.getLogger(title)
if not logger.handlers:
logger.setLevel(logging.DEBUG)
console_handler = RichHandler()
console_handler.setLevel(logging.DEBUG)
formatter = logging.Formatter(
"%(asctime)s - %(title)s - %(levelname)s - %(message)s"
)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)

return logger

We’ll simply import it in all of our core scripts.

Since we’re at it, let’s additionally write our utils.py script with some helper features.

# utils.py

import wave
import contextlib
from pydub import AudioSegment

import hashlib
import datetime

from src import logger

logger = logger.get_console_logger("utils")


def compute_cost_of_audio_track(audio_track_file_path: str):
file_extension = audio_track_file_path.break up(".")[-1].decrease()
duration_seconds = 0
if file_extension == "wav":
with contextlib.closing(wave.open(audio_track_file_path, "rb")) as f:
frames = f.getnframes()
charge = f.getframerate()
duration_seconds = frames / float(charge)
elif file_extension == "mp3":
audio = AudioSegment.from_mp3(audio_track_file_path)
duration_seconds = len(audio) / 1000.0 # pydub returns length in milliseconds
else:
logger.error(f"Unsupported file format: {file_extension}")
return

audio_duration_in_minutes = duration_seconds / 60
price = spherical(audio_duration_in_minutes, 2) * 0.006 # default worth of whisper mannequin
logger.information(f"Value to transform {audio_track_file_path} is ${price:.2f}")
return price


def hash_text(textual content: str) -> str:
return hashlib.md5(textual content.encode()).hexdigest()


def convert_timestamp_to_datetime(timestamp: str) -> str:
return datetime.datetime.fromtimestamp(int(timestamp)).strftime("%Y-%m-%d %H:%M:%S")

We gained’t find yourself utilizing the compute_cost_of_audio_track perform on this model of the instrument, however I’ve included it nonetheless if you wish to use it as an alternative.

hash_text goes for use lots to create the hash IDs to insert within the database, whereas convert_timestamp_to_datetime is beneficial to know the default datetime object positioned within the database upon merchandise creation.

Now let’s take a look at the database setup. We’ll setup the conventional CRUD interface:

# db.py

from sqlmodel import SQLModel, create_engine, Session, choose
from src.schema import Info
from src.logger import get_console_logger

sqlite_file_name = "database.db"
sqlite_url = f"sqlite:///{sqlite_file_name}"
engine = create_engine(sqlite_url, echo=False)

logger = get_console_logger("db")

SQLModel.metadata.create_all(engine)


def read_one(hash_id: dict):
with Session(engine) as session:
assertion = choose(Info).the place(Info.hash_id == hash_id)
info = session.exec(assertion).first()
return info


def add_one(knowledge: dict):
with Session(engine) as session:
if session.exec(
choose(Info).the place(Info.hash_id == knowledge.get("hash_id"))
).first():
logger.warning(f"Merchandise with hash_id {knowledge.get('hash_id')} already exists")
return None # or increase an exception, or deal with as wanted
info = Info(**knowledge)
session.add(info)
session.commit()
session.refresh(info)
logger.information(f"Merchandise with hash_id {knowledge.get('hash_id')} added to the database")
return info


def update_one(hash_id: dict, knowledge: dict):
with Session(engine) as session:
# Examine if the merchandise with the given hash_id exists
info = session.exec(
choose(Info).the place(Info.hash_id == hash_id)
).first()
if not info:
logger.warning(f"No merchandise with hash_id {hash_id} discovered for replace")
return None # or increase an exception, or deal with as wanted
for key, worth in knowledge.gadgets():
setattr(info, key, worth)
session.commit()
logger.information(f"Merchandise with hash_id {hash_id} up to date within the database")
return info


def delete_one(id: int):
with Session(engine) as session:
# Examine if the merchandise with the given hash_id exists
info = session.exec(
choose(Info).the place(Info.hash_id == id)
).first()
if not info:
logger.warning(f"No merchandise with hash_id {id} discovered for deletion")
return None # or increase an exception, or deal with as wanted
session.delete(info)
session.commit()
logger.information(f"Merchandise with hash_id {id} deleted from the database")


def add_many(knowledge: checklist):
with Session(engine) as session:
for information in knowledge:
# Reuse add_one perform for every merchandise
outcome = add_one(information)
if result's None:
logger.warning(
f"Merchandise with hash_id {information.get('hash_id')} couldn't be added"
)
else:
logger.information(
f"Merchandise with hash_id {information.get('hash_id')} added to the database"
)
session.commit() # Commit on the finish of the loop


def delete_many(ids: checklist):
with Session(engine) as session:
for id in ids:
# Reuse delete_one perform for every merchandise
outcome = delete_one(id)
if result's None:
logger.warning(f"No merchandise with hash_id {id} discovered for deletion")
else:
logger.information(f"Merchandise with hash_id {id} deleted from the database")
session.commit() # Commit on the finish of the loop


def read_all(question: dict = None):
with Session(engine) as session:
assertion = choose(Info)
if question:
assertion = assertion.the place(
*[getattr(Information, key) == value for key, value in query.items()]
)
info = session.exec(assertion).all()
return info


def delete_all():
with Session(engine) as session:
session.exec(Info).delete()
session.commit()
logger.information("All gadgets deleted from the database")

With this script, we’ll have the ability to create the database and simply learn, create, delete and replace gadgets one after the other or in bulk.

Now that now we have our info construction and an interface to the database, we’ll transfer to the administration of audio information.

Whisper Mannequin to Create Transcriptions

This was a totally non-obligatory step, however I wished to spice issues up. Our code will permit customers to add any .mp3 or .wav information and transcribe their contents by way of OpenAI’s Whisper mannequin. My persona in thoughts was a college scholar that would acquire his notes through voice recording.

Consider Whisper is a paid mannequin. On the time of writing this text, the value was $0.006 / minute. You’ll be able to be taught extra at this hyperlink.

Let’s create whisper.py and a single perform referred to as create_transcript.

from src.logger import get_console_logger

logger = get_console_logger("whisper")


def create_transcript(openai_client, file_path: str) -> None:
audio_file = open(file_path, "rb")
logger.information(f"Creating transcript for {file_path}")
transcript = openai_client.audio.transcriptions.create(
mannequin="whisper-1", file=audio_file
)
logger.information(f"Transcript created for {file_path}")
return transcript.textual content

This perform may be very easy, and it’s only a easy wrapper round OpenAI’s audio module.

The attentive eye will discover that openai_client is an argument to the perform. That’s not a mistake, and we’ll see why in only a second.

Now we are able to deal with textual content in all (of the supported) varieties, that are primary textual content and audio. It’s time to vectorize these texts and push them to our Upstash vector database.

Upstash Vector Database Setup

We’ll be utilizing a number of extra instruments right here to correctly embed our paperwork for vector search and RAG.

  • Tiktoken: the well-known library by OpenAI that enables for easy and environment friendly tokenization based mostly on LLM (in our case, GPT-3.5)
  • LangChain: I really like this library, and discover it very versatile regardless of what portion of the group says about it. On this mission, I borrow from it the RecursiveCharacterTextSplitter object

Once more, if you happen to cloned the repo, Poetry will import the required dependencies mechanically. If not, simply run the command poetry add langchain tiktoken.

After all, we’ll additionally want to put in Upstash Vector — the command is poetry add upstash-vector. As soon as put in, go to the web page https://console.upstash.com/ to setup your cloud surroundings.

Ensure you select 1536 as vector dimensionality to match the scale of OpenAI ADA mannequin.

As I discussed earlier than, Upstash is a paid instrument, however they do have a really beneficiant free tier that I used extensively for this mission.

Free: The free plan is appropriate for small tasks. It has a restrict of 10,000 queries and 10,000 updates restrict every day.

That is nice to get began constructing tasks like these. Scalability, as well as, shouldn’t be a problem since you’ll be able to simply tune your necessities.

As soon as executed, come up with your REST url and token

1*9cnnQINv1z Kji1UawUIqw
The endpoint and the token are wanted to ascertain connection through Python. Picture by creator.

Now we’re prepared to jot down our script.

# vector_db.py

from src.logger import get_console_logger

import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter
from upstash_vector import Vector
from tqdm import tqdm
import random

logger = get_console_logger("vector_db")

MODEL = "text-embedding-ada-002"
ENCODER = tiktoken.encoding_for_model("gpt-3.5-turbo")


def token_len(textual content):
"""Calculate the token size of a given textual content.

Args:
textual content (str): The textual content to calculate the token size for.

Returns:
int: The variety of tokens within the textual content.
"""
return len(ENCODER.encode(textual content))


def get_embeddings(openai_client, chunks, mannequin=MODEL):
"""Get embeddings for an inventory of textual content chunks utilizing the desired mannequin.

Args:
openai_client: The OpenAI shopper occasion to make use of for producing embeddings.
chunks (checklist of str): The textual content chunks to embed.
mannequin (str): The mannequin identifier to make use of for embedding.

Returns:
checklist of checklist of float: An inventory of embeddings, every akin to a bit.
"""
chunks = [c.replace("n", " ") for c in chunks]
res = openai_client.embeddings.create(enter=chunks, mannequin=mannequin).knowledge
return [r.embedding for r in res]


def get_embedding(openai_client, textual content, mannequin=MODEL):
"""Get embedding for a single textual content utilizing the desired mannequin.

Args:
openai_client: The OpenAI shopper occasion to make use of for producing the embedding.
textual content (str): The textual content to embed.
mannequin (str): The mannequin identifier to make use of for embedding.

Returns:
checklist of float: The embedding of the given textual content.
"""
# textual content = textual content.substitute("n", " ")
return get_embeddings(openai_client, [text], mannequin)[0]


def query_vector_db(index, openai_client, query, top_n=1):
"""Question the vector database for comparable vectors to the given query.

Args:
index: The vector database index to question.
openai_client: The OpenAI shopper occasion to make use of for producing the query embedding.
query (str): The query to question the vector database with.
system_prompt (str, non-obligatory): A further immediate to offer context for the query. Defaults to an empty string.
top_n (int, non-obligatory): The variety of prime comparable vectors to return. Defaults to 1.

Returns:
str: A string containing the concatenated texts of the highest comparable vectors.
"""
logger.information("Creating vector for query...")
question_embedding = get_embedding(openai_client, query)
logger.information("Querying vector database...")
res = index.question(vector=question_embedding, top_k=top_n, include_metadata=True)
context = "n-".be part of([r.metadata["text"] for r in res])
logger.information(f"Context returned. Size: {len(context)} characters.")
return context


def create_chunks(textual content, chunk_size=150, chunk_overlap=20):
"""Create textual content chunks based mostly on specified measurement and overlap.

Args:
textual content (str): The textual content to separate into chunks.
chunk_size (int, non-obligatory): The specified measurement of every chunk. Defaults to 150.
chunk_overlap (int, non-obligatory): The variety of overlapping characters between chunks. Defaults to twenty.

Returns:
checklist of str: An inventory of textual content chunks.
"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=token_len,
separators=["nn", "n", " ", ""],
)
return text_splitter.split_text(textual content)


def add_chunks_to_vector_db(index, chunks, metadata):
"""Embed textual content chunks and add them to the vector database.

Args:
index: The vector database index so as to add chunks to.
chunks (checklist of str): The textual content chunks to embed and add.
metadata (dict): The metadata to affiliate with every chunk.

Returns:
None
"""
for chunk in chunks:
random_id = random.randint(0, 1000000) # workaround whereas ready for metadata search to be applied
metadata["text"] = chunk
vec = Vector(
id=f"chunk-{random_id}", vector=get_embedding(chunk), metadata=metadata
)
index.upsert(vectors=[vec])
logger.information(f"Added chunk to vector db: {chunk}")


def fetch_by_source_hash_id(index, source_hash_id: str, max_results=10000):
"""Fetch vector IDs from the database by supply hash ID.

Args:
index: The vector database index to look.
source_hash_id (str): The supply hash ID to filter the vectors by.
max_results (int, non-obligatory): The utmost variety of outcomes to return. Defaults to 10000.

Returns:
checklist of str: An inventory of vector IDs that match the supply hash ID.
"""
ids = []
for i in tqdm(vary(0, max_results, 1000)):
search = index.vary(
cursor=str(i), restrict=1000, include_vectors=False, include_metadata=True
).vectors
for end in search:
if outcome.metadata["source_hash_id"] == source_hash_id:
ids.append(outcome.id)
return ids


def fetch_all(index):
"""Fetch all vectors from the database.

Args:
index: The vector database index to fetch vectors from.

Returns:
checklist: An inventory of vectors from the database.
"""
return index.vary(
cursor="0", restrict=1000, include_vectors=False, include_metadata=True
).vectors

There’s extra occurring on this script so let me dive deeper for a second.

get_embedding and get_embeddings are used to encode one or a number of texts. Simply conveniently positioned right here for higher management.

query_vector_db permits us to question Upstash for comparable gadgets to our question vector. On this perform, we embed the question and carry out the lookup by way of the index’s .question technique. The index, along with OpenAI’s shopper, are handed in as arguments later within the Streamlit app. The returned object is a string referred to as context which is a concatenation of the highest N most comparable gadgets to the enter question.

Persevering with, we leverage LangChain’s RecursiveCharacterTextSplitter to effectively create textual chunks from the paperwork.

Now a little bit of CRUD interface additionally for the vector DB: including and fetching knowledge (updating and deletion are simply carried out too and we’ll try this within the frontend).

Notice: on the time of writing this text, Upstash doesn’t but help search on metadata. Because of this since we’re utilizing hash_id to determine our paperwork, these aren’t immediately querable. I’ve added a easy workaround within the code to flick thru a bunch (100k) paperwork and lookup for the hash ID manually. I’ve learn on-line they’ll be implementing this performance quickly.

LLM Brokers To Construct the Community Graph

We’ll begin engaged on coding our LLM behaviors by engaged on prompts first.

There are going to be two brokers. The primary one is liable for extracting community knowledge from the textual content, whereas the second is liable for analyzing that community knowledge.

The immediate to the primary agent is the next:

You're an knowledgeable in creating community graphs from textual knowledge.
You're additionally a note-taking knowledgeable and you'll be able to create thoughts maps from textual content.
You're tasked with making a thoughts map from a given textual content knowledge by extracting the ideas and relationships from the textual content.n
The relationships ought to be amongst objects, individuals, or locations talked about within the textual content.n

TYPES ought to solely be one of many following:
- is a
- is said to
- is a part of
- is much like
- is totally different from
- is a kind of

Your output ought to be a JSON containing the next:
{ "relationships": [{"source": ..., "target": ..., "type": ..., "origin": _source_or_target_}, {...}] } n
- supply: The supply noden
- goal: The goal noden
- kind: The kind of the connection between the supply and goal nodesn


NEVER change this output format. ENGLISH is the output language. NEVER change the output language.
Your response will likely be used as a Python dictionary, so be all the time conscious of the syntax and the info varieties to return a JSON object.n

INPUT TEXT:n

The analyzer agent is as an alternative utilizing this immediate

You're a senior enterprise intelligence analyst, who is ready to extract precious insights from knowledge.
You're tasked with extracting info from a given thoughts map knowledge.n
The thoughts map knowledge is a JSON containing the next:
{{ "relationships": [{{"source": ..., "target": ..., "type": ..."origin": _source_or_target_}}, {{...}}] }} n
- supply: The supply noden
- goal: The goal noden
- kind: The kind of the connection between the supply and goal nodesn
- origin: The origin node from which the connection originatesn

You're to extract insights from the thoughts map knowledge and supply a abstract of the relationships.n

Your output ought to be a short touch upon the thoughts map knowledge, highlighting related insights and relationships utilizing centrality and different graph evaluation methods.n

NEVER change this output format. ENGLISH is the output language. NEVER change the output language.n
Hold your output very transient. Only a remark to focus on the highest most related info.

MIND MAP DATA:n
{mind_map_data}

These two prompts will likely be imported within the Pythonic method: that’s, as scripts.

Let’s create a script within the LLM folder referred to as prompts.py and create a dictionary of intents the place we place the prompts as values.

# llm.prompts.py

PROMPTS = {
"mind_map_of_one": """You're an knowledgeable in creating community graphs from textual knowledge.
You're additionally a note-taking knowledgeable and you'll be able to create thoughts maps from textual content.
You're tasked with making a thoughts map from a given textual content knowledge by extracting the ideas and relationships from the textual content.n
The relationships ought to be amongst objects, individuals, or locations talked about within the textual content.n

TYPES ought to solely be one of many following:
- is a
- is said to
- is a part of
- is much like
- is totally different from
- is a kind of

Your output ought to be a JSON containing the next:
{ "relationships": [{"source": ..., "target": ..., "type": ...}, {...}] } n
- supply: The supply noden
- goal: The goal noden
- kind: The kind of the connection between the supply and goal nodesn


NEVER change this output format. ENGLISH is the output language. NEVER change the output language.
Your response will likely be used as a Python dictionary, so be all the time conscious of the syntax and the info varieties to return a JSON object.n

INPUT TEXT:n
""",
"inspector_of_mind_map": """
You're a senior enterprise intelligence analyst, who is ready to extract precious insights from knowledge.
You're tasked with extracting info from a given thoughts map knowledge.n
The thoughts map knowledge is a JSON containing the next:
{{ "relationships": [{{"source": ..., "target": ..., "type": ...}}, {{...}}] }} n
- supply: The supply noden
- goal: The goal noden
- kind: The kind of the connection between the supply and goal nodesn
- origin: The origin node from which the connection originatesn

You're to extract insights from the thoughts map knowledge and supply a abstract of the relationships.n

Your output ought to be a short touch upon the thoughts map knowledge, highlighting related insights and relationships utilizing centrality and different graph evaluation methods.n

NEVER change this output format. ENGLISH is the output language. NEVER change the output language.n
Hold your output very transient. Only a remark to focus on the highest most related info.

MIND MAP DATA:n
{mind_map_data}
""",
}

On this method we are able to simply import and use the prompts just by pointing on the agent’s intent (mind_map_of_one, inspector_of_mind_map). We’ll import the prompts within the llm.py script.

# llm.llm.py

from src.logger import get_console_logger
from src.llm.prompts import PROMPTS


logger = get_console_logger("llm")
MIND_MAP_EXTRACTION_MODEL = "gpt-4-turbo-preview"
MIND_MAP_INSPECTION_MODEL = "gpt-4"

def extract_mind_map_data(openai_client: object, textual content: str) -> None:
logger.information(f"Extracting thoughts map knowledge from textual content...")
response = openai_client.chat.completions.create(
mannequin=MIND_MAP_EXTRACTION_MODEL,
response_format={"kind": "json_object"},
temperature=0,
messages=[
{"role": "system", "content": PROMPTS["mind_map_of_one"]},
{"position": "consumer", "content material": f"{textual content}"},
],
)
return response.decisions[0].message.content material


def extract_mind_map_data_of_two(
openai_client: object, source_text: str, target_text: str
) -> None:
logger.information(f"Extracting thoughts map knowledge from two texts...")
user_prompt = PROMPTS["mind_map_of_many"].format(
source_text=source_text, target_text=target_text
)
response = openai_client.chat.completions.create(
mannequin=MIND_MAP_INSPECTION_MODEL,
response_format={"kind": "json_object"}, # this is essential!
messages=[
{"role": "system", "content": PROMPTS["mind_map_of_many"]},
{"position": "consumer", "content material": user_prompt},
],
)
return response.decisions[0].message.content material


def extract_information_from_mind_map_data(openai_client_ object, knowledge: dict) -> None:
logger.information(f"Extracting info from thoughts map knowledge...")
user_prompt = PROMPTS["inspector_of_mind_map"].format(mind_map_data=knowledge)
response = openai_client.chat.completions.create(
mannequin="gpt-4",
messages=[
{"role": "system", "content": PROMPTS["inspector_of_mind_map"]},
{"position": "consumer", "content material": user_prompt},
],
)
return response.decisions[0].message.content material

All of the heavy work is finished by the 2 easy features that merely join an GPT agent to the suitable immediate. Notice response_format={“kind"=”json_object"} within the first perform. This ensures that GPT-4 builds a JSON illustration of the textual content’s community knowledge. With out this line, the complete software turns into extremely unstable.

Let’s put the logic to the check. When handed the immediate “Who’s Andrej Karpathy?” the primary agent creates this community illustration:

{
"relationships":[
{
"source":"Andrej Karpathy",
"target":"Slovak-Canadian",
"type":"is a"
},
{
"source":"Andrej Karpathy",
"target":"computer scientist",
"type":"is a"
},
{
"source":"Andrej Karpathy",
"target":"director of artificial intelligence and Autopilot Vision at Tesla",
"type":"served as"
},
{
"source":"Andrej Karpathy",
"target":"OpenAI",
"type":"worked at"
},
{
"source":"Andrej Karpathy",
"target":"deep learning",
"type":"specialized in"
},
{
"source":"Andrej Karpathy",
"target":"computer vision",
"type":"specialized in"
},
{
"source":"Andrej Karpathy",
"target":"Bratislava, Czechoslovakia",
"type":"was born in"
},
{
"source":"Andrej Karpathy",
"target":"Toronto",
"type":"moved to"
},
{
"source":"Andrej Karpathy",
"target":"University of Toronto",
"type":"completed degrees at"
},
{
"source":"Andrej Karpathy",
"target":"University of British Columbia",
"type":"completed master's degree at"
},
{
"source":"Andrej Karpathy",
"target":"OpenAI",
"type":"is a founding member of"
},
{
"source":"Andrej Karpathy",
"target":"Tesla",
"type":"became director of artificial intelligence at"
},
{
"source":"Andrej Karpathy",
"target":"Elon Musk",
"type":"reported to"
},
{
"source":"Andrej Karpathy",
"target":"MIT Technology Review's Innovators Under 35 for 2020",
"type":"was named one of"
},
{
"source":"Andrej Karpathy",
"target":"YouTube videos on how to create artificial neural networks",
"type":"makes"
},
{
"source":"Andrej Karpathy",
"target":"Stanford University",
"type":"received a PhD from"
},
{
"source":"Fei-Fei Li",
"target":"Stanford University",
"type":"is part of"
},
{
"source":"Andrej Karpathy",
"target":"natural language processing",
"type":"focused on"
},
{
"source":"Andrej Karpathy",
"target":"CS 231n: Convolutional Neural Networks for Visual Recognition",
"type":"authored and was the primary instructor of"
},
{
"source":"CS 231n: Convolutional Neural Networks for Visual Recognition",
"target":"Stanford",
"type":"is part of"
}
]
}

This knowledge comes from unstructured Wikipedia textual content uploaded within the instrument for testing functions. The illustration appears simply superb! Be happy to edit the prompts to extract much more potential info.

All that continues to be now’s to make use of this Python dictionary of relationships to create our interactive thoughts map with NetworkX and Plotly.

Constructing the Thoughts Map with NetworkX and Plotly

There’s going to be one perform solely, however goes to be fairly intense if you happen to’ve by no means labored with NetworkX earlier than. It isn’t the best framework to work with, however the outputs you may get from changing into proficient at it are precious.

What we’ll do is initialize a graph object with G = nx.DiGraph(), which creates a brand new directed graph. The perform iterates over an inventory of relationships supplied within the knowledge dictionary. For every relationship, it provides an edge to the graph G from the supply node to the goal node, with an attribute kind that describes the connection.

for relationship in knowledge["relationships"]:
G.add_edge(
relationship["source"], relationship["target"], kind=relationship["type"]
)

As soon as executed, the graph’s structure is computed utilizing the spring structure algorithm, which positions the nodes in a method that tries to reduce the overlap between edges and maintain the perimeters’ lengths uniform. The seed parameter ensures that the structure is reproducible.

Lastly, Plotly’s Graph Objects (go) module takes care of making scatterplots for every knowledge level, representing a node on the chart.

Right here’s how the mind_map.py script seems.

# mind_map.py

import networkx as nx
from graphviz import Digraph

import plotly.categorical as px
import plotly.graph_objects as go


def create_plotly_mind_map(knowledge: dict) -> go.Determine:
"""
knowledge is a dictionary containing the next
{ "relationships": [{"source": ..., "target": ..., "type": ...}, {...}] }
supply: The supply node
goal: The goal node
kind: The kind of the connection between the supply and goal nodes
"""

### START - NETWORKX LOGIC ###
# Create a directed graph
G = nx.DiGraph()

# Add edges to the graph
for relationship in knowledge["relationships"]:
G.add_edge(
relationship["source"], relationship["target"], kind=relationship["type"]
)

# Create a structure for our nodes
structure = nx.spring_layout(G, seed=42)

traces = []
for relationship in knowledge["relationships"]:
x0, y0 = structure[relationship["source"]]
x1, y1 = structure[relationship["target"]]
edge_trace = go.Scatter(
x=[x0, x1, None],
y=[y0, y1, None],
line=dict(width=0.5, colour="#888"), # Set a single colour for all edges
hoverinfo="none",
mode="traces",
)
traces.append(edge_trace)

# Modify node hint to paint based mostly on supply node
node_x = []
node_y = []
for node in G.nodes():
x, y = structure[node]
node_x.append(x)
node_y.append(y)

### END - NETWORKX LOGIC ###

node_trace = go.Scatter(
x=node_x,
y=node_y,
mode="markers+textual content",
# add textual content to the nodes and origin
textual content=[node for node in G.nodes()],
hoverinfo="textual content",
marker=dict(
showscale=False,
colorscale="Greys", # Change colorscale to grayscale
reversescale=True,
measurement=20,
colour='#505050', # Set node colour to grey
line_width=2,
),
)

# Add node and edge labels
edge_annotations = []
for edge in G.edges(knowledge=True):
x0, y0 = structure[edge[0]]
x1, y1 = structure[edge[1]]
edge_annotations.append(
dict(
x=(x0 + x1) / 2,
y=(y0 + y1) / 2,
xref="x",
yref="y",
textual content=edge[2]["type"],
showarrow=False,
font=dict(measurement=10),
)
)

node_annotations = []
for node in G.nodes():
x, y = structure[node]
node_annotations.append(
dict(
x=x,
y=y,
xref="x",
yref="y",
textual content=node,
showarrow=False,
font=dict(measurement=12),
)
)

node_trace.textual content = [node for node in G.nodes()]

# Create the determine
fig = go.Determine(
knowledge=traces + [node_trace],
structure=go.Format(
showlegend=False,
hovermode="closest",
margin=dict(b=20, l=5, r=5, t=40),
annotations=edge_annotations,
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
),
)

# Modify the structure to incorporate the legend
fig.update_layout(
legend=dict(
title="Origins",
traceorder="regular",
font=dict(measurement=12),
)
)

# Modify the node textual content colour for higher visibility on darkish background
node_trace.textfont = dict(colour="white")

# Modify the structure to incorporate the legend and set the plot background to darkish
fig.update_layout(
paper_bgcolor="rgba(0,0,0,1)", # Set the background colour to black
plot_bgcolor="rgba(0,0,0,1)", # Set the plot space background colour to black
legend=dict(
title="Origins",
traceorder="regular",
font=dict(measurement=12, colour="white"), # Set legend textual content colour to white
),
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
)

for annotation in edge_annotations:
annotation["font"]["color"] = "white" # Set edge annotation textual content colour to white

# Replace the colour of the node annotations for higher visibility
for annotation in node_annotations:
annotation["font"]["color"] = "white" # Set node annotation textual content colour to white

# Replace the sting hint colour to be extra seen on a darkish background
for hint in traces:
if "line" in hint:
hint["line"][
"color"
] = "#888" # Set edge colour to a single colour for all edges

# Replace the node hint marker border colour for higher visibility
node_trace.marker.line.colour = "white"

return fig

Be happy to easily copy-paste this perform in your logic and alter it as you please.

And that is how the thoughts map seems for the immediate “Who’s Sam Altman?”

1*MBfrlqHvCuWwikCmrSabcw
How a thoughts map seems. Picture by creator.

Nice work! We’re executed with the backend logic! Our final step is to implement the Streamlit app.

The Remaining Step: The Frontend App with Streamlit

We’re virtually there. Thanks for studying to this point. Hope you loved the journey up till now.

We’ll use a purposeful method to constructing the Streamlit app — this implies all logical blocks will likely be construct by calling features. That is the construction of the app

  • Setup the web page
  • Setup the hero / intro of the web page
  • Setup the sidebar
  • Coding the file ingestion logic
  • Setup the inputs part
  • Visualize the database
  • Render the thoughts map
  • Begin the engines!

We’ll import the database modules so as to add, take away and replace parts. We’ll import the utils and the schema information to make sure validation by way of Pydantic and likewise import the vector db logic, thoughts map and llm. Principally, all that we’ve constructed!

NamedTemporaryFile helps us momentarily save the uploaded information to seize helpful knowledge for storage.

That is how the frontend is coded:

# frontend.py

import streamlit as st

from src.logger import get_console_logger
from src.utils import hash_text, convert_timestamp_to_datetime
from src.schema import FileType
from src import db
from src.whisper import create_transcript
from src import vector_db
from src import mind_map
from src.llm import llm

from tempfile import NamedTemporaryFile

import pandas as pd

from openai import OpenAI
from upstash_vector import Index

logger = get_console_logger("frontend")

# CONSTANTS
AUDIO_FILE_TYPES = ["mp3", "wav"]
PAGE_TITLE = "Thoughts Mapper | Create thoughts maps out of your information"
PAGE_ICON = "🧠"
LAYOUT = "vast"
SIDEBAR_STATE = "expanded"

if "OPENAI_API_KEY" not in st.session_state:
st.session_state["OPENAI_API_KEY"] = ""
if "UPSTASH_VECTOR_DB_REST_URL" not in st.session_state:
st.session_state["UPSTASH_VECTOR_DB_REST_URL"] = ""
if "UPSTASH_VECTOR_DB_TOKEN" not in st.session_state:
st.session_state["UPSTASH_VECTOR_DB_TOKEN"] = ""

openai_client = OpenAI(api_key=st.session_state["OPENAI_API_KEY"])
vector_db_index = Index(
url=st.session_state["UPSTASH_VECTOR_DB_REST_URL"],
token=st.session_state["UPSTASH_VECTOR_DB_TOKEN"],
)

def setup_page():
st.set_page_config(
page_title=PAGE_TITLE,
page_icon=PAGE_ICON,
structure=LAYOUT,
initial_sidebar_state=SIDEBAR_STATE,
)


def setup_hero():
st.markdown(
"""
# Thoughts Mapper 🧠
_A easy instrument of data intelligence and visualization_ instrument powered by <b>OpenAI</b>, <b>Upstash Vector DB</b> and a little bit of magic ✨
""",
unsafe_allow_html=True,
)


def setup_sidebar():
with st.sidebar:
st.markdown("## 🔑 API Keys")
# Instance for organising an API key enter for OpenAI
st.markdown(
"### OpenAI"
"nGet your API key [here](https://platform.openai.com/docs/quickstart?context=python)"
)
openai_api_key = st.text_input(label="OpenAI API Key", kind="password")
# Instance for organising an API key enter for Upstash Vector DB
st.markdown(
"### Upstash Vector DB"
"nSetup your Vector DB [here](https://console.upstash.com/)"
)
upstash_vector_db_rest_url = st.text_input(
label="Upstash Vector DB REST url", kind="default"
)
upstash_vector_db_token = st.text_input(
label="Upstash Vector DB Token", kind="password"
)

# Add a button to substantiate the API keys setup
if st.button("Set API Keys"):
st.session_state["OPENAI_API_KEY"] = openai_api_key
st.session_state["UPSTASH_VECTOR_DB_REST_URL"] = upstash_vector_db_rest_url
st.session_state["UPSTASH_VECTOR_DB_TOKEN"] = upstash_vector_db_token
st.success("API keys set efficiently")


def ingest(hash_id: str):
# TODO
with st.spinner("Ingesting file..."):
# Assuming 'row' is outlined elsewhere and accessible right here
q = db.read_one(hash_id)
if not q.embedded:
chunks = vector_db.create_chunks(q.textual content)
vector_db.add_chunks_to_vector_db(
vector_db_index, chunks, metadata={"source_hash_id": q.hash_id}
)
db.update_one(q.hash_id, {"embedded": True})
st.success(f"Merchandise {hash_id} ingested")
else:
st.warning(f"Merchandise {hash_id} already ingested")


def text_input_area():
st.markdown("### 🔡 Inputs")
st.markdown(
"_Specify the information supply to course of. Inputs will likely be saved in a neighborhood database and ingested utilizing Upstash Vector DB for RAG purposes_"
)
st.markdown("#### 📝 Copy-Paste Content material")
textual content = st.text_area(
"Paste within the information you wish to course of",
top=50,
key="text_area",
disabled=True,
)
title = st.text_input("Present title", key="title_text_area", disabled=True)
# save to db
if st.button("Save to database", key="text_area_save", disabled=True):
if textual content and title:
hash_id = hash_text(textual content)
db.add_one(
{
"filename": "*manual_input*",
"title": title,
"file_type": FileType.TEXT,
"hash_id": hash_id,
"textual content": textual content,
}
)
ingest(hash_id)
st.success("Textual content saved to database")
else:
st.warning("Please enter textual content and title to proceed.")


def upload_text_file():
st.markdown("#### 📄 Add a Textual content File")
uploaded_text_file = st.file_uploader(
"Add a textual content file",
kind=["txt"], # Use the fixed for file varieties
accept_multiple_files=True,
disabled=True,
)
# save to db
if st.button("Save to database", key="upload_text_save", disabled=True):
progress_text = "Saving textual content information to database..."
progress_bar = st.progress(0, textual content=progress_text)
if uploaded_text_file shouldn't be None:
if len(uploaded_text_file) == 1:
with NamedTemporaryFile(suffix=".txt") as temp_text_file:
temp_text_file.write(uploaded_text_file.getvalue())
temp_text_file.search(0)
progress_bar.progress(int((1 / len(uploaded_text_file)) * 100))
hash_id = hash_text(temp_text_file.title)
db.add_one(
{
"filename": uploaded_text_file.title,
"title": uploaded_text_file.title,
"file_type": FileType.TEXT,
"hash_id": hash_id,
"textual content": temp_text_file.learn().decode("utf-8"),
}
)
ingest(hash_id)
st.success("Textual content file saved to database")
else:
for file in uploaded_text_file:
with NamedTemporaryFile(suffix=".txt") as temp_text_file:
temp_text_file.write(file.getvalue())
temp_text_file.search(0)

progress_bar.progress(
int(
(uploaded_text_file.index(file) + 1)
/ len(uploaded_text_file)
* 100
)
)
hash_id = hash_text(temp_text_file.title)
db.add_one(
{
"filename": file.title,
"title": file.title,
"file_type": FileType.TEXT,
"hash_id": hash_id,
"textual content": temp_text_file.learn().decode("utf-8"),
}
)
ingest(hash_id)
st.success("Textual content file saved to database")
else:
st.warning("Please add a textual content file to proceed.")


def upload_audio_file():
st.markdown("#### 🔊 Add an Audio File")
uploaded_audio_file = st.file_uploader(
"Add an audio file",
kind=AUDIO_FILE_TYPES, # Use the fixed for file varieties
disabled=True,
)
if st.button("Transcribe & Save to database", key="transcribe", disabled=True):
if uploaded_audio_file shouldn't be None:
extension = "." + uploaded_audio_file.title.break up(".")[-1]
with NamedTemporaryFile(suffix=extension) as temp_audio_file:
temp_audio_file.write(uploaded_audio_file.getvalue())
temp_audio_file.search(0)
with st.spinner("Transcribing audio monitor..."):
transcript = create_transcript(openai_client, temp_audio_file.title)
# Examine if the transcript already exists within the database
existing_item = db.read_one(hash_text(transcript))
if existing_item is None:
hash_id = hash_text(transcript)
db.add_one(
{
"filename": uploaded_audio_file.title,
"title": uploaded_audio_file.title,
"file_type": FileType.AUDIO,
"hash_id": hash_id,
"textual content": transcript,
}
)
ingest(hash_id)
st.success("Transcription full - merchandise saved in database")
else:
st.warning("Transcription already exists within the database.")
else:
st.warning("Please add an audio file to proceed.")


def visualize_db():
st.markdown("### 📊 Database")
all_files = db.read_all()
db_data = []
if len(all_files) > 0:
for file in all_files:
struct = file.model_dump()
db_data.append(
{
"id": struct["hash_id"],
"title": struct["title"],
"filename": struct["filename"],
"file_type": struct["file_type"].worth,
"created_at": convert_timestamp_to_datetime(struct["created_at"]),
"textual content": struct["text"][0:50] + "...",
}
)
df = pd.DataFrame(db_data).rename(
columns={
"id": "ID",
"title": "Title",
"file_type": "Sort",
"textual content": "Textual content",
"created_at": "Date",
}
)
st.dataframe(df, use_container_width=True)
# examine if gadgets are in db

items_selected = st.multiselect(
"Carry out actions on:",
# [str(i) + " - " + str(j) for i, j in zip(df["title"], df["filename"])],
df["Title"].to_list(),
max_selections=10,
)
# delete picks from db
if st.button("Delete chosen gadgets", key="delete"):
for merchandise in items_selected:
item_id = df[df["Title"] == merchandise]["ID"].values[0]
db.delete_one(item_id)
ids_to_delete = vector_db.fetch_by_source_hash_id(
vector_db_index, item_id
)
st.success(f"Merchandise {item_id} deleted from database")
attempt:
vector_db_index.delete(ids_to_delete)
st.success(f"Merchandise {item_id} deleted from vector database")
besides Exception as e:
st.error(f"Vector database deletion failed - {e}")

else:
st.information("No gadgets in database")


def create_mind_map():
st.markdown("### 🧠 Interrogate Data Base")
# get all doc titles from db
all_files = db.read_all()
db_data = []
knowledge = None
if len(all_files) > 0:
for file in all_files:
struct = file.model_dump()
db_data.append(
{
"hash_id": struct["hash_id"],
"title": struct["title"],
"created_at": convert_timestamp_to_datetime(struct["created_at"]),
}
)
df = pd.DataFrame(db_data).rename(
columns={
"hash_id": "hash_id",
"title": "title",
"created_at": "Date",
}
)

immediate = st.chat_input("Ask one thing about your information base")
remark = "No knowledge discovered."
llm_data = None
if immediate:
with st.chat_message("assistant"):
with st.standing("Processing request...", expanded=True):
st.write("- Querying vector database...")
knowledge = vector_db.query_vector_db(
index=vector_db_index,
openai_client=openai_client,
query=immediate,
top_n=5,
)
if knowledge:
st.write("- Extracting thoughts map...")
llm_data = llm.extract_mind_map_data(openai_client, knowledge)
llm_data = eval(llm_data)
st.write("- Evaluating outcomes...")
remark = llm.extract_information_from_mind_map_data(
openai_client, llm_data
)
with st.chat_message("assistant"):
st.write(remark)
st.plotly_chart(
mind_map.create_plotly_mind_map(llm_data),
use_container_width=True,
)
else:
st.information("No gadgets in database")


def start_frontend():
setup_page()
setup_hero()
setup_sidebar()
with st.container(border=True):
create_mind_map()
with st.expander("**🔡 Inputs**", expanded=True):
text_input_area()
col1, col2 = st.columns(2)
with col1:
upload_text_file()
with col2:
upload_audio_file()
with st.expander("**📊 Database**", expanded=False):
visualize_db()


if __name__ == "__main__":
start_frontend()

We are able to begin the appliance by working the command streamlit run src/frontend.py.

That is the top outcome.

Remaining look of the appliance. Picture by creator.

Conclusions

This text confirmed you how you can construct a easy but efficient AI software utilizing Streamlit, Upstash and OpenAI that goes past the straightforward RAG framework.

In its simplicity, this software can actually assist you to join the dots when fed knowledge coming from totally different sources and prompted appropriately.

In the event you handle to discover a helpful use case, share your story with me and the group by including a remark to this article.

Greatest regards,

Andrea

stat?event=post


Past RAG: Community Evaluation by way of LLMs for Data Extraction was initially printed in In the direction of Knowledge Science on Medium, the place persons are persevering with the dialog by highlighting and responding to this story.



Supply hyperlink

LEAVE A REPLY

Please enter your comment!
Please enter your name here