Understanding and Implementing Medprompt

Understanding and Implementing Medprompt

Digging into the small print behind the prompting framework

Illustration of the assorted parts of the Medprompt Technique (Picture taken from Fig:6 from the Medprompt paper [1] (https://arxiv.org/abs/2311.16452)

In my first weblog publish, I explored prompting and its significance within the context of Giant Language Fashions (LLMs). Prompting is essential for acquiring high-quality outputs from LLMs, because it guides the mannequin’s responses and ensures they’re related to the duty at hand. Constructing on that basis, two essential questions usually come up when attempting to unravel a use case utilizing LLMs: how far are you able to push efficiency with prompting alone, and when do you chunk the bullet and determine it is perhaps more practical to fine-tune a mannequin as a substitute?

When making design selections about leveraging prompting, a number of issues come into play. Strategies like few-shot prompting and Chain-of-Thought (CoT) [2] prompting might help in boosting the efficiency of LLMs for many duties. Retrieval-Augmented Technology (RAG) pipelines can additional enhances LLM efficiency by adapting to new domains with out fine-tuning and offering controllability over grounding the generated outputs whereas decreasing hallucinations. Total, we’ve a set of instruments to push the needle by way of LLM efficiency with out explicitly resorting to fine-tuning.

Tremendous-tuning comes with its personal set of challenges and problems, by way of labelled information necessities and the prices related to coaching of LLMs and their deployment. Tremendous-tuning might also improve the hallucinations of the LLM in sure instances [3]. Placing this all collectively, we are able to see that there’s important worth in attempting to optimize LLM efficiency for our job by prompting earlier than resorting to fine-tuning.

So, how can we go about this? On this article, we discover Medprompt [1], a complicated prompting technique launched by Microsoft. Medprompt ties collectively rules from few-shot prompting, CoT prompting and RAG to reinforce the efficiency of GPT-4 within the healthcare area with none domain-specific fine-tuning.

Desk of Contents:

  1. MedPrompt Defined
  2. Parts of MedPrompt
  3. Implementing MedPrompt
  4. Evaluating Efficiency
  5. Conclusion
  6. References

MedPrompt Defined

LLMs have demonstrated spectacular capabilities throughout numerous sectors, notably in healthcare. Final 12 months, Google launched MedPaLM [4] and MedPaLM-2 [5], LLMs that not solely excel in Medical A number of-Alternative Query Answering (MCQA) datasets but in addition carry out competitively and even outperform clinicians in open-ended medical query answering . These fashions have been tailor-made particularly for the healthcare area by instruction fine-tuning and the usage of clinician-written Chain-of-Thought templates, considerably enhancing their efficiency.

On this context, the paper “Can Generalist Basis Fashions Outcompete Particular-Function Tuning? Case Examine in Medication” [1] from Microsoft raises a compelling query:

Can the efficiency of a generalist mannequin like GPT-4 be improved for a selected area with out counting on domain-specific fine-tuning or expert-crafted sources?

As a part of this examine, the paper introduces Medprompt, an progressive prompting technique that not solely improves the mannequin’s efficiency but in addition surpasses specialised fashions corresponding to MedPaLM-2.

Comparability of varied LLMs on medical data benchmarks. GPT-4 with Medprompt outperforms Med-PaLM 2 throughout all these datasets. (Picture of Desk 1 from the Medprompt paper [1] (https://arxiv.org/abs/2311.16452))

GPT-4 with Medprompt outperforms Med-PaLM 2 throughout all medical MCQA benchmarks with none domain-specific fine-tuning. Let’s discover the parts in Medprompt.

Parts of Medprompt

Medprompt ties collectively rules from few-shot prompting, CoT prompting and RAG. Particularly there are 3 parts on this pipeline:

Dynamic Few-shot Choice`

Few-shot prompting refers to using instance input-output pairs as context for prompting the LLM. If these few-shot samples are static, the draw back is that they might not be essentially the most related examples for the brand new enter. Dynamic Few-shot Choice, the primary element in Medprompt, helps overcome this by deciding on the few-shot examples based mostly on every new job enter. This methodology entails coaching a Ok-Nearest Neighbors (Ok-NN) algorithm on the coaching set, which then retrieves essentially the most related coaching set examples to the check enter based mostly on cosine similarity in an embedding area. This technique effectively makes use of the present coaching dataset to retrieve related few-shot examples for prompting the LLM.

Self-Generated CoT

As famous within the paper [1], CoT historically depends on manually crafted few-shot exemplars that embrace detailed reasoning steps, as used with MedPaLM-2, the place such prompts had been written by medical professionals. Medprompt introduces Self-Generated CoT because the second module, the place the LLM is used to provide detailed, step-by-step explanations of its reasoning course of, culminating in a closing reply alternative. By robotically producing CoT reasoning steps for every coaching datapoint, the necessity for manually crafted exemplars is bypassed. To make sure that solely appropriate predictions with reasoning steps are retained and incorrect responses are filtered out, the reply generated by GPT-4 is cross-verified in opposition to the bottom fact.

Alternative Shuffling Ensemble

The Alternative Shuffling Ensemble approach is the third approach launched by Medprompt. It’s designed to fight the inherent biases which will have an effect on the mannequin’s decision-making course of, notably place bias in multiple-choice settings. The ordering of the reply decisions is shuffled, and this course of is repeated okay instances to create okay variants of the identical query with shuffled reply decisions. Throughout inference, every variant is used to generate a solution, and a majority vote is carried out over all variants to select the ultimate predicted choice.

How are these parts used within the preprocessing and inference stage?

Let’s now take a look on the preprocessing and inference levels in Medprompt.

Preprocessing Stage

Within the preprocessing pipeline, we start by taking every query from the coaching dataset and incorporating detailed directions inside the immediate to information the technology of each a solution and its related reasoning steps. The LLM is prompted to generate the reply and reasoning steps. After acquiring the generated response, we confirm its accuracy by evaluating the anticipated reply to the bottom fact for that specific query.

Medprompt Preprocessing Pipeline (Picture by Creator)

If the prediction is inaccurate, we exclude this occasion from our database of related questions. If the prediction is appropriate, we proceed by embedding the query utilizing a textual content embedding mannequin. We then retailer the query, query embedding, reply, and Chain of Thought (CoT) reasoning in a buffer. As soon as all questions have been processed, we make the most of the embeddings for coaching a KNN mannequin. This skilled KNN mannequin acts as our retriever in a RAG pipeline, enabling us to effectively question and retrieve the top-k related information factors based mostly on cosine similarity inside the embedding area.

Inference Pipeline

Throughout the inference stage, every query from our check set is first embedded utilizing the textual content embedding mannequin. We then make the most of the KNN mannequin to determine the top-k most related questions. For every retrieved information level, we’ve entry to the self-generated Chain of Thought (CoT) reasoning and the anticipated reply. We format these parts — query, CoT reasoning, and reply — into few-shot examples for our eventual immediate.

1*lgbxOy0t3Hk1OXXcqn5o A
Medprompt Inference Pipline (Picture by Creator)

We now carry out alternative shuffling ensembling by shuffling the order of reply decisions for every check query, creating a number of variants of the identical query. The LLM is then prompted with these variants, together with the corresponding few-shot exemplars, to generate reasoning steps and a solution for every variant. Lastly, we carry out a majority vote over the predictions from all variants and choose the ultimate prediction.

Implementing Medprompt

The code associated to this implementation may be discovered at this github repo hyperlink.

We use the MedQA [6] dataset for implementing and evaluating Medprompt. We first outline helper capabilities for parsing the jsonl recordsdata.

def write_jsonl_file(file_path, dict_list):
Write an inventory of dictionaries to a JSON Strains file.

- file_path (str): The trail to the file the place the information can be written.
- dict_list (checklist): A listing of dictionaries to put in writing to the file.
with open(file_path, 'w') as file:
for dictionary in dict_list:
json_line = json.dumps(dictionary)
file.write(json_line + 'n')

def read_jsonl_file(file_path):
Parses a JSONL (JSON Strains) file and returns an inventory of dictionaries.

file_path (str): The trail to the JSONL file to be learn.

checklist of dict: A listing the place every factor is a dictionary representing
a JSON object from the file.
jsonl_lines = []
with open(file_path, 'r', encoding="utf-8") as file:
for line in file:
json_object = json.masses(line)

return jsonl_lines

Implementing Self-Generated CoT

For our implementation, we make the most of the coaching set from MedQA. We implement a zero-shot CoT immediate and course of all of the coaching questions. We use GPT-4o in our implementation. For every query, we generate the CoT and the corresponding reply. We outline a immediate which relies on the template offered within the Medprompt paper.

system_prompt = """You're an knowledgeable medical skilled. You're supplied with a medical query with a number of reply decisions.
Your objective is to suppose by the query fastidiously and clarify your reasoning step-by-step earlier than deciding on the ultimate reply.
Reply solely with the reasoning steps and reply as specified under.
Beneath is the format for every query and reply:

## Query: {{query}}

## Reply
(mannequin generated chain of thought rationalization)
Due to this fact, the reply is [final model answer (e.g. A,B,C,D)]"""
def build_few_shot_prompt(system_prompt, query, examples, include_cot=True):
Builds the zero-shot immediate.

system_prompt (str): Activity Instruction for the LLM
content material (dict): The content material for which to create a question, formatted as
required by `create_query`.

checklist of dict: A listing of messages, together with a system message defining
the duty and a consumer message with the enter query.
messages = [{"role": "system", "content": system_prompt}]

for elem in examples:
messages.append({"position": "consumer", "content material": create_query(elem)})
if include_cot:
messages.append({"position": "assistant", "content material": format_answer(elem["cot"], elem["answer_idx"])})
answer_string = f"""## AnswernTherefore, the reply is {elem["answer_idx"]}"""
messages.append({"position": "assistant", "content material": answer_string})

messages.append({"position": "consumer", "content material": create_query(query)})
return messages

def get_response(messages, model_name, temperature = 0.0, max_tokens = 10):
Obtains the responses/solutions of the mannequin by the chat-completions API.

messages (checklist of dict): The constructed messages offered to the API.
model_name (str): Title of the mannequin to entry by the API
temperature (float): A worth between 0 and 1 that controls the randomness of the output.
A temperature worth of 0 ideally makes the mannequin choose the almost certainly token, making the outputs deterministic.
max_tokens (int): Most variety of tokens that the mannequin ought to generate

str: The response message content material from the mannequin.
response = shopper.chat.completions.create(
return response.decisions[0].message.content material

We additionally outline helper capabilities for parsing the reasoning and the ultimate reply choice from the LLM response.

def matches_ans_option(s):
Checks if the string begins with the particular sample 'Due to this fact, the reply is [A-Z]'.

s (str): The string to be checked.

bool: True if the string matches the sample, False in any other case.
return bool(re.match(r'^Due to this fact, the reply is [A-Z]', s))

def extract_ans_option(s):
Extracts the reply choice (a single capital letter) from the beginning of the string.

s (str): The string containing the reply sample.

str or None: The captured reply choice if the sample is discovered, in any other case None.
match = re.search(r'^Due to this fact, the reply is ([A-Z])', s)
if match:
return match.group(1) # Returns the captured alphabet
return None

def matches_answer_start(s):
Checks if the string begins with the markdown header '## Reply'.

s (str): The string to be checked.

bool: True if the string begins with '## Reply', False in any other case.
return s.startswith("## Reply")

def validate_response(s):
Validates a multi-line string response that it begins with '## Reply' and ends with the reply sample.

s (str): The multi-line string response to be validated.

bool: True if the response is legitimate, False in any other case.
file_content = s.cut up("n")

return matches_ans_option(file_content[-1]) and matches_answer_start(s)

def parse_answer(response):
Parses a response that begins with '## Reply', extracting the reasoning and the reply alternative.

response (str): The multi-line string response containing the reply and reasoning.

tuple: A tuple containing the extracted CoT reasoning and the reply alternative.
split_response = response.cut up("n")
assert split_response[0] == "## Reply"
cot_reasoning = "n".be part of(split_response[1:-1]).strip()
ans_choice = extract_ans_option(split_response[-1])
return cot_reasoning, ans_choice

We now course of the questions within the coaching set of MedQA. We receive CoT responses and solutions for all questions and retailer them to a folder.

train_data = read_jsonl_file("information/phrases_no_exclude_train.jsonl")

cot_responses = []
# os.mkdir("cot_responses")
existing_files = os.listdir("cot_responses/")

for idx, merchandise in enumerate(tqdm(train_data)):
if str(idx) + ".txt" in existing_files:

immediate = build_zero_shot_prompt(system_prompt, merchandise)
response = get_response(immediate, model_name="gpt-4o", max_tokens=500)
with open(os.path.be part of("cot_responses", str(idx) + ".txt"), "w", encoding="utf-8") as f:
besides Exception as e :

We now iterate throughout all of the generated responses to verify if they’re legitimate and cling to the prediction format outlined within the immediate. We discard responses that don’t conform to the required format. After that, we verify the anticipated solutions in opposition to the bottom fact for every query and solely retain questions for which the anticipated solutions match the bottom fact.

questions_dict = []
ctr = 0
for idx, query in enumerate(tqdm(train_data)):
file = open(os.path.be part of("cot_responses/", str(idx) + ".txt"), encoding="utf-8").learn()
if not validate_response(file):

cot, pred_ans = parse_answer(file)

dict_elem = {}
dict_elem["idx"] = idx
dict_elem["question"] = query["question"]
dict_elem["answer"] = query["answer"]
dict_elem["options"] = query["options"]
dict_elem["cot"] = cot
dict_elem["pred_ans"] = pred_ans

filtered_questions_dict = []
for merchandise in tqdm(questions_dict):
pred_ans = merchandise["options"][item["pred_ans"]]
if pred_ans == merchandise["answer"]:

Implementing the KNN mannequin

Having processed the coaching set and obtained the CoT response for all these questions, we now embed all questions utilizing the text-embedding-ada-002 from OpenAI.

def get_embedding(textual content, mannequin="text-embedding-ada-002"):
return shopper.embeddings.create(enter = [text], mannequin=mannequin).information[0].embedding

for merchandise in tqdm(filtered_questions_dict):
merchandise["embedding"] = get_embedding(merchandise["question"])
inv_options_map = {v:okay for okay,v in merchandise["options"].objects()}
merchandise["answer_idx"] = inv_options_map[item["answer"]]

We now practice a KNN mannequin utilizing these query embeddings. This acts as a retriever at inference time, because it helps us to retrieve related datapoints from the coaching set which might be most much like the query from the check set.

import numpy as np
from sklearn.neighbors import NearestNeighbors

embeddings = np.array([d["embedding"] for d in filtered_questions_dict])
indices = checklist(vary(len(filtered_questions_dict)))

knn = NearestNeighbors(n_neighbors=5, algorithm='auto', metric='cosine').match(embeddings)

Implementing the Dynamic Few-Shot and Alternative Shuffling Ensemble Logic

We are able to now run inference. We subsample 500 questions from the MedQA check set for our analysis. For every query, we retrieve the 5 most related questions from the practice set utilizing the KNN module, together with their respective CoT reasoning steps and predicted solutions. We assemble a few-shot immediate utilizing these examples.

For every query, we additionally shuffle the order of the choices 5 instances to create completely different variants. We then make the most of the constructed few-shot immediate to get the anticipated reply for every of the variants with shuffled choices.

def shuffle_option_labels(answer_options):
Shuffles the choices of the query.

answer_options (dict): A dictionary with the choices.

dict: A brand new dictionary with the shuffled choices.
choices = checklist(answer_options.values())
labels = [chr(i) for i in range(ord('A'), ord('A') + len(options))]
shuffled_options_dict = {label: choice for label, choice in zip(labels, choices)}

return shuffled_options_dict
test_samples = read_jsonl_file("final_processed_test_set_responses_medprompt.jsonl")

for query in tqdm(test_samples, color ="inexperienced"):
question_variants = []
prompt_variants = []
cot_responses = []
question_embedding = get_embedding(query["question"])
distances, top_k_indices = knn.kneighbors([question_embedding], n_neighbors=5)
top_k_dicts = [filtered_questions_dict[i] for i in top_k_indices[0]]
query["outputs"] = []

for idx in vary(5):
question_copy = query.copy()
shuffled_options = shuffle_option_labels(query["options"])
inv_map = {v:okay for okay,v in shuffled_options.objects()}

question_copy["options"] = shuffled_options
question_copy["answer_idx"] = inv_map[question_copy["answer"]]
immediate = build_few_shot_prompt(system_prompt, question_copy, top_k_dicts)

for immediate in tqdm(prompt_variants):
response = get_response(immediate, model_name="gpt-4o", max_tokens=500)

for question_sample, reply in zip(question_variants, cot_responses):
if validate_response(reply):
cot, pred_ans = parse_answer(reply)

cot = ""
pred_ans = ""

query["outputs"].append({"query": question_sample["question"], "choices": question_sample["options"], "cot": cot, "pred_ans": question_sample["options"].get(pred_ans, "")})

We now consider the outcomes of Medprompt over the check set. For every query, we’ve 5 predictions generated by the ensemble logic. We take the mode, or most incessantly occurring prediction, for every query as the ultimate prediction and consider the efficiency. Two edge instances are attainable right here:

  1. Two completely different reply choices are predicted two instances every, with no clear winner.
  2. There may be an error with the response generated, which means that we don’t have a predicted reply choice.

For each of those edge instances, we think about the query to be wrongly answered by the LLM.

def find_mode_string_list(string_list):
Finds essentially the most incessantly occurring strings.

string_list (checklist of str): A listing of strings.
checklist of str or None: A listing containing essentially the most frequent string(s) from the enter checklist.
Returns None if the enter checklist is empty.
if not string_list:
return None

string_counts = Counter(string_list)
max_freq = max(string_counts.values())
mode_strings = [string for string, count in string_counts.items() if count == max_freq]
return mode_strings

ctr = 0
for merchandise in test_samples:
pred_ans = [x["pred_ans"] for x in merchandise["outputs"]]
freq_ans = find_mode_string_list(pred_ans)

if len(freq_ans) > 1:
final_prediction = ""
final_prediction = freq_ans[0]

if final_prediction == merchandise["answer"]:
ctr +=1

print(ctr / len(test_samples))

Evaluating Efficiency

We consider the efficiency of Medprompt with GPT-4o by way of accuracy on the MedQA check subset. Moreover, we benchmark the efficiency of Zero-shot prompting, Random Few-Shot prompting, and Random Few-Shot with CoT prompting.

1*Aey3y SmH4NbhFLIa0PpCg
Outcomes of our analysis (Picture by Creator)

We observe that Medprompt and Random Few-Shot CoT prompting outperform the Zero and Few-Shot prompting baselines. Nonetheless, surprisingly, we discover that Random Few-Shot CoT outperforms our Medprompt efficiency. This could possibly be because of a few causes:

  1. The unique Medprompt paper benchmarked the efficiency of GPT-4. We observe that GPT-4o outperforms GPT-4T and GPT-4 on numerous textual content benchmarks considerably (https://openai.com/index/hello-gpt-4o/), indicating that Medprompt might have a lesser impact on a stronger mannequin like GPT-4o.
  2. We limit our analysis to 500 questions subsampled from MedQA. The Medprompt paper evaluates different Medical MCQA datasets and the complete model of MedQA. Evaluating GPT-4o on the whole variations of the datasets might give a greater image of the general efficiency.


Medprompt is an attention-grabbing framework for creating refined prompting pipelines, notably for adapting a generalist LLM to a selected area with out the necessity for fine-tuning. It additionally highlights the issues concerned in deciding between prompting and fine-tuning for numerous use instances. Exploring how far prompting may be pushed to reinforce LLM efficiency is essential, because it presents a useful resource and cost-efficient different to fine-tuning.


[1] Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., … & Horvitz, E. (2023). Can generalist basis fashions outcompete special-purpose tuning? case examine in drugs. arXiv preprint arXiv:2311.16452. (https://arxiv.org/abs/2311.16452)

[2] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., … & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in giant language fashions. Advances in Neural Data Processing Programs, 35, 24824–24837. (https://openreview.web/pdf?id=_VjQlMeSB_J)

[3] Gekhman, Z., Yona, G., Aharoni, R., Eyal, M., Feder, A., Reichart, R., & Herzig, J. (2024). Does Tremendous-Tuning LLMs on New Information Encourage Hallucinations?. arXiv preprint arXiv:2405.05904. (https://arxiv.org/abs/2405.05904)

[4] Singhal, Ok., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., … & Natarajan, V. (2023). Giant language fashions encode medical data. Nature, 620(7972), 172–180. (https://www.nature.com/articles/s41586-023-06291-2)

[5] Singhal, Ok., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., … & Natarajan, V. (2023). In the direction of expert-level medical query answering with giant language fashions. arXiv preprint arXiv:2305.09617. (https://arxiv.org/abs/2305.09617)

[6] Jin, D., Pan, E., Oufattole, N., Weng, W. H., Fang, H., & Szolovits, P. (2021). What illness does this affected person have? a large-scale open area query answering dataset from medical exams. Utilized Sciences, 11(14), 6421. (https://arxiv.org/abs/2009.13081) (Unique dataset is launched below a MIT License)


Understanding and Implementing Medprompt was initially printed in In the direction of Information Science on Medium, the place persons are persevering with the dialog by highlighting and responding to this story.

Supply hyperlink


Please enter your comment!
Please enter your name here