AI/ML News

Import a fine-tuned Meta Llama 3 mannequin for SQL question technology on Amazon Bedrock

August 2, 2024

Table of Contents

Amazon Bedrock is a completely managed service that provides a alternative of high-performing basis fashions (FMs) from main synthetic intelligence (AI) firms like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon by means of a single API. Amazon Bedrock additionally gives a broad set of capabilities wanted to construct generative AI functions with safety, privateness, and accountable AI practices.

Some FMs are publicly out there, which permits for personalisation tailor-made to particular use circumstances and domains. Nonetheless, deploying personalized FMs to assist generative AI functions in a safe and scalable method isn’t a trivial process. Internet hosting giant fashions entails complexity across the number of occasion kind and deployment parameters. To deal with this problem, AWS not too long ago introduced the preview of Amazon Bedrock Customized Mannequin Import, a characteristic that you should use to import personalized fashions created in different environments—corresponding to Amazon SageMaker, Amazon Elastic Compute Cloud (Amazon EC2) situations, and on premises—into Amazon Bedrock. This characteristic abstracts the complexity of the deployment course of by means of easy APIs for mannequin deployment and invocation. Presently, Customized Mannequin Import helps importing customized weights for chosen mannequin architectures (Meta Llama 2 and Llama 3, Flan, and Mistral) and precisions (FP32, FP16, and BF16), and serving the fashions on demand and with provisioned throughput.

Customizing FMs can unlock important worth by tailoring their capabilities to particular domains or duties. That is the primary in a collection of posts about mannequin customization eventualities that may be imported into Amazon Bedrock to simplify the method of constructing scalable and safe generative AI functions. By demonstrating the method of deploying fine-tuned fashions, we intention to empower knowledge scientists, ML engineers, and utility builders to harness the total potential of FMs whereas addressing distinctive utility necessities.

On this publish, we show the method of fine-tuning Meta Llama 3 8B on SageMaker to specialize it within the technology of SQL queries (text-to-SQL). Meta Llama 3 8B is a comparatively small mannequin that provides a steadiness between efficiency and useful resource effectivity. AWS prospects have explored fine-tuning Meta Llama 3 8B for the technology of SQL queries—particularly when utilizing non-standard SQL dialects—and have requested strategies to import their personalized fashions into Amazon Bedrock to learn from the managed infrastructure and safety that Amazon Bedrock gives when serving these fashions.

Resolution overview

We stroll by means of the steps of fine-tuning an FM with utilizing SageMaker, and importing and evaluating the fine-tuned FM for SQL question technology utilizing Amazon Bedrock. The entire stream is proven within the following determine and it covers the next steps:

The consumer invokes a SageMaker coaching job to fine-tune the mannequin utilizing QLoRA and retailer the weights in an Amazon Easy Storage Service (Amazon S3) bucket within the consumer’s account.
When the fine-tuning job is full, the consumer runs the mannequin import job utilizing the Amazon Bedrock console. This step will run Steps 3–5 routinely.
Amazon Bedrock service begins an import job in an AWS operated deployment account.
Mannequin artifacts are copied from the consumer’s account into an AWS managed S3 bucket.
When the import job is full, the fine-tuned mannequin will likely be made out there to be invoked.

All knowledge stays throughout the chosen AWS Area, the mannequin artifacts are imported into the AWS operated deployment account utilizing a VPC endpoint, and you may encrypt your mannequin knowledge with your individual Amazon Key Administration Service (AWS KMS) keys. The scripts for fine-tuning and analysis can be found on the GitHub repository.

A duplicate of your mannequin artifacts is saved in an AWS operated deployment account. This copy will stay till the customized mannequin is deleted. Deleting artifacts within the consumer’s account gained’t delete the mannequin or the artifacts within the AWS operated account. If totally different variations of a mannequin are imported into Amazon Bedrock, every model will likely be managed as an unbiased venture with its personal set of artifacts. You’ll be able to apply tags to fashions and import jobs to maintain monitor of various tasks and variations.

Meta Llama3 8B is a gated mannequin on Hugging Face, which implies that customers should be granted entry earlier than they’re allowed to obtain and customise the mannequin. Sign up to your Hugging Face account, learn the Meta Llama 3 Acceptable Use Coverage, and submit your contact info to be granted entry. This course of may take a few hours.

We use the sql-create-context dataset out there on Hugging Face for fine-tuning. The dataset accommodates 78,577 tuples of context (desk schema), query (question expressed in pure language), and reply (SQL question). Discuss with the licensing info relating to this dataset earlier than continuing additional.

We use Amazon SageMaker Studio to create a distant fine-tuning job, which is able to run as a SageMaker coaching job. SageMaker Studio is a single web-based interface for end-to-end machine studying (ML) improvement. In case you need assistance configuring your SageMaker Studio area and your JupyterLab setting, see Launch Amazon SageMaker Studio. The coaching job will use QLoRA and the PyTorch FullyShardedDataParallel API (FSDP) to fine-tune the Meta Llama 3 mannequin. QLoRA quantizes a pretrained language mannequin to 4 bits and attaches smaller low-rank adapters (LoRA), that are fine-tuned with our coaching knowledge. PyTorch FSDP is a parallelism approach that shards the mannequin throughout GPUs for environment friendly coaching. See the next pocket book for the entire code pattern.

Knowledge preparation

Within the knowledge preparation stage, we use the next immediate template to insert particular directions for deciphering the context and fulfilling the request, and retailer the modified coaching dataset as JSON information which are uploaded to Amazon S3:

system_message = """You're a highly effective text-to-SQL mannequin. Your job is to reply questions on a database."""

def create_conversation(report):
    pattern = {"messages": [
        {"role": "system", "content": system_message + f"""You can use the following table schema for context: {record["context"]}"""},
        {"function": "consumer", "content material": f"""Return the SQL question that solutions the next query: {report["question"]}"""},
        {"function" : "assistant", "content material": f"""{report["answer"]}"""}
    ]}
    return pattern

Tremendous-tune Meta Llama 3 8B mannequin

Discuss with the run_fsdp_qlora.py file outlined within the pocket book for a full description of the fine-tuning script. The next snippets describe the configuration of the QLoRA job:

if script_args.use_qlora:
    print(f"Utilizing QLoRA - {torch_dtype}")
    quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch_dtype,
            bnb_4bit_quant_storage=quant_storage_dtype,
        )
else:
    quantization_config = None

peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

The coach class relies on Supervised Tremendous-tuning Coach (SFT Coach) from Hugging Face, which is an API to create your SFT fashions and practice them with a couple of strains of code:

coach = SFTTrainer(
    mannequin=mannequin,
    args=training_args,
    train_dataset=train_dataset,
    dataset_text_field="textual content",
    eval_dataset=test_dataset,
    peft_config=peft_config,
    max_seq_length=script_args.max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with particular tokens
        "append_concat_token": False,  # No want so as to add further separator token
    },
)

As soon as the adapter is educated, it’s merged with the unique mannequin earlier than persisting the weights. Customized Mannequin Import doesn’t assist LoRA adapters for the time being.

mannequin = mannequin.merge_and_unload()
mannequin.save_pretrained(
    sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB"
)

For this use case, we use an ml.g5.12xlarge occasion, which has 4 NVIDIA A10 accelerators. The important thing configurations are as follows:

huggingface_estimator = HuggingFace(
    entry_point="run_fsdp_qlora.py",    # practice script
    source_dir="scripts/trl/",      # listing which incorporates all of the information wanted for coaching
    instance_type="ml.g5.12xlarge",   # situations kind used for the coaching job
    instance_count       = 1,                 # the variety of situations used for coaching
    max_run              = 2*24*60*60,        # most runtime in seconds (days * hours * minutes * seconds)
    base_job_name        = job_name,          # the identify of the coaching job
    function                 = function,              # Iam function utilized in coaching job to entry AWS ressources, e.g. S3
    volume_size          = 300,               # the dimensions of the EBS quantity in GB
    transformers_version = '4.36.0',            # the transformers model used within the coaching job
    pytorch_version      = '2.1.0',             # the pytorch_version model used within the coaching job
    py_version           = 'py310',           # the python model used within the coaching job
    hyperparameters      =  hyperparameters,  # the hyperparameters handed to the coaching job
    disable_output_compression = True,        # not compress output to avoid wasting coaching time and value
    distribution={"torch_distributed": {"enabled": True}},
    setting          = {
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache", # set env variable to cache fashions in /tmp
        "HF_TOKEN": HfFolder.get_token(),       # Retrieve HuggingFace Token for use for downloading base fashions from
        "ACCELERATE_USE_FSDP":"1", 
        "FSDP_CPU_RAM_EFFICIENT_LOADING":"1"
    },
)

In our testing, the coaching job accomplished two epochs in roughly 2.5 hours on a single ml.g5.12xlarge occasion, which incurred roughly $18 for coaching value. After coaching is full, mannequin weights within the Hugging Face safetensors format, the tokenizer, and the configuration file will likely be uploaded to the S3 bucket outlined within the coaching script. This path ought to be saved for use as the bottom listing for the import job within the subsequent part.

s3_files_path = huggingface_estimator.model_data["S3DataSource"]["S3Uri"]

The configuration file config.json will inform Amazon Bedrock learn how to load the weights from the safetensors information. Some parameters to bear in mind are the model_type, which should be one of many sorts presently supported by Amazon Bedrock, max_position_embeddings, which units the utmost size of enter sequence that the mannequin can deal with, the mannequin dimensions (hidden_size, intermediate_size, num_hidden_layers, and num_attention_heads), and rotary place embedding (RoPE) parameters, which describe the encoding of place info. See the next configuration:

{
  "_name_or_path": "meta-llama/Meta-Llama-3-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.40.2",
  "use_cache": true,
  "vocab_size": 128256
}

Import the fine-tuned mannequin into Amazon Bedrock

To import the fine-tuned Meta Llama 3 mannequin into Amazon Bedrock, compete the next steps:

On the Amazon Bedrock console, select Imported fashions on the navigation pane.
Select Import mannequin.

For Mannequin identify, enter llama-3-8b-text-to-sql.
For Mannequin import settings, enter the Amazon S3 location from the earlier steps.
Select Import mannequin.
The mannequin import job ought to take 15–18 minutes to finish.
When it’s executed, select Fashions to see your mannequin.
Copy the mannequin Amazon Useful resource Identify (ARN) so you possibly can invoke the mannequin with the AWS SDK within the subsequent part.

Consider SQL queries generated by the fine-tuned mannequin

On this part, we offer two examples to guage the SQL queries generated by the fine-tuned mannequin: one utilizing the Amazon Bedrock Textual content Playground and one utilizing a big language mannequin (LLM) as a decide.

Utilizing the Amazon Bedrock Textual content Playground

You’ll be able to check the mannequin utilizing the Amazon Bedrock Textual content Playground. For optimum outcomes, use the identical immediate template used to preprocess your coaching knowledge:

[INST] <>You're a highly effective text-to-SQL mannequin. Your job is to reply questions on a database. You should utilize the next desk schema for context: CREATE TABLE table_name_11 (event VARCHAR)<>

[INST]Human: Return the SQL question that solutions the next query: Which Event has A in 1987?[/INST]

Assistant:

The next animation reveals the outcomes.

Utilizing LLM as a decide

On the identical instance pocket book, we used the Amazon Bedrock InvokeModel API to name our imported mannequin on demand to generate SQL queries for information in our check dataset. We use the identical immediate template used with the coaching knowledge within the fine-tuning step. The imported mannequin will solely assist parameters that have been supported by the bottom mannequin (max_tokens, top_p, and temperature). Imported fashions don’t assist penalty phrases (repetition_penalty or length_penalty) or using token sampling as an alternative of grasping decoding (do_sample). See the next code:

def get_sql_query(system_prompt, user_question):
    """
    Generate a SQL question utilizing Llama 3 8B
    Bear in mind to make use of the identical template utilized in tremendous tuning
    """
    formatted_prompt = f"[INST] <>{system_prompt}<>nn[INST]Human: {user_question}[/INST]nnAssistant:"
    native_request = {
        "immediate": formatted_prompt,
        "max_tokens": 100,
        "top_p": 0.9,
        "temperature": 0.1
    }
    response = shopper.invoke_model(modelId=model_id,
                                   physique=json.dumps(native_request))
    response_text = json.masses(response.get('physique').learn())["outputs"][0]["text"]

    return response_text

After we generate mannequin predictions, we use a unique (extra highly effective) mannequin to behave as a decide and consider our fine-tuned mannequin responses. For this instance, we use the Anthropic Claude 3 Sonnet LLM on Amazon Bedrock to measure the similarity between the specified reply and the expected reply utilizing the next immediate:

formatted_prompt = f"""You're a knowledge science instructor that's introducing college students to SQL. Take into account the next query and schema:
{query}
{db_schema}
    
Right here is the right reply:
{correct_answer}
    
Right here is the scholar's reply:
{test_answer}

Please present a numeric rating from 0 to 100 on how effectively the scholar's reply matches the right reply for this query.
The rating ought to be excessive if the solutions say primarily the identical factor.
The rating ought to be decrease if some elements are lacking, or if further pointless elements have been included.
The rating ought to be 0 for a wholly unsuitable reply. Put the rating in  XML tags.
Don't contemplate your individual reply to the query, however as an alternative rating based mostly solely on the right reply above.
"""

The expected rating based mostly on our holdout break up of the dataset was 96.65%, which is great for a small mannequin tuned to a selected process.

Clear up

The mannequin will spin all the way down to zero after a interval of no exercise and your value will cease accruing. Nonetheless, we advocate deleting the imported mannequin utilizing the Amazon Bedrock console. Bear in mind to additionally delete mannequin artifacts out of your S3 bucket when the fine-tuned mannequin is not wanted to stop incurring prices.

Conclusion

This publish introduced an outline of the method of fine-tuning a small mannequin utilizing SageMaker to assist generate extra correct SQL queries based mostly on questions requested in pure language after which importing the fine-tuned mannequin into Amazon Bedrock utilizing the Customized Mannequin Import characteristic. After we imported the mannequin, it was made out there on demand by means of the Amazon Bedrock Playground and the InvokeModel API, which was used to guage the efficiency of the fine-tuned mannequin in opposition to a holdout dataset utilizing an LLM as a decide.

The next are beneficial finest practices which may be useful when utilizing fine-tuned FMs for code technology duties:

Choose a dataset that’s related and various sufficient to your code technology process
Monitor the coaching job and PEFT parameters to stop overfitting and catastrophic forgetting
Preprocess coaching knowledge with a constant instruction template
Retailer mannequin weights utilizing safetensors for quick loading
Invoke the mannequin utilizing the identical instruction template utilized in fine-tuning, utilizing solely inference parameters which are supported by the bottom mannequin and the Customized Mannequin Import characteristic in Amazon Bedrock

Discover the Amazon Bedrock Customized Mannequin Import characteristic as a strategy to deploy FMs fine-tuned for code technology duties in a safe and scalable method. Go to our GitHub repository to discover samples ready for fine-tuning and importing fashions from varied households.

Concerning the Authors

Evandro Franco is a Sr. AI/ML Specialist Options Architect engaged on Amazon Net Companies. He helps AWS prospects overcome enterprise challenges associated to AI/ML on prime of AWS. He has greater than 18 years working with know-how, from software program improvement, infrastructure, serverless, to machine studying.

Felipe Lopez is a Senior AI/ML Specialist Options Architect at AWS. Previous to becoming a member of AWS, Felipe labored with GE Digital and SLB, the place he centered on modeling and optimization merchandise for industrial functions.

Jay Pillai is a Principal Resolution Architect at Amazon Net Companies. On this function, he capabilities because the World Generative AI Lead Architect and in addition the Lead Architect for Provide Chain Options with AABG. As an Data Know-how Chief, Jay focuses on synthetic intelligence, knowledge integration, enterprise intelligence, and consumer interface domains. He has 23 years of in depth expertise working with a number of purchasers throughout provide chain, authorized applied sciences, actual property, monetary companies, insurance coverage, funds, and market analysis enterprise domains.

Rupinder Grewal is a Senior AI/ML Specialist Options Architect with AWS. He presently focuses on the serving of fashions and MLOps on Amazon SageMaker. Previous to this function, he labored as a Machine Studying Engineer constructing and internet hosting fashions. Outdoors of labor, he enjoys enjoying tennis and biking on mountain trails.

Sandeep Singh is a Senior Generative AI Knowledge Scientist at Amazon Net Companies, serving to companies innovate with generative AI. He focuses on Generative AI, Synthetic Intelligence, Machine Studying, and System Design. He’s obsessed with growing state-of-the-art AI/ML-powered options to unravel complicated enterprise issues for various industries, optimizing effectivity and scalability.

Ragha Prasad is a Principal Engineer and a founding member of Amazon Bedrock, the place he has had the privilege to take heed to buyer wants first-hand and understands what it takes to construct and launch scalable and safe Gen AI merchandise. Previous to Bedrock, he labored on quite a few merchandise in Amazon, starting from gadgets to Adverts to Robotics.

Supply hyperlink