Deploy Mistral/Llama 7b on AWS in 10 mins

Adithya S K
10 min readOct 8, 2023

A Step-by-Step Guide to Deploying in Just 3 Simple Stages

Have you ever wondered if you could harness the incredible capabilities of Large Language Models (LLMs) for your own projects or ideas? Deploying them in production environments might sound like a daunting task, but what if I told you that you can now do it in just 10 minutes?

In this guide, we’ll unlock the secrets to deploying LLMs like Mistral and LLama 2 as API endpoints on AWS. Whether you’re a seasoned developer or someone curious about the power of these language models, this step-by-step walkthrough will make the process not only achievable but surprisingly straightforward.

LLama 2

LLama 2 is a collection of pre-trained and fine-tuned generative text models developed by Meta. These models range in scale from 7 billion to 70 billion parameters and are designed for various text-generation tasks. The models in the LLama 2 family, particularly the Llama-2-Chat variations, are optimized for dialogue use cases, outperforming open-source chat models in most benchmarks and being on par with some popular closed-source models like ChatGPT and PaLM in terms of helpfulness and safety.


The team at MistralAI has created an exceptional language model called Mistral 7B Instruct. It has consistently delivered outstanding results in a range of benchmarks, positioning it as an ideal option for natural language generation and understanding. This guide will focus on how to fine-tune the model for coding purposes, but the methodology can effectively be applied to other tasks.

Why Deploy Instead of Using APIs

You might wonder why you should go through the trouble of deploying open-source LLMs like Mistral and LLama 2 when you could simply use existing APIs like ChatGPT. Here are some compelling advantages to consider:

  1. Customization: Deploying your own LLMs gives you complete control over customization. You can fine-tune the models to better suit your specific needs, whether it’s generating code, answering domain-specific questions, or engaging in natural language conversations.
  2. Data Privacy: By hosting your LLMs on your infrastructure, you have full control over data privacy and security. You can ensure that sensitive information remains within your organization and isn’t shared with external services.
  3. Cost Efficiency: While API services are convenient, they can become expensive as your usage scales. Deploying your own LLMs on cloud providers like AWS allows you to manage costs more effectively, especially for high-volume applications.
  4. Low Latency: Deploying LLMs as API endpoints on cloud platforms can reduce latency, ensuring quicker responses to user queries, which is crucial for real-time applications.
  5. Tailored Solutions: With custom deployment, you can integrate LLMs seamlessly into your existing applications, workflows, or products, creating tailored solutions that align perfectly with your goals.

These advantages demonstrate that deploying your own LLMs can offer flexibility, control, and cost-efficiency that may not be achievable with third-party API services.

Unit Economics

Before we dive into the deployment process, let’s take a moment to consider the economics behind deploying Llama on AWS. Understanding the costs involved is crucial for effective project planning.

To get a sense of the unit economics, let’s break it down:

  • 1 endpoint x 1 instance per endpoint x 24 hours per day x 30 days per month = 720.00 SageMaker Real-Time Inference hours per month
  • 720.00 hours per month x 💲2.03 per hour instance cost = 💲1,461.60 (monthly On-Demand cost)

This calculation provides an estimate of the monthly cost for running Llama on AWS. Keep these unit economics in mind as we proceed with the deployment process to ensure a clear understanding of the financial aspects.

Sections Overview

This blog will guide you through the process of deploying Llama on AWS for production use cases. We’ll break down the journey into three easy-to-follow sections:

1. Setting up AWS

In this initial section, we’ll get you started with the foundational steps needed to prepare your AWS environment for deploying Llama. You’ll learn how to set up the essential infrastructure to ensure a smooth deployment.

2. Deploying Llama Using SageMaker

Once your AWS environment is primed, we’ll dive into the heart of the matter — deploying Llama itself using Amazon SageMaker. You’ll discover the intricacies of this process, from model selection to training, and how to make the most out of this powerful tool.

3. Setting Up an API Endpoint Using AWS Lambda and API Gateway

Finally, we’ll round off our journey by showing you how to create a user-friendly API endpoint for your deployed Llama model. We’ll leverage AWS Lambda and API Gateway to make your Llama-powered application accessible to the world.

By the end of this blog, you’ll have the knowledge and confidence to deploy Llama on AWS successfully and embark on your own exciting projects with ease.

1.Setting up AWS

Before we dive into deploying Llama 7b using Amazon SageMaker, we need to ensure that our AWS environment is properly configured. Follow these steps:

  1. Request Quota Increase:
  • Navigate to “Service Quotas” in your AWS account.
  • Search for “Amazon SageMaker.”
  • For Llama, especially considering its 7b parameter model, it’s recommended to go with ml.g5.4xlarge. The quantized versions can operate on ml.g5.2xlarge as well. For this tutorial, we'll choose ml.g5.4xlarge.
  • You can find pricing details here.

- After selecting `ml.g5.4xlarge` for endpoint usage, request a quota increase of 48, which should be sufficient.

- Note: It may take 2 to 3 days for your quota allocation to be approved.

Here’s a list of which model can be hosted on which instance type:

2. Create a SageMaker Domain:

- If you don’t already have one, the first task is to create a SageMaker domain. You can achieve this by following these steps:

- Select “Quick Setup.”

- Choose a domain name.

- You can keep the user profile name as default or change it if needed.

- You’ll need to create a role if you don’t have one.

- Choose “Any S3 bucket” and click “create.”

- Once configured, your domain setup should look like this. Click “submit” to create the domain.

If you encounter any errors during domain creation, it may be related to user permissions or VPC configuration issues. Ensure these are properly configured to avoid any hiccups.

2. Deploying Llama / Mistral 7b Using SageMaker

After successfully creating your domain and user profile, it’s time to launch SageMaker Studio. Follow these steps:

  1. Launch SageMaker Studio:

When you launch SageMaker Studio, your dashboard should resemble the image below:

2. Select Your Model for Deployment:

Depending on whether you want to deploy Mistral-7b or LLama 2, follow these steps:

  • For Mistral-7b Deployment:
  • Click on “Mistral 7B Instruct.”
  • For LLama 2 Deployment:
  • Click on “Llama2–7b-Chat jumpstart” and then click on “Deploy.”

3. AWS SageMaker Setup:

After clicking on “Deploy,” AWS SageMaker will initiate the setup process. Please be patient as it may take 2 to 3 minutes for the entire setup to complete.

4. Testing the Endpoint:

Once the setup is complete, you can test the endpoint using the test notebook provided by SageMaker. This step helps you verify that your model is working as expected.

With these steps, you’ve successfully deployed Llama using SageMaker. Next, we’ll explore how to obtain inferences by invoking the endpoint.

3. Setting Up the API

In this section, we’ll guide you through setting up an API endpoint using AWS Lambda and API Gateway. Follow these steps:

Creating an AWS Lambda Role

  1. Go to IAM > Roles > Create Role.
  1. 2. Select “AWS Service” and choose “Lambda” as the service, then click Next.

    Search for and select these two policies:

    — `AmazonSageMakerFullAccess` (This allows Lambda to trigger events in Amazon SageMaker)
    — `CloudWatchFullAccess` (These permissions might be overkill for the task at hand but simplify the process).

    Add a name and optional description for your role. Verify that the selected policies are added as permissions to the role.

Search for these two policies, and click Next

Creating a Lambda Function

1. Go to Lambda > Create Function.
— Choose “Author from scratch.”
— Give your function a name.
— Select runtime as “Python 3.11.”
— Change the default execution role to “Choose an existing role,” then select the role you just created.

2. Click on “Create Function.”

Now paste the following code and

import boto3
import json
# grab environment variables
ENDPOINT_NAME = "jumpstart-dft-meta-textgeneration-llama-2-7b-rs"
runtime= boto3.client('runtime.sagemaker')
def lambda_handler(event, context):

response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,

response_content = response['Body'].read().decode()
result = json.loads(response_content)

return {
"statusCode": 200,
"body": json.dumps(result)

Additionally, go to the “Configurations” tab to increase the timeout to 1 minutes, which is reasonable for LLMs, as they can take some time to respond.

Adding Environment Variables

Edit the environment variables and add your ENDPOINT_NAME (found on the deployment page in Sagemaker). This step is crucial for connecting the Lambda function to your SageMaker endpoint.

You can find it again on Sagemaker > inference > Endpoints (or from the studio deployment page, if you still have running)

With your Lambda function configured, you can now deploy it.

Setting Up an API Using AWS API Gateway

  1. Go to API Gateway > Create API.
  2. Choose “HTTP API” and click “Build” (we’re creating a simple API for demonstration).
  3. Add routes and methods to your API, as needed.
  4. Deploy the API to make it accessible.

Now you have an API connected to your Lambda function. You can test it by sending requests to the API Gateway URL.

"inputs": [
{"role": "system", "content": "You are an expert in copywriting"},
{"role": "user", "content": "Write me a tweet about super conductors"}
"parameters": {"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.6}
import requests

def llama_chain():

api_url = '' # Replace this with your apigw URL

json ={
"inputs": [
"role": "system",
"content": "You are an expert in copywriting"
"role": "user",
"content": "Write me a tweet about super conductors"
"parameters": {
"max_new_tokens": 256,
"top_p": 0.9,
"temperature": 0.6

r =, json = json)
answer = r.json()[0]["generation"]
return answer



In this guide, I’ve shown you how to deploy and use powerful language models like Llama 7B and Mistral 7B on AWS SageMaker. These models offer incredible potential for text generation and understanding.

You’ve learned how to set up AWS, deploy models, and create API endpoints. With this knowledge, you can build applications that generate text and assist with content creation.

Remember to manage costs by deleting unnecessary resources when you’re done.

Now you’re ready to explore and create with these impressive language models. Enjoy your natural language adventures!

If you found this post valuable, make sure to follow me for more insightful content. I frequently write about the practical applications of Generative AI, LLMs, Stable Diffusion, and explore the broader impacts of AI on society.

Let’s stay connected on Twitter. I’d love to engage in discussions with you.

If you’re not a Medium member yet and wish to support writers like me, consider signing up through my referral link: Medium Membership. Your support is greatly appreciated!




Adithya S K

Post blogs about Gen AI | Cloud | Web Dev | Founder @CognitiveLab spending time fine-tuning LLMs ,Diffusion models and developing production ready application