How to Perform Batch Inferencing with DigitalOcean’s 1-Click Models

DigitalOcean’s 1-Click Models, powered by Hugging Face, makes it easy to deploy and interact with popular large language models such as Mistral, Llama, Gemma, Qwen, and more, all on the most powerful GPUs available in the cloud. Utilizing NVIDIA H100 GPU Droplets, this solution provides accelerated computing performance for deep learning tasks. It eliminates overwhelming infrastructure complexities, allowing developers of all skill levels—whether beginners or advanced—to concentrate on building applications without the hassle of complicated software configurations.
In this article, we will demonstrate batch processing using the 1-Click Model. Our tutorial will utilize the Llama 3.1 8B Instruct model on a single GPU. Although we will use a smaller batch for this example, it can easily be scaled to accommodate larger batches, depending on your workload and the computational resources available. The flexibility of DigitalOcean’s 1-Click Model deployment allows users to easily manage varying data sizes, making it suitable for scenarios ranging from small-scale tasks to large-scale enterprise applications.
Before diving into batch inferencing with DigitalOcean’s 1-Click Models, ensure the following:
- DigitalOcean Account: Sign up for a DigitalOcean account and set up billing.
- 1-Click Model Deployment: Read the blog to understand how to start with the 1-Click Model on GPU Droplets.
- Bearer Token: Obtained the Bearer Token from the web console of the GPU Droplet.
Batch inference is a process where batches or multiple data inputs are processed and analyzed together in a single operation rather than one at a time. Instead of sending each request to the model one at a time, a batch or group of requests is sent at once. This approach is especially useful when working with large datasets or handling large volumes of tasks.
This approach is beneficial for several reasons, a few of which are noted below.
- Faster Processing:
By processing multiple inputs together, batch inferencing reduces the time it takes to analyze large amounts of data. - Efficient Resource Use:
Sending requests in bulk reduces the overhead of handling multiple individual requests, optimizing the usage of computational resources like GPUs. - Cost-Effective:
Batch inferencing can lower operational costs by minimizing the number of requests sent to the inference endpoint, especially when billed based on the number of API calls. - Scalable for Big Data:
When dealing with large datasets, batch inferencing enables processing at scale without overwhelming the system. - Consistent Results:
Processing inputs in batches ensures uniform model performance and reduces variability in outcomes.
We have created a detailed article on how to get started with the 1-Click Model and DigitalOcean’s platform. Feel free to check out the link to learn more.
Analyzing customer comments has become a critical tool for businesses to monitor brand perception, understand customer satisfaction with the product, and predict trends. Using DigitalOcean’s 1-Click Models, you can efficiently perform sentiment analysis at scale. In the below example, we will analyze a batch of five comments.
Let’s walk through a batch inferencing example using a sentiment analysis use case.
[]Step 1: Install Dependencies
pip install –upgrade –quiet huggingface_hub []Step 2: Initialize the Inference Client
import os from huggingface_hub import InferenceClient client = InferenceClient(base_url=”http://localhost:8080″, api_key=os.getenv(“BEARER_TOKEN”)) []Step 3: Prepare Batch Inputs
batch_inputs = [ {“role”: “user”, “content”: “I love using this product. It’s amazing!”}, {“role”: “user”, “content”: “The service was terrible and I’m very disappointed.”}, {“role”: “user”, “content”: “It’s okay, not great but not bad either.”}, {“role”: “user”, “content”: “Absolutely fantastic experience, I highly recommend it!”}, {“role”: “user”, “content”: “I’m not sure if I like it or not.”}, ] []Step 4: Perform Batch Inferencing
batch_responses = [] for input_message in batch_inputs: response = client.chat.completions.create( model=”meta-llama/Meta-Llama-3.1-8B-Instruct”, messages=[input_message], temperature=0.7, top_p = 0.95, max_tokens = 128,) batch_responses.append(response[‘choices’][0][‘message’][‘content’]) []Step 5: Print the results
for idx, (input_text, sentiment) in enumerate(zip(batch_inputs, batch_responses), start=1): print(f”Input {idx}: {input_text[‘content’]}”) print(f”Sentiment: {sentiment}”) print(“-” * 50)
How It Works:
- Batch Inputs: Define a list of inputs, each containing a sentence to analyze for sentiment.
- Iterate Through Inputs: Send each input as a request to the deployed model using the InferenceClient.
- Temperature and Top-p:
- Set temperature=0.7 for deterministic results.
- Use top_p=0.95 to avoid sampling randomness.
- Extract Results: Collect the sentiment predictions from the responses and store them.
- Display Results: Print the original text alongside the sentiment label for clarity
- Replace “YOUR_BEARER_TOKEN” with the actual token obtained from your DigitalOcean Droplet.
- Adjust batch sizes and other parameters such as temperature and Top-p.
To conduct batch inferencing with DigitalOcean’s 1-Click Models, you can submit multiple questions in a single request. Here’s another example:
batch_inputs = [ {“role”: “user”, “content”: “What is Deep Learning?”}, {“role”: “user”, “content”: “Explain the difference between AI and Machine Learning.”}, {“role”: “user”, “content”: “What are neural networks used for?”}, ] for input_message in batch_inputs: response = client.chat.completions.create( model=”meta-llama/Meta-Llama-3.1-8B-Instruct”, messages=[input_message], temperature=0.7, top_p = 0.95, max_tokens = 128,) batch_responses.append(response[‘choices’][0][‘message’][‘content’]) for idx, output in enumerate(batch_responses, start=1): print(f”Response {idx}: {output}”)
Explanation:
- The inputs parameter is a list of strings, allowing you to send multiple texts in a single API call.
- The model processes all the inputs in one go, returning the results as a list.
- You can customize parameters like max_length and temperature based on your model’s requirements.
DigitalOcean’s infrastructure is designed for scalability:
- High-Performance GPU Droplets: Leverage NVIDIA H100 GPUs for fast and efficient inferencing.
- Autoscaling with Kubernetes: Automatically scale your Droplet cluster to handle micro-bursts and traffic spikes.
- Load Balancers: Distribute traffic across multiple Droplets for consistent performance.
Apart from sentiment analysis or recommendation systems, batch inference is a crucial feature for business applications that handle high data volumes. This makes the process faster, more efficient, and cost-effective.
- Marketing Campaigns: Monitor user sentiment during product launches. Often, businesses require analyzing customer sentiments from thousands of social media posts, tweets, and reviews. Batch processing can help process this data all at once, helping to identify trends like whether the reviews are positive or negative about the product launch or whether the customers are talking about a specific service issue.
- Customer Support: Companies receive large volumes of feedback via surveys or reviews. Batch inferencing can classify this feedback into predefined categories (e.g., “positive,” “negative,” and “neutral”), reducing the manual effort of going through each piece of feedback.
- Content Generation: Generating answers to multiple questions at a time is a common use case in many education and research institutes. For example, a business may want to automate responses to FAQs, or a teacher may need answers to questions from multiple students.
- Content Moderation on Platforms: Online platforms with user-generated content need to filter and moderate large amounts of text, images, or videos for inappropriate material. Batch inferencing allows for automated flagging of content violations.
Batch inferencing with DigitalOcean’s 1-Click Models is a powerful way to process multiple inputs efficiently. Using DigitalOcean’s 1-Click Models, you can quickly implement batch inferencing for sentiment analysis, enabling real-time insights into social media trends. This solution not only simplifies deployment but also ensures optimized performance and scalability,