Hosting LLMs with Hugging Face Inference Endpoints
 
 Contents
If you've fine‑tuned an LLM and now ask, “How on earth do I deploy this at scale without spinning up my own infra?” – you're not alone. As an AI engineer, I've wrestled with Docker, Kubernetes, SageMaker, Vertex AI, and countless scripts trying to get stable LLM endpoints up and running. Hugging Face Inference Endpoints changed that.
This tutorial walks you through everything – from preparing your model, to setting up inference endpoints, to integrating with AWS, Azure or GCP, following MLOps best practices, and seeing example API calls. I’ll include real-world hiccups I faced, why I made certain choices, and references to official blogs and docs to back up best practices.
1. Setting Up Your Fine‑Tuned LLM for Hugging Face Deployment
1.1 Merge Adapter or LoRA Weights Properly
When I fine-tuned using QLoRA or LoRA adapters, I realized I couldn’t just push the adapter – Hugging Face required merging it into the base model. In one deployment attempt, I uploaded only the adapter and got an “Invalid model format” error. After merging, it worked.
You can do:
from peft import PeftModel base = AutoModelForCausalLM.from_pretrained("base-model")
model = PeftModel.from_pretrained(base, "adapter-path")
model = model.merge_and_unload()
model.save_pretrained("merged-model")
This matches community advice on discuss.huggingface.co: users reported errors when uploading adapter-only repos, and solved it by using merge_and_unload().
1.2 Confirm Your Repo Structure
Make sure your model repository has:
- pytorch_model.binor safetensors
- config.json
- tokenizer.jsonor vocab files
- optional: handler.pyfor custom inference logic
Many forum posts stress that missing a proper config.json or tokenizer file broke endpoint creation. I once forgot uploading tokenizer.model, and it failed to initialize.
1.3 Push to Hugging Face Hub
Use:
huggingface-cli login
git lfs install
git clone https://huggingface.co/your-org/your-llm
cd your-llm
cp -r merged-model/* .
git add .
git commit -m "deployable merged model"
git push
When pushed, you'll see model metadata on the Hub. Double-check that the page shows the correct model size, and tokenizer info.
2. Creating an Inference Endpoint on Hugging Face
2.1 Navigate to the UI
I head to the model page on huggingface.co, click Deploy → Inference Endpoints. This opens the creation UI.
2.2 Select Model, Cloud, Region
- Choose your merged LLM repo.
- Pick provider: AWS, Azure, or GCP – same UI, works across all clouds.
- Region: I usually choose us-east-1oreu-west-1– closer to my user base.
2.3 Compute and Autoscaling
- For fine‑tuned LLMs, GPU is recommended (e.g., Nvidia T4 or A100). Hugging Face optimizes pipelines using Flash Attention, Paged Attention, and Text Generation Inference for low latency and high throughput.
- I enabled autoscaling: min 0 (scale‑to‑zero), max 2–3 GPUs. This cuts idle costs drastically but wakes cold endpoint from zero.
2.4 Security Settings
Access options: public, protected (requires org token), or private inside VPC. I choose Protected in initial tests, and shift to Private (with PrivateLink in AWS or Azure) for production workloads to enforce compliance and SOC2 Type‑2 controls.
2.5 Launch Endpoint
Click “Create endpoint”. Usually takes 5–10 minutes to deploy. I often run other tasks while waiting. Once status shows Live, you can copy endpoint URL and token.
Endpoint Setup Table
| Step | What to Do | What to Expect | 
|---|---|---|
| Choose model | your‑org/your‑llm | Model meta appears | 
| Select provider & region | AWS us‑east‑1, or Azure west‑us | Region dropdown | 
| Pick compute | GPU (T4/A100) | Estimated cost/hour | 
| Configure autoscale | min‑max replicas | Savings via scale‑to‑zero | 
| Set access | Protected or Private | Security options | 
| Deploy | click Create | Status: initializing → Live (~10 mins) | 
3. Integrating with AWS, Azure or GCP
3.1 AWS Integration
Hugging Face handles provisioning of AWS SageMaker resources behind the scenes. There's no need for you to manually create SageMaker endpoints or Docker containers – Hugging Face routes traffic through its managed layer and passes inference to SageMaker or EC2. From my experience, this replaced two weeks of Docker/ECS orchestration with just clicks and API configuration.
3.2 Microsoft Azure Integration
Deployed endpoints integrate directly with Azure infrastructure. You can optionally configure deployment through Azure ML Studio by selecting Hugging Face from model catalog, choosing compute (CPU/GPU instances), scaling rules, traffic splitting, and testing via interface or SDK.
3.3 Google Cloud Platform Integration
Similar story with GCP: Hugging Face endpoints run on Google infrastructure, using the chosen region’s GPUs/VMs. You don’t configure Vertex AI directly. If desired, advanced users may also use Vertex custom containers or GKE for more control – but endpoints simplify it.
Across all these: the code you write – or the API you call – stays identical regardless of cloud. The abstraction really helps teams reduce vendor lock‑in and operations overlap.
4. MLOps Best Practices: Monitoring & Updating
4.1 Logs and Metrics
Once your endpoint is live, Hugging Face provides built‑in observability: latency heatmaps, error counts, request volume, and scale events. You can export logs or integrate with Datadog, CloudWatch, or Azure Monitor if desired. This transparency helps detect anomalies and scale issues quickly.
4.2 Automate Deployments: CI/CD
I built a pipeline where:
- When merging PR to main, CI runs tests, merges adapter to base model, pushes to Hub.
- Another job calls Hugging Face CLI/API to update the running endpoint.
This means no downtime: endpoint stays live, redeploy is seamless. Using huggingface_hub CLI simplifies this workflow.
4.3 Model Retraining & Rollback
Keep track of versions. If performance dips, rollback to previous model revision instantly. Hugging Face versioning makes this easy. Retraining should retrain, test, and only then push.
4.4 Regular Evaluation
Use Evaluate library and Evaluation on the Hub to track metrics periodically. This allows performance comparisons across new fine‑tuned models or drift detection.
5. Example API Calls & Client Integration
5.1 Python Streaming & Generation
from huggingface_hub import InferenceClient client = InferenceClient(model="your‑org/your‑llm")
resp = client.text_generation(inputs="What’s the weather in Bucharest?", parameters={"max_new_tokens":64})
print(resp.generated_text) # For streaming:
async def chat(): stream = await client.chat.completions.create( model="your‑org/your‑llm", messages=[{"role":"user","content":"Hi!"}], stream=True ) async for chunk in stream: print(chunk.choices[0].delta.content, end="")
I tested generation latency on a Falcon‑based LLM in US‑east with an A100: warm latency of ~350 ms; cold start ~1.8 s after scale‑to‑zero. It was easy to meet SLAs without managing container pools manually.
5.2 JavaScript Example
As shown in a Hugging Face community tutorial by Paul Scanlon, using the JS SDK is straightforward for web or Node.js apps:
import { InferenceClient } from "@huggingface/inference";
const client = new InferenceClient({ token: process.env.HF_TOKEN }); const response = await client.textGeneration({ model: "your‑org/your‑llm", inputs: "Explain Hugging Face inference endpoints in simple terms.", parameters: { max_new_tokens: 100 }
});
console.log(response.generated_text);
Works identically across clouds and matches instructions in community posts from early 2025.
6. Real‑World Anecdotes & Pitfalls
- Forgot handler.py: I tried deploying a custom pipeline without including handler.py. It failed silently during build. Re-uploading repo with handler fixed it. The handler must instantiate pipelines, preprocess, postprocess, etc. This echoes community guidance.
- Compute sizing: Initially chose CPU for cost reasons. But inference times ballooned. GPU is worth it for LLMs.
- Cost surprises: Without scale‑to‑zero, endpoints can rack up cost quickly. Configure min replicas to zero if your application tolerates cold starts.
- Version drift: I once updated model weights on Hub but forgot to update the endpoint. My API still served old responses. Solution: automate endpoint updates in CI.
7. Summary & Next Steps
You’ve learned how to:
- Prepare and push a fine‑tuned LLM repo suitable for Hugging Face hub
- Use Hugging Face inference endpoints UI to deploy on AWS, Azure, or GCP
- Configure compute, autoscaling, and security settings
- Integrate with your cloud provider without managing infra directly
- Apply MLOps tools: monitoring, CI/CD, retraining, rollback
- Call endpoints from Python or JavaScript – same syntax across clouds
 
  
  
  
 