Skip to main content

Deploy LLMs in Minutes with the New LLM Job Type

· 3 min read

proxiML now offers a dedicated LLM job type that lets you deploy pre-configured large language models as managed inference endpoints in minutes. Select a model family and size from the platform and get an OpenAI-compatible endpoint with no custom serving commands, Docker images, or manual checkpoint setup required. Currently supported families include Gemma 4, Qwen 3.5, and Qwen 3.6.

How It Works

The llm job type is a specialized endpoint that abstracts the complexity of serving large language models. Instead of configuring Docker images, start commands, and checkpoint paths manually, you simply choose a model family and size. The platform handles the rest:

  • Managed Serving Image -- The platform automatically provisions the correct inference server container, pre-built for serving LLM workloads.
  • Auto-Resolved Checkpoints -- Based on your selected family and size, the matching model checkpoint is automatically attached to the job. No need to upload or reference model weights yourself.
  • Generated Start Command -- The inference server start command is built from your configuration, including settings for Mixture of Experts (MoE), quantization, reasoning parsers, and tool call parsers.
  • OpenAI-Compatible API -- Your deployed endpoint exposes a standard OpenAI-compatible interface, so you can connect with existing tooling or a simple curl to GET /v1/models.

Supported models include:

FamilySizes
Gemma 42B (MoE), 4B (MoE), 26B (MoE), 31B
Qwen 3.50.8B
Qwen 3.627B (FP8), 35B (MoE+FP8)

You can further fine-tune serving behavior with optional advanced settings like quantization method, reasoning parser, tool parser, and additional command-line arguments.

Using the Web Platform

From the job creation form, select LLM as the job type. The endpoint section presents:

  • Model Family -- Choose from the available families (Gemma 4, Qwen 3.5, Qwen 3.6).
  • Model Size -- Select a size from the chosen family.
  • Advanced Settings -- Optionally override the reasoning parser, tool parser, quantization method, or pass additional command-line arguments to the inference server.

The platform handles GPU selection, disk allocation, and checkpoint attachment automatically. Once the job reaches the running state, your LLM endpoint is live and accessible via the endpoint URL.

Using the SDK

job = await proximl.jobs.create( name="Test LLM Job", type="llm", gpu_type="rtx2070s", gpu_count=1, endpoint=dict( llm=dict( family="qwen_3.5", size="0.8B", additional_kwargs="--max-model-len 180064", ), ), )

You can also combine an LLM job with an Endpoint Authorizer to secure your deployment:

job = await proximl.jobs.create( name="Secure LLM Job", type="llm", gpu_type="rtx2070s", gpu_count=1, endpoint=dict( llm=dict( family="gemma_4", size="31B", ), authorizer=dict( type="api_key", keys=[dict(client_id="my-app", key="my-secret-key")], ), ), )

The family and size fields are required; all other llm options are optional and will use sensible defaults based on the selected model configuration.