Think Meta’s Llama 4 is fully released? Not yet.
Scout and Maverick are already available via SageMaker, Cloudflare, and model hubs, but the big Behemoth — the 288B active-parameter teacher model — is still training with no public release date.
That gap matters if you’re planning around the flagship or waiting for distilled improvements: for now, you should build with Scout and Maverick.
This piece explains the current availability, why Behemoth’s timing is unclear, and what to do next.
Latest Availability Update on the Llama 4 Model Release

Meta’s already shipped two Llama 4 models: Scout and Maverick. You can grab them right now through Amazon SageMaker JumpStart, Cloudflare Workers AI, or straight from the model hub. The big one, though? Llama 4 Behemoth is still training. No release date yet.
So we’re in this weird middle ground where the smaller models are out there working, but the flagship teacher model is still cooking. Behemoth packs 288 billion active parameters and close to 2 trillion total parameters. It’s built to teach the smaller models through distillation, not for you to run directly. Meta hasn’t said when it’ll be ready or even hinted at a timeline, which means if you’re evaluating Llama 4 for actual work, you’re planning around Scout and Maverick for now.
The announcement came with real builds you can deploy. AWS has them in US East for fine-tuning, plus several regions for inference. Edge silicon partners got deployment-ready versions too. Scout runs a 10-million-token context window and fits on a single NVIDIA H100 if you quantize to Int4. Maverick comes in FP8 (quantized) and BF16 (full precision), handles 128,000 tokens, and can run on one H100 DGX host or split across multiple GPUs depending on which precision you pick.
Technical Overview of Llama 4 Model Variants

Scout’s built with 17 billion active parameters, 16 experts, 109 billion total. The standout feature? A 10-million-token context window. That’s roughly 78 times what Llama 3 could handle. It uses early-fusion multimodality, meaning text and vision tokens get processed together in one unified backbone instead of being handled separately. Pre-training and post-training both used 256K context to help it generalize across things like multi-document summaries, deep codebase analysis, and long session personalization. And yeah, it fits on a single H100 when you quantize to Int4, which makes it accessible even if you’re not swimming in accelerators.
Maverick keeps the same 17 billion active parameters but scales up to 128 experts and 400 billion total through alternating dense and mixture-of-experts layers. Each token routes through one routed expert plus one shared expert, keeping per-token costs manageable while distributing the compute. You get two versions: FP8 for faster inference with less memory, or BF16 when you need maximum accuracy for heavy reasoning tasks. The context window sits at 128K, and it’s designed for multimodal reasoning, multilingual translation across 200 languages, agentic tool-calling, and serious coding work.
Behemoth is the teacher. It’s got 288 billion active parameters, 16 experts, nearly 2 trillion total. Its whole job is to guide knowledge distillation into Scout and Maverick. Pre-training ran on more than 30 trillion tokens at FP8 precision across 32,000 GPUs, hitting around 390 teraFLOPs per GPU during training. It beat GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks, including MATH-500 and GPQA Diamond. But you can’t use it yet.
Scout optimizes for extreme context length with minimal hardware. Maverick prioritizes efficiency through aggressive MoE routing while keeping strong performance on reasoning and coding benchmarks, matching or beating DeepSeek v3 with fewer active parameters. Behemoth maxes out raw capability and serves as a distillation source, not a deployment target.
All three use iRoPE (interleaved attention without positional embeddings) to handle their context windows. It replaces traditional positional encoding with something built for scaling across millions of tokens.
Technical Advancements That Influenced the Llama 4 Release Window

Meta built a new hyper-parameter optimization method called MetaP for Llama 4. It needed iterative testing across billions of tokens to lock in settings for learning rate, batch size, and MoE routing thresholds before full-scale training could even start. Pre-training ran at FP8 precision on 32,000 GPUs, pushing compute utilization to around 390 teraFLOPs per GPU during Behemoth’s training phase. That scale stretched timeline estimates because coordinating hardware and managing checkpoints across distributed clusters takes serious time.
The training data mixture topped 30 trillion tokens. More than double what Llama 3 used. It included 200 languages, with over 100 languages getting at least 1 billion tokens each. All of that required extensive curation, filtering, and deduplication before training could begin.
Post-training used a three-stage pipeline: lightweight supervised fine-tuning (SFT), continuous online reinforcement learning (RL), and lightweight direct preference optimization (DPO). Meta’s filtering strategy dumped more than 50 percent of “easy” SFT examples to improve RL effectiveness. For Behemoth, they pruned the SFT dataset by roughly 95 percent to focus the model on high-complexity prompts. Online RL used adaptive filtering to select medium-to-hard reasoning and coding challenges dynamically. That process took weeks of iterative tuning and delayed Behemoth’s release while Scout and Maverick wrapped their shorter post-training cycles.
Benchmark claims show Scout outperforming Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 in its class. Maverick beats GPT-4o and Gemini 2.0 Flash across broad evaluations and matches DeepSeek v3 on reasoning and coding despite using fewer active parameters. Those results justified staggered releases so smaller models could ship while the flagship kept training.
Safety and alignment improvements added extra post-training overhead through multi-layer mitigation systems: pre-training data filters, post-training safety tuning, and system-level safeguards including an input/output safety LLM, a prompt classifier for jailbreaks and prompt injections, plus cybersecurity evaluation tooling. Meta cut the refusal rate on debated political and social topics from about 7 percent in Llama 3.3 to below 2 percent in Llama 4. Unequal response refusal proportions dropped to less than 1 percent. Hitting those metrics required iterative red-teaming and preference-model retraining before clearing internal release criteria, which pushed the overall timeline past initial projections.
Llama 4 Availability Across Cloud Providers and Deployment Platforms

Amazon SageMaker JumpStart hosts three Llama 4 builds: Llama-4-Scout-17B-16E-Instruct, Llama-4-Maverick-17B-128E-Instruct, and Llama-4-Maverick-17B-128E-Instruct-FP8. All deployable in US East (N. Virginia). Fine-tuning support lives in US East (IAD). Default instance type is p5.48xlarge with an initial instance count of 1, optimized for real-time inference with sustained traffic and low latency. You can pick alternative instance types to handle increased context length or batch processing. Deployment needs accept_eula=True in the SageMaker SDK to acknowledge Meta’s community license agreement. Endpoints typically hit InService status within a few minutes, then you’re ready to make inference calls via the SageMaker runtime client or predictor interface.
Cloudflare Workers AI lists Llama 4 as available through their serverless inference runtime. You can run Scout and Maverick without managing infrastructure or provisioning GPUs. Meta’s announcement mentions distribution across additional cloud and edge partners, with integration into major messaging apps, social platforms, and web-based assistant interfaces, though specific partner names and regional rollout schedules weren’t detailed in the release materials.
| Platform | Model Availability | Notes |
|---|---|---|
| Amazon SageMaker JumpStart | Scout, Maverick (BF16 & FP8) | Deployable in US East (N. Virginia); fine-tuning in US East (IAD); requires p5.48xlarge or alternative GPU instances; accept_eula=True required |
| Cloudflare Workers AI | Scout, Maverick | Serverless inference runtime; no GPU provisioning needed; global edge distribution |
| Major model hubs | Scout, Maverick | Direct weight downloads available; cloud, edge silicon, and partner integrators noted in announcement |
How Llama 4 Compares to Earlier Generations and What That Means for Release Patterns

Llama 4 introduces mixture-of-experts routing across Scout and Maverick for the first time in the Llama family. That’s a departure from Llama 3’s dense architecture, and it required additional validation and testing before release. Scout’s 10-million-token context window is a 78x jump over Llama 3’s 128K maximum. That opens up new use cases like full-repository code analysis and multi-year document synthesis that just weren’t practical before.
Multilingual capabilities expanded roughly 10x over Llama 3. Pre-training covered 200 languages with more than 100 languages getting at least 1 billion tokens each, compared to Llama 3’s narrower language distribution. That change extended data-preparation timelines and delayed unified release dates for all model variants.
Llama 4’s performance claims put Scout and Maverick up against GPT-4o, Gemini 2.0 Flash, and DeepSeek v3. Behemoth targets GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM benchmarks like MATH-500 and GPQA Diamond. Llama 3.3 competed mostly against earlier GPT-3.5 and GPT-4 non-turbo variants, so Llama 4 is aiming at a higher tier with stronger reasoning, coding, and multimodal performance.
The shift to early-fusion multimodality (unified text and vision tokens) replaces Llama 3’s separate vision adapters. That required new post-training recipes and vision-encoder training, which added months to the development cycle compared to Llama 3’s text-only training path.
MoE routing got tested across billions of inference tokens to validate expert-selection accuracy and computational efficiency before Scout and Maverick could ship. The 10M token context window required new positional-encoding methods (iRoPE) and memory-efficient attention mechanisms, both of which needed large-scale verification before deployment. Multilingual data curation for 200 languages introduced additional filtering, deduplication, and quality-control steps that delayed pre-training starts compared to Llama 3’s faster data pipeline. Early-fusion multimodality required joint training and alignment of vision encoders with the language backbone, extending post-training timelines relative to Llama 3’s simpler text-only tuning. Safety and alignment improvements (reduced refusal rates, prompt-injection defenses, cybersecurity tooling) added iterative red-teaming cycles that pushed release dates beyond initial internal targets.
Expected Release Window for Llama 4 Behemoth and Future Updates

Meta confirmed that Llama 4 Behemoth is still in training. It’s serving as the teacher model for distillation into Scout and Maverick, but there’s no formal release date, estimated completion quarter, or beta-access timeline. The model’s scale (288 billion active parameters, 16 experts, nearly 2 trillion total parameters) requires extended training runs on 32,000 GPUs. Meta’s announcement didn’t say whether Behemoth will be released as downloadable weights, API-only access, or restricted to research partners.
Meta mentioned an upcoming event called LlamaCon scheduled for April 29. Additional details about the Llama 4 vision and future roadmap are expected there. That event date doesn’t guarantee a Behemoth release announcement, but it’s the next known public milestone where Meta might clarify availability, access tiers, and deployment options for the flagship model. Training-scale indicators suggest Behemoth’s development timeline extends months beyond Scout and Maverick because of the computational cost of pre-training and the iterative nature of the 95-percent SFT-data-pruning strategy used in post-training.
Enterprise features likely to come with Behemoth’s release include tiered API access with rate limits and quotas, managed inference endpoints optimized for high-throughput batch processing, and possibly restricted fine-tuning licenses for commercial deployments. Meta hasn’t announced pricing structures, subscription models, or free-tier details for Llama 4. Historical patterns suggest weights for smaller models (Scout, Maverick) will stay open for download under the community license while larger models like Behemoth may require paid access or partnership agreements. Teams planning enterprise rollouts should monitor Meta’s developer channels for API documentation updates, enterprise licensing terms, and service-level agreements that typically accompany flagship-model launches.
Developer Integration Guidance for Newly Released Llama 4 Models

Integrating Llama 4 Scout or Maverick via Amazon SageMaker JumpStart needs an AWS account with access to GPU-accelerated instances (p5.48xlarge recommended), an IAM role with SageMaker permissions, and acceptance of Meta’s end-user license agreement by setting accept_eula=True in the deployment SDK call. You can deploy models through SageMaker Studio, SageMaker AI notebooks, or compatible IDEs like PyCharm and VS Code. Endpoint provisioning typically wraps within a few minutes, status transitions to InService when ready for inference.
Cloudflare Workers AI supports Llama 4 through their serverless runtime, which cuts out infrastructure management and GPU provisioning. But you should check rate limits and regional availability before committing to production workloads.
Inference performance and resource requirements shift significantly by model and quantization choice. Scout at Int4 quantization fits on a single NVIDIA H100 and supports 10-million-token contexts. Maverick in BF16 precision requires either a single H100 DGX host for standard workloads or distributed inference across multiple GPUs for full 128K context at peak throughput. The FP8 quantized version of Maverick cuts memory overhead and increases tokens per second compared to BF16, but may produce slightly different outputs on edge cases needing maximum numerical precision. Teams running extended context lengths (multi-million tokens for Scout or near-128K for Maverick) should watch GPU memory utilization and consider instance upgrades or batch-size reductions to avoid out-of-memory errors during inference.
| Model | Quantization Options | Deployment Notes |
|---|---|---|
| Llama-4-Scout-17B-16E-Instruct | Int4 (single H100 fit) | 10M token context; early-fusion multimodal; tested post-training on up to 8 images; recommended for document-heavy and session-based apps |
| Llama-4-Maverick-17B-128E-Instruct | BF16 (full precision) | 128K token context; maximum accuracy for reasoning and coding; requires H100 DGX or distributed inference for peak loads |
| Llama-4-Maverick-17B-128E-Instruct-FP8 | FP8 (quantized) | 128K token context; faster inference and lower memory footprint; slight numerical differences vs. BF16 on edge cases |
Community and Ecosystem Response to the Llama 4 Rollout

Early community reports highlight strong performance on multimodal tasks: image captioning, visual question answering, document VQA, and multi-image synthesis. Developers are sharing examples of Scout processing entire codebases to extract build specifications and Maverick analyzing Amazon 10-K filings from 2017 through 2024 in a single prompt. Comparisons with GPT-4o and Gemini 2.0 Flash pop up frequently in developer forums and social threads. Anecdotal claims suggest Maverick matches or exceeds GPT-4o on reasoning benchmarks while using fewer active parameters and offering open weights for on-premise deployment. That combination is attracting teams with data-residency requirements or cost-optimization goals.
Demos posted by developers show practical use cases that were difficult or impossible with Llama 3’s 128K context limit. One example had Scout analyzing a 3-million-token research paper corpus to generate a structured literature review. Another used Maverick to perform multi-image reasoning across a sequence of medical scans and generate a preliminary diagnostic report. YouTube demos and community tutorials focus on SageMaker JumpStart deployment workflows, quantization tradeoffs between Int4, FP8, and BF16, and prompt-engineering techniques for maxing out context utilization without hitting rate limits or memory ceilings. Strong early adoption among developers familiar with AWS tooling and large-scale inference optimization.
Migration Notes for Teams Moving from Llama 3 to Llama 4

Llama 4’s shift to mixture-of-experts routing on Scout and Maverick changes the computational profile compared to Llama 3’s dense layers. Fine-tuning workflows that assume uniform per-layer costs will need adjustment for memory allocation and batch-size tuning. Scout’s 10-million-token context window and Maverick’s 128K window both exceed Llama 3’s 128K maximum. That enables new use cases but also requires prompt redesigns to take advantage of expanded context without triggering out-of-memory errors or excessive inference latency.
Multilingual capabilities increased roughly 10x. 200 languages supported, over 100 languages represented by at least 1 billion tokens each. Teams working with low-resource languages should revalidate prompt templates and few-shot examples to account for improved tokenization and vocabulary coverage.
Llama 4’s early-fusion multimodality (unified text and vision tokens) replaces Llama 3’s separate vision adapters. Prompts that previously passed images as external inputs may need restructuring to interleave vision tokens with text tokens in a single sequence.
Safety filters and refusal behaviors changed significantly. Llama 4 cut refusal rates on debated topics from about 7 percent to below 2 percent. Unequal response refusal dropped to less than 1 percent. Teams that built prompt guardrails or retry logic around Llama 3’s higher refusal rates should test whether those mechanisms remain necessary or introduce unnecessary latency.
Review and update memory allocation settings in fine-tuning scripts to account for MoE routing overhead. Make sure batch sizes don’t exceed GPU capacity during gradient accumulation. Redesign prompts to use extended context windows (10M for Scout, 128K for Maverick) by consolidating multi-turn conversations or document batches into single prompts where they previously required splitting. Revalidate multilingual prompt templates and few-shot examples, particularly for languages outside the top 20 by web presence, to confirm improved tokenization and remove workarounds implemented for Llama 3’s narrower language support. Update vision-input workflows to use early-fusion interleaving (text and vision tokens in a unified sequence) instead of separate vision-adapter calls. Test with up to 8 images per prompt based on Llama 4’s post-training validation limits. Test safety and refusal behavior on sensitive or edge-case prompts to see whether existing guardrails (retry logic, fallback prompts, manual moderation triggers) remain necessary given Llama 4’s reduced refusal rates and improved alignment.
Final Words
Scout and Maverick are available now on model hubs, SageMaker JumpStart (US East), and Cloudflare Workers, so you can start testing and fine-tuning immediately.
Behemoth is still in training and Meta hasn’t shared a public timeline; LlamaCon and other signals hint at more details, but the meta llama 4 release date remains unannounced for now.
Start migrating workflows to Scout/Maverick where helpful, monitor official feeds for Behemoth updates, and expect a careful, capability-driven rollout — a solid next step for teams and developers.
FAQ
Q: Is Llama 4 released yet?
A: Llama 4 Scout and Maverick are released and available now on model hubs and cloud platforms (SageMaker JumpStart, Cloudflare Workers AI); the largest Behemoth model remains in training with no public release date.
Q: Is Meta AI Llama 4 free?
A: Meta AI Llama 4’s pricing varies by platform: Meta hasn’t announced a single free tier; some demos or limited access may be free, but most cloud deployments are billed under each provider’s terms.
Q: Where is Llama 4 available?
A: Llama 4 Scout and Maverick are available on model hubs and cloud partners, including AWS SageMaker JumpStart (US East/N. Virginia), Cloudflare Workers AI, and other distributor platforms—check provider docs for model IDs and permissions.
Q: What GPU is needed for Llama 4?
A: GPU needs depend on the variant: Scout can run on a single NVIDIA H100; Maverick and Behemoth usually require multi‑H100‑class capacity or cloud p5 instances; FP8/BF16 quantized builds reduce memory requirements.

