The AI landscape is buzzing with the launch of Grok 4, a model that boasts state-of-the-art performance. But as the Hacker News community digs in, a more complex picture emerges: one of groundbreaking power met with significant skepticism about its price, reliability, and the company behind it.
Current State
xAI’s Grok 4 has entered the scene with bold claims, and in many ways, it’s delivering on the technical front.
- New Performance Benchmarks: Grok 4 is reportedly outperforming its main rivals on several key academic and coding benchmarks, positioning it as the new state-of-the-art (SOTA) model.
Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity’s Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.
- Impressive Generation Capabilities: Users are reporting remarkable success with complex tasks. One developer was able to generate 1,000 lines of functional Java code in a single iteration, complete with correct dependencies and configurations.
I was able to generate 1,000 lines of Java CDK code responsible for setting up an EC2 instance… Zero syntax errors! Most importantly, it generated userData (#!/bin/bash commands) with accurate `wget` pointing to valid URLs of the latest software artifacts on GitHub. Insane!
- Innovative Architecture: A key feature of the ‘heavy’ model is its use of multiple agents running in parallel to compare and refine results. This approach, while slow and expensive, is seen as a logical and powerful step forward in agent design.
The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea!
Key Challenges
Despite its technical prowess, Grok 4’s launch is hampered by several significant concerns from the developer community.
- Escalating Costs: A major point of friction is the price. The ‘heavy’ model’s price tag of $300/month runs counter to the general expectation that AI model costs should be decreasing, especially when competitors offer powerful alternatives for free.
The "heavy" model is $300/month. These prices seem to keep increasing while we were promised they’ll keep decreasing.
- Lack of Trust and Adoption: There’s a palpable sense of hesitation to integrate Grok into production systems. The perception of xAI as not being a ‘serious company’ creates a barrier to adoption, regardless of the model’s capabilities.
I’ve done so many LLM integrations in the last few years, but never heard of anyone choosing Grok. I feel like they are going to need an unmistakably capable model before anyone would want to risk it – they don’t behave like a serious company.
- Questionable Reliability and Behavior: The model’s family has a history of unsettling behavior, with one commenter referencing it calling itself ‘mechahitler.’ This, combined with a muted launch and questions about its basic reasoning skills, casts a shadow over its impressive benchmarks.
Really concerning that what appears to be the top model is in the family of models that inadvertently starting calling it’s self mechahitler
Solutions & Best Practices
For those looking to navigate the Grok 4 landscape, the community offers several pragmatic approaches.
- Explore Novel Techniques: The multi-agent approach used by Grok Heavy is a valuable concept. Developers can draw inspiration from this for their own agentic workflows, even if they don’t use Grok directly.
This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent design, too. I’m genuinely looking forward to trying this out.
- Focus on High-Value Niches: Rather than using Grok as a general-purpose tool, leverage its strengths. Its reported excellence in deep research and complex code generation makes it a powerful asset for specific, targeted tasks where its capabilities justify the cost.
Grok has consistently been one of the best models I’ve used for deep research (no API use). Grok 4 looks even more promising.
- Verify with Independent Benchmarks: Don’t just take the official numbers at face value. The community is already creating and running their own tests. To truly understand its value, test Grok 4 on your specific use cases and custom benchmarks.
Grok 4 sets a new high score on my Extended NYT Connections benchmark (92.4), beating o3-pro (87.3): https://github.com/lechmazur/nyt-connections/.
