Why AI Infrastructure Software Is Harder Than Hardware — Lessons from Building Aurora and FlexAI | FlexAI Blog
    Skip to content
    FlexAI NewsAI infrastructureGPUinferenceKubernetesmlopsNVIDIACloud Computingllm

    Why AI Infrastructure Software Is Harder Than Hardware — Lessons from Building Aurora and FlexAI

    April 7, 2026

    Small AI startups are dying — not from lack of innovation, but from infrastructure exhaustion. While the industry focuses on model architecture and training data, a quieter crisis unfolds in the trenches: talented teams spending weeks configuring Kubernetes clusters, burning through runway on idle GPU capacity, and becoming DevOps experts when they set out to build AI applications.

    Brijesh Tripathi has seen this problem from every angle. After 25 years at NVIDIA, Apple, Tesla, and Intel — where he led the delivery of Aurora, one of the world's most powerful supercomputers — he founded FlexAI to solve what he calls "the missing layer" in AI infrastructure.

    In a recent conversation on the AI Engineering Podcast with host Tobias Macy, Tripathi shared his perspective on what's actually broken in AI infrastructure and why the answer isn't more GPUs.

    What Delivering the Aurora Supercomputer Taught Brijesh About AI Infrastructure

    Tripathi's path to founding FlexAI started with an unexpected realization while delivering Aurora to Argonne National Lab. The supercomputer — currently one of the highest-power machines ever built — was a hardware marvel destined for research in weather modeling, drug discovery, and nuclear science.

    But getting it running wasn't the hard part.

    "As I was finishing it, I realized that hardware is okay, it's hard, but actually the word software is a misnomer because software was the hardest thing," Tripathi explains. "The challenge was not just getting it up and running, but really getting customers and users to take advantage of this massive capacity. And it came from not having the right infrastructure management tools."

    This runs counter to the conventional wisdom in AI infrastructure. While the industry obsesses over GPU availability and chip specs, the real bottleneck is often the software layer between hardware and applications. Tripathi sees this play out in several recurring patterns: 2–3 months of setup time to get GPU clusters operational, specialized knowledge requirements that differ across every cloud provider, 20–30% average GPU utilization across most organizations, and cost structures that force teams to rent GPUs by the month when they need them by the hour.

    Why "GPU as a Service" Failed AI Teams

    When FlexAI launched two years ago, "GPU as a Service" was the industry buzzword. Tripathi saw through it early.

    "What it was was plain and simple renting GPUs that only you have access to," he says. "You start renting a GPU that you have to now go build a stack on top of it to make sure that it can run whether you're trying to run training on it, pre-train, fine-tuning, or you are trying to deploy an inference server on it — and everything will require a very different setup."

    The result: teams spending days to weeks configuring infrastructure for each new workload, then tearing it down and starting over for the next one. Tripathi's thesis with FlexAI is that the abstraction layer should sit at the workload level, not the hardware level — what the company calls "workload as a service."

    The 90/10 Prediction: Why Inference Will Dominate AI Compute

    One of Tripathi's most consequential claims concerns the long-term split between training and inference compute.

    "As this industry settles down, I think the ratio is going to be 90 to 10," he predicts. "90% compute is going to be spent on inference, 10% on training. It's going to be a handful of consolidated players who do big models, but the rest of them are going to be either an optimized version or a fine-tuned version or a reduced-size version of those similar models for their specific use cases."

    The infrastructure implications are significant. Training workloads — the domain of OpenAI, Meta, xAI, Google, and Anthropic — run for months at near-100% utilization on massive NVIDIA clusters, where dedicated infrastructure and scheduler micro-optimization make sense.

    Inference workloads are a fundamentally different problem. They're spiky (high demand peaks with long idle periods), cost-sensitive (margins depend on efficient serving), increasingly architecture-agnostic (OpenAI API standardization means workloads can move between hardware), and multi-tenant (many customers can share the same infrastructure).

    Tripathi illustrates the cost pressure with a striking example: one company with \(100 million in ARR was spending \)87 million on infrastructure to serve it.

    Why Kubernetes Fell Short for AI Workloads

    Kubernetes promised to abstract away infrastructure complexity. Tripathi argues it hasn't delivered for AI.

    "Theoretically that was the promise made, but unfortunately the dependencies, the libraries, the overall complexity of actually starting from N number of Kubernetes implementations — not everybody offers the exact same abstraction," he explains.

    FlexAI's approach borrows from classic computer architecture instead. "If you have studied computer architecture, there is a whole concept of scheduling where the entire purpose is to make sure every cycle of the CPU is busy," Tripathi says. "Your entire job in scheduling is to make sure every cycle is being used and used for the right purposes."

    Applied to GPUs, this means priority-based orchestration across workloads:

    • Real-time (no interruptions): user-facing inference endpoints

    • High priority: business-critical training that impacts revenue

    • Best effort: find the cheapest resources, run when capacity is available

    The system uses checkpoints as natural preemption points — if a higher-priority inference workload needs capacity, a long-running training job can pause at its next checkpoint and resume later without losing progress.

    Technical Approaches to GPU Utilization, Multi-Tenancy, and Multi-Cloud

    Several of the technical details Tripathi shared on the podcast are worth highlighting:

    Self-healing infrastructure. When training across thousands of GPUs, node failures are inevitable. Traditional approaches require restarting from the last checkpoint, potentially losing hours of work. FlexAI's approach combines continuous background checkpointing (with near-zero performance cost) with automatic node replacement, so training continues without lost cycles.

    GPU multi-tenancy. CPUs have supported virtual machines for decades — multiple workloads sharing hardware. GPUs traditionally haven't. FlexAI enables training and inference to coexist on the same GPUs. If you're running fine-tuning over several days, slowing down slightly during peak inference hours doesn't impact your business but dramatically improves overall utilization.

    Intelligent caching for multi-cloud. Tripathi described a customer with data in a hyperscaler who wanted to use cheaper compute in a neocloud, but egress fees (tens of thousands of dollars) made it uneconomical. FlexAI's architecture caches data between cloud storage and on-node storage so egress fees are incurred once, not on every epoch.

    Heterogeneous compute orchestration. Different workloads benefit from different architectures — NVIDIA for training (most mature ecosystem), AMD for high-throughput inference (better throughput per dollar), Tenstorrent and emerging architectures for cost-optimized inference (potentially 6–10× more cost-effective), and ASICs for edge deployment. Because most inference endpoints now use the standard OpenAI API, the underlying hardware becomes an implementation detail that the orchestration layer can optimize transparently.

    Where This Approach Doesn't Apply

    Tripathi is candid about FlexAI's boundaries. Beyond 10,000 GPUs, organizations are likely running single massive training workloads and should build dedicated infrastructure teams to extract maximum performance. Similarly, teams whose core competency is infrastructure management and who need micro-level control probably don't need another abstraction layer.

    The target is the 90% of AI teams who want to build applications, not manage infrastructure.

    The Metric That Matters: Infrastructure Problems That Don't Happen

    When asked about success metrics, Tripathi's answer was notable for what it didn't include:

    "In my mind, the success metric for us is going to be when our users can claim that we haven't dealt with infrastructure issues in the last so many months that we have been working with Flex AI. That's going to be a success for us."

    Not revenue milestones or GPU counts. Success is measured by the weeks of DevOps work that AI teams never have to do, the runway that doesn't get burned on idle capacity, and the iterations that happen faster because infrastructure just works.

    As AI moves from research to production and from training to inference, infrastructure complexity only increases. The teams best positioned to succeed won't necessarily be those with the most GPUs or the biggest models — they'll be the ones who can iterate fastest and focus their talent on solving actual problems rather than managing infrastructure.

    Listen to the full conversation between Brijesh Tripathi and Tobias Macy on the AI Engineering Podcast.

    Get Started Today

    Start building with €100 in free credits for first-time users.