What happens after AI is trained? Microsoft-backed dMatrix CEO gives blunt reality check (EXCLUSIVE)
With the pace of the artificial intelligence race intensifying throughout the world, the focus of the industry has been entirely fixated on coming up with larger and more powerful models. However, as co-founder and CEO of semiconductor start-up dMatrix, Sid Sheth, says, this next defining step of AI possibly has less to do with model size and more to do with the efficiency with which those models can really execute in the real world.
The New Silicon Valley-based company dMatrix is working on chips specially used to perform AI inference, or to execute trained models to produce responses and do tasks. Microsoft and other investors back the company. In contrast to the conventional GPU design that divides memory and processing power, the company design, known as Digital In-Memory Computing, is geared towards minimizing the data flows inside the chips enhancing energy usage and latency.
To date, the startup has raised over $450 million, and Bengaluru is becoming one of the major research and development centers. With AI no longer a matter of experimentation but a matter of widespread implementation in products, services, and software systems, Sheth is of the opinion that the economics of inference will emerge as the determining factor of how extensively AI can be scaled.
During the conversation with International Business TimesSheth discussed the structural constraints of GPU-based AI infrastructure, the new diversification of data centre hardware, and the reason why India is playing an increasingly important role in the design of basic semiconductors.

Below are edited excerpts from the conversation.
IBT: You’ve said inference efficiency, not model size, will decide the AI race. Are you effectively arguing that today’s GPT-style scaling strategy from players like OpenAI and Anthropic is economically unsustainable?
Sid Sheth: We are engaged with virtually every leading AI model builder and service provider and can assure you that all are focused on delivering the fastest, most efficient AI possible. AI industry leaders recognize that a heterogeneous compute framework is necessary.
GPUs are exceptional for training and for building the larger foundation models that unlocked reasoning, multimodal capabilities, and entirely new classes of applications. That phase was critical to getting us here.
But as the industry shifts its focus from training to large-scale inference, where billions of users interact with AI systems every second, the cost structure changes. As AI becomes embedded into products, workflows, and autonomous agents, inference volume grows exponentially.
The next phase of AI won’t be defined only by who builds the biggest model, but by who can operationalize that intelligence predictably, efficiently and sustainably at scale.
IBT: NVIDIA’s GPUs still dominate AI infrastructure. What’s the biggest structural weakness in GPU-led inference that the industry isn’t openly acknowledging yet?
Sid: GPUs are very good at large-scale parallel computing, which is why they became the default engine for building AI models. They were optimized for the compute-intensive task of training massive models and they continue to perform well for that.
Where things become more nuanced is in real-time production inference. Many real-world inference workloads aren’t constrained by raw compute, they’re constrained by memory bandwidth, data movement and latency consistency.
Moving data back and forth between separate memory and compute resources adds cost and variability, especially as AI systems become more interactive and multi-stage.
That architectural mismatch is something the industry is only beginning to fully appreciate. At dMatrix, we’ve tightly integrated compute with memory to minimize data movement and deliver predictable, low-latency inference.
It’s not about replacing GPUs, it’s about recognizing that modern AI workloads require more than one architectural approach.
IBT: If inference costs don’t fall quickly, does AI risk becoming controlled by a handful of hyperscalers who alone can afford to run large models at scale?
Sid: No. I believe cost-efficient solutions such as ours are on the brink of major adoption and that AI will run a wide range of model sizes, not just the largest ones.
Large models and hyperscalers will remain important, but innovation doesn’t stop at the frontier. Thanks to the proliferation of open models that began with the introduction of DeepSeek, we’re seeing growth in smaller, task-specific models that can be deployed across industries and geographies.
What ultimately determines how broad AI becomes is accessibility. If the cost of serving intelligence remains high, adoption narrows. If serving AI becomes more affordable and predictable, it expands, across enterprises, regional clouds, sovereign environments and edge systems.
That broader participation is what keeps the ecosystem open and competitive.
IBT: Microsoft backs you, yet it is also deeply invested in GPU-heavy AI stacks through OpenAI. Do you see dMatrix as eventually competing with the very infrastructure its investors rely on today?
Sid: No. AI data center infrastructure is diversifying at a fast pace.
We predict GPUs will continue to serve the training needs of Microsoft and the industry at large for quite some time. However, the industry now understands that specialized infrastructure is required for fast and efficient inference.
Our focus from day one has been on inference workloads, not training. Microsoft invested in us because they recognized early that inference, not training, would become a much larger share of AI workloads as models moved into real-world deployment.
As AI scales in production, the infrastructure naturally becomes more layered.
IBT: India is becoming a major R&D base for dMatrix. Do you believe India can move from being a software talent hub to influencing global semiconductor architecture decisions?
Sid: Yes, I believe that shift is already happening.
India has long been known for software talent, but over the last decade the depth of expertise in hardware architecture, system design, verification and advanced silicon development has grown significantly.
At dMatrix, our Bengaluru team isn’t an extension office — they’re fully integrated into our core R&D efforts and contribute directly to our architecture decisions and intellectual property.
Semiconductor innovation today is globally distributed. As more companies build advanced AI infrastructure, India is increasingly influencing not just implementation, but foundational design choices that shape next-generation compute platforms.
IBT: If inference becomes dramatically cheaper, what new categories of AI applications become viable?
Sid: Once inference becomes dramatically cheaper and more predictable, you can unlock persistent, always-on AI.
Today many applications are still constrained by cost and latency. They work in demos, but behind the scenes they’re expensive to run continuously at scale.
Lower inference costs make it viable to embed AI into everyday systems: real-time coding copilots across entire engineering teams, autonomous agents that continuously monitor workflows, interactive simulation and video generation that respond instantly, and AI deeply integrated into enterprise software rather than offered as a premium feature.
When serving intelligence becomes affordable, AI shifts from something you occasionally query to something that operates alongside you all the time.
Comments are closed.