While llms.txt helps AI read the web and APIs help them connect, neither solves the infinite customization found in the economically important tasks in enterprise software. The real solution lies in computer-use agents that operate at the pixel level, learning from human demonstrations to navigate screens directly. This approach bypasses brittle connectors, allowing AI to handle complex workflows while humans remain in the loop for critical verification.While llms.txt helps AI read the web and APIs help them connect, neither solves the infinite customization found in the economically important tasks in enterprise software. The real solution lies in computer-use agents that operate at the pixel level, learning from human demonstrations to navigate screens directly. This approach bypasses brittle connectors, allowing AI to handle complex workflows while humans remain in the loop for critical verification.

The Screen Is the API

2025/12/10 15:36

"Why not just use llms.txt to understand the page?"

My friend was watching an AI agent work through a complex enterprise workflow. Clicking through menus, filling forms, handling the kind of nested configuration screens that were the definition of scope creep.

It was a reasonable question. Everyone is excited about llms.txt right now. A simple text file that tells AI systems what your website contains. Finally, the thinking goes, we have a standardized way for machines, or LLMs, to understand the web.

But my friend was confusing two very different problems. Reading is not doing.

The web did not become useful when machines learned to read it. It became useful when machines learned to act on it. And right now, the reading part is limited and we must shift focus to the doing.

Reading Isn’t the Hard Part

Let me be clear about what llms.txt actually does. It is a curated map for LLM inference. A structured way for language models to understand what exists on a website and where to find it. 

This is useful for bringing information to an LLM. But it is not a control mechanism. It does not let AI systems actually do anything. The gap between reading and acting is where the real work begins.

The Action Space

When people talk about AI automation, they usually mean APIs. Expose endpoints, let the AI call them, and you have automation. Simple.

Except it is not simple at all.

APIs expose only what developers choose to expose. They represent a curated subset of functionality that someone decided was worth the engineering effort to formalize. And in enterprise software, that subset is usually tiny compared to what users actually need to do.

Then came MCP, the Model Context Protocol. MCP tries to solve the connector problem. Instead of every AI system needing custom integrations with every application, you build one MCP connector and any MCP-compatible AI can use it.

This is an improvement. It solves the M×N problem where M AI systems need to integrate with N applications. But it assumes someone builds the connector in the first place.

Building these connectors is still hard. It requires understanding both the application and the MCP protocol. Most enterprise software will never get proper MCP support because the economics, I believe, are hard to justify. \n \n Attempts to automate API to MCP conversion have become popular, but they mostly produce brittle, low-level tools. As Han Lee and others point out, REST APIs are designed around nouns (resources with GET/PUT/POST/DELETE), while MCP works best when tools are verbs (deleteRow, createTask). Auto-wrapping one into the other hides that mismatch instead of solving it.

The M×N×P Problem

There is a deeper issue that neither APIs nor MCP can address. Call it the P variable: interface diversity.

P represents the number of unique ways the same software can be configured. And in enterprise software, P grows to enormous scale.

Consider SAP. A single SAP S/4HANA server contains tens of thousands of customizing tables. Every implementation is different. Every organization has its own approval chains, its own business rules, its own custom ABAP developments.

Here is a concrete example. Take something supposedly simple: a purchase order approval workflow. In a real SAP implementation, this involves parallel approval processes with all-or-nothing requirements. Custom rules like auto-approve if a contract covers the full purchase order amount. Multi-level approval chains where limits are maintained in custom tables. Dynamic role assignment based on cost center responsibility.

None of this is standard.

The approval chain requires both the Department Manager and Finance Department to approve simultaneously. Either rejection kills the whole workflow. 

Then come the rules. If the purchase order references a contract and the totals match, auto-approve. Otherwise, check approval limits in a custom table. If the first approver lacks sufficient authority, cascade to the next level.

And the approvers themselves? Assigned dynamically. Sometimes it is the Manager of Workflow Initiator. Sometimes the Cost Center Responsible. Sometimes specific users pulled from yet another custom table.

This is one workflow in one module. 

It requires domain-knowledge-specific consultants to implement because the out-of-the-box logic is too simple for how real organizations actually work.

This is the M×N×P problem. Even if you solved M×N with perfect connector protocols (like the MCP), you would still face the reality that every enterprise implementation is effectively a unique interface.

Computer-Use as the Universal Layer

There is one interface that is universal: the screen.

Computer-use agents operate at the pixel level. They see what humans see. They click where humans click. They navigate the same menus and fill out the same forms.

This sounds crude compared to elegant API calls. But it has one massive advantage: it works with everything. No connector required. No API exposure decisions. No MCP protocol adoption. If a human can do it, a computer-use agent can learn to do it.

The question is whether computer-use works well enough for production use. And here the research is early but encouraging.

The Demonstration Effect

The SCUBA benchmark tests AI agents on real Salesforce CRM workflows. In zero-shot settings, meaning no task-specific training, open-source models achieved less than 5% success rates. Even strong models that perform well on generic desktop benchmarks failed catastrophically when confronted with actual enterprise software.

But with demonstrations, meaning examples of humans completing the workflows, success rates jumped to 50%. Simultaneously, time and costs (of the agents) dropped by 13% and 16% respectively.

General capability is not enough. You need specific training on specific workflows.

Data Efficiency

In my experience, collecting computer-use trajectories is painful. Domain experts rarely understand what actually challenges a model. The infrastructure stacks on top of brittle web environments. Building those environments is pure tedium. When every example costs this much, data efficiency stops being nice-to-have.

Which is why the PC Agent-E research matters. Trained on just 312 trajectories, the model achieved a 141% improvement over the base model.

312 examples. Not millions. Not even thousands. A few hundred carefully chosen demonstrations of the exact workflows.

The model outperformed Claude 3.7 Sonnet with extended thinking on the WindowsAgentArena benchmark. And it generalized well to different operating systems, suggesting the learned behaviors were not brittle.

The economics of enterprise AI automation are simple: you do not need massive datasets. You need the right datasets from the right workflows.

The Honest Trade-Off

Now for the uncomfortable part. Generalization is necessary but not sufficient for high-stakes operations.

The same research that shows promising results also reveals gaps. Some agents that perform well on generic benchmarks like OSWorld achieve less than 5% success on specialized enterprise environments. Despite advances, today's RL systems struggle to generalize beyond narrow training contexts.

The sim-to-real gap persists. An agent that performs flawlessly in simulation may fail in production due to unmodeled variables. 

For high-volume, repetitive workflows like expense approvals, CRM updates, and standard procurement, trained computer-use agents are approaching production readiness. The error rate is acceptable because any single mistake is recoverable.

For one-off, high-stakes operations like schema migrations, financial reconciliations, and compliance configurations, the calculus is different. A database configuration error can cost millions. A compliance failure can trigger regulatory action.

The honest answer is that computer-use can handle navigation and execution for these tasks, but humans must remain at verification checkpoints. The agent does the clicking. The human confirms the consequences.

This is not a failure of the technology. It is appropriate risk management. And it still represents an enormous productivity gain. Navigating to the right screen, filling in the right fields, and preparing the right configurations is most of the work. Human verification at critical decision points is the remaining essential piece. At least for now.

Down The Middle: Agents and Humans

The path forward is not pure automation or pure human control. They are hybrid workflows where computer-use agents handle the interface complexity while humans handle the judgment calls. Human-in-the-loop is already the norm for production AI agents.

This requires new infrastructure. You need training pipelines for enterprise-specific demonstrations. You need simulation environments that match production configurations. You need checkpoint mechanisms that pause for human review at appropriate moments. Companies like Applied Compute, Theta, Osmosis, and Scale AI are starting to build this infrastructure.

But the hard technical problem, making computers reliably operate arbitrary interfaces, is being solved. The remaining problems are organizational and economic. Those problems have a tendency to get solved when the benefits are large enough.

The best agents still fail on most real enterprise tasks. But a few years ago they could barely hit a single submit button. The screen is the only universal interface. That's where the work should go.

\n

\

Market Opportunity
Tx24 Logo
Tx24 Price(TXT)
$0.00501
$0.00501$0.00501
-13.62%
USD
Tx24 (TXT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

The Channel Factories We’ve Been Waiting For

The Channel Factories We’ve Been Waiting For

The post The Channel Factories We’ve Been Waiting For appeared on BitcoinEthereumNews.com. Visions of future technology are often prescient about the broad strokes while flubbing the details. The tablets in “2001: A Space Odyssey” do indeed look like iPads, but you never see the astronauts paying for subscriptions or wasting hours on Candy Crush.  Channel factories are one vision that arose early in the history of the Lightning Network to address some challenges that Lightning has faced from the beginning. Despite having grown to become Bitcoin’s most successful layer-2 scaling solution, with instant and low-fee payments, Lightning’s scale is limited by its reliance on payment channels. Although Lightning shifts most transactions off-chain, each payment channel still requires an on-chain transaction to open and (usually) another to close. As adoption grows, pressure on the blockchain grows with it. The need for a more scalable approach to managing channels is clear. Channel factories were supposed to meet this need, but where are they? In 2025, subnetworks are emerging that revive the impetus of channel factories with some new details that vastly increase their potential. They are natively interoperable with Lightning and achieve greater scale by allowing a group of participants to open a shared multisig UTXO and create multiple bilateral channels, which reduces the number of on-chain transactions and improves capital efficiency. Achieving greater scale by reducing complexity, Ark and Spark perform the same function as traditional channel factories with new designs and additional capabilities based on shared UTXOs.  Channel Factories 101 Channel factories have been around since the inception of Lightning. A factory is a multiparty contract where multiple users (not just two, as in a Dryja-Poon channel) cooperatively lock funds in a single multisig UTXO. They can open, close and update channels off-chain without updating the blockchain for each operation. Only when participants leave or the factory dissolves is an on-chain transaction…
Share
BitcoinEthereumNews2025/09/18 00:09
SOLANA NETWORK Withstands 6 Tbps DDoS Without Downtime

SOLANA NETWORK Withstands 6 Tbps DDoS Without Downtime

The post SOLANA NETWORK Withstands 6 Tbps DDoS Without Downtime appeared on BitcoinEthereumNews.com. In a pivotal week for crypto infrastructure, the Solana network
Share
BitcoinEthereumNews2025/12/16 20:44
Why The Green Bay Packers Must Take The Cleveland Browns Seriously — As Hard As That Might Be

Why The Green Bay Packers Must Take The Cleveland Browns Seriously — As Hard As That Might Be

The post Why The Green Bay Packers Must Take The Cleveland Browns Seriously — As Hard As That Might Be appeared on BitcoinEthereumNews.com. Jordan Love and the Green Bay Packers are off to a 2-0 start. Getty Images The Green Bay Packers are, once again, one of the NFL’s better teams. The Cleveland Browns are, once again, one of the league’s doormats. It’s why unbeaten Green Bay (2-0) is a 8-point favorite at winless Cleveland (0-2) Sunday according to betmgm.com. The money line is also Green Bay -500. Most expect this to be a Packers’ rout, and it very well could be. But Green Bay knows taking anyone in this league for granted can prove costly. “I think if you look at their roster, the paper, who they have on that team, what they can do, they got a lot of talent and things can turn around quickly for them,” Packers safety Xavier McKinney said. “We just got to kind of keep that in mind and know we not just walking into something and they just going to lay down. That’s not what they going to do.” The Browns certainly haven’t laid down on defense. Far from. Cleveland is allowing an NFL-best 191.5 yards per game. The Browns gave up 141 yards to Cincinnati in Week 1, including just seven in the second half, but still lost, 17-16. Cleveland has given up an NFL-best 45.5 rushing yards per game and just 2.1 rushing yards per attempt. “The biggest thing is our defensive line is much, much improved over last year and I think we’ve got back to our personality,” defensive coordinator Jim Schwartz said recently. “When we play our best, our D-line leads us there as our engine.” The Browns rank third in the league in passing defense, allowing just 146.0 yards per game. Cleveland has also gone 30 straight games without allowing a 300-yard passer, the longest active streak in the NFL.…
Share
BitcoinEthereumNews2025/09/18 00:41