---
title: "We tried to teach agents to be employees. Here's what happened."
description: "A look inside our V1 operating system, and what we're doing next."
lead: "A look inside our V1 operating system, and what we're doing next."
date: 2026-06-22
category: agentic-operations
categoryLabel: agentic operations
tags:
  - agentic operations
  - skills
  - context
  - automation
readingTime: 15 min read
layout: layouts/post.njk
permalink: /posts/we-tried-to-teach-agents-to-be-employees/
---

Nine months ago, we became obsessed with an idea. What if we could give AI agents - rapidly increasing in both capabilities and context windows - access to as many of our internal tools as possible, plus all of the information in our organization's filesystems, plus the leadership's overarching goals. And then challenged them to help us meet specific targets?

AI systems are increasingly capable of devising their own paths to solutions. So a system imbued with all of our small organization's context and abilities would, we reasoned, be able to help us fit our goals far more effectively. We would still need to provide direction and oversight - this was never intended to be a truly autonomous agent - but we figured that armed with virtually everything our human colleagues had access to, whatever we asked the AI to do, it should do a better job.

This seems a little abstract, but we had concrete use cases in mind which we thought might work - a finance analysis agent which would be able to pull down data from our banking and accounting systems and flag or forecast deviations from our quarterly targets without asking, for instance. Or an advertising operations agent which, alongside pulling data on demand, would automatically tag ad creative reports by which objective our marketing team were trying to hit when it compiled them. Or - and this was the stretch goal - a brainstorming agent which would be able to challenge us with new angles which would hit our marketing goals faster, based on threads it could draw together from its birds-eye view of our businesses, perhaps by performing deep research across multiple company sources before it began.

This post is intended to examine some of the results we've had - what went as expected, what didn't (hint: a lot), and what surprised us. At the end, I'll talk about some of the changes we're making as a result of the experiment.

Editorial note: It's worth noting that we're a relatively small organization, with a higher degree of technical ability than the average. So take from the below writeup what's useful for your organization, and ping us with any questions.

## Our experimental agentic setup

Our team runs on both Claude Code and Codex. If you're familiar with these tools you know that they're directory-based. If you're not, it's worth stopping here and watching some introductory videos. While we use Max, Pro and Enterprise subscriptions and not API-based billing, it's important to understand here that we were not using the consumer versions of these apps, ChatGPT and Claude web respectively.

### File management

Since Claude Code and Codex make extensive use of directory and file management, we used a shared drive with organizational subfolders to begin with. Sales and marketing had a folder, with sales and marketing subfolders, operations had a folder, finance had a folder, and so on and so forth. Our AIs could traverse limited folders to begin with (the one they were instantiated in, and an additional CEO folder which contained strategy context, see below), although we later relaxed these rules and allowed them to use and search for files and skills anywhere in the tree structure.

### Skills

Skills, which provide the information the agents need to conduct activities - pulling data, creating images, sending emails, writing spreadsheets, basically any task which is repeatable - were also owned by teams, and nested inside the organizational units' shared drives.

We ended up with a total of 28 skills across sales and marketing, operations and finance. The majority of these were wrappers to connect to various external services. We're a retail business, so it included logistics providers, Amazon, Meta and various analytics and finance platforms. One connected to a custom internal tool we use. The others were workflow-based, consolidating processes and SOPs into one repeatable workflow which could call other skills as required.

### Setting context and rules

Context in LLMs is, on the face of it, a pretty simple thing: the information you need the LLM to know, added before or at the same time as your request. Anyone who's been a manager knows that context is vital for human workers too, and so we approached our V1 operating system with the idea that our AI systems could benefit from some basic data about what we were trying to achieve and why.

Lots of frameworks exist for capturing organizational objectives. My personal favourite is the OKR framework (once a Googler, always a Googler, I guess). But while OKRs are superb for teams and operations folks, I've always felt they don't inspire me in the way a visionary leader laying out their big dream does. For that, I much prefer the V/TO framework, which is part of the Entrepreneurial Operating System.

So, we started by giving our internal AI systems shared context notes, written into every AGENT.md we used (with CLAUDE.md files pointing at them). Our agents now knew why they were being asked to do something, and we also instructed them to question if it wasn't obvious how the instructions they were getting were related to our organizational goals.

EOS provides for individual teams with individual rocks which ladder up to the V/TO. So we provided the agents with that too, sitting in a dedicated /ceo-office/ folder. Every time we asked for a task, we told them to examine carefully where it sat within the organization's strategy, consider competing priorities, and log their tasks and actions in a report which was loosely structured according to the organizational objectives.

## What's worked well

Since then, we have spent millions of tokens executing different skills, tasks and projects within this file structure. It has helped us in some ways, but we'll be moving away from it. First, the positives:

### AIs are very good at finding relevant information and skills.

Giving AIs a full structure to grep every time they're asked to do something turned out to be incredibly powerful. Most particularly, it significantly improved reporting tasks by allowing the AIs to include information which would ordinarily have been in a second level report, without being asked. As an example, a marketing report which we assumed would contain weekly revenue as reported by Facebook was returned with payouts reconciled with our actual bank account thanks to a skill in the finance folder.

It also produced some surprising observations. Asked for strategy reports, our agents would regularly pull in suggestions related to the work of another "team", sometimes even offering to proactively update the tasks of that team. When analysing and designing a new landing page offer, one agent suggested creating an SEO team task to improve the visibility, correctly surmising that this task wasn't within its own scope but would be a value-add later on.

**Conclusion: AIs work cross-functionally better than most humans. If you can do it safely and securely, providing AI with broader access to key information inside your organization is likely to improve reporting-based activities.**

### AIs can check their objectives, but are liberal about aligning tasks with them

We didn't see that providing the high level overview of our organizational objectives through the V/TO led to a universal and obvious improvement in agentic outcomes. Agents are still task-driven - probably thanks to their training - and so they're willing to faithfully do what you ask but won't always take the initiative to do much more, even if they know where you're trying to get to.

The exception to this was when we wrapped tasks. We had asked the agents to log their activities following task completion in a centralized folder, including with lines about potential next steps. We found that they normally did include reasonable suggestions of what to do next in these notes, which were generally aligned with our goals. But they wouldn't do them autonomously, and the outputs were quantity over quality. A dedicated brainstorming session with fewer task outcomes almost certainly would have produced better results than "here's a vast list of potential things to do now", which is what we ended up with.

But the agents *were* rigorous about making sure they understood where the work fitted within our grand plans. Every task we observed started with the agent considering which team or organizational objective they were working towards. In most cases, it was a simple observation, and sometimes, agents bent reality to make an idea fit. This could be useful in a larger team, but as we run pretty lean, a reconfirmation of "yes you're working on the right thing, broadly" didn't help that much.

**Conclusion: Broad goal-based context didn't add much to our agents' abilities or direction, and was pretty ineffective as a guardrail against inefficient or irrelevant work.**

### Skills: great to have, a pain to maintain

Skills were a fantastic addition to our agentic stack and the folder structure was ideal for organizing them. But we have done a poor job at keeping them up-to-date, which meant that we burned a lot of tokens forcing the AI to rewrite key scripts because some API somewhere had changed.

This points to an issue with skills: they essentially become a software dependency, which means they need to be maintained, updated and rigorously tested. We were bad at that, and looking back, I think we'd trade fewer skills for more maintainability and cleaner scripts. One of our most-used marketing skills was forever getting amended by different agents, which wasn't disastrous in our case, but was inefficient.

**Conclusion: Use skills, but review and test them regularly. Consider versioning them and forcing agents to go through a pull-request process for major changes.**

### You're not going to replace most of your SaaS

The agentic operating system we were dreaming of long-term, capable of discerning goals and self-directing, was perhaps a fantasy from day one. I did think that some areas, for instance customer service, might end this experiment being virtually self-sufficient, but this wasn't the case.

It turns out that our broad "give agents access to everything" approach came at the expense of going deep on certain workflows which could have benefitted from that. Customer service is a great example.

In our customer service setup, a ticket touches between 1 and 6 systems. In case you're curious: Shopify, two 3PLs, Stripe, Recharge and a custom app. We dutifully built out connections and skills for all of them, and tried letting our agents loose on tickets as they came in.

Broadly, this worked for the majority, but by no means all tickets. The robustness level was nowhere near what we'd accept on a long-term basis. Agents were still prone to forgetting to check every source to solve complex cases. They would get confused when multiple tickets from the same customer arrived, especially if statuses had not been set properly. In one case, our test AI replied to a ticket promising a refund before checking whether one was appropriate.

All of these problems are technically solvable, and SaaS solutions have solved them pretty well, which is why we still pay for a SaaS tool instead of just handling things with agents. But we still hope to double down on our customer service agentic processes one day (see below).

While the technical build for this is very clear, we've realized that this will require an extended period of SOP documentation and testing, to ensure all edge cases are captured and agents aren't taking shortcuts when dealing with cases. It will likely require ongoing smoke tests as APIs evolve and our product suite and SOPs change. And probably the addition of new features as our channel mix for customer contact changes. All of this will probably sound familiar because I'm basically describing running a software company.

**Conclusion: Replacing specialized SaaS solutions is possible, but the cost/value equation needs careful consideration, especially at the lower end. For SaaS which costs single-digit thousands annually, the tradeoff won't be there for most businesses. For more expensive solutions, an internal tool managed by a single FTE may be cheaper.**

### Agents create mess

Developers will be aware that AI models create mess. Agentic organizations are no different. Despite giving the agents strict instructions to keep things tidy, our directory folders rapidly filled with temporary MD files, scripts, reports, CSV files and images.

While our agents were extremely good at not editing human files unless specifically asked to, the file system became far less usable because it was filled with so many working files the AIs had created as part of their tasks.

Faced with this over the long term, we would have used a weekly or daily agent to sweep the directory and start quarantining files. For some folders, we ended up using git to keep track of important files vs fluff, but this isn't a scalable approach for a whole organization.

**Conclusion: When working alongside humans, AI agents will either need to learn good housekeeping, have their own directory for working files, or have another agent clean up after them (or possibly all three). Without this, they'll produce so many artifacts that humans will struggle to make sense of the noise.**

## What we learned and what's next

### Skills are here to stay

Significantly increasing the number of usable skills our AIs have is probably the best outcome from this project. It's only when you try to use AI across the entire organization that you realize how many external endpoints you need to communicate with, and can begin to think about the best way to enable your agents to do the same thing. Curiously, making skills is actually the easy bit.

Many of the connections and skills we've built out in this project will stay, so individuals can use them in their own custom workflows. We're going to keep working on a way to keep them pristine and up-to-date, as well as a more scalable solution for secure credentials.

### Focus on the processes

Agentic operations, more than I expected, require you to really have a handle on what moves the needle for your business. When everything is possible, the art of choosing what to do becomes the most important differentiating factor.

I want to write more on this in a follow up post, but for now I think that having precision about where you integrate agents into existing workflows is likely going to yield significantly higher ROI than trying to embed them everywhere for everyone, like we did. Our experiment yielded fun results, but nothing that has fundamentally transformed our business in the way that AI is expected to transform businesses.

I should be clear that I don't think this is a reflection on agentic capabilities. In my view, modern AI (Opus 4.8/ChatGPT 5.5 as of June 2026) are capable of completing 60-70% of day-to-day white-collar knowledge *output* at a quality level that's on par with humans. This will vary by specific sectors, specialisms and tasks, but the trend is clear. The challenge is no longer "are the models intelligent enough?", but "are we deploying these systems in the right way?"

### Event-driven automation is easier than self-direction

You might remember the dream at the top of this post about self-directing agents who would add value based on the additional context they have. At the moment, I don't think this is achievable for most businesses, though it's certainly not far off.

What seems eminently achievable, and what we'll be exploring next, is targeted agents running automatically and silently, ideally in situations where we could or would never use a human, in a value adding way.

We already have some areas to explore here, around marketing and logistics. One idea is to scale marketing outputs across geographies automatically. For example, when our ads team launches a new US campaign, could a specialized agent automatically localize creative, set up campaign structures and deploy mirror campaigns in other regions? Another example might be in investigating delivery issues. As a retail business we get inbound data from carriers on logistics issues, but we currently rely on customers to tell us if their package hasn't arrived. This seems like an example of something relatively simple and boundaried which we could clearly improve.

### Looping to objectives

Loops are all the rage in AI software development these days. Modern agents are sophisticated enough that if you can state your objective clearly enough, the AI can iterate in a loop until it reaches it. This is coming to knowledge work too, and we want to experiment with it next.

Obviously, this is also dependent on setting great goals. Problem specification is an art in itself. I think there are some lessons in writing a great PRD which an agentic loop can follow which we can take into setting broader business objectives.

For example, could an AI agent loop until it had devised a new financing plan for a large asset purchase, including researching and contacting the potential providers, building the financial models and assessing the risks? Could an AI agent loop until it had finished designing an award-winning new product? I'm reminded of much of the advice from the ecosystem when Claude Fable was first released, which was essentially "you'll be surprised at what this model can do."

My sense is that we are not far off from powerful loops which will allow models, armed with the right skills and information, to work for long periods of time to create non-software entities of immense value. This is an area which we'll be exploring next.

## Follow the journey

We'll be investing more time, money and tokens in this journey. We've seen that agents aren't quite employees in the way we imagined, but this first experiment has clearly shown that they can add value when thoughtfully directed and given the tools, context, boundaries and feedback that they need. Our challenge now is to develop a repeatable framework for identifying the processes where they can add value that compounds to drive the organization forward.
