Self-improving AI skills: what to automate, and what to keep human

Self-improving AI skills loop: scorer, validation gate, and human taste gate

If I were setting up an AI agent with a writing skill today, the one that drafts my blog posts and content, I’d start with one thing: whatever it writes has to sound like me. My words, my voice, the way I think, not generic AI prose with my name on top.

That goal has two parts. One is the grind any draft needs: em-dashes, the SEO nags, the dead links. A machine can own all of it. The other is whether it sounds like me, and no rule decides that. Self-improving AI skills are how I automate the first and teach the system my taste on the second.

skill-self-improvement is the tool I built for it. And today I’m starting a journey to document the outcomes as I build it over time from ground up.

A skill is just a markdown file: the standing instructions that steer an AI agent. The model is not the focus; it changes and is swappable with every upgrade. The instructions are what I tune, and they carry across whatever model comes next.

What self-improving AI skills are

Self-improving AI skills do not touch model weights. The skill is a markdown document, usually a few hundred to a couple of thousand tokens, and the system edits that document from real usage. Microsoft’s SkillOpt calls this text-space optimization: treat the skill as a trainable artifact, propose small edits, keep the edits that work.

So “self-improving” here doesn’t mean autonomous. The system watches the corrections I make, sorts each into a handful of categories, and holds a slice back to check itself. Then it proposes an edit, and I decide. The whole design leans on that last part.

Two halves: rules and taste

The review is where I spent most of my time. And it turns out to be two tasks, not one.

One task is rule-shaped. Em-dashes, banned phrases, list formatting: a rule decides them, so a script can. Whether every claim traces to a source and every link resolves is checkable. The Yoast SEO pass is already green, orange, and red. Anything with a right answer lives here.

The other task is judgment. Does this sentence sound like me? Is this the right opening? No rule settles that. You decide by reading.

Interactive card:

These two halves do not need the same owner. The rule-shaped half is the part I used to dread, and it is the part a machine handles well. The judgment half is the part I want to keep.

Why I started from SkillOpt

This rabbit-hole first started with curiosity about DSPy, Stanford’s framework for optimizing an LLM pipeline’s instructions against a metric. The idea that you train the text, not the model, stuck with me. I went looking for the same thing aimed at one skill document and found Microsoft’s SkillOpt, which became my starting point.

The optimizer is the part that proposes edits and keeps the ones that work. I did not want to build that from scratch, and SkillOpt does it well. It works on text instead of model weights, which is the shape of my problem.

Its loop is simple. The agent runs the skill many times (the rollouts). A step reads how each run went and proposes small edits to the skill text: add a line, delete one, replace one. Then the part that matters. A validation gate accepts an edit only if it improves performance on a held-out evaluation set the edit was not tuned on. That held-out set is what separates real generalization from overfitting to your own examples.

Interactive card:

The published numbers back the approach. SkillOpt v0.1.0 (June 2026) reports average accuracy lifts of +23.5 to +24.8 percentage points across 6 benchmarks, 7 models, and 3 execution harnesses, with optimized skills transferring across model scales without retraining. A companion, SkillOpt-Sleep, harvests session transcripts, replays recurring tasks offline, and consolidates improvements under the same gate.

To me, the validation gate is the single best idea in the project. SkillOpt is the clearest approach I found. There may well be others that solve this as well or better, from prompt optimizers like TextGrad to the wider DSPy ecosystem; SkillOpt is the one I built on. That is why it sits in my design as the reference architecture for self-improving AI skills.

Interactive card:

The hard half was the data, not the optimizer

When I started, I expected the optimizer to be the hard part. It was the other way round. The data and the score were where the real work sat.

Give a system a set of tasks and a trustworthy score, and “propose bounded edits, keep the ones that raise a held-out score” is largely solved. SkillOpt is that solved piece. The hard half is the data and the score: where do labeled examples come from, and can you trust the number? That data-and-score half is, I think, why most self-improving systems stall, not the optimizer.

So I inverted the build order. I wrote the scorer first, deterministic and independent of SkillOpt: mechanical, factual, SEO. That delivers value with zero optimizer wired in, and it keeps me off a v0.1.0 dependency whose API will churn. SkillOpt stays swappable.

What the scorer is allowed to judge

Building the scorer first forced an early decision: what it may judge, and what it must leave alone.

The line lives in one small move. For anything mechanical, the scorer gives a verdict: this passed, this did not. For voice and taste, it is allowed to say something else: I can’t judge this one. That answer is the whole boundary. It separates the scorer disagreeing with me from the scorer admitting a call isn’t its to make. Voice and taste always get it, on purpose.

Why not hand taste to an LLM judge? Because an LLM judge you optimize against gets gamed. Push edits toward a model’s idea of good prose and the writing drifts toward the judge, not toward me. That is reward hacking, and it is the failure mode that turns a personal voice into generic competence.

Take a concrete one. A draft introduces three changes and joins them to the list with an em-dash. The fix is a colon. A mechanical scorer catches that and ranks the corrected version higher every time, no taste required. Now a different one. A draft opens with a knowing tease about what readers have not noticed. The fix cuts the tease and states a concrete fact in the first person instead. No rule catches that. It is a taste call, and it stays mine.

The payoff

Nothing I fix gets thrown away. Every correction I make becomes a feedback pair: the rejected version, the approved version, the why. The mechanical and factual fixes fold back into rules, so the next draft arrives already clean on those. The pile I fix by hand shrinks.

What it shrinks toward is the taste verdict, and only that. Keeping the human as the taste gate is, I think, what lets the rule-shaped review shrink toward zero without the writing drifting away from me. The part I keep is the part worth keeping.

Why it doesn’t run itself yet

There are three parts to self-improving AI skills, and it helps to keep them apart. The dataset of corrections is the answer key. The scorer is the examiner. The writer, the skill that drafts in the first place, is the student. So far I have built the answer key and trained the examiner. The student’s homework is still uncorrected.

That order is on purpose. A student left to grade their own work against an examiner they can fool learns to fool the examiner, not to write better. So before I let anything edit the writer on its own, four things have to hold:

  • Enough labeled data. A handful of corrections is not a test. The held-out set only means something past roughly thirty pairs, spanning every format.
  • A score I trust. The examiner has to agree with me where it can see the answer, and stay quiet when it cannot. On the corrections so far, it does.
  • A guard against gaming. The moment an optimizer can win by pleasing a judge, it will. Taste stays mine for that reason.
  • One document at a time. Point an optimizer at a whole tangled agent and you cannot tell what helped. It edits one skill file, gated, or nothing.

Until those hold, the loop stays manual. I draft, I correct, the correction goes into the answer key, and the examiner learns to catch it next time. Only when the answer key is deep enough and the examiner is trustworthy enough does it make sense to hand the student a red pen. That is the line between a tool that records my taste and one that slowly drifts away from it.

What is built now, and what is not

“Self-improving” sounds bigger than what it is.

Phase 0 of skill-self-improvement – for self-improving AI skills is built and tested: a core that knows nothing about content, the first scorer for prose, the command-line tools, and 26 tests. Against a seed dataset of 17 feedback pairs, the content adapter scores 0 wrong, 3 correct, 12 blind, and 2 not-applicable deletions. The 12 blind pairs are the voice and taste cases the scorer marks “can’t judge,” by design, so on every pair it did decide, it decided correctly.

The optimizer (SkillOpt) is not wired in yet. It waits on a larger dataset, because a handful of pairs is not a validation set. The risks stay on the table. There is a size mismatch (SkillOpt tunes one small doc; real skills are bigger and linked), reward hacking on any judge, thin data below the threshold, and a v0.1.0 dependency that will keep changing. I keep the scorer and the dataset independent so none of that can sink the rest.

This isn’t only about writing

Nothing about self-improving AI skills is special to blog posts, or to agent skills for content creation in general. A skill is any document that steers an agent: your own expert playbook, a process your company wrote down, one you pulled from GitHub. The same approach works on all of them.

Think back to the two halves, rules and taste. Every skill has both, not just writing. One half has a right answer a machine can check. The other is judgment, where only a person can say whether it landed. What changes from one skill to the next is how big each half is. The more of a skill a machine can check, the more of it improves on its own.

A coding skill sits near one end: did it run, did the tests pass, almost all right-or-wrong, so a machine scores most of it. A strategy skill sits at the other end, where “was this the right call” is judgment and stays with you. A writing skill, like mine, sits between: the mechanics and the SEO are checkable, whether it sounds like me is not.

That in-between case is the one SkillOpt never had to solve. It was proven on tasks that are checkable all the way through, so it never met the taste problem. The moment the valuable part of a skill is taste, you cannot hand the whole score to a machine. You split it, and you keep the judgment.

So when you point this at your own skill, start with one question, not the model: how much of this skill has a right answer a machine can check? That is how much improves on its own. The rest stays yours, and only the scorer changes. That is what lets self-improving AI skills work on any skill, not just mine.

Where to go next

If you are building self-improving AI skills, or anything that learns from its own runs, start with the gate, not the optimizer. Decide what your held-out set is and what a trustworthy score looks like before you let anything propose edits.

SkillOpt is the reference behind all of this. The paper lays out the method, the project page has the results and ablations, and the code is where the implementation lives. If you want to see where self-improving AI skills come from, start there.

The inspiration for this blog-series came from the DSPy project, have a look at their GitHub repo: stanfordnlp/dspy: DSPy: The framework for programming—not prompting—language models

For a separate take on evaluation gates, Microsoft’s guide on evaluating AI agents has a practical walk-through.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *