Drawing the Line Between Judgment and Recall

A field study under our Human and AI Collaboration research. We deployed a drafting agent inside a specialty contractor's estimating team to study a specific question: which parts of skilled work can be handed to a system, and which have to stay with the person?

The interesting question about an AI system is rarely whether it can do a task. It is which tasks should be handed over at all, where a person needs to stay in the loop, and how a system earns enough trust to be relied on. That is our third research direction, Human and AI Collaboration, and a specialty contractor's estimating team was a clean place to study it. Their work mixed pure recall with real judgment, and the two were tangled together in a way that let us watch what happens when you try to separate them.

The Work We Were Studying

The firm produced 15 to 20 proposals a month, each taking an estimator 3 to 4 hours. The work was not technically hard, it was manual end to end: search past projects for relevant scope language, pull current labor and material rates, write the executive summary from scratch, format to the company template, and route for review. A large share of every proposal was structurally identical to past ones: the same service descriptions, the same safety and qualification language, the same pricing format with project-specific adjustments.

Every estimator knew this, and every estimator still wrote it fresh each time, because reusing language from the wrong precedent (a different scope tier, a different safety classification, a client with different contract terms) created risk they did not intend to accept. Starting from scratch felt safer than trusting that you had found the right thing to copy. So the recall work and the judgment work were fused, and the judgment was being spent guarding against bad recall rather than on anything that needed expertise.

The Hypothesis

Our hypothesis was that recall and assembly could be handed to a system if, and only if, the system made its sources legible enough that the estimator could keep exercising judgment over them. The collaboration we wanted was not the system doing the job and the human approving it. It was the system carrying tireless retrieval and correlation while the person kept every decision that carried consequences.

What We Built

We gave a drafting agent structured, read-only access to three internal resources: a four-year proposal archive (about 340 documents), a pricing database updated weekly from the ERP, and an approved content library of scope descriptions, qualification statements, and legal language. The estimator provides the RFP scope summary, project type, and project-specific variables. The agent retrieves the most relevant precedents, pulls current pricing for the applicable service lines, assembles the narrative from approved content adapted to the context, and produces a complete first draft on the firm's template.

The draft is then reviewed and edited by the estimator, typically in 20 to 30 minutes, before submission. The agent submits nothing and makes no pricing decisions. The human review step is a deliberate part of the design, not a limitation. The architecture enforces the boundary directly:

Retrieval is chunked at the section level, not the document level, so the agent pulls the right safety language from the right project without dragging the rest of an unrelated proposal with it.
Pricing is fetched live from the ERP at draft time and never indexed with the documents. Finance and legal made that non-negotiable: a draft that locks in last year's rates is a liability, so historical pricing is structurally impossible.
The content library is a tagged template store. The agent can adapt phrasing to fit the surrounding narrative, but it cannot introduce pricing, scope, or legal commitments outside the approved set.
The review interface shows source citations inline. For every section, the estimator can see which past proposal and which library block it came from, and inspect the source directly.

The constraint to approved language and live pricing came from the client's legal and finance teams and was built into the architecture from the first design review, not retrofitted afterward.

What We Watched Happen

The team was skeptical at the start, which was the right reaction. Their concerns were concrete: that the agent would pull scope from the wrong project type, that pricing would lag reality, that the review step would degrade into a rubber stamp. The section-level citations answered the first, the live ERP connection answered the second, and the third we could not engineer away. We could only make the review fast and transparent enough that skipping it would feel negligent.

The behavior that emerged was the result we cared about most. By the end of the first month, the people who had trusted the system least were its heaviest users, and they were also the ones using the source-citation view most often. They were verifying, not accepting. That is well-calibrated trust: relying on the system for recall while keeping judgment active over what it retrieved. The collaboration worked because the division of labor was legible.

What the Study Showed

Average proposal time: 3.5 hrs to 28 min
Monthly volume: 18 baseline to 61 in the first quarter, no added headcount
Win rate on expedited proposals: held at 33% against a 34% baseline
First-draft estimator approval on mechanical scope: rose from 71% to 84% as the index was refined

The held win rate matters more than the speed. It shows the assisted drafts were not lower quality, so handing recall to the system did not cost judgment, it freed it. The estimators' own summary after 90 days was consistent: the agent gets structure and scope right most of the time, pricing is always current, and the work that remains (calibrating tone for a relationship, making scope-inclusion calls, reviewing risk language) is the part that actually needs their experience. The work they lost was the part they did not want.

The Boundary Is the Finding

The system also reported its own limits. For roughly 20 niche scope categories where the archive was thin, it produced weak first drafts, and the right answer was still to write from scratch. That is not a failure. A collaborator that knows where its competence ends is exactly what the research is trying to produce. The design principle throughout was minimum viable automation: automate the retrieval and assembly, preserve the judgment, and make the seam between them visible enough that people can trust it. That seam, where work passes between person and system, is the thing we are studying.