Most agencies treat every inbound reply as a chance to engage. We run AI outbound for 50+ B2B companies, have handled over 95,000 positive replies this year, and the data says the opposite. The teams that book more meetings reply to fewer messages, not more. Below, the 10 reply categories that actually matter, the 2 that earn an auto-response, and the sarcasm-detection rubric that protects the brand from the back-and-forth that lowers trust.
What Cold Email Reply Classification Actually Is
Every reply that lands in a cold email inbox carries one of 10 intents. A buyer says "yes send it over." A buyer asks "how much does this cost." A buyer fires back "we already have an agency, not interested." A buyer's autoresponder bounces an OOO notice. A buyer challenges with "really, you think you can do better than what we have?" Each one needs a different next move. The classifier is the layer that sorts them so the response engine sends the right reply, or no reply at all.
At small volume (under 500 emails a month), a human can sort replies by eye. At 15,000 emails a month per client across 50+ campaigns, manual sorting is impossible and even a senior SDR misclassifies under fatigue. The classifier has to be programmatic. The accuracy bar is higher than people assume because the cost of a wrong call is asymmetric.
- Reply Classification
- The function in a cold email pipeline that reads an inbound reply, identifies the buyer's intent, and labels the message with one of the 10 standard categories. The classifier sits between the inbox webhook and the response composer. Modern classifiers use a fine-tuned LLM (Claude Haiku or Sonnet, GPT-4 Mini) with the cold email body, the reply body, and a system prompt that defines the categories. Accuracy targets sit at 92 percent or higher on positive and question detection, and 97 percent or higher on hard-no and unsubscribe detection.
According to HubSpot research on B2B sales response patterns, the median sales team treats every inbound reply as an opportunity and responds to over 80 percent of them. The teams that actually book more meetings respond to roughly 30 percent of inbound replies. The 50 percent gap is replies that should be logged but not engaged. That gap is what reply classification owns.
The 10 Standard Reply Categories
Every reply gets one label. The 10 categories cover roughly 99 percent of real-world inbound messages on a B2B campaign.
- Positive. Clear interest. Vague yes, "sounds good," "tell me more," "send it over." The buyer wants engagement on the offer.
- Question. A real info-seeking question about the offer, pricing, scope, fit, mechanism, or timeline. The buyer wants information they would actually use to decide.
- Call-request. The buyer explicitly asks for a phone call or meeting. Different from positive because it skips the "tell me more" step.
- Objection. Pushback without a sincere info-seeking question. "We already have an agency," "we tried this and it didn't work," "we don't have budget."
- Banter. Sarcasm, snark, jokes, rhetorical challenges, mockery. Often contains a question mark, but the tone is not sincere.
- Not-now. Polite deferral. "Circle back in 6 months," "after our fundraise," "Q3 maybe."
- Hard-no. Explicit disinterest with clear language. "Not interested," "stop emailing me," "do not contact us again."
- Out-of-office. Autoresponder bounce. Vacation, parental leave, role change.
- Unsubscribe. Explicit request to be removed from the list. "Remove me," "unsubscribe," "take me off."
- Other. Catch-all for replies that do not fit cleanly into the above 9. Wrong-person bounces, internal forwards, replies from the buyer's assistant, gibberish.
Some teams add an 11th category for "wrong-person" forwards (where the original recipient routes the email to the actual buyer). Some merge positive and question into one "warm" bucket. The 10-category taxonomy above is the version we run across every client because it maps cleanly to one action per category, and one action per category is what makes the system run at scale.
Which Categories Should Earn an Auto-Response
The default assumption is that more replies build more pipeline. In practice the opposite holds. Only 2 of the 10 categories should earn an automated response, plus the scripted call-request path.
- Positive. Reply with the next-step asset (lead magnet link, Calendly, deck) inside 15 minutes. Speed is the demo here. A 15-minute response on a positive reply closes at roughly 3 times the rate of a 24-hour response in our data.
- Question. Reply with a direct answer to the specific question, then append the next-step asset. Never deflect a sincere question to a sales call. The buyer asked it because they need the answer to decide.
- Call-request. Scripted path. Send the booking link with a single sentence acknowledging the ask. No LLM in this path because the answer is always the same.
Everything else gets logged to the CRM but does not earn a reply. That includes objections, banter, not-now, hard-no, OOO, unsubscribe, and other. The lift from staying silent on those 7 categories is larger than the lift from any copy tweak we have run in 2026.
The intuition that drives the wrong default: "if a buyer replied at all, that's a signal worth engaging." The data we see across 95,000 positive replies says signal quality is bimodal. Either the buyer is sincere (positive, question, call-request) or the buyer is dismissive or distracted (everything else). Replying to the second group converts at under 1 percent and produces brand damage that the first group's replies cannot offset.
The Sarcasm Detection Rubric That Protects the Brand
The hardest category to classify is banter. Roughly 4 percent of all inbound replies in B2B outbound carry a sarcastic or rhetorical tone, and many of them contain a question mark. A naive classifier reads the question mark, labels the reply as a question, and the response engine fires off an earnest answer. The buyer screenshots it, posts it to LinkedIn, and the brand takes the hit.
A reply is banter if any of these signals fire, regardless of surface category:
- Mocking tone or rhetorical question structure. "What do you say?" at the end of a challenge. "How about that?" "Really?"
- Sarcastic agreement or feigned engagement. "Oh sure, send me everything you've got."
- Jokes, taunts, or ribbing. "Haha," "lol," emoji-heavy responses.
- Challenges that re-pose the sender's premise back at them. Without a real follow-up question.
- Dismissive single-clause comebacks aimed at the sender, not the offer.
A question is sincere (and earns a reply) only when the buyer is asking for information they would actually use to decide: pricing, scope, timeline, mechanism, fit, integration, references, next steps. Anything else is banter or objection, both of which go to the silent path.
The cost asymmetry that locks this rule in: missing one borderline positive costs one lost reply. Replying to one piece of sarcasm costs a screenshot, a public callout, and reputational damage that takes months to recover from. The bias is silence when in doubt.
How to Build the Classifier
A reply classifier is an LLM with a system prompt, the cold email body, and the reply body. Three configuration calls matter.
- Model selection. Use a small fast model (Claude Haiku, GPT-4 Mini) for the classification call itself. Reserve Sonnet or larger for the response composition. Classification is a labeling problem, not a generation problem, and the smaller model is 5 to 10 times faster at lower cost with negligible accuracy loss.
- System prompt structure. Define each of the 10 categories with 2 to 3 example replies and 1 to 2 sentences on what the category means. Include the sarcasm rubric verbatim. Include the cost asymmetry rule ("when in doubt, classify as silent path"). Cache the system prompt because it does not change between leads.
- Context window. Pass the full cold email body, the full reply body, and the prior 2 messages in the thread if they exist. Truncate older history. The classifier needs the conversational context but does not need the full 6-month history.
Output format: a single JSON object with the category label, a confidence score, and a one-sentence reasoning string. The reasoning string is for human review. If accuracy slips, the reasoning column in the log is where you debug.
Mickey Hardy used this exact classification approach and went from referrals-only to a 200K month, with every positive reply getting a 15-minute personalized lead magnet auto-delivered. Read the full case study →
The 3 Failure Modes That Kill Reply Classifiers
Most classifiers we audit on existing client campaigns fail in one of 3 ways. The fixes are simple. The diagnosis is what takes time.
Failure mode 1: false positives on banter. The classifier reads a question mark, labels the reply as a question, and the response engine fires. The fix is the sarcasm rubric above plus a temperature of 0 on the classification call. A higher temperature on classification produces creative interpretations that are exactly what you do not want.
Failure mode 2: false negatives on positive replies. The buyer writes "yeah, send it" or "sure" and the classifier reads it as too vague to be positive. The fix is example-driven prompting. Include 5 to 10 real "vague yes" examples in the system prompt. The model learns the surface pattern fast once it sees real ones.
Failure mode 3: misclassification of multi-intent replies. The buyer writes "what's the pricing, also can we meet next week." That's a question and a call-request in one message. The classifier picks one and drops the other. The fix is treating multi-intent replies as their own class. Route them to a composer that addresses both intents in the response. Cuts misrouted replies by roughly two-thirds in our testing.
According to Salesforce State of Sales research, response-quality issues are the single most common cause of churn in outbound programs, ahead of list quality and copy quality. The classifier is the layer most teams underinvest in because it sits behind the inbox and produces no visible artifact for the CEO to review.
The Practitioner Frame on Reply Classification
Reply classification is not glamorous work. It sits behind the inbox, produces no visual artifact, and the only people who notice it are the ones whose campaigns it quietly ruins. Most agencies skip it or build it once and never tune it. Both are mistakes.
The teams that get this right run a weekly review on a sample of 50 to 100 classified replies. They check the category labels by hand, surface the misses, and update the system prompt with the new examples. The classifier improves every week. The campaigns that feed it produce more booked meetings every month.
The taxonomy is 10 categories. The auto-response set is 2 categories. The sarcasm rubric is 5 signals. The classifier is one small fast LLM call with a cached system prompt and a 15-second SLA. Build it that way, review it weekly, and the back end of your cold email program stops being the bottleneck on revenue.
See How an AI SDR System Works
15 minute demo. No fluff. We will walk you through the exact system, show real prospect examples, and scope what it looks like for your market.
Schedule a Demo →