Most buyers evaluate cold email agencies on the wrong inputs. They read case studies, check Clutch reviews, and ask for a sample campaign. We run AI outbound for 50+ B2B companies and have onboarded clients who left every major agency on the market. The pattern is always the same: the agency looked strong on the surface and fell apart on the details nobody checked. Below, the 8 point evaluation framework we use internally, the contract terms that actually predict long term results, and the red flags that should end the conversation immediately.
Why Most Agency Evaluations Fail Before They Start
The problem with the standard evaluation process is that it measures marketing quality, not operational quality. A polished pitch deck, a strong Clutch profile, and a well-produced case study video tell you the agency is good at selling themselves. They tell you nothing about whether they will be good at selling you.
- Cold Email Agency Evaluation
- The process of assessing a cold email service provider across operational, technical, and contractual dimensions before committing to an engagement. A thorough evaluation covers data sourcing quality, sending infrastructure ownership, copy methodology, post-reply systems, reporting transparency, contract terms, and reference checks with current (not former) clients.
We have had clients switch to us from agencies with 200+ Clutch reviews and 6+ years in business. The complaints are always the same: bounce rates above 5%, domains registered under the agency's account, reporting dashboards that lead with open rates, and zero system for converting positive replies into booked meetings. The agency was technically competent at sending emails. They were not competent at producing revenue.
The evaluation framework below is designed to surface the operational gaps that sales calls and case studies hide.
The 8 Point Evaluation Framework
When we evaluate a vendor ourselves, or when a client asks us what to look for, we use these 8 criteria. They are ordered by how much damage they cause if you get them wrong.
1. Infrastructure ownership. Who registers the sending domains? Who owns the Google Workspace or Microsoft 365 accounts? If the agency registers everything under their account, they control your sending reputation. When you leave, you start from zero. This is the single most important question in the evaluation. Get it wrong and every other criterion is irrelevant.
2. Bounce rate across active campaigns. Not the best campaign. The average across all active clients. A well run operation sits between 1% and 2%. Above 3% means they are skipping email verification or using a low quality data source. Above 5%, your domains are taking damage every day they send. Ask for the number. If they dodge, assume the worst.
3. Data sourcing methodology. Single source (Apollo only, ZoomInfo only) or waterfall enrichment across multiple providers? Single source agencies deliver the same leads your competitors already have in their sequences. Waterfall enrichment pulls from 3 to 5 sources and deduplicates, which produces cleaner, less saturated lists. Ask which providers they use and how they verify email addresses before sending.
4. Copy methodology. Templated fill in the blank ("Hi {first_name}, I noticed {company} is...") or research driven personalization? The difference shows up in reply rates. Salesforce data shows personalized cold emails produce 26% higher reply rates than templated alternatives. Ask to see 5 real emails they sent for a client in the past 30 days. Not sample campaigns. Real sent emails. The gap between their sample and their actual output tells you everything.
5. Post-reply system. What happens in the 15 minutes after a prospect replies positively? If the answer is "we forward the reply to your sales team," you are paying $3,000 to $8,000 per month for a notification service. The agencies that produce the best cost per meeting have a system between the reply and the booked meeting: a personalized asset, a booking flow, a follow up sequence. This is where most agencies stop and where the strongest ones start.
6. Reporting transparency. What metrics does the agency report? Open rates are noise. Image blocking makes them unreliable. Total reply rate includes negative replies and out of office messages. The only metrics that correlate with revenue are positive reply rate, meetings booked, and cost per booked meeting. If the agency's reporting dashboard leads with open rates, their incentive structure does not match yours.
7. Warmup and ramp timeline. How long from signing to first emails sending? Reputable agencies need 2 to 3 weeks for domain warmup before going live. Agencies that promise emails going out in week 1 are either skipping warmup (which damages deliverability) or using pre-warmed shared infrastructure (which means your campaigns share reputation with other clients). Ask what their warmup protocol looks like and how many emails per domain per day they send during ramp.
8. Client retention data. How long does the average client stay? Agencies that deliver results keep clients for 6 to 12+ months. Agencies that churn clients every 3 months are not delivering what they promised. Ask for the number. Then ask for references from clients who have been with them for 6+ months, not the cherry-picked success story from 2 years ago.
Infrastructure Questions That Predict Long Term Results
Infrastructure is not the exciting part of evaluating an agency. It is the part that determines whether you have leverage 6 months from now. Here are the specific questions to ask and what the answers should sound like.
"How many sending domains do you set up per client?" The right answer is 3 to 5 dedicated domains per client, each with 2 to 3 inboxes. Shared infrastructure (multiple clients sending from the same domains) means one client's bad campaign can tank your deliverability. If the agency uses shared sending, your results are partially determined by their worst performing client, not your campaign quality.
"Do you use secondary domains or my primary domain?" The answer must be secondary domains. No legitimate cold email agency sends from your primary business domain. If they do, a spam complaint can damage the domain your employees use for day to day business email. Secondary domains (like yourbrand-mail.com or tryyourbrand.com) isolate cold email reputation from your primary domain. We covered this in detail in our cold email infrastructure guide.
"What email warmup tool do you use, and for how long before going live?" The answer should name a specific tool (Instantly, Warmbox, Mailreach, or equivalent) and a specific timeline (14 to 21 days minimum). Agencies that skip warmup or do it for less than 2 weeks are trading your long term domain health for short term speed.
"What is your inbox rotation strategy?" Sending all volume from a single inbox triggers spam filters. The agency should rotate across multiple inboxes per domain, limit each inbox to 30 to 50 sends per day, and spread sends across time windows. If they cannot explain their rotation strategy in specific numbers, they do not have one.
How to Evaluate Post-Reply Systems
The email gets the reply. What happens next determines whether the reply becomes a booked meeting or a dead thread. Most agencies treat the reply as the finish line. The agencies that produce the best results treat it as the starting line.
Ask these 3 questions during your evaluation.
"Walk me through what happens in the first 60 minutes after a prospect replies positively." A strong answer includes specific steps: automated detection, immediate follow up email, personalized asset delivery, and booking link within the first reply. A weak answer is "we flag it in your CRM and your sales team follows up." The speed of the follow up directly correlates with booking rate. Harvard Business Review research found that responding within 5 minutes makes you 100x more likely to connect than responding within 30 minutes.
"Do you send any assets or conversion materials on positive reply?" The best agencies ship a personalized deliverable (a walkthrough, a teardown, a competitive analysis) alongside the booking link. This is the step that separates a $300 cost per meeting from a $600 cost per meeting. The asset gives the prospect a reason to book beyond curiosity. It pre-sells the conversation before it happens. If the agency does not have a post-reply asset system, you are leaving meetings on the table.
"What is your positive reply to booked meeting conversion rate?" This is the metric that exposes whether the agency has a real post-reply system or just forwards emails. Agencies with strong post-reply systems convert 25% to 35% of positive replies into booked meetings. Agencies without one typically convert 10% to 15%. The gap between those numbers, at scale, is the difference between 5 meetings per month and 15.
The Contract Terms That Matter More Than Price
Most buyers negotiate price. The terms that actually determine whether the engagement works are buried in the contract and rarely discussed on the sales conversation.
Minimum commitment length. Some agencies require 6 to 12 month commitments with no out clause. A 30 to 60 day pilot with a month to month option after that is the standard you should hold. If the agency will not offer a pilot, ask why. "Our results take time" is a valid answer for a 60 day pilot. It is not a valid answer for a 12 month lock-in with no performance benchmarks.
Domain and data ownership. The contract should explicitly state that all sending domains, email accounts, lead lists, and campaign data belong to you. Not "will be transferred on request." Belong to you from day 1. If the contract is silent on ownership, assume the agency keeps everything. Get it in writing before you sign.
Data portability. If you leave, can you export your full lead list with engagement history (who opened, who replied, who bounced)? Some agencies hold this data as leverage to prevent churn. Your lead data should be exportable in a standard format (CSV) at any time, without a fee, without a waiting period.
Performance benchmarks. The contract should include measurable targets with a defined review window. Not "we will book you 20 meetings per month." Something like: "By day 60, campaigns will achieve a positive reply rate above 2% and a bounce rate below 3%." If the agency will not commit to any measurable benchmark, they do not have confidence in their own system.
Travis tried 2 agencies before finding a system that worked. His first full month on AI outbound produced $106K in closed revenue. Read the full case study →
Red Flags That Should End the Conversation
Not every mismatch is a dealbreaker. Some agencies are strong in one area and developing in another. But the following red flags are structural. They indicate a business model problem, not a capability gap.
- They promise meetings in week 1. Domain warmup alone takes 2 to 3 weeks. Any agency promising results in week 1 is either skipping warmup or redefining what "meeting" means.
- They will not share their bounce rate. Every agency tracks this number. If they will not share it, the number is bad.
- They register domains under their own account and call it "standard practice." It is standard practice for agencies that want to lock you in. It is not standard practice for agencies that earn retention through results.
- They require a 12 month commitment with no pilot option. Confidence in your own product means offering a trial. A 12 month lock-in without a pilot means the agency's retention model is contractual, not performance based.
- They cannot name their data sources. "We have a proprietary database" without naming the underlying providers (Apollo, ZoomInfo, Clearbit, Hunter, FindyMail) usually means a single low quality source rebranded. Ask specifically.
- They report open rates as the primary success metric. Open rate tracking relies on invisible pixel loading. Most enterprise email clients block these pixels by default. An agency that leads with open rates is either unaware of this limitation or hoping you are.
- They send from your primary domain. This is the fastest way to damage your business email deliverability. No exceptions. If they suggest it, the conversation is over.
How to Run a 30 Day Pilot Before Committing
The pilot is not a formality. It is the most important evaluation tool you have. Here is how to structure it so the data actually tells you something useful.
Week 1 to 2: Setup and warmup. The agency sets up sending domains, configures DNS (SPF, DKIM, DMARC), and begins warmup. You should receive confirmation of domain registration (under your account), DNS records, and warmup tool access. If you do not receive these by the end of week 2, the agency is behind.
Week 3: First sends. The agency begins sending at low volume (50 to 100 emails per day per domain) and ramps over the next 2 weeks. You should receive access to the sending platform dashboard so you can monitor deliverability, bounce rates, and reply rates in real time. Do not rely on the agency's summary reports during the pilot. Look at the raw data yourself.
Week 4 to 5: Data accumulation. By the end of week 5, you should have 1,000 to 2,000 emails sent with enough data to evaluate. Track these 4 numbers: bounce rate (should be below 3%), positive reply rate (should be above 1% even in early sends), reply quality (are the positive replies from your ICP, or random responders), and speed to follow up (how fast does the agency respond to positive replies).
Week 6: Decision. If the pilot hit the benchmarks you agreed on, convert to an ongoing engagement with month to month terms. If it did not, you have real data to explain why. You also have your domains, your data, and your reputation intact because you confirmed ownership on day 1.
The best way to evaluate a cold email agency is not reading their case studies. It is running a paid pilot with clear benchmarks and watching what they actually do when nobody is pitching you.
What Separates the Top Tier From Everyone Else
After onboarding clients from dozens of agencies, the pattern is clear. The agencies that produce the best results share 3 traits that the rest do not.
First, they own the full lifecycle. They do not just send emails. They handle what happens after the reply: asset delivery, booking flow, pre-meeting nurture. The email is a wedge. The post-reply system is the revenue engine. We wrote about this in detail in our ranking of the best cold email agencies.
Second, they invest in infrastructure depth over campaign volume. More domains, cleaner data, slower ramp schedules, lower sends per inbox. These agencies produce fewer total emails and more total meetings. The agencies that lead with "we send 50,000 emails per month" are optimizing for the wrong number.
Third, they report on the metrics their clients care about, not the ones that make the agency look good. Positive reply rate. Booked meetings. Cost per booked meeting. Show rate. These agencies do not mention open rates because they know open rates are noise. When you find an agency that reports this way without being asked, you have found one worth working with.
The evaluation process takes effort. It requires asking questions that agencies do not expect and looking at data that agencies do not volunteer. But the difference between a strong agency and a weak one is not visible on the surface. It lives in the infrastructure, the contracts, and the post-reply systems. The buyers who check those details before signing spend 6 to 12 months with a productive partner. The ones who skip the evaluation spend 3 months with the wrong agency, lose their domain reputation, and start over. The evaluation is not overhead. It is the most productive week you will spend in the entire engagement.
See How an AI SDR System Works
15-minute demo. No fluff. We will walk you through the exact system, show real prospect examples, and scope what it looks like for your market.
Schedule a Demo →