How we test AI advertising tools
The protocol behind every review on ad-stack — three reference briefs, twelve metrics, no agency discounts, no preview builds. The only variable is the tool.
Every tool that lands in the journal goes through the same three-stage gauntlet: a creative brief, a production run, and a published-cost accounting. The brief is fixed across all tools. The brands are fixed across all tools. The metrics are fixed across all tools. The only thing that changes is the software in the middle.
We wrote this page so every review and comparison the journal publishes points back to the same protocol. If a tool’s score doesn’t match your own experience with it, this is where to start a productive disagreement.
The three reference briefs
We carry three reference briefs across every tool we test. They cover the three buyer segments that show up most often in the questions we get from readers.
Brief 1 — DTC supplement (consumer e-commerce). A real product, real positioning, a real offer, and a brand kit that includes logos, fonts, hero imagery, and a packaging photo. Target placements: Meta feed, Meta Reels, TikTok in-feed.
Brief 2 — B2B SaaS (developer tools play). A real product with a 14-day trial, a calmer brand tone, and screenshots of the actual UI. Target placements: Meta feed, LinkedIn feed.
Brief 3 — Consumer mobile app. An App Store URL the tool can import from, a public landing page, and a download-to-install funnel. Target placements: Meta feed, Meta Reels, TikTok, Google UAC where the tool supports it.
Same brief, same brand kit, same offer, same placements. The brief never gets re-tuned to flatter a tool. If a tool can’t handle one of the three, that’s part of the review.
The twelve metrics
We grade every tool on the same twelve dimensions:
- Time to first usable draft — how long from prompt to something we’d actually publish.
- Number of regenerations to get to that first usable draft.
- Hours of human editing required after the tool’s output.
- Final spend on the public plan tier we actually used.
- Brand-safety incidents — flagged content, broken brand-kit application, hallucinated product names.
- Output variance across reruns — same prompt, different days, how consistent is the result.
- Asset format coverage — 9:16, 1:1, 16:9, static, video, multi-scene.
- Export friction — file format, watermark, captions, supported destinations.
- Onboarding friction — how long from sign-up to the first piece of usable output.
- Failure modes — what happens when the tool can’t do what we asked, and how well it tells us.
- Customer support response time on a real ticket we file.
- Cost per published ad — not cost per generated clip. The number you actually pay to get one ad live, including any post-production hand-off.
Twelve metrics, six “speed and volume” measures, six “quality and economics” measures. Each gets a 0-5 score. The overall verdict is a weighted average, with the weights leaning toward metrics 4, 5, 10, and 12 — the ones that decide whether the tool actually works on a real campaign rather than a demo reel.
What we publish
Every review ends with a score, a verdict sentence, a price-tier breakdown, and a “buy it if / don’t buy it if” cut. Every comparison ends with a table of the twelve metrics and a verdict in the same shape. Every alternatives page ranks the five most-tested alternatives plus a “when the original still wins” section.
When we re-test a tool that has shipped a major update, we publish a field note with the delta. The main score on the review only moves at the next full re-test, which we do once a quarter.
What we don’t do
We don’t accept free seats. We don’t run sponsored reviews. We don’t take “exclusive demo builds” the public can’t sign up for. We pay for our own seats at the public plan tier and use the tools the way a real buyer would.
If a vendor sponsors a section, we say so in the byline before the headline. As of this writing, none have, and no section on the site is sponsored.
When you disagree
This protocol is good but not perfect. If you’ve run the same brief through a tool we’ve reviewed and gotten a meaningfully different result, we want to know. Send us your brief, your output, and your score — we’ll re-run if it shifts the verdict materially. Address is in our about page.
Letters from readers
-
Q·01 How is ad-stack funded?
We pay for every tool seat ourselves at the public plan tier, and the journal is reader-supported via the newsletter. No vendor pays for placement, and no review is sponsored.
-
Q·02 Why benchmark on the same brief instead of letting each tool play to its strengths?
Because the only fair variable in a head-to-head test is the tool. Letting each vendor pick their best demo brief is how the AI ad category got into its current marketing-led mess — every tool wins on its own showcase. Same brief means you can actually compare cost-to-published across the field.
-
Q·03 How often do you re-test tools that have shipped major updates?
Every quarter. Reviews carry a 'last tested' date in the byline. If a tool ships a meaningful capability change between quarterly cycles, we publish a field note rather than waiting — but the score on the main review only moves at the next full re-test.
-
Q·04 Can I send in a tool to be reviewed?
Yes — send a note via the contact link in the footer. We can't promise coverage of every submission, and being suggested has no bearing on the eventual verdict. Vendors who pay for seats themselves rather than offering us free credits are evaluated identically.