AI

How to Evaluate an MCP Development Partner: Buyer's Checklist for 2026

How to evaluate partners in a category too young for typical portfolio signals.

The signals product teams normally use to evaluate a build partner – portfolio depth, named-client logos, polished case studies, years of category experience – are unreliable in MCP because the category is too young for any partner to have a deep portfolio. Replace them with four signal categories: production references (not case studies), specific embedding-level experience (not generic AI experience), actual artifacts from previous engagements (auth designs, tool definitions, audit logs), and a partner with a defended view rather than just execution capacity. Run a structured evaluation: written brief, screening calls, technical deep-dive, reference calls, written proposals, decision review. Disqualify any partner who cannot produce a current production reference, has no documented auth/audit artifacts, or offers execution without an opinion.

This is the structural problem of evaluating in a young category: the marketing surface and the actual capability are unusually decoupled. Two partners can have similar websites, similar logos, and similar pitch decks, and one of them has shipped three production MCP apps to enterprise customers and the other has built an internal demo. Distinguishing them takes a different kind of evaluation than the one most procurement processes are built for.

This guide is for product leaders running that evaluation. It assumes you have already made the build-vs-buy decision in favor of a partner (MCP build vs buy) and uses the working vocabulary in our MCP terminology guide. For broader context on technology partner evaluation methodology, see how to evaluate a technology partner and our framework for reference checks for technology partners. The patterns for AI partner evaluation specifically are in our guide to selecting an AI development partner.

The Cost of the Wrong Choice

The cost of the wrong partner choice is not a contract write-off; it is a year of compounded delay during the window when MCP presence matters most. A poorly-built MCP app passes initial review and breaks in production over the following quarter – auth tokens silently expire, mutations are not idempotent, the audit story falls apart the first time a customer asks for a log. Six months in, the buyer is rebuilding while still paying for the original.

Why MCP Partner Evaluation Is Different

Three things make MCP partner evaluation different from evaluating, say, a generalist agency for a marketing site or a typical mobile-app build:

  1. Experience is concentrated in a small number of teams, hidden by a much larger number of agencies who have updated their websites to claim MCP capability. The honest count of teams that have shipped multiple production MCP apps to paying customers is probably under fifty globally as of mid-2026. The count of agencies whose website mentions MCP is in the thousands. The signal-to-noise ratio of the marketing surface is unusually bad.
  2. The work is adjacent to but not the same as previous specialties. Teams with strong API-design backgrounds can ship adequate MCP apps without prior MCP experience; teams with strong full-stack agency backgrounds, no API experience, and a couple of weeks of MCP demo work can produce something that looks shippable in a pitch and breaks in production.
  3. The failure mode is delayed. A poorly-built MCP app passes initial review and breaks in production over the following quarter. Six months in, the buyer is rebuilding while still paying for the original.

A more rigorous evaluation up front pays for itself many times over. The rigor takes a different shape than mature-category evaluation. You are not looking for the partner with the most MCP-shaped marketing. You are looking for the partner with the most MCP-shaped artifacts – production references, real auth designs from previous engagements, actual tool definitions, working audit-log examples – and you are looking past the partners who can talk about MCP and toward the partners who can show you what they have shipped.

The Four Signals That Actually Work

In place of the usual portfolio-and-logos checklist, four signal categories matter when the category is young.

Signal 1: Production references, not case studies

A case study is a document the partner controls. A production reference is a customer the partner introduces you to, who is currently using an MCP app the partner shipped, and who will talk to you for thirty minutes about what worked and what did not.

Ask for the latter. The single most valuable thirty minutes of an MCP partner evaluation is a reference call with a current customer. The single most predictive failure signal is a partner who hedges on whether such a call can happen.

Signal 2: Specific embedding-level experience

A partner who has shipped three level-1 read-only MCP apps and never shipped a level-2 actions app is not the right partner for your level-2 build. The discontinuity between read-only and actions is large – idempotency, reversibility, audit, intent preview – and previous experience at the higher level matters more than total count of apps.

Ask specifically: How many MCP apps have you shipped to production, on which clients, at what embedding level? The honest answer for most partners in 2026 is one to three. A partner who answers many without a list, or we have AI experience without naming MCP-specific projects, has not shipped what they are claiming.

If you are shipping at level 2, the partner should have shipped at level 2. If you are shipping to multiple clients, cross-client experience matters – Claude experience does not transfer to Microsoft Copilot’s enterprise model without real cost.

Signal 3: Artifacts, not pitches

The most diagnostic step in a partner evaluation is asking to see actual artifacts from a previous engagement. Three artifacts are particularly informative:

  • The auth design from a previous engagement. Not the marketing pitch – the actual design. Scope taxonomy, token lifetime decisions, refresh behavior, revocation flow, the SOC 2 considerations baked in. A partner who can talk fluently about why they chose per-resource per-verb scopes for a previous client has done the work. (See MCP auth and security for the bar.)
  • The tool surface from a previous build – actual tool definitions, names, descriptions, parameter shapes, error responses for a real production MCP app. Read them. Look at description quality, parameter naming, error legibility.
  • The audit and observability defaults. What does the partner instrument out of the box? Per-invocation logs? Customer-admin-facing log surface? Session reconstruction?

These three artifacts are diagnostic because they cannot be faked in a pitch deck.

Signal 4: A view, not just execution capacity

The strongest partners have an opinionated view on the work; the weakest have execution capacity but no point of view.

A partner with a view will tell you which embedding level you should ship at, based on the diagnostics in our strategy decision framework, and will defend the recommendation. A partner without a view will offer to build whatever you specify and will charge you to discover during the build that what you specified was not the right thing.

Ask: What would you do differently from the brief we sent? A partner who answers nothing, that brief is great is offering execution. A partner who pushes back on something specific – embedding level, client choice, auth approach – is offering partnership. The latter is rarer and worth more.

Sample Brief and Screening Questions

Sample MCP partner evaluation brief

A working template for the written brief shared with three to five partners at the start of the evaluation. Keep it under 4 pages; partners who cannot scope from this should disqualify themselves.

SUBJECT: MCP Partner Evaluation – [Your Company Name]

ABOUT US
- Company: [name, ARR, customer base, one-paragraph product description]
- Audience: [primary buyer persona; segments]
- Existing tech stack: [key infrastructure relevant to MCP work]

THE OPPORTUNITY
- Strategic posture: [Posture 1/2/3 from MCP strategy framework]
- Why we're shipping: [the specific buyer signal driving urgency]
- What success looks like: [first-year goals in measurable terms]

SCOPE
- Target client(s): [Claude / ChatGPT / Cursor / Copilot / Gemini / Perplexity, in priority order]
- Embedding level: [read-only / actions / agent-resident]
- Tool surface (rough): [estimated tool count; key resources/operations]
- Auth model: [OAuth 2.1 + PKCE expected; SSO required for enterprise]
- Audit / observability requirements: [SOC 2, HIPAA, customer-facing audit log, etc.]

TIMELINE
- Strategy & scope: [target weeks]
- Build: [target weeks]
- Beta + GA: [target weeks]
- Hard constraints: [any non-negotiable dates]

ENGAGEMENT MODEL
- Build-only / build + maintenance / build + transition to in-house
- Knowledge transfer expectations: [pairing? runbooks? architecture review?]
- Maintenance retainer expected: [yes/no; range]

EVALUATION CRITERIA
- We will evaluate proposals against the criteria in the attached scorecard
- We expect a written proposal with scope, sequencing, knowledge transfer model,
  maintenance commitment, and pricing structure
- We will conduct reference calls with one production customer per finalist
- Final decision: [target date]

QUESTIONS
- Please direct questions to [contact, email]
- Proposal due: [date]

This brief is a screening tool by itself. Partners who respond with a generic deck rather than a scoped proposal disqualify themselves. Partners who respond with thoughtful clarifying questions move forward.

Screening call questions

A complete first-round screening call works through these in roughly 60 minutes.

Track record (10 minutes):

  1. How many MCP apps have you shipped to production, on which clients, at what embedding level? Look for: a list with names, dates, and URLs. Vague answers disqualify.
  2. Of those, which is the most recent? When did it ship? When was its last meaningful update? Look for: shipped within last 12 months; ongoing maintenance.
  3. Can I talk to a current customer using an MCP app you shipped, ideally at the embedding level we need? Look for: yes, with a name and timeframe.
  4. What’s the longest-running MCP app you’ve shipped, and what does maintenance look like for it? Look for: a real story with concrete details about spec changes, auth changes, scope reviews.

Technical depth (20 minutes):

  1. Can you walk me through the auth design from a previous engagement, in detail? Look for: per-resource per-verb per-sensitivity scope structure; OAuth 2.1 + PKCE + DCR; thoughtful token lifetime decisions.
  2. Can I see actual tool definitions from a previous build? Look for: well-named tools, clear descriptions, well-shaped parameter schemas.
  3. What does your audit and observability default look like? Look for: built-in, not afterthought.
  4. How do you handle host-client spec changes mid-engagement? Look for: built into retainer, not surprise scope changes.
  5. What’s your idempotency pattern for write tools? Look for: idempotency keys, server-side caching, Stripe-style discipline.
  6. What’s your reversibility pattern for high-stakes operations? Look for: soft-delete with restore tokens, two-phase commit for irreversibles, revision history.

Process and engagement (15 minutes):

  1. What does discovery and design look like before any code is written? Look for: 2–3 weeks of strategy + scope work; documented deliverables before build phase.
  2. How do you structure knowledge transfer for in-house takeover? Look for: pairing engagements, runbook deliverables, architecture review milestones.
  3. What’s your maintenance commitment after launch? Look for: written terms, clear coverage.
  4. What’s your typical timeline for level-1 / level-2 / multi-client engagements? Look for: realistic numbers (1 quarter / 2 quarters / 2.5–3 quarters).

Commercial (10 minutes):

  1. What’s your pricing structure (fixed-fee, T&M, retainer, milestone-based)?
  2. What’s typically out of scope in your fixed-fee engagements? Look for: clear answer; “very little” is a flag.
  3. How do you handle scope changes? Look for: written change-order process.
  4. What’s your IP and source-code ownership stance? Look for: client owns code; explicit license; source escrow if applicable.

Strategic (5 minutes):

  1. What would you do differently from the brief we sent? Look for: specific pushback, not “great brief”.
  2. Which embedding level would you recommend for our situation, and why? Look for: defended recommendation, not deference.
  3. What’s the most common reason engagements like ours go wrong? Look for: lessons from real engagements.

Reference call question script

When you reach reference calls, conduct them yourself, not via the partner. Thirty minutes per reference.

1. How did you find [partner]?
2. What was the scope of the engagement?
3. How did the partner handle the strategy/scope phase?
   Did they push back on your initial brief? Where?
4. What changed between the proposal and the actual delivery?
5. What was the first incident in production? How did the partner handle it?
6. How was the auth design? Has it held up?
7. How was the audit story? Has it been sufficient when issues came up?
8. What's maintenance been like since launch?
9. If you were doing it again, would you hire [partner]?
   What would you do differently in the engagement?
10. What's something the partner did well that you didn't expect?
11. What's something the partner did poorly that surprised you?
12. Anything we should know that we wouldn't think to ask?

The signals to listen for: specifics (vs generalities), proactive disclosure of issues (vs glossing), comfort answering question 11 honestly (vs deflection).

The 100-Point Scorecard

A scorecard you can apply to candidate proposals. Score each finalist; the highest score is not necessarily the best partner (judgment matters), but a finalist scoring below 65 should not be selected.

CategoryWeightSub-criteriaMax
Track record25Production MCP apps shipped (10), specific embedding-level match (10), production reference call available (5)25
Technical depth30Auth design artifact reviewed (10), tool definition artifact reviewed (10), audit/observability default (5), idempotency/reversibility patterns demonstrated (5)30
Process & engagement20Discovery/design phase before build (5), knowledge transfer model (5), maintenance terms clear (5), realistic timeline (5)20
Commercial10Clear pricing structure (3), explicit IP ownership (3), change-order process (2), maintenance retainer terms (2)10
Strategic view15Pushed back on the brief substantively (5), defended an embedding-level recommendation (5), articulated common failure modes from experience (5)15
Total100

Score interpretation:

  • 85–100: strong partner; proceed with confidence
  • 70–84: acceptable partner; document specific concerns and address in contract
  • 65–69: marginal; pursue only if no stronger alternatives and you can mitigate weak areas
  • <65: do not select

If multiple partners score 80+, the tiebreakers worth weighing are: reference customer signal (a great reference adds weight), strategic-view depth (the partner who pushed back hardest on your brief usually delivers the best engagement), and commercial alignment (the partner whose pricing structure matches your cost-management preferences).

Red Flags, Process, and Contract Terms

Five disqualifiers

Five signals strong enough to drop a partner regardless of the rest of the evaluation:

  1. No production reference call available. If the partner cannot put you in front of a paying customer, the experience claim is unsupported.
  2. No documented auth or audit artifacts. Partners who treat auth and audit as engineering details to be figured out during the build are starting too late.
  3. No view on which embedding level you should ship at. A partner who says we’ll build whatever you specify is offering execution, not partnership.
  4. Single-client experience masquerading as MCP expertise. A team that has built three Claude connectors and never touched another client’s MCP surface is a Claude partner, not an MCP partner. Fine if Claude is the only client you care about, ever. Not fine if you intend to expand.
  5. Aggressive on timeline, vague on safety. Partners who promise four-week ships at level 2 without articulating idempotency, reversibility, and audit work are either underestimating or planning to skip. Both disqualifying.

How to run the process

A defensible evaluation takes four to six weeks and follows six steps:

  1. Written brief (Week 1) – shared with three to five partners
  2. First-round screening calls (Weeks 1–2) – 60 minutes each, anchored on the question list above
  3. Technical deep-dive (Weeks 2–3) – with the surviving two or three; review actual artifacts; engineering lead in the room
  4. Reference calls (Weeks 3–4) – conducted directly, not via the partner; one production customer per partner; thirty minutes each
  5. Written proposals (Weeks 4–5) – from each finalist, with scope, sequencing, knowledge transfer, maintenance, and pricing all explicit
  6. Decision review (Weeks 5–6) – with internal stakeholders including engineering, security, and the eventual product owner

Questions That Reveal True Capability

Beyond the screening list, three questions consistently surface real capability gaps. "What's the most recent MCP-spec change you absorbed mid-engagement, and how did you handle it?" "What's a scope decision you made on a previous project that you'd reverse with hindsight?" "Walk me through how you'd handle a customer's security team asking for proof of revocation latency in production." Partners with real experience answer these with specifics. Partners with thin experience deflect or generalize.

Required contract terms

The minimum bar for a defensible MCP partner contract:

  • Statement of Work with line-itemed scope matching the cost breakdown in MCP build vs buy
  • Acceptance criteria per phase – what does auth design complete mean? What does tool surface implemented mean?
  • IP ownership – code is yours, license explicit
  • Source code escrow for partners who hold ongoing operational responsibility
  • Audit log deliverable – partner builds and surfaces it to customer admins
  • Penetration test deliverable – third-party pentest before launch, with remediation in scope
  • Knowledge transfer deliverables – runbooks, pairing schedule, architecture review milestones with internal sign-off
  • Maintenance terms – what’s included; what counts as scope change; rate card
  • Disengagement clauses – termination notice, transition assistance, source escrow
  • Key-person commitments – named engineers, rotation notice
  • Confidentiality – including AI training. Whether the partner can use any artifacts from your engagement to train models or improve their own tools

The most common contract gaps are knowledge transfer (often soft-pedaled) and audit log surfacing (often left as engineering detail). Both deserve explicit line items.

When to bring in an outside advisor

For most product teams, this evaluation is doable internally with the framework above. Bring in an outside advisor if:

  • Your company has not run a build-partner evaluation in this category before
  • The strategic stakes are high enough that an additional set of experienced eyes is worth the cost
  • The internal team is too close to existing partner relationships to evaluate them on their merits
  • The buyer wants the diligence documented for procurement, board, or audit purposes

We do this work regularly, both as the primary evaluator and as a second-opinion review on a partner the team has already chosen. Typical engagement: 4–6 weeks, $40K–$100K depending on scope and number of finalists.

Why This Evaluation Matters More Now Than Later

The MCP build-partner market will mature. In two years, evaluating partners will look more like evaluating mobile-app shops in 2014 – a settled craft, a known set of credible firms, an unambiguous portfolio standard. The asymmetry between marketing surface and actual capability will close. Production reference checks will be a formality rather than a diagnostic.

We are not there yet. The next eighteen months are the period in which signal-to-noise ratio of partner marketing is at its worst, the cost of getting the choice wrong is at its highest, and the rigor of a serious evaluation has the most leverage. The teams that get this right are the ones that treat partner evaluation as a real procurement exercise rather than a vendor-shopping exercise.

The MCP app you ship is the product surface your customers will see for the next five years. It is shaped, more than anyone wants to admit, by the partner you picked at the beginning. Pick deliberately. Document the reasoning. Hold the partner to the artifacts they showed you in the proposal.

Frequently Asked Questions

How long does an MCP partner evaluation take?

A defensible evaluation takes four to six weeks: one week for the written brief, one to two weeks for screening calls, one to two weeks for technical deep-dive and reference calls, and one week for proposals and decision review.

How many MCP partners should I evaluate?

Three to five for the written brief; two or three through the technical deep-dive; finalists submit written proposals. Fewer than three risks no real comparison; more than five becomes process overhead without proportional decision-quality benefit.

What's the most important question to ask an MCP development partner?

Can I talk to a paying customer currently using an MCP app you shipped? The answer (and how the partner handles the answer) tells you more than any other single question. Production reference availability is the strongest single signal in the evaluation.

How do I know if an MCP partner has real experience?

Three tests: production reference calls, specific embedding-level experience (not generic AI experience), and actual artifacts from previous engagements (auth designs, tool definitions, audit logs). Marketing claims, case studies, and pitch decks are unreliable in this category.

What should be in an MCP partner proposal?

Six elements: scope (specific tools, embedding level, target client), sequencing (week-by-week or sprint-by-sprint plan), knowledge transfer model (if you intend in-house takeover), maintenance commitment (post-launch), pricing structure (fixed-fee vs T&M vs retainer), and an articulated point of view on what you should do differently from the brief.

How much should I pay an MCP development partner?

Partner-built level-1 read-only MCP apps typically run $100K–$300K for a single client; level-2 actions apps run $300K–$700K. The cost varies with auth complexity, tool surface size, and underlying product complexity.

Should I hire an MCP agency or a freelancer?

For most production-quality MCP apps, an established MCP agency or development partner is the right choice. Freelancers can work for narrow scopes (one tool surface, no auth complexity) but rarely have the breadth across server, auth, audit, safety, and distribution that a production MCP app requires.

What if all my partner candidates fail the disqualifiers?

Two paths: extend the search to less obvious candidates (boutique product engineering firms with strong API-design backgrounds, sometimes labeled differently than 'MCP partner'), or reconsider the build-vs-buy decision. Failing the disqualifier list across multiple candidates is a signal that the category is not yet mature enough for your specific need; in-house may be the better answer.

How do I score MCP partner proposals?

Use the 100-point scorecard: track record (25), technical depth (30), process (20), commercial (10), strategic view (15). Finalists below 65 should not be selected; above 85 is strong; 70–84 is workable with documented mitigations.

What contract terms most often cause regret post-engagement?

Two: vague knowledge-transfer deliverables (the partner did the work, your team can't take it over) and audit-log-not-surfaced (the partner built the log internally, your customers' admins can't query it). Both should be explicit, scoped, and acceptance-criteria'd in the contract.

← AI Guides

Start a Conversation

15 minutes with an advisor. No pitch, no pressure.
We'll help you figure out what you actually need.

Talk to an Advisor