Do I need any special software or paid tool to run this evaluation?

No. A spreadsheet works. The evaluation is a structured observation exercise: you track four signals over 30 days, apply a simple threshold calculation, and make a keep-or-cancel call at the end. A Google Sheet or Notion table with five columns (date, tool, task, minutes saved, outcome) is everything you need. The AI ROI Projection tool at /roi speeds up the threshold math, but it is not required. The spreadsheet approach is slower, takes 10 minutes instead of 3, but produces the same number. If you run a lot of AI tools, the Projection tool will pay for the time it saves quickly. Start wherever you are.

Is it safe to run AI tools on business data? What should I know before I start?

The general rule for small businesses is to stay on the Business or Teams tier of any AI tool, not the consumer tier. Consumer tiers (ChatGPT Free, Claude.ai personal) may use your inputs to train models. Business and Teams tiers typically offer data processing agreements that prohibit training on your data. Before running any client information, financial data, or proprietary business data through an AI tool, check two things: the tier you are on, and whether you have signed a Data Processing Agreement (sometimes called a DPA or Business Agreement) with that vendor. For most SMB operators, the answer is to sign up for the Business tier before the trial starts. It costs more but protects the business.

Will my team actually stick with a new AI tool after I decide to keep it?

Only if someone owns it. The single most common reason AI tools get renewed and then abandoned is that no one on the team was assigned ownership during the 30-day trial. The tool becomes "the thing we all tried for a week" rather than a workflow change. During the evaluation period, assign one person as the primary user. That person runs the tool on real tasks, tracks the signal data, and becomes the internal resource when others have questions. A 12-person services firm I worked with lost three AI subscriptions this way before implementing a one-owner rule. After the rule, retention at 90 days went from roughly 30 percent to over 80 percent. The tool is rarely the problem. Ownership is.

How do I share the evaluation results with my business partner or operations manager if they are skeptical of AI spending?

Show them the hours-saved-per-dollar calculation, not the feature list. Skeptics are not skeptical about AI in the abstract. They are skeptical about specific budget line items that have not proven their value. The threshold number (hours saved per month divided by monthly cost equals a ratio) is the right language. A ratio above 2 means the tool is returning more than twice its cost in staff time. A ratio below 1 means it is not. That framing works in a budget conversation because it translates AI capability into the thing the skeptic already measures: labor hours. Bring the 30-day log. Show the tasks, the time saved per task, the monthly total. The conversation is faster than you expect when there is a number on the table.

What if my whole team has access but only one or two people actually use the tool?

That is the most common pattern, and it changes the evaluation. If you are paying for 10 seats and 2 people use them, the correct denominator in your threshold calculation is 2 active seats, not 10. Either the tool has a usage problem (the other 8 people were not onboarded well, did not have a relevant use case, or were not given time to try it) or it has a fit problem (it genuinely solves a task that only 2 roles on the team do). Both are useful findings. Usage problems are fixable: assign a 30-day onboarding run for the other 8. Fit problems are not fixable with onboarding: right-size the seat count to the actual users or cancel the seats that are not producing output. Paying for unused seats is the quietest AI budget leak most businesses have.

Can my employees or contractors use the same AI tool subscription, or do they need their own accounts?

It depends on the tool's licensing terms, and the terms vary significantly. Most Business tier AI subscriptions are per-seat, meaning each user needs their own seat under the business account. Sharing login credentials across multiple people is a terms-of-service violation for most major AI platforms, including Claude, ChatGPT, and Gemini. For 1099 contractors specifically, the question gets more complicated: most platforms prohibit sharing accounts with non-employees, and some data processing agreements explicitly exclude contractors from coverage. Check the platform's terms before handing a login to a contractor. The safe path: seat the contractors under the business account during the engagement, remove them when the engagement ends. Do not assume the personal account plan covers business team use.

I am not technical. Is it realistic for me to track hours saved across 30 days without it becoming a project itself?

Yes, but only if you make the tracking fast. The tracking failure mode is a system that takes longer than the time it saves. A single note at the end of each AI session works: tool name, task type, rough minutes saved, yes or no on whether the output was usable. That is a 90-second log entry. At the end of 30 days, you have 20 to 30 entries per tool, enough to calculate a real average. The mistake most non-technical operators make is trying to build a precise time-tracking system before they have evidence it is worth tracking. Do not do that. The goal is a directionally accurate number, not an audit-grade time log. A rough number beats no number, because the decision you are making is directional: keep or cancel.

What should I do if a tool passes the threshold in month one but I am not sure it will hold up?

Set a 90-day check-in on the calendar before you close the renewal decision. Month-one performance for AI tools is often artificially high because the operator is paying close attention, trying new things, and has novelty-driven motivation. Month three tells you whether the tool stuck. The specific things to watch at 90 days: Is the tool still being used daily or has it dropped to weekly? Is it being used for the same tasks or has the use case broadened? Has the time savings held or has the team gotten lazy about prompt specificity? If the 90-day check-in shows the tool held, you have a keeper. If it shows usage dropped, you have a tool that was never really embedded in the workflow, and the renewal decision at month four should go differently than month one.

How Do I Decide If an AI Tool Is Worth Keeping After 30…

Most SMB operators I talk to have between 6 and 14 active AI subscriptions. They signed up during a wave of enthusiasm, or a salesperson made a compelling demo, or someone on the team said "we should try this." Now renewals hit every month and nobody is sure which tools are actually earning their fee. Cutting everything feels rash. Keeping everything feels wasteful. The problem is there is no system in place to answer the question with a number.

This guide gives you that system. By the end, you will be able to run a 30-day keep-or-cancel evaluation on any AI tool, calculate a defensible threshold number before the renewal date, and have a short script for the renewal conversation with your team or your business partner. The evaluation takes about 15 minutes of setup and 90 seconds per session to maintain.

Before you start: if you want to see how the numbers look across your whole AI stack before diving into a per-tool evaluation, the AI ROI Projection tool at /roi runs the math in about three minutes. It is a useful sanity check against the full subscription spend before you run the deeper evaluation here. We also cover the boardroom-level version of this question (the metrics a CFO or board wants to see) in the companion white paper AI ROI Defense: 6 Numbers Your Board Wants to See.

Why this matters for small business operators specifically

Large organizations have IT departments and vendor management systems that flag unused software licenses and track utilization rates quarterly. Small businesses do not. The renewal email arrives, the credit card gets charged, and the subscription continues for another year because nobody had time to review it. Over 24 months of this pattern, a 10-person company can accumulate $3,000 to $8,000 in annual AI subscription spend with a fraction of the tools producing measurable output.

The challenge is that AI tool evaluation for small businesses has a second problem that enterprise does not: the operator is also the evaluator. The person deciding whether to keep the tool is often the same person who championed buying it, which means the emotional attachment to the purchase is in the room when the decision gets made. A structured evaluation with a threshold number removes the emotion from the conversation. The number either clears the bar or it does not.

What a keep-or-cancel evaluation actually does

An AI tool keep-or-cancel evaluation is a 30-day structured observation period followed by a threshold calculation. It is not a feature audit. It is not a comparison against competitor tools. It is a focused answer to one question: is this specific tool producing a return in excess of what it costs, measured in hours saved?

Three things make this different from the informal "do we like this tool" gut-check most teams run:

It is logged, not remembered. Memory of tool performance is biased toward recent impressions. Logging captures the full 30 days.
It produces a ratio, not a feeling. The ratio is defensible in a budget conversation. A feeling is not.
It separates real signals from vanity signals. Most AI tools surface metrics that look impressive but do not translate to business value. This evaluation ignores those.

Think of it as a hiring decision with a 30-day trial period. You are not asking "is this tool impressive?" You are asking "is this tool doing the job?"

Before you start

You need:

A spreadsheet, Notion table, or any simple log where you can record 5 fields per entry (date, tool, task type, minutes saved, output usable yes/no). Do not overthink the format. Five columns is enough.
A specific tool you are evaluating and a specific use case you are testing it on. Pick the task where the tool was supposed to save the most time, because that is the strongest test.
Thirty days of real work ahead, not a sprint month or a holiday week where your normal patterns are disrupted.
90 seconds at the end of each AI session to log the entry before you close the tool.

One thing to settle before you run client data, employee data, or financial details through any AI tool: the data protection question. Consumer-tier AI accounts typically have different data handling terms than Business-tier accounts. We have a short explanation in the compliance section below. It is not complicated, but it is non-negotiable.

The 4 signals that actually matter

The keep-or-cancel decision should rest on four signals. Everything else is noise.

Signal 1: Hours saved per month. This is the primary metric. Track time spent on the task before the tool existed (either from memory or from time-tracking if you have it) versus time spent with the tool in the loop. Log this every session. At 30 days, you will have a real average. The question is not whether the tool is fast. It is how much faster it makes a task you do regularly.

For a 30-day hours-saved log: at the end of each AI session, answer these three questions in your log. What task did I just do? How long did it take with the tool? How long would it have taken without the tool? What was the difference in minutes? Do not estimate retroactively at the end of the month. Log in the moment while the task is fresh.

The daily habit is 90 seconds. By day 30, you have 20 to 40 entries and a real number to work with. The specific prompt above is also useful for training yourself to notice when AI is saving time versus when you are fighting the tool to get usable output.

For businesses where multiple people use the same tool, have each user keep their own log and aggregate. A 5-person team with separate logs gives you a team-wide hours-saved figure that is much more defensible than one person's impression.

Signal 2: Output usability rate. Not every AI output is usable. Some require so much editing that you would have been faster starting from scratch. Track the yes/no on whether the output was usable (meaning you could use it with minor editing, not that it was perfect). A tool with an 80 percent usability rate across 30 sessions is a keeper at the right price. A tool with a 40 percent usability rate is costing you time even if the individual outputs are impressive when they work.

Signal 3: Task fit. Does the tool solve a task you actually do repeatedly, or does it solve a task you do rarely or not at all? The most common AI tool mistake I see in small businesses is buying a tool for a use case that sounded compelling in the demo but does not match the actual workflow. A video editing AI is not valuable to a firm that produces one video per year. A legal document drafting AI is not valuable to a team that does not draft legal documents. Track whether the task fit is real across the 30 days. If you are not using the tool for the task it was sold to solve, that is the finding.

Signal 4: Workflow friction. Does the tool fit into how you already work, or does it require a context switch that breaks your flow? A tool that requires you to log into a separate browser tab, copy and paste output, reformat it, and paste it back into your actual work tool has workflow friction that eats into the hours-saved number. Log friction events in your session notes. A tool with high hours-saved but high friction will have lower-than-expected net savings once friction is accounted for.

The 3 vanity signals to ignore

AI tool marketing is good at surfacing metrics that look meaningful but do not translate to keep-or-cancel decisions.

Ignore "AI-generated outputs" counts. The number of outputs a tool produced tells you nothing about whether those outputs saved time or were usable. A tool that produced 200 drafts and required heavy editing on 180 of them is not a productive tool. Do not let a usage dashboard substitute for the hours-saved log.

Ignore "features used" metrics. Most AI tools surface a dashboard showing which features you accessed. Using five features out of twenty is not a bad sign. It means you found the ones that fit your workflow. Tools that push you to use more features to justify their cost are shifting the evaluation away from whether the tool is valuable toward whether you are using the tool fully. Those are different questions.

Ignore time-on-platform metrics. More time inside the tool is not a positive signal. It means the tool is taking up more of your attention, not less. The goal of an AI productivity tool is to free time, not occupy it. If the platform's success metric is daily active usage, and your goal is to spend less time on a task, those are in direct conflict. Evaluate the output, not the engagement.

The hours-saved-per-dollar threshold

After 30 days of logging, calculate this ratio:

Total hours saved in 30 days divided by monthly subscription cost equals hours-saved-per-dollar-per-month.

For a $50/month tool, if the tool saved 6 hours in 30 days, the ratio is 6 divided by 50 equals 0.12 hours saved per dollar. Multiply by the fully-loaded hourly cost of the person using the tool (salary plus benefits plus overhead) and you get the dollar return per dollar spent.

A 25-dollar-per-hour employee saving 6 hours per month on a $50 tool produces $150 in labor-time return on a $50 investment. That is a 3x return. Any ratio above 2x is a keeper. A ratio below 1x means the tool is costing more than it saves in labor time.

The AI ROI Projection tool runs this calculation across multiple tools at once, with fully-loaded labor cost inputs. It takes about 3 minutes and produces a stack-ranked view of your AI subscription spend by return. Use it at the 30-day mark, not the beginning, so the inputs are based on real logged data rather than estimates.

One important note: the threshold calculation assumes the time saved translates to productive work elsewhere, not to recovered idle time. If your team is already fully utilized and a tool saves 6 hours per month per person, those 6 hours go to higher-value work. If your team has available capacity, the 6 hours may not produce additional revenue. The threshold calculation works differently in each case. Be honest about which situation applies.

The 30-day instrumentation

Setting up the log correctly at the start of the 30 days makes the end-of-month calculation fast. Here is the structure that works:

Set up a 5-column log named after the tool you are evaluating. Column 1: date. Column 2: task type (use a consistent vocabulary, "email drafting", "meeting summary", "report writing", not a different label every time). Column 3: minutes spent with the tool on this task. Column 4: estimated minutes without the tool (use a consistent baseline, set this once at the beginning based on your real pre-AI time for each task type, then do not change it). Column 5: output usable yes/no. After 30 days, sum Column 3 by task type and compare to Column 4 to get total minutes saved. Divide by 60 for hours. Run the threshold calculation.

The consistent vocabulary in Column 2 matters because it lets you see whether the tool is outperforming on some task types and underperforming on others. A writing tool that saves 45 minutes per email drafting session but saves 3 minutes per report might be worth keeping for email-heavy roles and canceling for report-heavy roles. Task-level data makes that visible. Aggregate data hides it.

For teams of more than one user, assign one person to consolidate the logs at day 30. That person runs the threshold calculation and presents the summary. The 15 minutes of consolidation at the end of the month is the only additional overhead the system requires.

The renewal conversation script

When the renewal date arrives, this is the conversation structure that works, whether you are talking to a business partner, a department head, or yourself:

"We ran [tool] for 30 days. It saved [X] hours at a cost of [Y] per month. The usability rate was [Z] percent. Based on our fully-loaded labor cost of [W] per hour, the tool returned [dollar amount] in labor-time savings on a [Y] investment. [Keep: That is above our 2x threshold, I am renewing.] [Cancel: That is below our threshold, I am canceling before the renewal date.]"

That is the whole script. It is short because the number does the talking. The most common renewal conversation mistake is bringing the feature list into the room. Features do not justify subscription costs. Returns do.

If the tool is borderline (ratio between 1x and 2x), the conversation has one additional component: is there a usage problem (the tool was underused, and a different workflow or a different team member would push it over the threshold) or a fit problem (the tool was used correctly, it just does not match the business well enough). Usage problems get one more 30-day run with a specific change. Fit problems get canceled.

For tools where multiple people share the subscription, bring the individual logs to the renewal conversation. The conversation is different when one person is at 4x return and another is at 0.5x return. Right-sizing the seat count based on individual performance data is a more precise outcome than keeping or canceling the whole subscription.

The AI-evaluation prompts that actually work

The evaluation framework itself benefits from AI assistance, specifically for summarizing the log at the end of 30 days and drafting the renewal recommendation.

Specify what you logged, not what you felt. Paste your 30-day log into the AI tool and ask for a summary by task type. "Summarize this log. Show me total hours saved by task type, average usability rate by task type, and flag any task types where the usability rate was below 60 percent." The AI reads the log; you read the summary. That is a faster path to the threshold calculation than doing it by hand.

Specify the threshold before you ask for the recommendation. When asking AI to draft a renewal recommendation memo, include the threshold in the prompt. "Our keep threshold is 2x return on monthly cost. Based on the attached log summary, draft a one-paragraph renewal recommendation for [tool name]. Include the actual ratio, the hours-saved total, the monthly cost, and a clear keep or cancel recommendation." AI with a clear threshold produces a clear recommendation. AI without a threshold produces a balanced pros-and-cons memo that leaves the decision where you started.

Specify the audience of the memo. If the recommendation goes to a business partner who is skeptical of AI spending, say so. "The audience is my business partner who is skeptical of AI investment. Lead with the number, not the features." That framing shifts the output from a tool evaluation to a budget argument, which is what the conversation actually needs.

Specify what you want to do next. If the tool passes the threshold and you want to expand usage, say so in the prompt. "The tool passed our threshold. Draft a one-page onboarding note for the two team members who have not used it yet. Include the three tasks where it performed best and the one task where it underperformed." AI is useful on the follow-through, not just the decision.

The small business compliance non-negotiables

This section is short because the rule is simple, but it is the most important section in this guide.

Do not put any of the following into the consumer tier of any AI tool during your evaluation:

Client names, contact information, or any personally identifiable information about customers or prospects
Employee records, performance reviews, pay data, or anything that touches employment law
Financial statements, bank account details, or proprietary business financials
Any contract, agreement, or document with a confidentiality clause that restricts third-party disclosure
Any data covered by a nondisclosure agreement the business has signed
Vendor agreements, pricing terms, or other competitively sensitive business information

The practical workflow that respects these rules: run the 30-day evaluation using anonymized tasks wherever possible. "Draft a follow-up email to a client who missed a deadline" does not require the client's name or any identifying detail. "Summarize this sales pipeline" does not require real prospect names. Build the evaluation on the type of task, not the real underlying data. Once you decide to keep a tool, move to the Business or Teams tier before running real client or employee data through it, and confirm that the vendor offers a Data Processing Agreement before proceeding.

For employment-related AI use specifically, general compliance hygiene for small businesses includes being careful about any AI tool that makes or informs hiring, compensation, or performance decisions. Employment law varies by state, and several states have passed or are passing AI-in-hiring regulations that require disclosure. If you are evaluating an AI tool for HR tasks, check whether your state has specific requirements before the tool goes live on real employment decisions.

If your business has signed a Business or Enterprise agreement with a major AI vendor that includes a Data Processing Addendum, the rules on data handling are different. Ask your operations lead or legal advisor what is covered under that agreement. Do not assume the consumer-tier terms apply or that the Business tier automatically covers all your use cases.

When NOT to use the AI tool evaluation framework

The 30-day framework is the right answer for most AI subscription decisions. There are four scenarios where it is not.

Any tool where the task is too infrequent to generate 30-day data. If you use the tool once a month, the 30-day log will have one data point. That is not enough to calculate a reliable average. Extend the evaluation to 90 days, or wait until you have 10 or more sessions before running the threshold calculation.
Any tool where the value is primarily in risk reduction, not time savings. Compliance tools, security tools, and backup systems do not save hours on a regular basis. They save the business from a catastrophic event. The hours-saved calculation is the wrong frame. Evaluate these on incident-prevention value and replacement-cost logic instead.
Any tool where the team is in active onboarding and has not reached baseline competency yet. A 30-day trial during the learning curve will understate the tool's long-term value. If the tool requires significant training, evaluate it after the team has completed onboarding, not during.
Any tool where the decision has already been made at an organizational or procurement level. If the business is committed to a vendor through a multi-year contract or a bundle agreement, the keep-or-cancel framework applies to the renewal, not the current term. Use the 30 days to document the performance case for the renewal negotiation instead.

A simple rule: the 30-day evaluation is the right tool for the 80% of AI subscriptions where the primary value claim is time savings on a recurring task. Trust a different frame for the 20% where the value is insurance, compliance, or organizational mandate.

The quick-start template

Here is the prompt scaffold for the end-of-30-day-log summary. Paste your log into the AI tool, then use this prompt:

Review the attached 30-day log for [tool name]. The log includes date, task type, minutes with the tool, baseline minutes without the tool, and usability rating (yes/no). Calculate: total hours saved across all sessions, average hours saved per session, usability rate overall and by task type, and any task types where the usability rate was below 70 percent. Then apply this threshold: the tool costs [X] per month and our fully-loaded labor rate is [Y] per hour. Is the return above or below 2x monthly cost? Output a one-paragraph summary I can use in a renewal conversation.

For recurring monthly reviews after you decide to keep a tool, adjust the prompt to a rolling-90-day log. The 30-day evaluation is for the keep-or-cancel decision. The 90-day rolling review is for catching performance decay before the next annual renewal.

Bigger wins beyond the individual tool decision

Build a portfolio view of the full AI stack. Once you run the 30-day evaluation on two or three tools, you have the inputs for a stack-level analysis. Which tools have the highest return? Which are borderline? Which are actively costing more than they save? The AI ROI Projection tool takes individual tool evaluations and produces a ranked stack view in minutes. The stack view reveals something the per-tool evaluation does not: whether the business is over-indexed on a category (four writing tools, zero operations tools) or under-spending in an area with high potential return.

Turn the evaluation into an onboarding protocol for future tools. Every new AI tool the business evaluates from this point forward goes through the same 30-day framework. That means you accumulate a library of evaluation logs over time, which becomes the institutional memory for AI tool performance. When someone proposes a new tool, you can check whether you evaluated a similar one already and what the result was. That library prevents repeat mistakes.

Use the renewal conversation script to negotiate pricing. AI vendors want to keep business customers. A renewal conversation backed by a performance log and a clear threshold number is a stronger negotiating position than a gut-feel conversation. If the tool is borderline (1.5x return, not 2x), the vendor has a reason to offer a lower price to keep the account. The log gives you the negotiating position in that conversation. A 20 percent price reduction on a borderline tool often clears the threshold.

Read AI ROI Defense: 6 Numbers Your Board Wants to See before the next board or partner meeting. The 30-day evaluation produces the per-tool data. The white paper explains how to roll that data up into the board-level metrics that investment-minded stakeholders actually want: total AI spend as a percentage of payroll, aggregate labor-hours returned, fully-loaded ROI across the stack. If you are presenting AI spending to a board or investor, the per-tool evaluation is the input, not the presentation.

The small business AI consulting connection

The keep-or-cancel framework is one tool in one category. The bigger question for small businesses is not which AI subscriptions to keep. It is whether the business has an AI adoption strategy that is building toward something or just accumulating tools. A pile of individual AI subscriptions with no coherent workflow is expensive and produces fragmented results. A small number of well-integrated tools, evaluated against real performance data, and connected to the work the business most needs to improve is a different proposition entirely.

If your business is at the point where individual tool decisions are adding up to a broader question about AI direction, the AI Consulting for Small Business page covers the full picture: what an AI strategy for a small business looks like, where most SMB operators get stuck, and how an engagement with a consulting partner is structured.

Closing

The goal of this evaluation is not to cut tools for the sake of cutting. It is to get honest about which parts of the AI stack are producing returns and which are collecting renewal fees. Done once, the framework produces one decision. Done quarterly, it becomes the system that keeps the AI stack purposeful and the budget defensible. Start with the tool you are least sure about. Log 30 days. Run the calculation. Make the call.

If you want to talk about how AI fits into your business at the strategy level, not just the subscription level, the AI Consulting for Small Business page lays out the full picture and how an engagement works.

How Do I Decide If an AI Tool Is Worth Keeping After 30 Days?