Categorization AI

2023-2024

Overview

Bench was the largest bookkeeping service for small businesses in America. I led the redesign of Bench’s transaction categorization, transforming a high-touch service into a scalable, AI-assisted system. By introducing conversational AI and smart bulk-categorization, we delivered books 22% faster and maintained 95%+ accuracy, while giving customers more control, trust, and time savings.

Company

Bench Accounting

Role

Principal Product Designer
End-to-end UX, UI, conversational design, service design, research

tl;dr

Problem

Bench's categorization relied in large part on bookkeeping specialists manually reviewing transactions. This created bottlenecks, delays, and errors. Customer felt frustrated with slow books and limited control.

Solution

  • Categorization Assistant: Conversational AI guiding complex categorizations and verifications in real time
  • Similar Transactions: Smart grouping to reduce repetitive categorizations
  • Service redesign: Redesigned system for more customer agency and redefined bookkeeper/specialist roles for smooth AI-human handoffs

Results

230,000
transactions categorized via AI conversations in 3 months (equivalent to 30,000 hours of human bookkeeping)

67%
time savings. Customers cut time spent categorizing from ~30 min to ~10 (per customer feedback)

22%
increase in monthly book completions

95%+
categorization accuracy

0.1%
dropoff rate indicating strong customer trust and adoption

Problem

Manual categorization wasn't scalable

By 2023, categorization at Bench had become a bottleneck. A small group of specialists were handling tens of thousands of ambiguous transactions each month, often without full customer context. Customers could only pick from limited pre-approved categories or leave notes for review, then wait days for follow-up. This created delays and errors, customer frustration, and workarounds. Some customers even build parallel spreadsheets, signaling an erosion of product value.

At the same time, trust was non-negotiable. Customers chose Bench for tax accuracy, and operations worried that opening up categorization risked compliance errors.

Customer painpoints along the categorization experience
Customer painpoints along the categorization experience
Solution

AI-assisted, human-verified

As Principal Product Designer, I redesigned Bench's transaction categorization system, balancing automation with trust. This was a key initiative to scale the service and enable lower-cost tiers.

Categorization Assistant

  • AI assistant helps customers categorize complex transactions in real-time
  • Clarifying questions ensure tax compliance, with bookkeeper review in background
  • Immediate results reduce customer wait times from days to minutes

Similar Transactions

  • Surface groups of similar transactions for customers to action at once
  • Trains algorithm to auto-categorize future transactions
  • Checkbox UI allows easy review and adjustment
Screenshot of Bench Categorization Assistant escalation path and part of service blueprint

Service redesign

  • Redefined workflows: AI handles first pass; specialists review exceptions; bookkeepers ensure final accuracy
  • Clear in-flow messaging reassures bookkeeper oversight and escalation paths
Process

Building the AI-integrated categorization system

My role spanned end-to-end execution across customer research, prototyping, conversation design, service design, and collaboration with PMs, engineers, bookkeepers, and AI specialists. I redesigned both customer and internal workflows.

Discovery

  • Conducted customer and bookkeeper interviews to uncover pain points

  • Analyzed uncategorized transaction data to identify automation opportunities

  • Reviewed customer chat transcripts to surface trust gaps

Categorization-related customer problems discovered through interviews and analysis of customer chats and in-app feedback
Categorization-related customer problems discovered through interviews and analysis of customer chats and in-app feedback

Conversation design

Partnered with product, operations, and engineering to build AI-powered agent using licensed GPT models within Bench's secure infrastructure:

  • Ledger mapping: Worked with ops to map transaction types to ledgers (categories) with example transactions, exception cases, and follow-up questions
  • Prompt design: Worked with AI/ML engineers to test and refine how the assistant would generate responses
  • Escalation rules: Identified ambiguous cases and error handling that escalated to human review.
  • Voice & tone: Worked with brand/marketing to give the assistant a clear, supportive tone and to design in-flow messaging that reassured customers their bookkeeper would still review categorizations.
Supporting prompt design, including role, tone, routing logic
Supporting prompt design, including role, tone, routing logic
Mapping dialog flow that AI follows to categorize, verify, or escalate a transaction
Mapping dialog flow that AI follows to categorize, verify, or escalate a transaction
Refining ledger reference consumed by AI to determine appropriate category
Refining ledger reference consumed by AI to determine appropriate category

Service blueprinting

Collaborated with operations teams to rethink how AI, specialists, and bookkeepers worked together behind the scenes. This resulted in redefined roles, where AI handled first-pass categorization; specialists reviewed edge cases/escalations; bookkeepers oversaw overall categorization consistency and quality reviews. I also mapped escalation paths, which ensured unclear or sensitive transactions were routed back to human specialists.

Service blueprinting bookkeeper, specialist ('Cat team'), AI, and customer interactions, to redefine how the service will work with AI integration
Service blueprinting bookkeeper, specialist ('Cat team'), AI, and customer interactions, to redefine how the service will work with AI integration

Design & prototyping

  • Scoping MVP: Worked with Product & Eng to define MVP vs. future vision; prioritizing launch-critical features while mapping long-term improvements
  • Prototyping flows: Tested variations to validate usability
  • Technical collaboration: Worked with eng to scope what the models could support, simplifying designs to ship faster
Visualizing how the flow will evolve from MVP to future iterations
Given tight delivery pressures, we needed to ship an MVP quickly. I worked with my team to visualize what the smallest MVP could be, and how the flow will evolve from MVP to future iterations

Testing

  • Internal testing: Ran early prototypes with bookkeepers, using real transactions to evaluate the AI’s suggestions. Confusing or inaccurate outputs were logged to identify gaps in training data and refine prompts
  • Alpha testing: Conducted limited tests with a small group of customers to observe initial interactions and surface any usability or trust issues
  • Beta testing & segmented rollout: Incrementally rolled out the feature to larger segments of customers in beta, allowing us to monitor behaviour, track adoption, and collect qualitative feedback. This helped us catch edge cases, tune responses and flows, and adjust before broader releases
Prototypes used for alpha testing the conversation AI experience, simulated with customer's real transactions
To gauge customers' acceptance of AI early on (before we had a working GPT prototype), I used Figma prototypes to mimic a live chat and seeded them with each tester's real transactions.
We logged AI categorizations and regularly monitored them to ensure accuracy and improve the model
We logged AI categorizations and regularly monitored them to ensure accuracy and improve the model

Constraints

  • Tech: Accuracy of AI/ML models required iterative refinement

  • Organizational: Balancing executive enthusiasm for ChatGPT-style AI interactions with practical user needs

  • Resources: Company restructuring/financial pressures accelerated AI delivery timelines, required shipping MVPs before polish

  • Trust: System, design, and messaging needed to uphold accuracy and emphasize bookkeeper oversight

Feature Deep Dive

Categorization Assistant

Before

For unclear/complex categorizations, customers had to leave notes to bookkeepers, leading to slow turnaround and forgotten context.

After

Conversational AI guided customers in real time with vendor-based prompts, clarifying questions, and tax-related verifications. Bookkeepers still retained oversight.

Showing After
Categorization flow before the Categorization Assistant
Before
After

Not every problem needs a chatbot

The initial redesign moved all categorization into a chat flow. Testing and user feedback quickly revealed issues:

  • Chat can be cognitively heavy when customers want quick selections
  • Because the AI agent had to load inside Bench’s microfrontend architecture, the latency was a noticeable issue
  • AI’s responses were variable in length and clarity, creating uneven interactions
Early iteration
Early design for Categorization Assistant. With selectable category bubbles, the UX was meant to make transitioning between selecting categories and free-form comments seamless, but in reality, the latency and variability made this feel slow and confusing.
Example feedback from early testers. This and other feedback prompted us to refine AI's 'chattiness' and add selection-first flow
Example feedback from early testers. This and other feedback prompted us to refine AI's 'chattiness' and add selection-first flow
Visualization of the latency in the chat-first flow
Visualization of the latency in the chat-first flow (red shows seconds wait time between each step)

These insights reframed how we thought about the utility of chat. Instead of making chat the primary interface, I redesigned it as a supporting layer, best used when customers needed extra guidance or compliance checks. Everyday categorizations moved back into a selection-first flow, where speed and predictability mattered most.

Final flow
Showing the categorization flow where category selections are moved out of the chat. Free-form text triggers the Categorization Assistant to identify the category; and some sensitive categories also trigger the Categorization Assistant to verify the selection
100%
Showing the categorization flow where category selections are moved out of the chat. Free-form text triggers the Categorization Assistant to identify the category; and some sensitive categories also trigger the Categorization Assistant to verify the selection

With industry hype around conversational AI, it's easy to think of chatbots as the answer. Here are some alternate explorations on how AI might support categorization beyond the chatbot interface:

ExplorationMoving verifications out of chat: What if AI can generate differentiating options directly in the flow? Example: asking Was this meal with employees or clients as a set of buttons, not a chat thread.
Moving verifications out of chat: What if AI can generate differentiating options directly in the flow? Example: asking "Was this meal with employees or clients" as a set of buttons, not a chat thread.
ExplorationPattern recognition from past comments: What if AI can use customers' past inputs to infer categories?
Pattern recognition from past comments: What if AI can use customers' past inputs to infer categories?
ExplorationAI as an agent inside bookkeeper threads: Rather than separating the experience, AI could be suggested/added as a participant in bookkeeper comment threads
AI as an agent inside bookkeeper threads: Rather than separating the experience, AI could be suggested/added as a participant in bookkeeper comment threads
Feature Deep Dive

Similar Transactions

Before

New and catch-up customers manually categorized repeat vendors until enough data was available to auto-categorize them.

After

AI surfaced contextual groups right after a single categorization, enabling bulk action with one click. Bookkeepers saw audit trails showing “source transaction” to ensure accuracy.

Showing After
Categorization flow before the Similar Transactions feature
Before
After

Progressive disclosure builds trust

One early idea for Similar Transactions was a design that grouped similar transactions by vendor directly on the transactions screen. While theoretically valuable for vendor-based analyses (eg. calculating expenses by vendor), testing showed that if the AI grouping made mistakes, they were too visible and persistent, leaving customers feeling responsible for fixing them.

Lo-fi exploration and tested design of pre-grouped approach
Lo-fi exploration/design of pre-grouped approach, which was tested and did not make it into the final product

I shifted to a contextual design, where groups appeared only after a customer categorized one transaction. In this model, errors could be skipped or unchecked and then disappeared without consequence. By revealing automation only when relevant, the feature felt like a lightweight assist instead of extra work.

This progressive disclosure not only reduced cognitive load and hid AI mistakes gracefully, but also helped customers build trust in the system (and was faster to ship!).

Making automation audit-friendly

Designing internal workflows was just as important as the customer experience. For example, on the bookkeeper platform, transactions categorized through Similar Transactions were clearly flagged with their source. Bookkeepers could see which transaction provided the reasoning, ensuring tax justification and compliance.

Design specs for internal bookkeeper tools
Design specs for internal bookkeeper tools
Impact

Results

Customers noticed the difference. Here's how one customer, Patrick McKenzie, described the value Bench delivered:

Screenshot 1 of Twitter post by Patrick McKenzie describing the value of Bench categorization AIScreenshot 2 of Twitter post by Patrick McKenzie describing the value of Bench categorization AIScreenshot 3 of Twitter post by Patrick McKenzie describing the value of Bench categorization AI

By the numbers

Customer impact

  • 22% increase in monthly book completions, reducing delays/waiting for bookkeeper action
  • 0.1% dropoff rate in Categorization flow indicating customer trust and adoption
  • 67% less time categorizing, or ~20 minutes saved per session, based on customer feedback on time spent categorizing
  • 95%+ accuracy maintained, preserving trust in compliance-critical process. Accuracy is tracked based on changes or lack thereof between AI categorization and bookkeeper review

Business impact

  • 230,000 transactions auto-categorized in 3 months with Categorization Assistant, equivalent to ~30,000 hours of human bookkeeping or 37% of work done by specialist human teams in same period
  • 55% increase in rate of customer's self-serve categorizations upon deployment of Similar Transactions, saving ~110 workdays per month
  • At full adoption, customers categorized ~130,000 transactions in a month using both features with 75% requiring no internal action
Next steps

Future opportunities

Although Bench's closure paused further development, I'd been keen on some next steps for improving the experience based on research of chat logs, in-app surveys, and customer interviews:

Smarter AI with industry context

Adding in the business context would make categorizations feel smarter and more intuitive: eg. a coffee shop categorizing “coffee” would be Cost of Goods Sold, rather than Office Kitchen Expense or Business Meals Expense

Training on past conversations

Using past customer comments and bookkeeper decisions to inform customer-specific model. If a customer had commented “Sarah” and bookkeeper categorized it as “Professional Services Expense”, AI could establish this pattern and auto-categorize whensever “Sarah” is input

One place for every transaction

Unifying all categorization-related actions: notes, document uploads and bookkeeper follow-ups, tax verifications, into a single thread for clarity, ease of resolution, and tracking of context for auditability