Tool2Vec: a smarter way to find the right tool for your LLM agent

Written by bryanwhl • February 18, 2026 • 4 min read

airesearchllm

If you've been building LLM agents, you've probably run into this problem before. Your agent needs to call the right tool out of potentially thousands of available ones. You can't just dump all the tool descriptions into the context window, because that gets expensive fast and starts hurting performance. So you need some kind of retrieval system that picks the relevant tools before passing them to the model.

The standard approach is to embed the tool descriptions and do a similarity search against the user query. Sounds reasonable. But there's a catch: tool descriptions are written like documentation, and user queries are written like... how people actually talk. Those two things live in very different parts of the vector space, and that gap causes a lot of retrieval misses.

A paper from UC Berkeley tries to fix this with two things: Tool2Vec and ToolRefiner.

The core idea behind Tool2Vec

The insight here is simple but effective. Instead of embedding the tool's description text, you embed example user queries that were historically used with that tool.

So say you have a tool called find_email_address. Instead of embedding "This tool searches for and returns the email address of a given person", you collect all the user queries that led to this tool being called, embed each of those, and average them together. That average becomes your tool's embedding.

The result is that your tool embeddings now sit right in the middle of where real user queries land in vector space. When a new query comes in, it's naturally much closer to the Tool2Vec embeddings than to description-based ones. They visualized this with t-SNE and you can literally see the Tool2Vec embeddings sitting at the centroids of the query clusters.

It's one of those "why didn't anyone do this earlier" ideas.

ToolRefiner: the second stage

Even with better embeddings, retrieval can get messy when you're dealing with a huge number of tools. A lot of tools do similar things, and simple vector search struggles to differentiate between them when the query is ambiguous or complex.

This is where ToolRefiner comes in. The system works in two stages:

A fast retriever (Tool2Vec or a multi-label classifier) narrows the pool down from thousands of tools to a small candidate set of around 32 to 64 tools.
ToolRefiner then takes the user query and the Tool2Vec embeddings of those candidates, and runs them all through a fine-tuned DeBERTa-V3-xsmall model in a single forward pass to produce the final ranked list.

The key thing here is the "single forward pass" bit. Normal rerankers compare each tool to the query one by one, which doesn't let the model reason about how the tools relate to each other. ToolRefiner processes all candidates together, so it can pick up on interactions between tools and figure out which combination actually makes sense for the query.

The model itself is only 22M parameters, so it's fast and cheap to run.

Results

They tested on ToolBench and a new dataset they built called ToolBank. ToolBank is more interesting to me because they used Llama-3-70B to generate queries that follow natural tool co-occurrence patterns, so the queries feel more like what a real user would ask rather than the more robotic queries in ToolBench. In a head-to-head evaluation, ToolBank queries had a 60% win rate on naturalness and fluency over ToolBench.

On the hardest split of ToolBench, the framework hit a Recall@3 of 75.23, compared to the ToolBench Retriever's 54.07. That's a pretty big jump. On ToolBank's domain-specific tasks (Numpy, Pandas, AWS), they saw up to 30.5 improvement in Recall@K over description-based retrieval.

What I think

The core lesson from this paper is that the data you use to represent a tool matters way more than the documentation text. If you have historical usage data, use it. The gap between "how tools are documented" and "how people actually ask for them" is real, and Tool2Vec directly addresses that by learning from actual usage rather than from static text.

ToolRefiner is a nice addition on top, but Tool2Vec alone already produces a significant improvement. The two-stage setup is practical too since you are not replacing your existing retriever entirely, just adding a lightweight refinement step on top.

If you're building anything that involves tool selection at scale, this paper is worth a read.