Reducing Tool Output by 95%: JSON Compression, Response Projection, and Caching for LLM Cost Optimization

AI agents executing tools face a critical challenge: large tool outputs consume massive token budgets, driving up costs and reducing accuracy through context rot. A single API response can contain thousands of unnecessary fields, consuming tens of thousands of tokens before the LLM even processes the user's request.
At Apigene, we've implemented a three-layer optimization strategy combining JSON compression, response projection, and intelligent caching to reduce tool output by up to 95%. This article explores how these techniques work together to dramatically reduce LLM costs, improve accuracy, and accelerate agent performance.
The Problem: Tool Output Bloat
Token Consumption Explosion
Consider a typical API response from Google Search Console:
Original Response (10,000 tokens):
{
"rows": [
{
"keys": ["2024-01-01", "https://example.com/page1"],
"clicks": 1250,
"impressions": 45000,
"ctr": 0.0278,
"position": 3.2,
"device": "DESKTOP",
"country": "US",
"query": "example search query",
"page": "https://example.com/page1",
"date": "2024-01-01",
"searchAppearance": "NONE",
"dataState": "FINAL",
"isDataGolden": true,
"dimensionFilterGroups": [],
"dimensionGroup": null,
"dimensionValues": {},
"metricValues": {},
"metadata": {
"schema": "https://schema.org/WebPage",
"lastModified": "2024-01-01T00:00:00Z",
"crawlStatus": "OK",
"indexStatus": "INDEXED"
}
},
// ... 99 more similar objects
],
"responseAggregationType": "BY_PROPERTY",
"rowCount": 100,
"kind": "webmasters#searchAnalyticsSearchResponse",
"timeZone": "America/New_York",
"dataState": "FINAL"
}This response contains:
- 100 rows of search analytics data
- 20+ fields per row (most unused)
- Metadata and schema information (rarely needed)
- Total: ~10,000 tokens
The Hidden Costs
Large tool outputs create three critical problems:
- Token Cost Explosion: At $0.01 per 1K input tokens, 10,000 tokens = $0.10 per request
- Context Rot: Irrelevant data dilutes important information, reducing LLM accuracy
- Speed Degradation: Processing large responses increases latency and reasoning time
Real-World Impact
Scenario: Agent analyzing search performance
1. Fetch Google Search Console data: 10,000 tokens
2. Process user query: 100 tokens
3. LLM reasoning with full context: 2,000 tokens
4. Generate response: 500 tokens
Total: 12,600 tokens = $0.126 per request
If the agent only needs clicks, impressions, and ctr, we're wasting 9,500 tokens (75% waste) on irrelevant data.
The Solution: Three-Layer Optimization
Apigene implements three complementary techniques that work together to reduce tool output:
- Response Projection: Extract only needed fields using JMESPath
- JSON Compression: Compress remaining data using tabular format
- Intelligent Caching: Cache responses to avoid redundant API calls
Layer 1: Response Projection with JMESPath
Response projection uses JMESPath expressions to extract only the fields you need from API responses, dramatically reducing token consumption.
How It Works
Apigene applies projection before compression and caching, ensuring minimal data flows through the system. The projection engine uses JMESPath to parse expressions and extract only the specified fields from JSON responses. This happens server-side before any data is sent to the LLM, dramatically reducing token consumption.
Example 1: Google Search Console - Field Selection
Original Response: 10,000 tokens
With Projection:
{
"tool": "run_action",
"arguments": {
"app_name": "Google Search Console",
"user_input": "Get search analytics data",
"context": {
"operationId": "searchAnalytics.query",
"startDate": "2024-01-01",
"endDate": "2024-01-31",
"dimensions": ["date", "page"]
},
"response_projection": "rows[*].{date: keys[0], page: keys[1], clicks: clicks, impressions: impressions, ctr: ctr, position: position}"
}
}Projected Response: 2,500 tokens (75% reduction)
Result:
{
"rows": [
{
"date": "2024-01-01",
"page": "https://example.com/page1",
"clicks": 1250,
"impressions": 45000,
"ctr": 0.0278,
"position": 3.2
}
// ... 99 more rows (only 6 fields each)
]
}Example 2: Google Search Console - Filtering and Aggregation
Filter High-Performing Pages:
{
"response_projection": "rows[?clicks > `1000`].{page: keys[1], clicks: clicks, ctr: ctr}"
}Aggregate by Date:
{
"response_projection": "{dates: rows[*].{date: keys[0], total_clicks: clicks, total_impressions: impressions}} | group_by(@, &date)"
}Calculate Average Position:
{
"response_projection": "{avg_position: avg(rows[*].position), total_clicks: sum(rows[*].clicks)}"
}Example 3: Complex Nested Projection
Extract Nested Data:
{
"response_projection": "rows[*].{url: keys[1], metrics: {clicks: clicks, impressions: impressions}, date: keys[0]}"
}Filter and Transform:
{
"response_projection": "rows[?ctr > `0.03`].{page: keys[1], performance: {clicks: clicks, ctr: ctr, position: position}}"
}JMESPath Projection Patterns
1. Field Selection:
"{name: name, email: email, id: id}"2. Array Extraction:
"items[*].name"3. Filtering:
"items[?price > `100`]"4. Aggregation:
"{total: sum(items[*].price), avg: avg(items[*].price), max: max(items[*].price)}"5. Nested Extraction:
"users[*].{id: id, profile: {name: profile.name, email: profile.email}}"6. Conditional Filtering:
"items[?status == 'active' && price > `50`]"7. Sorting:
"sort_by(items, &price)[-10:]"8. Multi-Level Projection:
"{summary: {total: length(items), active: length(items[?status == 'active'])}, items: items[*].{id: id, name: name}}"Layer 2: JSON Compression
After projection, Apigene applies JSON compression to further reduce size by converting uniform arrays into tabular format.
How It Works
Apigene supports two compression modes optimized for different use cases:
1. Column/Row Format (json-compact): Converts uniform arrays of objects into a tabular columnar format. When all objects in an array share identical keys, the system extracts column names once and stores values as rows, eliminating repetitive key names.
2. TOON Format: A specialized format optimized for LLM token efficiency. TOON uses a compact syntax that reduces token count while maintaining high LLM comprehension accuracy.
Both modes automatically detect when compression is beneficial and only apply when a minimum compression ratio threshold is met (default 5%).
Compression Examples
Column/Row Format Example:
Before Compression (2,500 tokens):
[
{"date": "2024-01-01", "clicks": 1250, "impressions": 45000},
{"date": "2024-01-02", "clicks": 1320, "impressions": 47000},
{"date": "2024-01-03", "clicks": 1180, "impressions": 43000}
]After Column/Row Compression (800 tokens - 68% reduction):
{
"columns": ["date", "clicks", "impressions"],
"rows": [
["2024-01-01", 1250, 45000],
["2024-01-02", 1320, 47000],
["2024-01-03", 1180, 43000]
]
}TOON Format Example:
Before Compression (15,144 tokens - GitHub repositories example): Standard JSON with repetitive field names and structure.
After TOON Compression (8,744 tokens - 42.3% reduction): Compact TOON syntax that eliminates repetitive keys while maintaining full data fidelity. According to TOON benchmarks, this format achieves higher LLM accuracy (73.9% vs 69.7%) with significantly fewer tokens.
Compression Mode Comparison
Column/Row Format (json-compact):
- Best for: Uniform tabular data with identical object structures
- Compression: Converts arrays to column/row format
- Token Reduction: 35-60% reduction vs standard JSON
- Example: Employee records, time-series data, uniform API responses
TOON Format:
- Best for: LLM consumption with maximum token efficiency
- Compression: Specialized compact syntax optimized for LLMs
- Token Reduction: 40-65% reduction vs standard JSON
- LLM Accuracy: According to TOON benchmarks, TOON achieves 73.9% accuracy (vs JSON's 69.7%) while using 39.6% fewer tokens
- Efficiency Score: 26.9 accuracy%/1K tokens (highest among formats tested)
- Per-Model Performance:
- Claude Haiku: 59.8% accuracy (vs JSON's 57.4%)
- Gemini Flash: 87.6% accuracy (vs JSON's 77.0%)
- GPT-5 Nano: 90.9% accuracy (tied with JSON compact)
- Grok Fast: 57.4% accuracy (vs JSON's 55.5%)
- Key Advantage: Not only reduces tokens but actually improves LLM comprehension accuracy across multiple models
Compression Ratio Threshold
Compression only applies when it achieves a minimum compression ratio (configurable, default 5%). This ensures compression overhead is justified by meaningful size reduction. If compression doesn't meet the threshold, the original response is returned to avoid unnecessary processing.
Layer 3: Intelligent Caching
Apigene implements intelligent response caching to avoid redundant API calls, further reducing token consumption and improving speed.
How It Works
Apigene's caching system operates transparently at the interaction layer. When a tool execution request is made, the system first checks for a cached response using a composite cache key that includes:
- API name and operation ID
- Request context parameters (excluding temporal fields)
- User ID (for user-scoped caching)
- Response projection (if applied)
If a cache hit occurs, the response is returned immediately without executing the API call. On cache miss, the API executes normally and the result is stored in cache with a configurable TTL (time-to-live).
Cache Key Generation
Cache keys are generated using a composite approach that includes:
- API name and operation ID
- Context parameters (excluding temporal fields like timestamps)
- User ID (for user-scoped caching) or tenant ID (for tenant-scoped caching)
- Response projection expression (if applied)
This ensures that identical requests with the same projection return cached results, while different projections or parameters generate separate cache entries.
Cache Configuration
Caching can be configured at the organization level with settings for:
- Enabled/Disabled: Toggle caching on or off
- TTL (Time-To-Live): How long cached responses remain valid (default: 1 hour)
- Operation Type Filter: Which operation types to cache (e.g., GET operations only)
- Cache Scope: User-level or tenant-level caching
This allows fine-grained control over what gets cached and for how long, balancing freshness with performance.
Cache Benefits
- Zero Token Cost: Cached responses don't consume API tokens
- Instant Response: Cached data returns immediately
- Reduced API Load: Fewer calls to external APIs
- Cost Savings: No API rate limit consumption
Complete Optimization Pipeline
The three layers work together in sequence:
1. API Response (10,000 tokens)
↓
2. Response Projection (JMESPath) → 2,500 tokens (75% reduction)
↓
3. JSON Compression (Tabular) → 800 tokens (68% reduction)
↓
4. Caching (if applicable) → 0 tokens (100% reduction on cache hit)
Total Reduction: 92% (from 10,000 to 800 tokens)
Real-World Examples
Example 1: Google Search Console Analytics
Use Case: Analyze top-performing pages
Without Optimization:
{
"tool": "run_action",
"arguments": {
"app_name": "Google Search Console",
"context": {
"operationId": "searchAnalytics.query"
}
}
}Result: 10,000 tokens
With Optimization:
{
"tool": "run_action",
"arguments": {
"app_name": "Google Search Console",
"context": {
"operationId": "searchAnalytics.query",
"startDate": "2024-01-01",
"endDate": "2024-01-31"
},
"response_projection": "rows[?clicks > `1000`].{page: keys[1], clicks: clicks, ctr: ctr, position: position}"
}
}Result: 500 tokens (95% reduction)
Example 2: Salesforce Opportunity Analysis
Use Case: Get high-value opportunities
Without Optimization:
{
"tool": "run_action",
"arguments": {
"app_name": "Salesforce",
"context": {
"operationId": "listOpportunities"
}
}
}Result: 8,000 tokens
With Optimization:
{
"tool": "run_action",
"arguments": {
"app_name": "Salesforce",
"context": {
"operationId": "listOpportunities"
},
"response_projection": "opportunities[?amount > `10000`].{name: name, amount: amount, stage: stageName, closeDate: closeDate}"
}
}Result: 1,200 tokens (85% reduction)
Example 3: GitHub Pull Request Review
Use Case: List open pull requests with basic info
Without Optimization:
{
"tool": "run_action",
"arguments": {
"app_name": "GitHub",
"context": {
"operationId": "listPullRequests"
}
}
}Result: 6,000 tokens
With Optimization:
{
"tool": "run_action",
"arguments": {
"app_name": "GitHub",
"context": {
"operationId": "listPullRequests",
"state": "open"
},
"response_projection": "pull_requests[*].{number: number, title: title, author: user.login, status: state, url: html_url}"
}
}Result: 800 tokens (87% reduction)
Example 4: Jira Issue Tracking
Use Case: Get open issues with summaries
Without Optimization:
{
"tool": "run_action",
"arguments": {
"app_name": "Jira",
"context": {
"operationId": "listIssues"
}
}
}Result: 7,500 tokens
With Optimization:
{
"tool": "run_action",
"arguments": {
"app_name": "Jira",
"context": {
"operationId": "listIssues",
"status": "open"
},
"response_projection": "{issues: issues[*].{id: id, key: key, summary: fields.summary, status: fields.status.name, assignee: fields.assignee.displayName}}"
}
}Result: 1,000 tokens (87% reduction)
Example 5: Gmail Email Analysis
Use Case: Get unread email summaries
Without Optimization:
{
"tool": "run_action",
"arguments": {
"app_name": "Gmail",
"context": {
"operationId": "listMessages"
}
}
}Result: 5,000 tokens
With Optimization:
{
"tool": "run_action",
"arguments": {
"app_name": "Gmail",
"context": {
"operationId": "listMessages",
"q": "is:unread"
},
"response_projection": "messages[*].{id: id, subject: payload.headers[?name == 'Subject'].value | [0], from: payload.headers[?name == 'From'].value | [0], date: internalDate}"
}
}Result: 600 tokens (88% reduction)
Performance Impact
Token Reduction Metrics
| API | Original Size | Projected Size | Compressed Size | Total Reduction |
|---|---|---|---|---|
| Google Search Console | 10,000 | 2,500 | 800 | 92% |
| Salesforce | 8,000 | 1,500 | 600 | 92.5% |
| GitHub | 6,000 | 1,200 | 500 | 91.7% |
| Jira | 7,500 | 1,000 | 400 | 94.7% |
| Gmail | 5,000 | 800 | 300 | 94% |
Cost Savings
Scenario: Agent processing 1,000 requests/day
Without Optimization:
- Average response: 7,000 tokens
- Daily tokens: 7,000,000
- Cost: $70/day ($2,100/month)
With Optimization:
- Average response: 500 tokens (93% reduction)
- Daily tokens: 500,000
- Cost: $5/day ($150/month)
Savings: $1,950/month (93% cost reduction)
Accuracy Improvement
Context Rot Reduction:
- Without Optimization: 7,000 tokens of mixed relevant/irrelevant data
- With Optimization: 500 tokens of focused, relevant data
- Improvement: 93% reduction in noise, leading to:
- Better LLM focus on relevant information
- Reduced hallucination risk
- Improved reasoning accuracy
- Faster response generation
Speed Improvements
Latency Reduction:
- Projection: <10ms overhead
- Compression: <5ms overhead
- Cache Hit: 0ms (instant)
- Total Speedup: 10-100x faster on cache hits
How It Works: High-Level Architecture
Response Processing Pipeline
Apigene processes tool responses through a three-stage pipeline:
Stage 1: Response Projection
- JMESPath expressions are parsed and evaluated against the response
- Only specified fields are extracted, reducing data size by 70-90%
- Projection happens server-side before any data transmission
Stage 2: JSON Compression
- Two modes available: Column/Row format or TOON format
- Compression algorithm analyzes data structure to determine optimal format
- Only applies when minimum compression ratio threshold is met
- Reduces remaining data by an additional 35-65%
Stage 3: Intelligent Caching
- Cache key generated from request parameters and projection
- Cache checked before API execution
- Successful responses stored with configurable TTL
- Subsequent identical requests return instantly from cache
Compression Mode Selection
The system automatically selects the best compression mode based on:
Column/Row Format Selection:
- Detects uniform arrays of objects with identical keys
- Converts to columnar format when all objects share the same structure
- Ideal for tabular data like employee records, time-series, or uniform API responses
TOON Format Selection:
- Optimized for LLM consumption with maximum token efficiency
- Uses compact syntax that reduces tokens while maintaining accuracy
- According to TOON benchmarks, achieves higher accuracy (73.9% vs 69.7%) with 39.6% fewer tokens than JSON
- Best efficiency score: 26.9 accuracy%/1K tokens
Compression Effectiveness
Based on benchmark data from TOON format benchmarks:
Token Efficiency Comparison:
- TOON: 26.9 accuracy%/1K tokens (highest efficiency)
- JSON compact: 22.9 accuracy%/1K tokens
- YAML: 18.6 accuracy%/1K tokens
- JSON: 15.3 accuracy%/1K tokens
- XML: 13.0 accuracy%/1K tokens
Key Insight: TOON achieves 73.9% accuracy (vs JSON's 69.7%) while using 39.6% fewer tokens, making it the optimal choice for LLM consumption when maximum efficiency is required.
Best Practices
1. Always Use Projection for Large Responses
{
"response_projection": "items[*].{id: id, name: name, status: status}"
}2. Combine Projection with Filtering
{
"response_projection": "items[?status == 'active' && price > `100`].{id: id, name: name}"
}3. Use Aggregation for Summary Data
{
"response_projection": "{total: length(items), active: length(items[?status == 'active']), avg_price: avg(items[*].price)}"
}4. Enable Compression for Tabular Data
Column/Row Format (for uniform tabular data):
{
"json_compression": {
"enabled": true,
"min_compression_ratio": 5.0,
"mode": "json-compact"
}
}TOON Format (for maximum LLM efficiency):
{
"json_compression": {
"enabled": true,
"min_compression_ratio": 5.0,
"mode": "toon"
}
}When to Use Each Mode:
- json-compact: Best for uniform tabular data where you need human-readable format
- toon: Best for LLM consumption where maximum token efficiency and accuracy are priorities. According to TOON benchmarks, TOON achieves the highest efficiency score (26.9 accuracy%/1K tokens) and better accuracy than JSON while using fewer tokens.
5. Configure Caching for Read Operations
{
"cache_enabled": true,
"cache_ttl": 3600,
"operation_type": "get"
}6. Use Projection in Batch Operations
{
"tool": "run_action_batch",
"arguments": {
"base_context": {
"operationId": "readEmail",
"response_projection": "{id: id, subject: subject, from: from}"
},
"batch_context": [
{"email_id": "123"},
{"email_id": "456"}
]
}
}Advanced JMESPath Patterns
Pattern 1: Multi-Level Filtering
{
"response_projection": "rows[?clicks > `1000` && ctr > `0.03`].{page: keys[1], metrics: {clicks: clicks, ctr: ctr}}"
}Pattern 2: Grouping and Aggregation
{
"response_projection": "rows | group_by(@, &keys[0]) | map({date: @[0].keys[0], total_clicks: sum(@[*].clicks), avg_ctr: avg(@[*].ctr)})"
}Pattern 3: Nested Field Extraction
{
"response_projection": "items[*].{id: id, user: {name: user.name, email: user.email}, metadata: {created: metadata.created_at, updated: metadata.updated_at}}"
}Pattern 4: Conditional Transformation
{
"response_projection": "items[*].{id: id, status: status, priority: status == 'urgent' ? 'high' : 'normal'}"
}Pattern 5: Array Flattening
{
"response_projection": "items[*].tags[]"
}Monitoring and Optimization
Compression Metrics
Apigene provides detailed compression metrics for monitoring:
- Compression ratio: Percentage reduction achieved
- Mode used: Column/Row or TOON format
- Before/after sizes: Token counts for comparison
- Threshold check: Whether compression met minimum ratio
Example metrics:
JSON compression: 10000 bytes -> 800 bytes (92.0% reduction)
TOON compression: 15000 bytes -> 8744 bytes (41.7% reduction)
Cache Performance Monitoring
Track cache effectiveness:
- Cache hit rate: Percentage of requests served from cache
- Cache miss rate: Percentage requiring API execution
- Average response time: Comparison between cache hits and misses
- TTL effectiveness: How often cached data expires vs is used
Token Usage Tracking
Monitor end-to-end token consumption:
- Original response size: Before any optimization
- After projection: Size reduction from field extraction
- After compression: Additional reduction from compression
- Cache impact: Zero tokens when cache hit occurs
- Total reduction: Overall percentage improvement
This tracking helps identify optimization opportunities and measure ROI of compression and caching strategies.
Comparison: Before vs After
| Metric | Without Optimization | With Optimization | Improvement |
|---|---|---|---|
| Average Response Size | 7,000 tokens | 500 tokens | 93% reduction |
| Cost per Request | $0.07 | $0.005 | 93% reduction |
| Context Accuracy | Low (noise) | High (focused) | Significant |
| Response Time | 500ms | 50ms (cache) | 10x faster |
| API Rate Limits | High usage | Low usage | 93% reduction |
Conclusion
Tool output optimization is critical for building cost-effective, accurate AI agents. Apigene's three-layer approach—response projection, JSON compression (with both Column/Row and TOON formats), and intelligent caching—reduces tool output by up to 95%, delivering:
- 93% cost reduction through token optimization
- Improved accuracy by reducing context rot (TOON format achieves 73.9% accuracy vs JSON's 69.7% with 39.6% fewer tokens)
- 10x faster responses with intelligent caching
- Better scalability through reduced API load
- Optimal format selection with automatic detection of best compression mode
By combining these techniques, agents can process more requests with lower costs, faster responses, and better accuracy. The key is applying projection first to extract only needed data, then compressing what remains using the optimal format (Column/Row for tabular data, TOON for maximum LLM efficiency), and finally caching to avoid redundant calls.
According to TOON format benchmarks, the TOON compression mode achieves the highest efficiency score (26.9 accuracy%/1K tokens) among all formats tested, making it ideal for LLM consumption where both token efficiency and accuracy are critical.
Ready to optimize your agent's token usage? Get started with Apigene's MCP Gateway and experience the power of intelligent tool output reduction.
Learn More: