insights

Reducing Tool Output by up to 99%: JSON Compression, Response Projection, and Caching for LLM Cost Optimization

January 15, 2025•Apigene Team•

15 min read

Reducing Tool Output by up to 99%: JSON Compression, Response Projection, and Caching for LLM Cost Optimization

AI agents executing tools face a critical challenge: large tool outputs consume massive token budgets, driving up costs and reducing accuracy through context rot. A single API response can contain thousands of unnecessary fields, consuming tens of thousands of tokens before the LLM even processes the user's request.

At Apigene, we've implemented a three-layer optimization strategy combining JSON compression, response projection, and intelligent caching to reduce tool output by up to 99%. This article explores how these techniques work together to dramatically reduce LLM costs, improve accuracy, and accelerate agent performance.

The Problem: Tool Output Bloat

Token Consumption Explosion

Consider a typical API response from Google Search Console:

Original Response (10,000 tokens):

{
  "rows": [
    {
      "keys": ["2024-01-01", "https://example.com/page1"],
      "clicks": 1250,
      "impressions": 45000,
      "ctr": 0.0278,
      "position": 3.2,
      "device": "DESKTOP",
      "country": "US",
      "query": "example search query",
      "page": "https://example.com/page1",
      "date": "2024-01-01",
      "searchAppearance": "NONE",
      "dataState": "FINAL",
      "isDataGolden": true,
      "dimensionFilterGroups": [],
      "dimensionGroup": null,
      "dimensionValues": {},
      "metricValues": {},
      "metadata": {
        "schema": "https://schema.org/WebPage",
        "lastModified": "2024-01-01T00:00:00Z",
        "crawlStatus": "OK",
        "indexStatus": "INDEXED"
      }
    },
    // ... 99 more similar objects
  ],
  "responseAggregationType": "BY_PROPERTY",
  "rowCount": 100,
  "kind": "webmasters#searchAnalyticsSearchResponse",
  "timeZone": "America/New_York",
  "dataState": "FINAL"
}

This response contains:

100 rows of search analytics data
20+ fields per row (most unused)
Metadata and schema information (rarely needed)
Total: ~10,000 tokens

The Hidden Costs

Large tool outputs create three critical problems:

Token Cost Explosion: At $0.01 per 1K input tokens, 10,000 tokens = $0.10 per request
Context Rot: Irrelevant data dilutes important information, reducing LLM accuracy
Speed Degradation: Processing large responses increases latency and reasoning time

Real-World Impact

Scenario: Agent analyzing search performance

1. Fetch Google Search Console data: 10,000 tokens
2. Process user query: 100 tokens
3. LLM reasoning with full context: 2,000 tokens
4. Generate response: 500 tokens
Total: 12,600 tokens = $0.126 per request

If the agent only needs clicks, impressions, and ctr, we're wasting 9,500 tokens (75% waste) on irrelevant data.

The Solution: Three-Layer Optimization

Apigene implements three complementary techniques that work together to reduce tool output:

Response Projection: Extract only needed fields using JMESPath
JSON Compression: Compress remaining data using tabular format
Intelligent Caching: Cache responses to avoid redundant API calls

Layer 1: Response Projection with JMESPath

Response projection uses JMESPath expressions to extract only the fields you need from API responses, dramatically reducing token consumption.

How It Works

Apigene applies projection before compression and caching, ensuring minimal data flows through the system. The projection engine uses JMESPath to parse expressions and extract only the specified fields from JSON responses. This happens server-side before any data is sent to the LLM, dramatically reducing token consumption.

Example 1: Google Search Console - Field Selection

Original Response: 10,000 tokens

With Projection:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Google Search Console",
    "user_input": "Get search analytics data",
    "context": {
      "operationId": "searchAnalytics.query",
      "startDate": "2024-01-01",
      "endDate": "2024-01-31",
      "dimensions": ["date", "page"]
    },
    "response_projection": "rows[*].{date: keys[0], page: keys[1], clicks: clicks, impressions: impressions, ctr: ctr, position: position}"
  }
}

Projected Response: 2,500 tokens (75% reduction)

Result:

{
  "rows": [
    {
      "date": "2024-01-01",
      "page": "https://example.com/page1",
      "clicks": 1250,
      "impressions": 45000,
      "ctr": 0.0278,
      "position": 3.2
    }
    // ... 99 more rows (only 6 fields each)
  ]
}

Example 2: Google Search Console - Filtering and Aggregation

Filter High-Performing Pages:

{
  "response_projection": "rows[?clicks > `1000`].{page: keys[1], clicks: clicks, ctr: ctr}"
}

Aggregate by Date:

{
  "response_projection": "{dates: rows[*].{date: keys[0], total_clicks: clicks, total_impressions: impressions}} | group_by(@, &date)"
}

Calculate Average Position:

{
  "response_projection": "{avg_position: avg(rows[*].position), total_clicks: sum(rows[*].clicks)}"
}

Example 3: Complex Nested Projection

Extract Nested Data:

{
  "response_projection": "rows[*].{url: keys[1], metrics: {clicks: clicks, impressions: impressions}, date: keys[0]}"
}

Filter and Transform:

{
  "response_projection": "rows[?ctr > `0.03`].{page: keys[1], performance: {clicks: clicks, ctr: ctr, position: position}}"
}

JMESPath Projection Patterns

1. Field Selection:

"{name: name, email: email, id: id}"

2. Array Extraction:

"items[*].name"

3. Filtering:

"items[?price > `100`]"

4. Aggregation:

"{total: sum(items[*].price), avg: avg(items[*].price), max: max(items[*].price)}"

5. Nested Extraction:

"users[*].{id: id, profile: {name: profile.name, email: profile.email}}"

6. Conditional Filtering:

"items[?status == 'active' && price > `50`]"

7. Sorting:

"sort_by(items, &price)[-10:]"

8. Multi-Level Projection:

"{summary: {total: length(items), active: length(items[?status == 'active'])}, items: items[*].{id: id, name: name}}"

Layer 2: JSON Compression

After projection, Apigene applies JSON compression to further reduce size by converting uniform arrays into tabular format.

How It Works

Apigene supports two compression modes optimized for different use cases:

1. Column/Row Format (json-compact): Converts uniform arrays of objects into a tabular columnar format. When all objects in an array share identical keys, the system extracts column names once and stores values as rows, eliminating repetitive key names.

2. TOON Format: A specialized format optimized for LLM token efficiency. TOON uses a compact syntax that reduces token count while maintaining high LLM comprehension accuracy.

Both modes automatically detect when compression is beneficial and only apply when a minimum compression ratio threshold is met (default 5%).

Compression Examples

Column/Row Format Example:

Before Compression (2,500 tokens):

[
  {"date": "2024-01-01", "clicks": 1250, "impressions": 45000},
  {"date": "2024-01-02", "clicks": 1320, "impressions": 47000},
  {"date": "2024-01-03", "clicks": 1180, "impressions": 43000}
]

After Column/Row Compression (800 tokens - 68% reduction):

{
  "columns": ["date", "clicks", "impressions"],
  "rows": [
    ["2024-01-01", 1250, 45000],
    ["2024-01-02", 1320, 47000],
    ["2024-01-03", 1180, 43000]
  ]
}

TOON Format Example:

Before Compression (15,144 tokens - GitHub repositories example): Standard JSON with repetitive field names and structure.

After TOON Compression (8,744 tokens - 42.3% reduction): Compact TOON syntax that eliminates repetitive keys while maintaining full data fidelity. According to TOON benchmarks, this format achieves higher LLM accuracy (73.9% vs 69.7%) with significantly fewer tokens.

Compression Mode Comparison

Column/Row Format (json-compact):

Best for: Uniform tabular data with identical object structures
Compression: Converts arrays to column/row format
Token Reduction: 35-60% reduction vs standard JSON
Example: Employee records, time-series data, uniform API responses

TOON Format:

Best for: LLM consumption with maximum token efficiency
Compression: Specialized compact syntax optimized for LLMs
Token Reduction: 40-65% reduction vs standard JSON
LLM Accuracy: According to TOON benchmarks, TOON achieves 73.9% accuracy (vs JSON's 69.7%) while using 39.6% fewer tokens
Efficiency Score: 26.9 accuracy%/1K tokens (highest among formats tested)
Per-Model Performance:
- Claude Haiku: 59.8% accuracy (vs JSON's 57.4%)
- Gemini Flash: 87.6% accuracy (vs JSON's 77.0%)
- GPT-5 Nano: 90.9% accuracy (tied with JSON compact)
- Grok Fast: 57.4% accuracy (vs JSON's 55.5%)
Key Advantage: Not only reduces tokens but actually improves LLM comprehension accuracy across multiple models

Compression Ratio Threshold

Compression only applies when it achieves a minimum compression ratio (configurable, default 5%). This ensures compression overhead is justified by meaningful size reduction. If compression doesn't meet the threshold, the original response is returned to avoid unnecessary processing.

Layer 3: Intelligent Caching

Apigene implements intelligent response caching to avoid redundant API calls, further reducing token consumption and improving speed.

How It Works

Apigene's caching system operates transparently at the interaction layer. When a tool execution request is made, the system first checks for a cached response using a composite cache key that includes:

API name and operation ID
Request context parameters (excluding temporal fields)
User ID (for user-scoped caching)
Response projection (if applied)

If a cache hit occurs, the response is returned immediately without executing the API call. On cache miss, the API executes normally and the result is stored in cache with a configurable TTL (time-to-live).

Cache Key Generation

Cache keys are generated using a composite approach that includes:

API name and operation ID
Context parameters (excluding temporal fields like timestamps)
User ID (for user-scoped caching) or tenant ID (for tenant-scoped caching)
Response projection expression (if applied)

This ensures that identical requests with the same projection return cached results, while different projections or parameters generate separate cache entries.

Cache Configuration

Caching can be configured at the organization level with settings for:

Enabled/Disabled: Toggle caching on or off
TTL (Time-To-Live): How long cached responses remain valid (default: 1 hour)
Operation Type Filter: Which operation types to cache (e.g., GET operations only)
Cache Scope: User-level or tenant-level caching

This allows fine-grained control over what gets cached and for how long, balancing freshness with performance.

Cache Benefits

Zero Token Cost: Cached responses don't consume API tokens
Instant Response: Cached data returns immediately
Reduced API Load: Fewer calls to external APIs
Cost Savings: No API rate limit consumption

Complete Optimization Pipeline

The three layers work together in sequence:

1. API Response (10,000 tokens)
   ↓
2. Response Projection (JMESPath) → 2,500 tokens (75% reduction)
   ↓
3. JSON Compression (Tabular) → 800 tokens (68% reduction)
   ↓
4. Caching (if applicable) → 0 tokens (100% reduction on cache hit)

Total Reduction: 92% (from 10,000 to 800 tokens)

Real-World Examples

Example 1: Google Search Console Analytics

Use Case: Analyze top-performing pages

Without Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Google Search Console",
    "context": {
      "operationId": "searchAnalytics.query"
    }
  }
}

Result: 10,000 tokens

With Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Google Search Console",
    "context": {
      "operationId": "searchAnalytics.query",
      "startDate": "2024-01-01",
      "endDate": "2024-01-31"
    },
    "response_projection": "rows[?clicks > `1000`].{page: keys[1], clicks: clicks, ctr: ctr, position: position}"
  }
}

Result: 500 tokens (95% reduction)

Example 2: Salesforce Opportunity Analysis

Use Case: Get high-value opportunities

Without Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Salesforce",
    "context": {
      "operationId": "listOpportunities"
    }
  }
}

Result: 8,000 tokens

With Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Salesforce",
    "context": {
      "operationId": "listOpportunities"
    },
    "response_projection": "opportunities[?amount > `10000`].{name: name, amount: amount, stage: stageName, closeDate: closeDate}"
  }
}

Result: 1,200 tokens (85% reduction)

Example 3: GitHub Pull Request Review

Use Case: List open pull requests with basic info

Without Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "GitHub",
    "context": {
      "operationId": "listPullRequests"
    }
  }
}

Result: 6,000 tokens

With Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "GitHub",
    "context": {
      "operationId": "listPullRequests",
      "state": "open"
    },
    "response_projection": "pull_requests[*].{number: number, title: title, author: user.login, status: state, url: html_url}"
  }
}

Result: 800 tokens (87% reduction)

Example 4: Jira Issue Tracking

Use Case: Get open issues with summaries

Without Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Jira",
    "context": {
      "operationId": "listIssues"
    }
  }
}

Result: 7,500 tokens

With Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Jira",
    "context": {
      "operationId": "listIssues",
      "status": "open"
    },
    "response_projection": "{issues: issues[*].{id: id, key: key, summary: fields.summary, status: fields.status.name, assignee: fields.assignee.displayName}}"
  }
}

Result: 1,000 tokens (87% reduction)

Example 5: Gmail Email Analysis

Use Case: Get unread email summaries

Without Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Gmail",
    "context": {
      "operationId": "listMessages"
    }
  }
}

Result: 5,000 tokens

With Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Gmail",
    "context": {
      "operationId": "listMessages",
      "q": "is:unread"
    },
    "response_projection": "messages[*].{id: id, subject: payload.headers[?name == 'Subject'].value | [0], from: payload.headers[?name == 'From'].value | [0], date: internalDate}"
  }
}

Result: 600 tokens (88% reduction)

Performance Impact

Token Reduction Metrics

API	Original Size	Projected Size	Compressed Size	Total Reduction
Google Search Console	10,000	2,500	800	92%
Salesforce	8,000	1,500	600	92.5%
GitHub	6,000	1,200	500	91.7%
Jira	7,500	1,000	400	94.7%
Gmail	5,000	800	300	94%

Cost Savings

Scenario: Agent processing 1,000 requests/day

Without Optimization:

Average response: 7,000 tokens
Daily tokens: 7,000,000
Cost: $70/day ($2,100/month)

With Optimization:

Average response: 500 tokens (93% reduction)
Daily tokens: 500,000
Cost: $5/day ($150/month)

Savings: $1,950/month (93% cost reduction)

Accuracy Improvement

Context Rot Reduction:

Without Optimization: 7,000 tokens of mixed relevant/irrelevant data
With Optimization: 500 tokens of focused, relevant data
Improvement: 93% reduction in noise, leading to:
- Better LLM focus on relevant information
- Reduced hallucination risk
- Improved reasoning accuracy
- Faster response generation

Speed Improvements

Latency Reduction:

Projection: <10ms overhead
Compression: <5ms overhead
Cache Hit: 0ms (instant)
Total Speedup: 10-100x faster on cache hits

How It Works: High-Level Architecture

Response Processing Pipeline

Apigene processes tool responses through a three-stage pipeline:

Stage 1: Response Projection

JMESPath expressions are parsed and evaluated against the response
Only specified fields are extracted, reducing data size by 70-90%
Projection happens server-side before any data transmission

Stage 2: JSON Compression

Two modes available: Column/Row format or TOON format
Compression algorithm analyzes data structure to determine optimal format
Only applies when minimum compression ratio threshold is met
Reduces remaining data by an additional 35-65%

Stage 3: Intelligent Caching

Cache key generated from request parameters and projection
Cache checked before API execution
Successful responses stored with configurable TTL
Subsequent identical requests return instantly from cache

Compression Mode Selection

The system automatically selects the best compression mode based on:

Column/Row Format Selection:

Detects uniform arrays of objects with identical keys
Converts to columnar format when all objects share the same structure
Ideal for tabular data like employee records, time-series, or uniform API responses

TOON Format Selection:

Optimized for LLM consumption with maximum token efficiency
Uses compact syntax that reduces tokens while maintaining accuracy
According to TOON benchmarks, achieves higher accuracy (73.9% vs 69.7%) with 39.6% fewer tokens than JSON
Best efficiency score: 26.9 accuracy%/1K tokens

Compression Effectiveness

Based on benchmark data from TOON format benchmarks:

Token Efficiency Comparison:

TOON: 26.9 accuracy%/1K tokens (highest efficiency)
JSON compact: 22.9 accuracy%/1K tokens
YAML: 18.6 accuracy%/1K tokens
JSON: 15.3 accuracy%/1K tokens
XML: 13.0 accuracy%/1K tokens

Key Insight: TOON achieves 73.9% accuracy (vs JSON's 69.7%) while using 39.6% fewer tokens, making it the optimal choice for LLM consumption when maximum efficiency is required.

Best Practices

1. Always Use Projection for Large Responses

{
  "response_projection": "items[*].{id: id, name: name, status: status}"
}

2. Combine Projection with Filtering

{
  "response_projection": "items[?status == 'active' && price > `100`].{id: id, name: name}"
}

3. Use Aggregation for Summary Data

{
  "response_projection": "{total: length(items), active: length(items[?status == 'active']), avg_price: avg(items[*].price)}"
}

4. Enable Compression for Tabular Data

Column/Row Format (for uniform tabular data):

{
  "json_compression": {
    "enabled": true,
    "min_compression_ratio": 5.0,
    "mode": "json-compact"
  }
}

TOON Format (for maximum LLM efficiency):

{
  "json_compression": {
    "enabled": true,
    "min_compression_ratio": 5.0,
    "mode": "toon"
  }
}

When to Use Each Mode:

json-compact: Best for uniform tabular data where you need human-readable format
toon: Best for LLM consumption where maximum token efficiency and accuracy are priorities. According to TOON benchmarks, TOON achieves the highest efficiency score (26.9 accuracy%/1K tokens) and better accuracy than JSON while using fewer tokens.

5. Configure Caching for Read Operations

{
  "cache_enabled": true,
  "cache_ttl": 3600,
  "operation_type": "get"
}

6. Use Projection in Batch Operations

{
  "tool": "run_action_batch",
  "arguments": {
    "base_context": {
      "operationId": "readEmail",
      "response_projection": "{id: id, subject: subject, from: from}"
    },
    "batch_context": [
      {"email_id": "123"},
      {"email_id": "456"}
    ]
  }
}

Advanced JMESPath Patterns

Pattern 1: Multi-Level Filtering

{
  "response_projection": "rows[?clicks > `1000` && ctr > `0.03`].{page: keys[1], metrics: {clicks: clicks, ctr: ctr}}"
}

Pattern 2: Grouping and Aggregation

{
  "response_projection": "rows | group_by(@, &keys[0]) | map({date: @[0].keys[0], total_clicks: sum(@[*].clicks), avg_ctr: avg(@[*].ctr)})"
}

Pattern 3: Nested Field Extraction

{
  "response_projection": "items[*].{id: id, user: {name: user.name, email: user.email}, metadata: {created: metadata.created_at, updated: metadata.updated_at}}"
}

Pattern 4: Conditional Transformation

{
  "response_projection": "items[*].{id: id, status: status, priority: status == 'urgent' ? 'high' : 'normal'}"
}

Pattern 5: Array Flattening

{
  "response_projection": "items[*].tags[]"
}

Monitoring and Optimization

Compression Metrics

Apigene provides detailed compression metrics for monitoring:

Compression ratio: Percentage reduction achieved
Mode used: Column/Row or TOON format
Before/after sizes: Token counts for comparison
Threshold check: Whether compression met minimum ratio

Example metrics:

JSON compression: 10000 bytes -> 800 bytes (92.0% reduction)
TOON compression: 15000 bytes -> 8744 bytes (41.7% reduction)

Cache Performance Monitoring

Track cache effectiveness:

Cache hit rate: Percentage of requests served from cache
Cache miss rate: Percentage requiring API execution
Average response time: Comparison between cache hits and misses
TTL effectiveness: How often cached data expires vs is used

Token Usage Tracking

Monitor end-to-end token consumption:

Original response size: Before any optimization
After projection: Size reduction from field extraction
After compression: Additional reduction from compression
Cache impact: Zero tokens when cache hit occurs
Total reduction: Overall percentage improvement

This tracking helps identify optimization opportunities and measure ROI of compression and caching strategies.

Comparison: Before vs After

Metric	Without Optimization	With Optimization	Improvement
Average Response Size	7,000 tokens	500 tokens	93% reduction
Cost per Request	$0.07	$0.005	93% reduction
Context Accuracy	Low (noise)	High (focused)	Significant
Response Time	500ms	50ms (cache)	10x faster
API Rate Limits	High usage	Low usage	93% reduction

Conclusion

Tool output optimization is critical for building cost-effective, accurate AI agents. Apigene's three-layer approach—response projection, JSON compression (with both Column/Row and TOON formats), and intelligent caching—reduces tool output by up to 99%, delivering:

93% cost reduction through token optimization
Improved accuracy by reducing context rot (TOON format achieves 73.9% accuracy vs JSON's 69.7% with 39.6% fewer tokens)
10x faster responses with intelligent caching
Better scalability through reduced API load
Optimal format selection with automatic detection of best compression mode

By combining these techniques, agents can process more requests with lower costs, faster responses, and better accuracy. The key is applying projection first to extract only needed data, then compressing what remains using the optimal format (Column/Row for tabular data, TOON for maximum LLM efficiency), and finally caching to avoid redundant calls.

According to TOON format benchmarks, the TOON compression mode achieves the highest efficiency score (26.9 accuracy%/1K tokens) among all formats tested, making it ideal for LLM consumption where both token efficiency and accuracy are critical.

Ready to optimize your agent's token usage? Get started with Apigene's MCP Gateway and experience the power of intelligent tool output reduction.

Learn More:

#token-optimization#json-compression#response-projection#jmespath#llm-cost-reduction#context-window#response-caching#tool-output-reduction#token-efficiency#mcp-gateway#ai-agents#cost-optimization