insights

Reducing Tool Output by 95%: JSON Compression, Response Projection, and Caching for LLM Cost Optimization

Apigene Team
15 min read
Reducing Tool Output by 95%: JSON Compression, Response Projection, and Caching for LLM Cost Optimization

AI agents executing tools face a critical challenge: large tool outputs consume massive token budgets, driving up costs and reducing accuracy through context rot. A single API response can contain thousands of unnecessary fields, consuming tens of thousands of tokens before the LLM even processes the user's request.

At Apigene, we've implemented a three-layer optimization strategy combining JSON compression, response projection, and intelligent caching to reduce tool output by up to 95%. This article explores how these techniques work together to dramatically reduce LLM costs, improve accuracy, and accelerate agent performance.

The Problem: Tool Output Bloat

Token Consumption Explosion

Consider a typical API response from Google Search Console:

Original Response (10,000 tokens):

{
  "rows": [
    {
      "keys": ["2024-01-01", "https://example.com/page1"],
      "clicks": 1250,
      "impressions": 45000,
      "ctr": 0.0278,
      "position": 3.2,
      "device": "DESKTOP",
      "country": "US",
      "query": "example search query",
      "page": "https://example.com/page1",
      "date": "2024-01-01",
      "searchAppearance": "NONE",
      "dataState": "FINAL",
      "isDataGolden": true,
      "dimensionFilterGroups": [],
      "dimensionGroup": null,
      "dimensionValues": {},
      "metricValues": {},
      "metadata": {
        "schema": "https://schema.org/WebPage",
        "lastModified": "2024-01-01T00:00:00Z",
        "crawlStatus": "OK",
        "indexStatus": "INDEXED"
      }
    },
    // ... 99 more similar objects
  ],
  "responseAggregationType": "BY_PROPERTY",
  "rowCount": 100,
  "kind": "webmasters#searchAnalyticsSearchResponse",
  "timeZone": "America/New_York",
  "dataState": "FINAL"
}

This response contains:

  • 100 rows of search analytics data
  • 20+ fields per row (most unused)
  • Metadata and schema information (rarely needed)
  • Total: ~10,000 tokens

The Hidden Costs

Large tool outputs create three critical problems:

  1. Token Cost Explosion: At $0.01 per 1K input tokens, 10,000 tokens = $0.10 per request
  2. Context Rot: Irrelevant data dilutes important information, reducing LLM accuracy
  3. Speed Degradation: Processing large responses increases latency and reasoning time

Real-World Impact

Scenario: Agent analyzing search performance

1. Fetch Google Search Console data: 10,000 tokens
2. Process user query: 100 tokens
3. LLM reasoning with full context: 2,000 tokens
4. Generate response: 500 tokens
Total: 12,600 tokens = $0.126 per request

If the agent only needs clicks, impressions, and ctr, we're wasting 9,500 tokens (75% waste) on irrelevant data.

The Solution: Three-Layer Optimization

Apigene implements three complementary techniques that work together to reduce tool output:

  1. Response Projection: Extract only needed fields using JMESPath
  2. JSON Compression: Compress remaining data using tabular format
  3. Intelligent Caching: Cache responses to avoid redundant API calls

Layer 1: Response Projection with JMESPath

Response projection uses JMESPath expressions to extract only the fields you need from API responses, dramatically reducing token consumption.

How It Works

Apigene applies projection before compression and caching, ensuring minimal data flows through the system. The projection engine uses JMESPath to parse expressions and extract only the specified fields from JSON responses. This happens server-side before any data is sent to the LLM, dramatically reducing token consumption.

Example 1: Google Search Console - Field Selection

Original Response: 10,000 tokens

With Projection:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Google Search Console",
    "user_input": "Get search analytics data",
    "context": {
      "operationId": "searchAnalytics.query",
      "startDate": "2024-01-01",
      "endDate": "2024-01-31",
      "dimensions": ["date", "page"]
    },
    "response_projection": "rows[*].{date: keys[0], page: keys[1], clicks: clicks, impressions: impressions, ctr: ctr, position: position}"
  }
}

Projected Response: 2,500 tokens (75% reduction)

Result:

{
  "rows": [
    {
      "date": "2024-01-01",
      "page": "https://example.com/page1",
      "clicks": 1250,
      "impressions": 45000,
      "ctr": 0.0278,
      "position": 3.2
    }
    // ... 99 more rows (only 6 fields each)
  ]
}

Example 2: Google Search Console - Filtering and Aggregation

Filter High-Performing Pages:

{
  "response_projection": "rows[?clicks > `1000`].{page: keys[1], clicks: clicks, ctr: ctr}"
}

Aggregate by Date:

{
  "response_projection": "{dates: rows[*].{date: keys[0], total_clicks: clicks, total_impressions: impressions}} | group_by(@, &date)"
}

Calculate Average Position:

{
  "response_projection": "{avg_position: avg(rows[*].position), total_clicks: sum(rows[*].clicks)}"
}

Example 3: Complex Nested Projection

Extract Nested Data:

{
  "response_projection": "rows[*].{url: keys[1], metrics: {clicks: clicks, impressions: impressions}, date: keys[0]}"
}

Filter and Transform:

{
  "response_projection": "rows[?ctr > `0.03`].{page: keys[1], performance: {clicks: clicks, ctr: ctr, position: position}}"
}

JMESPath Projection Patterns

1. Field Selection:

"{name: name, email: email, id: id}"

2. Array Extraction:

"items[*].name"

3. Filtering:

"items[?price > `100`]"

4. Aggregation:

"{total: sum(items[*].price), avg: avg(items[*].price), max: max(items[*].price)}"

5. Nested Extraction:

"users[*].{id: id, profile: {name: profile.name, email: profile.email}}"

6. Conditional Filtering:

"items[?status == 'active' && price > `50`]"

7. Sorting:

"sort_by(items, &price)[-10:]"

8. Multi-Level Projection:

"{summary: {total: length(items), active: length(items[?status == 'active'])}, items: items[*].{id: id, name: name}}"

Layer 2: JSON Compression

After projection, Apigene applies JSON compression to further reduce size by converting uniform arrays into tabular format.

How It Works

Apigene supports two compression modes optimized for different use cases:

1. Column/Row Format (json-compact): Converts uniform arrays of objects into a tabular columnar format. When all objects in an array share identical keys, the system extracts column names once and stores values as rows, eliminating repetitive key names.

2. TOON Format: A specialized format optimized for LLM token efficiency. TOON uses a compact syntax that reduces token count while maintaining high LLM comprehension accuracy.

Both modes automatically detect when compression is beneficial and only apply when a minimum compression ratio threshold is met (default 5%).

Compression Examples

Column/Row Format Example:

Before Compression (2,500 tokens):

[
  {"date": "2024-01-01", "clicks": 1250, "impressions": 45000},
  {"date": "2024-01-02", "clicks": 1320, "impressions": 47000},
  {"date": "2024-01-03", "clicks": 1180, "impressions": 43000}
]

After Column/Row Compression (800 tokens - 68% reduction):

{
  "columns": ["date", "clicks", "impressions"],
  "rows": [
    ["2024-01-01", 1250, 45000],
    ["2024-01-02", 1320, 47000],
    ["2024-01-03", 1180, 43000]
  ]
}

TOON Format Example:

Before Compression (15,144 tokens - GitHub repositories example): Standard JSON with repetitive field names and structure.

After TOON Compression (8,744 tokens - 42.3% reduction): Compact TOON syntax that eliminates repetitive keys while maintaining full data fidelity. According to TOON benchmarks, this format achieves higher LLM accuracy (73.9% vs 69.7%) with significantly fewer tokens.

Compression Mode Comparison

Column/Row Format (json-compact):

  • Best for: Uniform tabular data with identical object structures
  • Compression: Converts arrays to column/row format
  • Token Reduction: 35-60% reduction vs standard JSON
  • Example: Employee records, time-series data, uniform API responses

TOON Format:

  • Best for: LLM consumption with maximum token efficiency
  • Compression: Specialized compact syntax optimized for LLMs
  • Token Reduction: 40-65% reduction vs standard JSON
  • LLM Accuracy: According to TOON benchmarks, TOON achieves 73.9% accuracy (vs JSON's 69.7%) while using 39.6% fewer tokens
  • Efficiency Score: 26.9 accuracy%/1K tokens (highest among formats tested)
  • Per-Model Performance:
    • Claude Haiku: 59.8% accuracy (vs JSON's 57.4%)
    • Gemini Flash: 87.6% accuracy (vs JSON's 77.0%)
    • GPT-5 Nano: 90.9% accuracy (tied with JSON compact)
    • Grok Fast: 57.4% accuracy (vs JSON's 55.5%)
  • Key Advantage: Not only reduces tokens but actually improves LLM comprehension accuracy across multiple models

Compression Ratio Threshold

Compression only applies when it achieves a minimum compression ratio (configurable, default 5%). This ensures compression overhead is justified by meaningful size reduction. If compression doesn't meet the threshold, the original response is returned to avoid unnecessary processing.

Layer 3: Intelligent Caching

Apigene implements intelligent response caching to avoid redundant API calls, further reducing token consumption and improving speed.

How It Works

Apigene's caching system operates transparently at the interaction layer. When a tool execution request is made, the system first checks for a cached response using a composite cache key that includes:

  • API name and operation ID
  • Request context parameters (excluding temporal fields)
  • User ID (for user-scoped caching)
  • Response projection (if applied)

If a cache hit occurs, the response is returned immediately without executing the API call. On cache miss, the API executes normally and the result is stored in cache with a configurable TTL (time-to-live).

Cache Key Generation

Cache keys are generated using a composite approach that includes:

  • API name and operation ID
  • Context parameters (excluding temporal fields like timestamps)
  • User ID (for user-scoped caching) or tenant ID (for tenant-scoped caching)
  • Response projection expression (if applied)

This ensures that identical requests with the same projection return cached results, while different projections or parameters generate separate cache entries.

Cache Configuration

Caching can be configured at the organization level with settings for:

  • Enabled/Disabled: Toggle caching on or off
  • TTL (Time-To-Live): How long cached responses remain valid (default: 1 hour)
  • Operation Type Filter: Which operation types to cache (e.g., GET operations only)
  • Cache Scope: User-level or tenant-level caching

This allows fine-grained control over what gets cached and for how long, balancing freshness with performance.

Cache Benefits

  1. Zero Token Cost: Cached responses don't consume API tokens
  2. Instant Response: Cached data returns immediately
  3. Reduced API Load: Fewer calls to external APIs
  4. Cost Savings: No API rate limit consumption

Complete Optimization Pipeline

The three layers work together in sequence:

1. API Response (10,000 tokens)
   ↓
2. Response Projection (JMESPath) → 2,500 tokens (75% reduction)
   ↓
3. JSON Compression (Tabular) → 800 tokens (68% reduction)
   ↓
4. Caching (if applicable) → 0 tokens (100% reduction on cache hit)

Total Reduction: 92% (from 10,000 to 800 tokens)

Real-World Examples

Example 1: Google Search Console Analytics

Use Case: Analyze top-performing pages

Without Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Google Search Console",
    "context": {
      "operationId": "searchAnalytics.query"
    }
  }
}

Result: 10,000 tokens

With Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Google Search Console",
    "context": {
      "operationId": "searchAnalytics.query",
      "startDate": "2024-01-01",
      "endDate": "2024-01-31"
    },
    "response_projection": "rows[?clicks > `1000`].{page: keys[1], clicks: clicks, ctr: ctr, position: position}"
  }
}

Result: 500 tokens (95% reduction)

Example 2: Salesforce Opportunity Analysis

Use Case: Get high-value opportunities

Without Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Salesforce",
    "context": {
      "operationId": "listOpportunities"
    }
  }
}

Result: 8,000 tokens

With Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Salesforce",
    "context": {
      "operationId": "listOpportunities"
    },
    "response_projection": "opportunities[?amount > `10000`].{name: name, amount: amount, stage: stageName, closeDate: closeDate}"
  }
}

Result: 1,200 tokens (85% reduction)

Example 3: GitHub Pull Request Review

Use Case: List open pull requests with basic info

Without Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "GitHub",
    "context": {
      "operationId": "listPullRequests"
    }
  }
}

Result: 6,000 tokens

With Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "GitHub",
    "context": {
      "operationId": "listPullRequests",
      "state": "open"
    },
    "response_projection": "pull_requests[*].{number: number, title: title, author: user.login, status: state, url: html_url}"
  }
}

Result: 800 tokens (87% reduction)

Example 4: Jira Issue Tracking

Use Case: Get open issues with summaries

Without Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Jira",
    "context": {
      "operationId": "listIssues"
    }
  }
}

Result: 7,500 tokens

With Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Jira",
    "context": {
      "operationId": "listIssues",
      "status": "open"
    },
    "response_projection": "{issues: issues[*].{id: id, key: key, summary: fields.summary, status: fields.status.name, assignee: fields.assignee.displayName}}"
  }
}

Result: 1,000 tokens (87% reduction)

Example 5: Gmail Email Analysis

Use Case: Get unread email summaries

Without Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Gmail",
    "context": {
      "operationId": "listMessages"
    }
  }
}

Result: 5,000 tokens

With Optimization:

{
  "tool": "run_action",
  "arguments": {
    "app_name": "Gmail",
    "context": {
      "operationId": "listMessages",
      "q": "is:unread"
    },
    "response_projection": "messages[*].{id: id, subject: payload.headers[?name == 'Subject'].value | [0], from: payload.headers[?name == 'From'].value | [0], date: internalDate}"
  }
}

Result: 600 tokens (88% reduction)

Performance Impact

Token Reduction Metrics

APIOriginal SizeProjected SizeCompressed SizeTotal Reduction
Google Search Console10,0002,50080092%
Salesforce8,0001,50060092.5%
GitHub6,0001,20050091.7%
Jira7,5001,00040094.7%
Gmail5,00080030094%

Cost Savings

Scenario: Agent processing 1,000 requests/day

Without Optimization:

  • Average response: 7,000 tokens
  • Daily tokens: 7,000,000
  • Cost: $70/day ($2,100/month)

With Optimization:

  • Average response: 500 tokens (93% reduction)
  • Daily tokens: 500,000
  • Cost: $5/day ($150/month)

Savings: $1,950/month (93% cost reduction)

Accuracy Improvement

Context Rot Reduction:

  • Without Optimization: 7,000 tokens of mixed relevant/irrelevant data
  • With Optimization: 500 tokens of focused, relevant data
  • Improvement: 93% reduction in noise, leading to:
    • Better LLM focus on relevant information
    • Reduced hallucination risk
    • Improved reasoning accuracy
    • Faster response generation

Speed Improvements

Latency Reduction:

  • Projection: <10ms overhead
  • Compression: <5ms overhead
  • Cache Hit: 0ms (instant)
  • Total Speedup: 10-100x faster on cache hits

How It Works: High-Level Architecture

Response Processing Pipeline

Apigene processes tool responses through a three-stage pipeline:

Stage 1: Response Projection

  • JMESPath expressions are parsed and evaluated against the response
  • Only specified fields are extracted, reducing data size by 70-90%
  • Projection happens server-side before any data transmission

Stage 2: JSON Compression

  • Two modes available: Column/Row format or TOON format
  • Compression algorithm analyzes data structure to determine optimal format
  • Only applies when minimum compression ratio threshold is met
  • Reduces remaining data by an additional 35-65%

Stage 3: Intelligent Caching

  • Cache key generated from request parameters and projection
  • Cache checked before API execution
  • Successful responses stored with configurable TTL
  • Subsequent identical requests return instantly from cache

Compression Mode Selection

The system automatically selects the best compression mode based on:

Column/Row Format Selection:

  • Detects uniform arrays of objects with identical keys
  • Converts to columnar format when all objects share the same structure
  • Ideal for tabular data like employee records, time-series, or uniform API responses

TOON Format Selection:

  • Optimized for LLM consumption with maximum token efficiency
  • Uses compact syntax that reduces tokens while maintaining accuracy
  • According to TOON benchmarks, achieves higher accuracy (73.9% vs 69.7%) with 39.6% fewer tokens than JSON
  • Best efficiency score: 26.9 accuracy%/1K tokens

Compression Effectiveness

Based on benchmark data from TOON format benchmarks:

Token Efficiency Comparison:

  • TOON: 26.9 accuracy%/1K tokens (highest efficiency)
  • JSON compact: 22.9 accuracy%/1K tokens
  • YAML: 18.6 accuracy%/1K tokens
  • JSON: 15.3 accuracy%/1K tokens
  • XML: 13.0 accuracy%/1K tokens

Key Insight: TOON achieves 73.9% accuracy (vs JSON's 69.7%) while using 39.6% fewer tokens, making it the optimal choice for LLM consumption when maximum efficiency is required.

Best Practices

1. Always Use Projection for Large Responses

{
  "response_projection": "items[*].{id: id, name: name, status: status}"
}

2. Combine Projection with Filtering

{
  "response_projection": "items[?status == 'active' && price > `100`].{id: id, name: name}"
}

3. Use Aggregation for Summary Data

{
  "response_projection": "{total: length(items), active: length(items[?status == 'active']), avg_price: avg(items[*].price)}"
}

4. Enable Compression for Tabular Data

Column/Row Format (for uniform tabular data):

{
  "json_compression": {
    "enabled": true,
    "min_compression_ratio": 5.0,
    "mode": "json-compact"
  }
}

TOON Format (for maximum LLM efficiency):

{
  "json_compression": {
    "enabled": true,
    "min_compression_ratio": 5.0,
    "mode": "toon"
  }
}

When to Use Each Mode:

  • json-compact: Best for uniform tabular data where you need human-readable format
  • toon: Best for LLM consumption where maximum token efficiency and accuracy are priorities. According to TOON benchmarks, TOON achieves the highest efficiency score (26.9 accuracy%/1K tokens) and better accuracy than JSON while using fewer tokens.

5. Configure Caching for Read Operations

{
  "cache_enabled": true,
  "cache_ttl": 3600,
  "operation_type": "get"
}

6. Use Projection in Batch Operations

{
  "tool": "run_action_batch",
  "arguments": {
    "base_context": {
      "operationId": "readEmail",
      "response_projection": "{id: id, subject: subject, from: from}"
    },
    "batch_context": [
      {"email_id": "123"},
      {"email_id": "456"}
    ]
  }
}

Advanced JMESPath Patterns

Pattern 1: Multi-Level Filtering

{
  "response_projection": "rows[?clicks > `1000` && ctr > `0.03`].{page: keys[1], metrics: {clicks: clicks, ctr: ctr}}"
}

Pattern 2: Grouping and Aggregation

{
  "response_projection": "rows | group_by(@, &keys[0]) | map({date: @[0].keys[0], total_clicks: sum(@[*].clicks), avg_ctr: avg(@[*].ctr)})"
}

Pattern 3: Nested Field Extraction

{
  "response_projection": "items[*].{id: id, user: {name: user.name, email: user.email}, metadata: {created: metadata.created_at, updated: metadata.updated_at}}"
}

Pattern 4: Conditional Transformation

{
  "response_projection": "items[*].{id: id, status: status, priority: status == 'urgent' ? 'high' : 'normal'}"
}

Pattern 5: Array Flattening

{
  "response_projection": "items[*].tags[]"
}

Monitoring and Optimization

Compression Metrics

Apigene provides detailed compression metrics for monitoring:

  • Compression ratio: Percentage reduction achieved
  • Mode used: Column/Row or TOON format
  • Before/after sizes: Token counts for comparison
  • Threshold check: Whether compression met minimum ratio

Example metrics:

JSON compression: 10000 bytes -> 800 bytes (92.0% reduction)
TOON compression: 15000 bytes -> 8744 bytes (41.7% reduction)

Cache Performance Monitoring

Track cache effectiveness:

  • Cache hit rate: Percentage of requests served from cache
  • Cache miss rate: Percentage requiring API execution
  • Average response time: Comparison between cache hits and misses
  • TTL effectiveness: How often cached data expires vs is used

Token Usage Tracking

Monitor end-to-end token consumption:

  • Original response size: Before any optimization
  • After projection: Size reduction from field extraction
  • After compression: Additional reduction from compression
  • Cache impact: Zero tokens when cache hit occurs
  • Total reduction: Overall percentage improvement

This tracking helps identify optimization opportunities and measure ROI of compression and caching strategies.

Comparison: Before vs After

MetricWithout OptimizationWith OptimizationImprovement
Average Response Size7,000 tokens500 tokens93% reduction
Cost per Request$0.07$0.00593% reduction
Context AccuracyLow (noise)High (focused)Significant
Response Time500ms50ms (cache)10x faster
API Rate LimitsHigh usageLow usage93% reduction

Conclusion

Tool output optimization is critical for building cost-effective, accurate AI agents. Apigene's three-layer approach—response projection, JSON compression (with both Column/Row and TOON formats), and intelligent caching—reduces tool output by up to 95%, delivering:

  • 93% cost reduction through token optimization
  • Improved accuracy by reducing context rot (TOON format achieves 73.9% accuracy vs JSON's 69.7% with 39.6% fewer tokens)
  • 10x faster responses with intelligent caching
  • Better scalability through reduced API load
  • Optimal format selection with automatic detection of best compression mode

By combining these techniques, agents can process more requests with lower costs, faster responses, and better accuracy. The key is applying projection first to extract only needed data, then compressing what remains using the optimal format (Column/Row for tabular data, TOON for maximum LLM efficiency), and finally caching to avoid redundant calls.

According to TOON format benchmarks, the TOON compression mode achieves the highest efficiency score (26.9 accuracy%/1K tokens) among all formats tested, making it ideal for LLM consumption where both token efficiency and accuracy are critical.

Ready to optimize your agent's token usage? Get started with Apigene's MCP Gateway and experience the power of intelligent tool output reduction.


Learn More:

#token-optimization#json-compression#response-projection#jmespath#llm-cost-reduction#context-window#response-caching#tool-output-reduction#token-efficiency#mcp-gateway#ai-agents#cost-optimization