TL;DR: I tested TOON's token optimisation claims on real data. CSV beats TOON by 10% on token efficiency and has 30 years of mature tooling. TOON and CSV share the same type ambiguity problem that JSON solves.
The Setup
TOON (Token-Oriented Object Notation) promises 30-60% token savings vs JSON for LLM applications. As someone building LLM systems for financial data, this claim caught my attention. Token costs add up quickly when you're processing market data at scale.
I ran comprehensive tests comparing JSON, TOON, and CSV across three dataset sizes:
- Small: 3 users
- Medium: 50 users
- Large: 200 users
Test methodology:
- Used
tiktokenwithcl100k_baseencoding (GPT-4/GPT-3.5-turbo) - Tested on uniform tabular data (user records)
- All formats contained identical information
- Measured raw token counts
Test Results
Small Dataset (3 users)
- JSON: 66 tokens
- CSV: 20 tokens (69.7% savings) ✅
- TOON: 28 tokens (57.6% savings)
- Result: TOON uses 40% MORE tokens than CSV
Medium Dataset (50 users)
- JSON: 1,709 tokens
- CSV: 562 tokens (67.1% savings) ✅
- TOON: 617 tokens (63.9% savings)
- Result: TOON uses 9.8% MORE tokens than CSV
Large Dataset (200 users)
- JSON: 6,809 tokens
- CSV: 2,217 tokens (67.4% savings) ✅
- TOON: 2,422 tokens (64.4% savings)
- Result: TOON uses 9.2% MORE tokens than CSV
CSV consistently beats TOON on token efficiency.
Why TOON Uses More Tokens
TOON adds overhead that CSV doesn't need:
TOON: users[50]{id,name,email,role,active}:
CSV: id,name,email,role,active
TOON's extras:
- Array length declarations
[50] - Curly braces
{} - Colon delimiter
: - Leading spaces on rows
Pure overhead.
The Type Ambiguity Problem
Here's where it gets interesting. TOON and CSV share the SAME type ambiguity problem:
CSV: 1,Alice,150.25,true
TOON: 1,Alice,150.25,true
Questions for the parser/LLM:
- Is
150.25a string or float? - Is
truea boolean or string "true"? - Is
1an integer or string?
Neither format specifies types. Both leave it to the parser/LLM to guess.
Only JSON has explicit types:
{"id": 1, "name": "Alice", "price": 150.25, "active": true}
No ambiguity. No guessing.
Real-World Implications
Integration Hell
Every data boundary requires conversion:
Bloomberg/Reuters → JSON → CSV/TOON → LLM → JSON → Database
The entire financial infrastructure speaks JSON:
- APIs
- Databases
- Logging systems
- Monitoring tools
- Message queues
Using TOON (or even CSV for LLMs):
- Convert at every boundary
- Risk bugs in conversion logic
- Custom tooling overhead
- Team learning curve
Training Data Mismatch
LLMs are trained on:
- Trillions of JSON tokens (every API since 2000)
- Billions of CSV tokens (every data export, spreadsheet)
- ~Zero TOON tokens (launched late 2024)
TOON is a foreign language to the model.
Fewer tokens does not equal better reasoning when the model hasn't seen the format.
Poor Fit for Real Market Data
Market data is messy:
- Variable fields (tick data ≠ trade data ≠ order book updates)
- Nested structures (quotes, greeks, metadata)
- CSV/TOON require uniform, flat tables
Real world doesn't fit tabular constraints.
When Each Format Makes Sense
Use CSV when:
- Pure tabular data
- Need maximum token efficiency
- Can tolerate type ambiguity
- Have simple, uniform schemas
Use JSON when:
- Need type safety (critical for financial systems)
- Nested/complex data structures
- Integration with existing systems
- Production-ready, battle-tested tooling matters
Use TOON when:
- You have 10M+ uniform records daily
- Token cost is your #1 expense
- Closed system with full pipeline control
- Array length validation is valuable
For 95% of use cases: CSV for tabular, JSON for structured. TOON solves a problem CSV already solved 30 years ago.
The Code
Full test code and methodology available on GitHub.
Conclusion
Token optimisation is the right direction. But it needs to happen at the RIGHT layer:
❌ Application-layer format changes (TOON, custom schemas) ❌ Developer-facing complexity ❌ "Reinventing" CSV with 10% overhead
✅ Native LLM provider support (like browser gzip - invisible) ✅ Model-layer compression (transparent to developers) ✅ Industry standards (not format fragmentation)
CSV has been solving this problem for 30 years.
Sometimes the best optimisation is using the tool that's been there all along.
Want more analysis like this? I test AI claims on real data and share what I learn. Subscribe to my newsletter for weekly insights on AI engineering and quant development.