The SQA Breakthrough: How We Made Attention 3x Faster
Sometimes the biggest breakthroughs come from questioning fundamental assumptions. Here's how we accidentally discovered that reducing query heads, not key/value heads, provides superior attention efficiency.
The SQA Breakthrough: How We Made Attention 3x Faster
Sometimes the most significant breakthroughs happen by accident. While working on reactive architectures, our team made a discovery that challenges fundamental assumptions about attention mechanism optimization. This is the story of Sparse Query Attention (SQA) and how it changes everything we thought we knew about efficient attention.
The Conventional Wisdom
For years, the AI community has focused on reducing key/value heads in attention mechanisms:
- Multi-Head Attention (MHA): N query heads, N key/value heads
- Grouped Query Attention (GQA): N query heads, M key/value heads (M < N)
- Multi-Query Attention (MQA): N query heads, 1 key/value head
The reasoning seemed logical: key/value heads affect two matrix multiplications in attention computation, while query heads only affect one. This reasoning is completely wrong.
The Accidental Discovery
While optimizing attention mechanisms for Reactive Transformers, I was experimenting with different head configurations. Instead of further reducing key/value heads (from GQA to MQA), I accidentally configured a test where I reduced query heads instead.
The results were shocking:
Performance Comparison (1024 sequence length)
| Mechanism | Training Time | Loss | Accuracy |
|---|---|---|---|
| GQA | 258 min | 1.2177 | 77.12% |
| MQA | 261 min | 1.2497 | 76.64% |
| SQA | 241 min | 1.2272 | 76.97% |
SQA was 10% faster than GQA while maintaining better performance than MQA!
The real breakthrough, however, came with benchmarks using the longer sequences - for 32k-128k sequences, xSQA was even 3x faster. More details below.
The Fundamental Insight
Why Query Reduction Works Better
The key insight is about operation count vs. operation size:
Traditional Approach (GQA/MQA):- Reduces matrix multiplication dimensions
- Still requires full number of operations
- Optimization: Making existing operations smaller
- Reduces number of operations
- Fewer query heads = fewer matrix multiplications
- Optimization: Doing fewer operations total
The Mathematics
Consider a 16-head attention layer:
GQA Configuration (16Q, 4KV):- First multiplication: 16 query heads × 4 key heads = 16 ops
- Second multiplication: 16 attention weight tensors × 4 value heads = 16 ops
- Total: 32 operations
- Query operations: 8 query heads × 8 key heads = 8 ops
- Key/Value operations: 8 attention weight tensors × 8 value head = 8 ops
- Total: 16 operations
Experimental Validation
Test 1: Dense Models (10-12M parameters)
Architecture:
- 8 layers, 256 dimensions
- 1024 context length
- Wikipedia training data
Results:
MHA: 269 min training, 1.1976 loss, 77.35% accuracy
GQA: 258 min training, 1.2177 loss, 77.12% accuracy
MQA: 261 min training, 1.2497 loss, 76.64% accuracy
SQA: 241 min training, 1.2272 loss, 76.97% accuracy ⭐
sSQA: 243 min training, 1.2201 loss, 77.05% accuracy ⭐
xSQA: 235 min training, 1.2428 loss, 76.74% accuracy ⭐Test 2: Long Context Benchmarks
The real breakthrough became apparent with longer sequences:
32k Sequence Length (A100 GPU):- xSQA: 6.74s (fastest!)
- Flex Attention: 9.36s
- SQA: 9.96s
- GQA: 18.19s (2.7x slower!)
- xSQA: 18.80s (3x faster than GQA!)
- SQA: 31.54s
- Flex Attention: 37.66s
- GQA: 62.79s
SQA Variants Discovered
1. Standard SQA
- 50% query heads (8/16 → 4/16)
- Standard key/value configuration
- Best balance of speed and quality
2. Symmetric SQA (sSQA)
- Equal query and key/value heads
- Uses optimized MHA kernels
- Minimal performance degradation
3. Extreme SQA (xSQA)
- Maximum query head reduction (25% of total)
- Highest computational savings
- Still competitive with MQA quality
4. Light SQA (lSQA)
- Opposite direction to xSQA - using more than 50% of query/kv heads, like 75%
- It's still work in progress and we have to confirm the results
- It's expected that it will reach better accuracy than GQA, while still being about 25% faster
Why This Changes Everything
For Training
- 2-3x faster training for long context models
- Reduced infrastructure costs for research
- Faster experimentation cycles for development
For Inference
- Lower latency for user interactions
- Better throughput for production systems
- Cost reduction for cloud deployments
For Research
- Enables longer context experiments on limited budgets
- New optimization directions beyond spatial sparsity
- Foundation for reactive architectures requiring efficient attention
Implementation Insights
Architecture Changes Required
class SparseQueryAttention(nn.Module):
def __init__(self, embed_dim, num_heads, num_kv_heads, num_query_heads):
super().__init__()
# Key insight: Reduce query projection dimensions
self.q_proj = nn.Linear(embed_dim,
embed_dim // num_heads * num_query_heads)
# Standard key/value projections
self.k_proj = nn.Linear(embed_dim,
embed_dim // num_heads * num_kv_heads)
self.v_proj = nn.Linear(embed_dim,
embed_dim // num_heads * num_kv_heads)
# Compensate for reduced query dimensions
self.out_proj = nn.Linear(embed_dim // num_heads * num_query_heads,
embed_dim)Parameter Reduction Bonus
SQA not only runs faster but also has fewer parameters:
- Query projection: Smaller output dimensions
- Output projection: Smaller input dimensions
- Total reduction: 10-15% fewer parameters
Implications for Different Use Cases
Reactive Transformers
SQA is particularly beneficial for reactive architectures:
- Memory Cross-Attention: Accesses static STM (benefits from query reduction)
- Short Sequences: Most reactive interactions are brief
- Real-time Processing: Every millisecond matters
Long Context Models
- 0-128k tokens: SQA is fastest attention mechanism
- 128k+ tokens: Can combine with Flex Attention (Flex-SQA)
- Training: Enables longer context on limited hardware
Edge Deployment
- Mobile devices: Reduced computation requirements
- IoT systems: Lower power consumption
- Real-time applications: Consistent low latency
The Broader Lesson
This discovery highlights a crucial point about AI research: question fundamental assumptions.
What We Thought We Knew
- Key/Value reduction is optimal (because it affects two operations)
- Query heads are secondary considerations
- Spatial sparsity is the only path to efficiency
What We Actually Learned
- Operation count matters more than operation size
- Query reduction provides superior benefits
- Structural sparsity can outperform spatial sparsity
Future Research Directions
Flex-SQA Combination
Combining SQA with Flex Attention could enable:
- 4-8x longer sliding windows
- Optimal efficiency for ultra-long sequences
- New possibilities for context-intensive applications
Dynamic Query Selection
- Adaptive head selection based on input complexity
- Token-importance routing for optimal efficiency
- Learned sparsity patterns beyond fixed configurations
Hardware Optimization
- SQA-specific kernels for maximum performance
- Memory access pattern optimization
- Edge device acceleration for mobile deployment
Lessons for the Industry
Rethink Optimization Strategies
The SQA discovery suggests we should:
- Question established wisdom about efficiency bottlenecks
- Experiment with unconventional approaches
- Measure real-world impact, not just theoretical complexity
Focus on Total System Efficiency
Rather than optimizing individual components:
- Consider operation count alongside operation complexity
- Optimize for actual hardware performance characteristics
- Balance speed, quality, and resource usage
Open Source Impact
We've made SQA freely available because breakthrough efficiency improvements benefit everyone:
- RxNN Framework: Full SQA implementation
- Pre-trained Models: SQA-optimized models on Hugging Face
- Research Papers: Complete experimental data
- Benchmarking Tools: Reproducible performance comparisons
Conclusion: Accidental Breakthroughs Change Everything
The SQA discovery reminds us that in rapidly evolving fields like AI, our assumptions need constant testing. Sometimes the biggest breakthroughs come from questioning what "everyone knows" to be true.
Key Takeaways
- Question Conventional Wisdom: The most accepted approaches aren't always optimal
- Experiment Fearlessly: Accidental discoveries often lead to breakthroughs
- Measure Real Performance: Theoretical complexity doesn't always predict real-world results
- Share Discoveries: Open research accelerates progress for everyone
What's Next
SQA is now integrated into our Reactive Transformer architectures and available for the broader AI community. But this is just the beginning—we're exploring even more radical approaches to attention efficiency.
The next breakthrough might be just one "accidental" experiment away.
---
Try SQA yourself: Install RxNN and experiment with our SQA implementations. Who knows what you might discover? Sometimes the best way forward is to question which direction everyone else is heading.Adam Filipek
CEO, Founder and Lead Researcher in Reactive AI. Creator of Event-Driven AI paradigm and author of reactive models research