INFO: Site in progress and may contain some parts of irrelevant AI generated data - it will be updated soon

Featured Posts

8 articles available

Research
Jul 25, 2025
10 min
Adam Filipek

The SQA Breakthrough: How We Made Attention 3x Faster

Sometimes the biggest breakthroughs come from questioning fundamental assumptions. Here's how we accidentally discovered that reducing query heads, not key/value heads, provides superior attention efficiency.

SQAResearchAttentionOptimizationBreakthrough

The SQA Breakthrough: How We Made Attention 3x Faster

Sometimes the most significant breakthroughs happen by accident. While working on reactive architectures, our team made a discovery that challenges fundamental assumptions about attention mechanism optimization. This is the story of Sparse Query Attention (SQA) and how it changes everything we thought we knew about efficient attention.

The Conventional Wisdom

For years, the AI community has focused on reducing key/value heads in attention mechanisms:

  • Multi-Head Attention (MHA): N query heads, N key/value heads
  • Grouped Query Attention (GQA): N query heads, M key/value heads (M < N)
  • Multi-Query Attention (MQA): N query heads, 1 key/value head

The reasoning seemed logical: key/value heads affect two matrix multiplications in attention computation, while query heads only affect one. This reasoning is completely wrong.

The Accidental Discovery

While optimizing attention mechanisms for Reactive Transformers, I was experimenting with different head configurations. Instead of further reducing key/value heads (from GQA to MQA), I accidentally configured a test where I reduced query heads instead.

The results were shocking:

Performance Comparison (1024 sequence length)

MechanismTraining TimeLossAccuracy
GQA258 min1.217777.12%
MQA261 min1.249776.64%
SQA241 min1.227276.97%

SQA was 10% faster than GQA while maintaining better performance than MQA!

The real breakthrough, however, came with benchmarks using the longer sequences - for 32k-128k sequences, xSQA was even 3x faster. More details below.

The Fundamental Insight

Why Query Reduction Works Better

The key insight is about operation count vs. operation size:

Traditional Approach (GQA/MQA):
  • Reduces matrix multiplication dimensions
  • Still requires full number of operations
  • Optimization: Making existing operations smaller
SQA Approach:
  • Reduces number of operations
  • Fewer query heads = fewer matrix multiplications
  • Optimization: Doing fewer operations total

The Mathematics

Consider a 16-head attention layer:

GQA Configuration (16Q, 4KV):
  • First multiplication: 16 query heads × 4 key heads = 16 ops
  • Second multiplication: 16 attention weight tensors × 4 value heads = 16 ops
  • Total: 32 operations
sSQA Configuration (8Q, 8KV):
  • Query operations: 8 query heads × 8 key heads = 8 ops
  • Key/Value operations: 8 attention weight tensors × 8 value head = 8 ops
  • Total: 16 operations
Result: 50% fewer total operations!

Experimental Validation

Test 1: Dense Models (10-12M parameters)

Architecture:

  • 8 layers, 256 dimensions
  • 1024 context length
  • Wikipedia training data

Results:

MHA:  269 min training, 1.1976 loss, 77.35% accuracy
GQA:  258 min training, 1.2177 loss, 77.12% accuracy  
MQA:  261 min training, 1.2497 loss, 76.64% accuracy
SQA:  241 min training, 1.2272 loss, 76.97% accuracy ⭐
sSQA: 243 min training, 1.2201 loss, 77.05% accuracy ⭐
xSQA: 235 min training, 1.2428 loss, 76.74% accuracy ⭐

Test 2: Long Context Benchmarks

The real breakthrough became apparent with longer sequences:

32k Sequence Length (A100 GPU):
  1. xSQA: 6.74s (fastest!)
  2. Flex Attention: 9.36s
  3. SQA: 9.96s
  4. GQA: 18.19s (2.7x slower!)
131k Sequence Length:
  1. xSQA: 18.80s (3x faster than GQA!)
  2. SQA: 31.54s
  3. Flex Attention: 37.66s
  4. GQA: 62.79s

SQA Variants Discovered

1. Standard SQA

  • 50% query heads (8/16 → 4/16)
  • Standard key/value configuration
  • Best balance of speed and quality

2. Symmetric SQA (sSQA)

  • Equal query and key/value heads
  • Uses optimized MHA kernels
  • Minimal performance degradation

3. Extreme SQA (xSQA)

  • Maximum query head reduction (25% of total)
  • Highest computational savings
  • Still competitive with MQA quality

4. Light SQA (lSQA)

  • Opposite direction to xSQA - using more than 50% of query/kv heads, like 75%
  • It's still work in progress and we have to confirm the results
  • It's expected that it will reach better accuracy than GQA, while still being about 25% faster

Why This Changes Everything

For Training

  • 2-3x faster training for long context models
  • Reduced infrastructure costs for research
  • Faster experimentation cycles for development

For Inference

  • Lower latency for user interactions
  • Better throughput for production systems
  • Cost reduction for cloud deployments

For Research

  • Enables longer context experiments on limited budgets
  • New optimization directions beyond spatial sparsity
  • Foundation for reactive architectures requiring efficient attention

Implementation Insights

Architecture Changes Required

class SparseQueryAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, num_kv_heads, num_query_heads):
        super().__init__()
        
        # Key insight: Reduce query projection dimensions
        self.q_proj = nn.Linear(embed_dim, 
                               embed_dim // num_heads * num_query_heads)
        
        # Standard key/value projections
        self.k_proj = nn.Linear(embed_dim, 
                               embed_dim // num_heads * num_kv_heads)
        self.v_proj = nn.Linear(embed_dim, 
                               embed_dim // num_heads * num_kv_heads)
        
        # Compensate for reduced query dimensions
        self.out_proj = nn.Linear(embed_dim // num_heads * num_query_heads, 
                                 embed_dim)

Parameter Reduction Bonus

SQA not only runs faster but also has fewer parameters:

  • Query projection: Smaller output dimensions
  • Output projection: Smaller input dimensions
  • Total reduction: 10-15% fewer parameters

Implications for Different Use Cases

Reactive Transformers

SQA is particularly beneficial for reactive architectures:

  • Memory Cross-Attention: Accesses static STM (benefits from query reduction)
  • Short Sequences: Most reactive interactions are brief
  • Real-time Processing: Every millisecond matters

Long Context Models

  • 0-128k tokens: SQA is fastest attention mechanism
  • 128k+ tokens: Can combine with Flex Attention (Flex-SQA)
  • Training: Enables longer context on limited hardware

Edge Deployment

  • Mobile devices: Reduced computation requirements
  • IoT systems: Lower power consumption
  • Real-time applications: Consistent low latency

The Broader Lesson

This discovery highlights a crucial point about AI research: question fundamental assumptions.

What We Thought We Knew

  • Key/Value reduction is optimal (because it affects two operations)
  • Query heads are secondary considerations
  • Spatial sparsity is the only path to efficiency

What We Actually Learned

  • Operation count matters more than operation size
  • Query reduction provides superior benefits
  • Structural sparsity can outperform spatial sparsity

Future Research Directions

Flex-SQA Combination

Combining SQA with Flex Attention could enable:

  • 4-8x longer sliding windows
  • Optimal efficiency for ultra-long sequences
  • New possibilities for context-intensive applications

Dynamic Query Selection

  • Adaptive head selection based on input complexity
  • Token-importance routing for optimal efficiency
  • Learned sparsity patterns beyond fixed configurations

Hardware Optimization

  • SQA-specific kernels for maximum performance
  • Memory access pattern optimization
  • Edge device acceleration for mobile deployment

Lessons for the Industry

Rethink Optimization Strategies

The SQA discovery suggests we should:

  1. Question established wisdom about efficiency bottlenecks
  2. Experiment with unconventional approaches
  3. Measure real-world impact, not just theoretical complexity

Focus on Total System Efficiency

Rather than optimizing individual components:

  1. Consider operation count alongside operation complexity
  2. Optimize for actual hardware performance characteristics
  3. Balance speed, quality, and resource usage

Open Source Impact

We've made SQA freely available because breakthrough efficiency improvements benefit everyone:

  • RxNN Framework: Full SQA implementation
  • Pre-trained Models: SQA-optimized models on Hugging Face
  • Research Papers: Complete experimental data
  • Benchmarking Tools: Reproducible performance comparisons

Conclusion: Accidental Breakthroughs Change Everything

The SQA discovery reminds us that in rapidly evolving fields like AI, our assumptions need constant testing. Sometimes the biggest breakthroughs come from questioning what "everyone knows" to be true.

Key Takeaways

  1. Question Conventional Wisdom: The most accepted approaches aren't always optimal
  2. Experiment Fearlessly: Accidental discoveries often lead to breakthroughs
  3. Measure Real Performance: Theoretical complexity doesn't always predict real-world results
  4. Share Discoveries: Open research accelerates progress for everyone

What's Next

SQA is now integrated into our Reactive Transformer architectures and available for the broader AI community. But this is just the beginning—we're exploring even more radical approaches to attention efficiency.

The next breakthrough might be just one "accidental" experiment away.

---

Try SQA yourself: Install RxNN and experiment with our SQA implementations. Who knows what you might discover? Sometimes the best way forward is to question which direction everyone else is heading.

Adam Filipek

CEO, Founder and Lead Researcher in Reactive AI. Creator of Event-Driven AI paradigm and author of reactive models research

The SQA Breakthrough: How We Made Attention 3x Faster | Reactive AI