Right Usage of Kafka: Patterns and Anti-Patterns
Right Usage of Kafka: Patterns and Anti-Patterns
How to use Apache Kafka effectively — from partitioning strategies to consumer group design, and common mistakes to avoid.
🚀 Introduction
Kafka is a powerful distributed event streaming platform, but using it correctly requires understanding its core concepts deeply.
📋 Key Topics Covered
- Topic design and partitioning strategies
- Consumer groups and rebalancing
- Exactly-once semantics
- Schema Registry and Avro
- Retention policies and compaction
- Monitoring: lag, throughput, and errors
- Common anti-patterns (using Kafka as a database, over-partitioning)
- Kafka vs RabbitMQ vs Redis Streams
🏗️ Topic Design and Partitioning Strategies
🎯 Proper Topic Design
Topics should be designed around business domains or event types, not technical implementation details. Each topic represents a category of related events.
Best Practices:
- Use descriptive, domain-specific names (
user-events,order-updates,payment-transactions) - Separate concerns: different event types should go to different topics
- Consider using naming conventions with prefixes/suffixes for environments (
prod-user-events,staging-user-events) - Align topic boundaries with bounded contexts in DDD
⚖️ Partitioning Strategy
Partitions enable parallelism and determine throughput capacity.
Guidelines:
- Start with enough partitions for peak throughput: Calculate based on expected producer/consumer throughput per partition
- Align with consumer group size: Max parallel consumers = number of partitions
- Consider key distribution: Choose partition keys that evenly distribute load
- Plan for growth: It's easier to add partitions than to rebalance poorly distributed data
- Monitor partition skew: Uneven distribution creates hot partitions
Anti-Pattern: Creating too many partitions (e.g., 100+ for low-throughput topics) increases overhead and recovery time.
👥 Consumer Groups and Rebalancing
🔄 How Consumer Groups Work
Consumer groups enable scalable consumption where multiple consumers share the workload of processing topic partitions.
Key Points:
- Each partition is consumed by exactly one consumer in a group
- Adding consumers up to partition count increases processing parallelism
- Rebalancing occurs when group membership changes (consumers join/leave)
- During rebalance, consumers temporarily stop processing
⚙️ Minimizing Rebalance Impact
- Use static membership (Kafka 2.3+): Assign persistent consumer IDs to reduce shuffling
- Optimize session.timeout.ms: Balance between failure detection and unnecessary rebalances
- Avoid frequent restarts: Graceful shutdowns trigger rebalances; rolling updates are better
- Use proper heartbeat settings: Ensure consumers can send heartbeats within session timeout
🎯 Exactly-Once Semantics (EOS)
🔄 Achieving Exactly-Once Processing
Kafka provides exactly-once semantics through idempotent producers and transactional APIs.
Implementation Steps:
- Enable idempotent producers: Set
enable.idempotence=true(default in newer versions) - Use transactions for multi-topic writes: Producer sends data to multiple topics atomically
- Consume with read_committed isolation level: Consumers only see committed transactions
- Design idempotent consumers: Handle duplicate messages gracefully as fallback
Note: EOS has performance implications due to additional coordination overhead.
📜 Schema Registry and Avro
📊 Why Use Schema Registry?
Schema Registry provides a centralized schema store and enforces compatibility rules.
Benefits:
- Data governance: Prevent incompatible schema changes
- Evolution safety: Backward/forward compatibility checks
- Serialization efficiency: Avro is compact and fast
- Documentation: Schemas serve as API contracts
🔄 Schema Compatibility Types
- BACKWARD: New schema can read old data (consumers can upgrade first)
- FORWARD: Old schema can read new data (producers can upgrade first)
- FULL: Both backward and forward compatible
- NONE: No compatibility checks
Best Practice: Use BACKWARD or FULL compatibility for most use cases.
⏳ Retention Policies and Compaction
📦 Retention Policies
Control how long Kafka keeps data.
Types:
- Time-based:
log.retention.hours(default 168 hours = 7 days) - Size-based:
log.retention.bytesper topic - Delete vs Compact:
cleanup.policy(deleteorcompact)
🗜️ Log Compaction
Keeps only the latest value for each key, useful for:
- Event sourcing: Rebuild state from events
- Configuration snapshots: Latest config per service
- Entity state: Current user profile, inventory counts
Compaction Trigger: When log segment reaches min.cleanable.dirty.ratio threshold
📊 Monitoring: Lag, Throughput, and Errors
📈 Key Metrics to Monitor
Consumer Lag:
- Difference between current offset and end offset
- Indicates if consumers can keep up with producers
- Alert when lag grows steadily
Throughput:
- Messages/sec in/out
- Bytes/sec in/out
- Request rates
Error Rates:
- Failed produce requests
- Consumer exceptions
- Connection failures
🛠️ Monitoring Tools
- Built-in: JMX metrics,
kafka-consumer-groups.shtool - Open Source: Prometheus + Grafana, Confluent Control Center
- Managed Services: Cloud provider monitoring integrations
⚠️ Common Anti-Patterns
❌ Using Kafka as a Database
Problem: Storing data indefinitely expecting query capabilities Solution: Use Kafka for event streaming, move data to appropriate databases for querying
❌ Over-Partitioning
Problem: Too many partitions increase overhead (metadata, file handles, recovery time) Solution: Start with reasonable partition count (e.g., 3-6 per broker), scale based on throughput needs
❌ Ignoring Message Ordering Guarantees
Problem: Assuming global ordering across partitions Solution: Use partitioning keys for ordering within key groups, design consumers to handle out-of-order messages
❌ Not Handling Poison Pills
Problem: Bad messages causing consumer crashes and infinite restart loops Solution: Implement dead letter queues, poison pill handling, or skip mechanisms
⚔️ Kafka vs RabbitMQ vs Redis Streams
🆚 Feature Comparison
| Feature | Kafka | RabbitMQ | Redis Streams |
|---|---|---|---|
| Model | Log-based streaming | Traditional message queue | Log-based streaming |
| Throughput | Very High (MB/s) | Medium | High |
| Persistence | Disk-based (configurable) | Disk/memory | Memory (with AOF) |
| Ordering | Per-partition | Per-queue (FIFO) | Per-stream |
| Consumer Model | Consumer groups | Competing consumers | Consumer groups |
| Retry/DLQ | Manual implementation | Built-in | Manual implementation |
| Best For | Event sourcing, high-throughput pipelines | Complex routing, task queues | Simple streaming, low-latency |
🎯 When to Choose Each
Choose Kafka when:
- Building event-driven architectures
- Needing high throughput and durability
- Implementing event sourcing or CQRS
- Long-term data retention is needed
Choose RabbitMQ when:
- Complex routing is required (topics, headers)
- Need sophisticated queueing patterns
- Lower latency is critical
- Polyglot protocol support is important
Choose Redis Streams when:
- Already using Redis in infrastructure
- Need simple streaming with consumer groups
- Can accept memory-limited durability
- Low-latency processing is priority
🏁 Conclusion
Using Kafka effectively requires understanding its distributed nature and embracing event streaming principles. By following the patterns outlined—proper topic design, thoughtful partitioning, consumer group management, and avoiding common anti-patterns—you can build robust, scalable event-driven systems.
Remember that Kafka excels as a high-throughput, durable event log, not as a general-purpose database or task queue. Align your usage with its strengths, and you'll unlock powerful capabilities for real-time data processing and microservices communication.
❓ Frequently Asked Questions
Q: How many partitions should I start with for a new topic?
A: Start with 3-6 partitions per broker as a baseline, then scale based on your throughput requirements. Monitor consumer lag and adjust as needed. Remember that you can always increase partitions later (but not decrease easily).
Q: Should I use keys in my Kafka messages?
A: Use keys when you need ordering guarantees for related events (e.g., all events for a specific user_id should be processed in order). Without keys, messages are distributed randomly across partitions, which provides better throughput but no ordering guarantees.
Q: How do I handle schema evolution safely?
A: Use Schema Registry with appropriate compatibility settings (BACKWARD or FULL). Always test consumer/producer compatibility before deploying schema changes. Consider using a canary release strategy for schema updates.
Q: What's the difference between delete and compact cleanup policies?
A: delete removes old messages based on time or size thresholds. compact retains only the latest value for each message key, effectively creating a snapshot of the latest state for each key.
Q: How can I reduce consumer rebalances?
A: Use static membership (Kafka 2.3+), optimize session.timeout.ms, avoid frequent consumer restarts, and ensure your consumers can process heartbeats within the session timeout period.
Q: Is exactly-once semantics worth the performance cost?
A: It depends on your use case. For financial transactions or other critical operations where duplicates could cause incorrect state, yes. For many event streaming use cases (metrics, logging, etc.), at-least-once with idempotent consumers is sufficient and performs better.
Q: When should I consider alternatives to Kafka?
A: Consider alternatives when you need: complex message routing (RabbitMQ), ultra-low latency with in-memory storage (Redis Streams), or simple task queues where Kafka's overhead isn't justified.