Right Usage of Kafka: Patterns and Anti-Patterns

How to use Apache Kafka effectively — from partitioning strategies to consumer group design, and common mistakes to avoid.

🚀 Introduction

Kafka is a powerful distributed event streaming platform, but using it correctly requires understanding its core concepts deeply.

📋 Key Topics Covered

Topic design and partitioning strategies
Consumer groups and rebalancing
Exactly-once semantics
Schema Registry and Avro
Retention policies and compaction
Monitoring: lag, throughput, and errors
Common anti-patterns (using Kafka as a database, over-partitioning)
Kafka vs RabbitMQ vs Redis Streams

🏗️ Topic Design and Partitioning Strategies

🎯 Proper Topic Design

Topics should be designed around business domains or event types, not technical implementation details. Each topic represents a category of related events.

Best Practices:

Use descriptive, domain-specific names (user-events, order-updates, payment-transactions)
Separate concerns: different event types should go to different topics
Consider using naming conventions with prefixes/suffixes for environments (prod-user-events, staging-user-events)
Align topic boundaries with bounded contexts in DDD

⚖️ Partitioning Strategy

Partitions enable parallelism and determine throughput capacity.

Guidelines:

Start with enough partitions for peak throughput: Calculate based on expected producer/consumer throughput per partition
Align with consumer group size: Max parallel consumers = number of partitions
Consider key distribution: Choose partition keys that evenly distribute load
Plan for growth: It's easier to add partitions than to rebalance poorly distributed data
Monitor partition skew: Uneven distribution creates hot partitions

Anti-Pattern: Creating too many partitions (e.g., 100+ for low-throughput topics) increases overhead and recovery time.

👥 Consumer Groups and Rebalancing

🔄 How Consumer Groups Work

Consumer groups enable scalable consumption where multiple consumers share the workload of processing topic partitions.

Key Points:

Each partition is consumed by exactly one consumer in a group
Adding consumers up to partition count increases processing parallelism
Rebalancing occurs when group membership changes (consumers join/leave)
During rebalance, consumers temporarily stop processing

⚙️ Minimizing Rebalance Impact

Use static membership (Kafka 2.3+): Assign persistent consumer IDs to reduce shuffling
Optimize session.timeout.ms: Balance between failure detection and unnecessary rebalances
Avoid frequent restarts: Graceful shutdowns trigger rebalances; rolling updates are better
Use proper heartbeat settings: Ensure consumers can send heartbeats within session timeout

🎯 Exactly-Once Semantics (EOS)

🔄 Achieving Exactly-Once Processing

Kafka provides exactly-once semantics through idempotent producers and transactional APIs.

Implementation Steps:

Enable idempotent producers: Set enable.idempotence=true (default in newer versions)
Use transactions for multi-topic writes: Producer sends data to multiple topics atomically
Consume with read_committed isolation level: Consumers only see committed transactions
Design idempotent consumers: Handle duplicate messages gracefully as fallback

Note: EOS has performance implications due to additional coordination overhead.

📜 Schema Registry and Avro

📊 Why Use Schema Registry?

Schema Registry provides a centralized schema store and enforces compatibility rules.

Benefits:

Data governance: Prevent incompatible schema changes
Evolution safety: Backward/forward compatibility checks
Serialization efficiency: Avro is compact and fast
Documentation: Schemas serve as API contracts

🔄 Schema Compatibility Types

BACKWARD: New schema can read old data (consumers can upgrade first)
FORWARD: Old schema can read new data (producers can upgrade first)
FULL: Both backward and forward compatible
NONE: No compatibility checks

Best Practice: Use BACKWARD or FULL compatibility for most use cases.

⏳ Retention Policies and Compaction

📦 Retention Policies

Control how long Kafka keeps data.

Types:

Time-based: log.retention.hours (default 168 hours = 7 days)
Size-based: log.retention.bytes per topic
Delete vs Compact: cleanup.policy (delete or compact)

🗜️ Log Compaction

Keeps only the latest value for each key, useful for:

Event sourcing: Rebuild state from events
Configuration snapshots: Latest config per service
Entity state: Current user profile, inventory counts

Compaction Trigger: When log segment reaches min.cleanable.dirty.ratio threshold

📊 Monitoring: Lag, Throughput, and Errors

📈 Key Metrics to Monitor

Consumer Lag:

Difference between current offset and end offset
Indicates if consumers can keep up with producers
Alert when lag grows steadily

Throughput:

Messages/sec in/out
Bytes/sec in/out
Request rates

Error Rates:

Failed produce requests
Consumer exceptions
Connection failures

🛠️ Monitoring Tools

Built-in: JMX metrics, kafka-consumer-groups.sh tool
Open Source: Prometheus + Grafana, Confluent Control Center
Managed Services: Cloud provider monitoring integrations

⚠️ Common Anti-Patterns

❌ Using Kafka as a Database

Problem: Storing data indefinitely expecting query capabilities Solution: Use Kafka for event streaming, move data to appropriate databases for querying

❌ Over-Partitioning

Problem: Too many partitions increase overhead (metadata, file handles, recovery time) Solution: Start with reasonable partition count (e.g., 3-6 per broker), scale based on throughput needs

❌ Ignoring Message Ordering Guarantees

Problem: Assuming global ordering across partitions Solution: Use partitioning keys for ordering within key groups, design consumers to handle out-of-order messages

❌ Not Handling Poison Pills

Problem: Bad messages causing consumer crashes and infinite restart loops Solution: Implement dead letter queues, poison pill handling, or skip mechanisms

⚔️ Kafka vs RabbitMQ vs Redis Streams

🆚 Feature Comparison

Feature	Kafka	RabbitMQ	Redis Streams
Model	Log-based streaming	Traditional message queue	Log-based streaming
Throughput	Very High (MB/s)	Medium	High
Persistence	Disk-based (configurable)	Disk/memory	Memory (with AOF)
Ordering	Per-partition	Per-queue (FIFO)	Per-stream
Consumer Model	Consumer groups	Competing consumers	Consumer groups
Retry/DLQ	Manual implementation	Built-in	Manual implementation
Best For	Event sourcing, high-throughput pipelines	Complex routing, task queues	Simple streaming, low-latency

🎯 When to Choose Each

Choose Kafka when:

Building event-driven architectures
Needing high throughput and durability
Implementing event sourcing or CQRS
Long-term data retention is needed

Choose RabbitMQ when:

Complex routing is required (topics, headers)
Need sophisticated queueing patterns
Lower latency is critical
Polyglot protocol support is important

Choose Redis Streams when:

Already using Redis in infrastructure
Need simple streaming with consumer groups
Can accept memory-limited durability
Low-latency processing is priority

🏁 Conclusion

Using Kafka effectively requires understanding its distributed nature and embracing event streaming principles. By following the patterns outlined—proper topic design, thoughtful partitioning, consumer group management, and avoiding common anti-patterns—you can build robust, scalable event-driven systems.

Remember that Kafka excels as a high-throughput, durable event log, not as a general-purpose database or task queue. Align your usage with its strengths, and you'll unlock powerful capabilities for real-time data processing and microservices communication.

❓ Frequently Asked Questions

Q: How many partitions should I start with for a new topic?

A: Start with 3-6 partitions per broker as a baseline, then scale based on your throughput requirements. Monitor consumer lag and adjust as needed. Remember that you can always increase partitions later (but not decrease easily).

Q: Should I use keys in my Kafka messages?

A: Use keys when you need ordering guarantees for related events (e.g., all events for a specific user_id should be processed in order). Without keys, messages are distributed randomly across partitions, which provides better throughput but no ordering guarantees.

Q: How do I handle schema evolution safely?

A: Use Schema Registry with appropriate compatibility settings (BACKWARD or FULL). Always test consumer/producer compatibility before deploying schema changes. Consider using a canary release strategy for schema updates.

Q: What's the difference between `delete` and `compact` cleanup policies?

A: delete removes old messages based on time or size thresholds. compact retains only the latest value for each message key, effectively creating a snapshot of the latest state for each key.

Q: How can I reduce consumer rebalances?

A: Use static membership (Kafka 2.3+), optimize session.timeout.ms, avoid frequent consumer restarts, and ensure your consumers can process heartbeats within the session timeout period.

Q: Is exactly-once semantics worth the performance cost?

A: It depends on your use case. For financial transactions or other critical operations where duplicates could cause incorrect state, yes. For many event streaming use cases (metrics, logging, etc.), at-least-once with idempotent consumers is sufficient and performs better.

Q: When should I consider alternatives to Kafka?

A: Consider alternatives when you need: complex message routing (RabbitMQ), ultra-low latency with in-memory storage (Redis Streams), or simple task queues where Kafka's overhead isn't justified.