Build Mini-Kafka
Build a production-grade message broker from scratch. Implement Kafka's append-only partition log, Raft-based leader election, ISR tracking and high-watermark advancement, idempotent and transactional producers, and consumer groups with rebalancing.
Apache Kafka
Kafka is the backbone of modern data infrastructure. LinkedIn open-sourced it in 2011 to solve a problem that every large company faces: how do you reliably move billions of events per day between hundreds of services? By 2024, more than 80% of Fortune 100 companies rely on it for everything from fraud detection to real-time analytics. A single Kafka cluster can sustain millions of messages per second with single-digit millisecond end-to-end latency.
The Core Architecture
Kafka's architecture separates concerns sharply. Producers write to topics. Brokers store and serve partitions. Consumers read from topics at their own pace. No component is tightly coupled to another — producers do not care which consumers exist, and consumers can be added, removed, or restarted without affecting producers or brokers.
A topic is a named stream of records. It is divided into partitions — ordered, append-only logs distributed across the broker cluster. Each partition has one leader broker that handles all reads and writes for that partition, and one or more follower brokers that replicate the partition log to provide fault tolerance.
What You Will Build
You will implement the core subsystems of Kafka from first principles, working through three tracks:
- Partition Log (Track 1) — the append-only log that is Kafka's central storage primitive. Build offset-addressed storage where every message gets an immutable offset, and a sparse index that enables O(log n) seeks into arbitrary positions of the log without scanning from the beginning.
- Leader Election and ISR (Track 2) — implement Raft-based partition leader election. One leader per partition handles all writes; followers replicate from it. Understand the In-Sync Replica (ISR) set — the set of replicas sufficiently caught up with the leader — and the high watermark that defines the highest offset consumers are allowed to read.
- Producers and Consumers (Track 3) — build idempotent producers that deduplicate retried messages using sequence numbers, eliminating duplicates caused by network retries without requiring coordination or distributed locks. Then implement a consumer group coordinator that assigns partitions round-robin across consumers and triggers rebalancing when the group membership changes.
The Central Idea: The Log as the Database
Kafka's key insight is treating the log as the primary data structure. Rather than a traditional message queue that deletes messages on delivery, Kafka retains them. Consumers track their own position in the log independently. This decoupling of write speed from read speed — and the ability to replay any offset window — is what makes Kafka the foundation of event-sourced architectures, stream processing pipelines, and change data capture systems.
The log is also what makes Kafka's replication straightforward: followers simply replay the leader's log from the beginning, and the high watermark tells everyone which prefix is safe to serve. There is no complex conflict resolution, no vector clocks, no anti-entropy — just sequential log replication.
Prerequisites
Comfort with Python or Go, and basic understanding of lists and dictionaries. Familiarity with the concept of sequential file I/O is helpful but not required. The tracks build on each other — complete them in order for the best learning progression.
Tracks
Partition Log
0/2 completedBuild the core storage engine: an append-only log with offset addressing and sparse index. This is the data structure that gives Kafka its sequential write performance and replay capability.
Replication
0/2 completedImplement the replication layer that makes Kafka durable. Raft-based leader election picks a leader per partition. ISR tracking and the high watermark define which messages are safe to serve to consumers.
Leader Election and ISR Tracking
Producers and Consumers
0/2 completedBuild the client-side guarantees. Idempotent producers eliminate duplicates on retry. Consumer groups coordinate partition ownership and rebalance when members join or leave.