Design WhatsApp

whatsapp

WhatsApp is a secure and popular messaging app with over 2 billion users worldwide. On average, users spend 19.4 hours monthly, sending more than 100 billion messages daily—a 54% rise since 2018.

As system designers, we must think about how fast users are growing.

WhatsApp is a good example to study.

Some key questions are:

  • How is WhatsApp designed?
  • How does it actually work?
  • What components make up the system?
  • How can it support billions of users at the same time?
  • How does it keep user data safe and secure?

Functional Requirements

Conversations:

  • Support one-to-one chats.
  • Support group chats.

Acknowledgment:

  • Show message status: sent, delivered, and read.

Sharing:

  • Allow sharing of images, videos, and audio files.

Chat Storage:

  • Store chat messages even if the user is offline.
  • Deliver messages once the user comes online.

Push Notifications:

  • Notify users about new messages when they are offline.
  • Deliver notifications as soon as they are online.

Non Functional Requirement

Low Latency

  • Messages should be delivered quickly with minimal delay.

Consistency

  • Messages must appear in the same order they were sent.
  • Chat history should stay the same across all devices.

Availability

  • The system should always be accessible.
  • In some cases, availability may be sacrificed to maintain consistency.

Security

  • All messages must use end-to-end encryption.
  • Only the sender and receiver can read the content, not even WhatsApp.

Scalability

  • The system must handle a growing number of users.
  • It should support billions of messages per day.

Resource estimation

WhatsApp is the world’s most widely used messaging app.

It has over 2 billion users globally.

Users exchange more than 100 billion messages per day.

Storage estimation

  • 100 billion message, lets consider 90 billion text, 5 billion image, 5 billion video shared every day
  • Let say average message size is 50 char which is 50*2 = 100 Bytes
  • 100 bytes * 100 billion = 10 TB (remember 1 billion * 1 KB = 1 TB)
  • Image average size is 2 Mb: 2 Mb * 5 Billion = 10 PB ( 1 MB * 1 Billion = 1 PB)
  • Image average size 100 Mb: 100 MB * 5 Billion = 500 PB ( 1 MB * 1 Billion = 1 PB)
  • Total Storage ~ 511 PB

Bandwidth Estimation

  • 100 billion message, let say shared to 2 people minimum
  • 200 billion query / day
  • 200 Billion / 86400 = 2300K QFPS (1 Billion req /day = 11.5K QPS)
  • Standard Server capacity is 64 K QFPS typically 32-core CPU, 128 GB RAM, NVMe SSDs, 10 Gbps network, plus caching and load balancing.
  • No of Server = 2300 / 64 = 35 Server

API Design

sendMessage(message_ID, sender_ID, receiver_ID, type, text=none, media_object=none, document=none)

getMessage(user_Id)

uploadFile(file_type, file)

downloadFile(user_id, file_id)

High-level Design

image

Detailed Design

What’s missing from the high-level design?

  • How do clients and servers create a communication channel?
  • How can the design scale to billions of users?
  • Where and how is user data stored?
  • How do we identify the correct receiver for a message?

Let’s dive into the high-level design and examine each component in detail.

When you open WhatsApp on your phone, something interesting happens behind the scenes. Your device doesn’t just send messages blindly—it first establishes a persistent connection with a WebSocket server using the WebSocket protocol. Unlike traditional HTTP, this connection stays open, allowing instant, two-way communication.

But here’s the catch: one server cannot handle billions of people chatting at the same time. That’s why WhatsApp has many WebSocket servers, each responsible for keeping connections alive. Every online user gets a port, and the information about which user is connected to which server and port is carefully maintained in a central place called the WebSocket Manager, which sits on top of a Redis cluster. Think of Redis as a super-fast phone directory that helps find where each user is currently connected.

Sending and Receiving Messages

Now, imagine User A wants to send a message to User B. Here’s how the story unfolds:

  1. User A sends the message to their WebSocket server.
  2. That server checks with the WebSocket Manager to find where User B is connected.
    • If User B is online, the manager immediately points to their server.
    • If User B is offline, the message takes a different route.
  3. At the same time, the message is also stored in the Message Service, which sits on top of a special distributed database called Mnesia.
    • Mnesia is designed for fast lookups, high fault tolerance, and quick deletion of old messages.
    • Messages are stored temporarily (FIFO order) and deleted once delivered—or after 30 days if undelivered.
  4. If B is online, their WebSocket server picks up the message and delivers it instantly. If B is offline, they’ll receive it when they come back online, often via a push notification.

To make this smoother, each WebSocket server keeps a small cache of recent connections so it doesn’t always need to ask the WebSocket Manager. For example, if A and B are chatting continuously, the servers already know each other’s location.

Sharing Media Files

Text is light, but media—like photos, videos, and documents—are heavy. To handle this, WhatsApp uses a dedicated Asset Service. Here’s the process:

  1. The media file is compressed and encrypted on the sender’s phone.
  2. It is uploaded to blob storage via the Asset Service.
    • To avoid duplication, a hash is generated. If the file already exists, WhatsApp just reuses the existing copy.
  3. The Asset Service generates a unique file ID and passes this ID to the receiver via the Message Service.
  4. The receiver then downloads the media directly from storage using the ID.
  5. If a particular file is requested too often, the Asset Service caches it in a CDN for faster delivery.

Group Messages

Groups are trickier. Not everyone in a group is online at the same time. Here’s how WhatsApp handles it:

The Group Message Handler queries the Group Service, gets the list of members, and then delivers the message to each user, just like a WebSocket server would.

When User A sends a group message, it goes first to the Message Service.

The message is then pushed into Kafka, which acts like a message bus. In Kafka terms:

The group = a topic.

Senders = producers.

Group members = consumers.

The Group Service (on top of a MySQL cluster with Redis caching) maintains full group details: IDs, members, icons, status, etc.

image

Validate Non Functional Requirement

The system is designed to meet key non-functional requirements: low latency, consistency, availability, security, and scalability.

  • Low Latency
    • Use geographically distributed WebSocket servers with caching.
    • Add Redis cache clusters on top of MySQL clusters.
    • Use CDNs for fast delivery of media and documents.
  • Consistency
    • Ensure message order with a FIFO messaging queue.
    • Use a Sequencer to assign IDs and maintain causality.
    • Store offline messages in the Mnesia database queue and deliver in order when users reconnect.
  • Availability
    • Deploy enough WebSocket servers with data replication.
    • Re-create sessions via load balancer if a WebSocket server fails.
    • Use Mnesia cluster with primary-secondary replication for durability and availability.
  • Security
    • Apply end-to-end encryption so only sender and receiver can read messages.
  • Scalability
    • One server can handle ~10 million connections.
    • Add or remove servers dynamically as load changes.

Trade-offs in WhatsApp’s Design

Even though the system meets functional and non-functional requirements, there are two key trade-offs:

  1. Consistency vs. Availability
    • CAP Theorem: During network failures, a system can guarantee either consistency or availability, but not both.
    • WhatsApp’s Choice:
      • Message order is very important.
      • Prioritize consistency (messages must stay in order).
      • Accept reduced availability in rare failure cases.
  2. Latency vs. Security
    • Low Latency: Users expect real-time message delivery.
    • Security Requirement: End-to-end encryption ensures messages are private.
    • Trade-off:
      • Encryption/decryption of text, images, videos, and audio adds processing time.
      • This may increase latency, especially for large multimedia files.
    • WhatsApp’s Choice:
      • Prioritize security over ultra-low latency.
      • Accept slight delays for safe message transmission.

Summary

  • We designed a WhatsApp messenger system.
  • Steps we covered:
    • Identified functional and non-functional requirements.
    • Estimated key resources (storage, bandwidth, servers).
    • Designed both high-level and detailed architecture.
    • Explained components and their roles in the system.
    • Evaluated how the system meets non-functional requirements.
    • Discussed important trade-offs (consistency vs. availability, latency vs. security).
  • Key takeaway:
    • General-purpose servers can be optimized for large-scale systems.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top