TCP's Three-Way Handshake Explained: Connection Setup, Teardown, and State Machine
A protocol-level deep dive into how TCP establishes and tears down connections — covering SYN/ACK mechanics, Initial Sequence Number security, the complete TCP state machine, TIME_WAIT behavior, SYN flood defenses, and TCP Fast Open for latency-critical applications.
Every time your browser loads a webpage, your database driver connects to Postgres, or your mobile app fetches data from an API, a silent ceremony takes place before a single byte of real data crosses the wire. This ceremony is the TCP three-way handshake — a precise exchange of three packets that transforms two unrelated network endpoints into a coordinated, reliable communication channel.
The handshake is so fundamental that it runs hundreds of millions of times per second across the internet without most developers ever thinking about it. But when things go wrong — connection timeouts, SYN flood attacks, port exhaustion from TIME_WAIT accumulation, or latency spikes from cold TCP connections — understanding the handshake at the protocol level is the only way to diagnose and fix the problem.
In this masterclass, Professor Pixel will walk you through every step: why the handshake exists, what each packet contains, how TCP tracks connection state through its finite state machine, how connections are gracefully terminated, and where the landmines are for production systems dealing with high connection rates.
1. Why TCP Needs a Handshake: The Reliability Problem
1.1 IP Is Unreliable by Design
The Internet Protocol (IP) is a best-effort delivery system. IP packets can be dropped by routers under load, reordered by different routing paths, duplicated by link-layer retransmissions, or corrupted in transit. IP makes no promises. It simply forwards packets toward their destination and discards them without complaint if a router's queue overflows.
TCP (Transmission Control Protocol) sits on top of IP and adds the guarantees that IP lacks: ordered delivery, loss detection and retransmission, flow control to prevent a fast sender from overwhelming a slow receiver, and congestion control to prevent the entire network from collapsing. Delivering all these guarantees requires that both sides of a connection maintain shared state — knowing each other's sequence numbers, window sizes, and connection status. The handshake is the protocol for establishing that shared state before data transfer begins.
1.2 The Mental Model: A Phone Call Protocol
Think of the TCP handshake as the equivalent of starting a phone call. You dial (SYN). The other person picks up and says "Hello?" (SYN-ACK — they acknowledge your call and announce their presence). You say "Hi, I can hear you" (ACK — you confirm the connection is two-way and working). Only then does the actual conversation (data transfer) begin. Neither party would start talking immediately after dialing without confirming the other person is actually on the line — the handshake is that confirmation protocol.
Common Misconception: Many developers think the handshake only takes one round-trip. In reality it takes 1.5 round-trips: the client sends SYN (0.5 RTT), the server responds with SYN-ACK (0.5 RTT back), and the client sends ACK (0.5 RTT). But the client can only send data after sending its ACK, which means the first data reaches the server after 1 full RTT of handshake latency. On a connection with 100 ms RTT, every new TCP connection wastes 100 ms before any data flows.
2. TCP in the Network Stack: Where It Lives
2.1 The OSI Layer Position
TCP operates at Layer 4 (Transport Layer) of the OSI model. It sits between the application layer (your HTTP, database protocol, or custom protocol) and the network layer (IP). When your application calls socket.send(data), the data passes down through TCP, which segments it, adds headers, and hands segments to IP for routing. When data arrives at the receiver, IP delivers packets to TCP, which reassembles them in order and presents a clean byte stream to the application.
Mermaid Diagram: TCP's position in the network stack between application and IP layers.
2.2 TCP vs UDP: When Not to Use TCP
TCP's reliability guarantees come with overhead: the handshake latency, per-segment acknowledgments, and retransmission on loss. For applications where occasional packet loss is acceptable and low latency is critical — video streaming, online gaming, DNS queries, VoIP — UDP is preferred because it sends data immediately with no setup ceremony. The application layer handles any needed reliability itself (or tolerates loss). Understanding this tradeoff prevents the common mistake of defaulting to TCP for every use case without considering the connection overhead.
3. The Three-Way Handshake: Step-by-Step Mechanics
3.1 Step 1 — SYN: Client Initiates
The client (the side initiating the connection — e.g., your browser) creates a TCP segment with the SYN flag set. This segment carries the client's Initial Sequence Number (ISN) — a 32-bit number that labels the starting position in the client's byte stream. The client sends this segment to the server's IP address and destination port (e.g., port 443 for HTTPS) and transitions to the SYN_SENT state.
The SYN segment contains no application data — its payload is empty. Its purpose is purely to announce the client's existence and share its starting sequence number. The segment header includes the client's Maximum Segment Size (MSS), window scale factor, and SACK (Selective Acknowledgment) capability — options that negotiate capabilities for the entire connection lifetime.
3.2 Step 2 — SYN-ACK: Server Responds
The server receives the SYN. If the server has a listening socket on that port (i.e., a server process called listen()), it replies with a segment that has both the SYN and ACK flags set. This segment carries two critical values:
- Acknowledgment Number: The client's ISN + 1. This confirms the server received the client's SYN and tells the client which byte the server expects next from the client's stream.
- Server's own ISN: A separate sequence number that labels the starting position of the server's byte stream toward the client. TCP is a bidirectional protocol — each direction has its own independent sequence number space.
The server transitions to the SYN_RECEIVED state after sending SYN-ACK. At this point, the kernel allocates a minimal connection entry in the SYN backlog — a half-open connection waiting to be confirmed.
3.3 Step 3 — ACK: Client Confirms
The client receives the SYN-ACK. It sends a final ACK segment, acknowledging the server's ISN + 1. Both sides now have confirmed each other's sequence numbers. The client transitions to ESTABLISHED, and when the server receives this ACK, it also transitions to ESTABLISHED. The connection is ready and application data can flow in both directions.
Pitfall — The Server Never Sees the ACK: In a high-traffic SYN flood attack, attackers send enormous numbers of SYN packets with spoofed source IPs. The server sends SYN-ACK to the spoofed addresses (which never respond), filling the SYN backlog with half-open connections. When the backlog is full, the server drops all new SYN packets — effectively a Denial of Service. The defense is SYN Cookies, covered in Section 9.
4. TCP Segment Header: The Key Fields
4.1 Header Field Map
The TCP header is 20 bytes minimum (up to 60 bytes with options). Understanding the key fields explains every behavior described in this post:
| Field | Size | Purpose |
|---|---|---|
| Source Port | 16 bits | Sender's port number (client picks ephemeral port 1024–65535) |
| Dest Port | 16 bits | Target port (80 for HTTP, 443 for HTTPS, 5432 for Postgres) |
| Sequence Number | 32 bits | Byte offset of the first data byte in this segment's payload |
| Acknowledgment # | 32 bits | Next byte expected from the other side (cumulative ACK) |
| Flags | 9 bits | SYN, ACK, FIN, RST, PSH, URG, ECE, CWR, NS |
| Window Size | 16 bits | Receiver's advertised buffer space — how many bytes can be sent without ACK |
| Checksum | 16 bits | Error detection over header + data |
| Options | 0–40 bytes | MSS, window scale, SACK, timestamps (negotiated during handshake) |
4.2 The Window Size and Flow Control
The Window Size field implements TCP's flow control mechanism. It tells the sender how many bytes the receiver's buffer can currently accept. If the receiver's buffer is full (e.g., the application isn't reading fast enough), it advertises Window = 0, and the sender must stop transmitting until the window opens again. The maximum raw window size is $2^{16} - 1 = 65{,}535$ bytes — far too small for modern high-bandwidth links. The window scale option (negotiated during the handshake) multiplies this by up to $2^{14}$, allowing effective window sizes of up to 1 GB.
For a 100 Mbps link with 100 ms RTT and a 65 KB window: $\frac{65{,}535 \times 8 \text{ bits}}{0.1 \text{ s}} = 5.2 \text{ Mbps}$ — the window size caps throughput at just 5% of link capacity. Window scaling is essential for long-distance high-speed connections.
5. Sequence Numbers: Why They Start Random
5.1 The ISN Must Be Unpredictable
A naïve TCP implementation might start every connection with sequence number 0. This was actually done in early TCP implementations and led to a devastating attack: if an attacker can predict the ISN, they can inject forged TCP segments into an existing connection or hijack a connection entirely. RFC 6528 mandates that the Initial Sequence Number be generated using a cryptographically strong pseudorandom function.
Linux's ISN generation uses a hash of the four-tuple (source IP, source port, destination IP, destination port) combined with a secret key and a time-based counter. The result is a sequence number that is unique per connection and unpredictable to an off-path attacker who cannot observe the actual packets.
5.2 Sequence Number Wraparound
Because sequence numbers are 32-bit values, they wrap around from $2^{32} - 1$ back to 0 after 4 GB of data. On a 10 Gbps link, 4 GB can be transmitted in about 3 seconds — fast enough that an old packet from early in the connection, delayed in the network, could arrive after wraparound and be mistakenly accepted as new data. The TCP Timestamps option (RFC 7323) adds a monotonically increasing timestamp to each segment, allowing the receiver to distinguish old wrapped-around segments from new ones.
Pitfall — ISN Prediction on Predictable Systems: Virtual machines that are cloned or restored from snapshots sometimes reset their random number generator state, causing TCP to regenerate the same ISNs as previous sessions. An attacker on the same network who observed previous sessions could predict the next ISN and inject data. Always ensure cryptographic RNG reseeding after VM clone or snapshot restore operations.
6. The TCP State Machine: Every State Explained
6.1 The Full State Diagram
TCP is a finite state machine. Every socket is always in exactly one of the states below. Transitions between states are triggered by sending or receiving specific segments. Understanding this diagram explains every netstat output, every connection error, and every timeout behavior you will encounter in production.
Mermaid Diagram: Complete TCP finite state machine — both client (active open) and server (passive open) sides.
6.2 Key States Every Developer Should Know
When you run ss -s or netstat -an on a production server, you see a list of sockets in various TCP states. Knowing what each state means immediately tells you what the system is doing:
- LISTEN: Server is waiting for incoming connections. A server process has called
bind()andlisten()and is ready to accept clients. - SYN_RECEIVED: Server received a SYN, sent SYN-ACK, and is waiting for the client's final ACK. If you see thousands of sockets in this state, you are almost certainly under a SYN flood attack.
- ESTABLISHED: The connection is fully open and data is flowing (or the connection is idle but still open). Normal operating state.
- CLOSE_WAIT: The remote side closed the connection (sent FIN), the local side acknowledged it, but the local application has NOT yet closed its socket. If you see a large number of CLOSE_WAIT sockets, your application has a connection leak — it is not calling
close()after receiving the remote FIN. - TIME_WAIT: The local side sent the final ACK and is waiting to ensure it was received. Discussed in depth in Section 8.
7. Connection Teardown: The Four-Way FIN Handshake
7.1 Why Four Steps Instead of Three
TCP connections are full-duplex: each direction is an independent byte stream. Closing one direction does not automatically close the other. When side A sends a FIN, it signals "I have no more data to send," but side B may still have data to send. TCP therefore requires each direction to be closed independently, resulting in a four-step sequence instead of three.
The sequence is: A sends FIN → B sends ACK (B's receive direction closed) → B sends FIN → A sends ACK (A's receive direction closed, connection fully closed). Between step 2 and step 3, B can still send data to A — this is the half-close state, where A has finished sending but B may continue. HTTP/1.0 servers use this to signal end-of-response: they send all data, then close, and the client reads until EOF.
7.2 RST: The Abrupt Close
The RST (Reset) flag provides an alternative to the graceful FIN sequence. When a host receives a segment for a connection it doesn't recognize (e.g., the process crashed and the socket was destroyed), it sends an RST to immediately abort the connection. The receiver must discard all buffered unread data and report an error to the application. Unlike FIN (graceful shutdown acknowledging all data was delivered), RST discards everything immediately.
From a debugging perspective, seeing Connection reset by peer in your application logs means the remote side sent RST — either because the server process crashed, the server closed the socket without graceful teardown, or a firewall or load balancer silently dropped the connection and sent RST on the server's behalf.
Pitfall — CLOSE_WAIT Accumulation: A common application-level bug: a server reads a request, begins processing, the client closes its side (sends FIN), but the server keeps the socket open while processing. If processing is slow or the server ignores the client's FIN, sockets accumulate in CLOSE_WAIT indefinitely. After hundreds of such leaked connections, the process runs out of file descriptors and crashes. Always handle EOF on read (empty read with no error) by closing the socket promptly.
8. TIME_WAIT State: Why It Exists and When It Hurts (Advanced)
8.1 The Purpose of TIME_WAIT
After the initiating side sends the final ACK of the four-way FIN handshake, it enters TIME_WAIT instead of immediately going to CLOSED. It stays in TIME_WAIT for 2 × MSL (Maximum Segment Lifetime) — typically 60 to 120 seconds on Linux. This serves two purposes:
- Late packet protection: If the final ACK is lost, the remote side retransmits its FIN. TIME_WAIT ensures the local side is still around to re-send the ACK instead of sending a confusing RST to a FIN it no longer understands.
- Sequence number safety: Any delayed packets from the old connection that arrive after the socket is closed could be mistakenly accepted by a new connection reusing the same four-tuple (source IP, source port, dest IP, dest port). TIME_WAIT's 2×MSL duration ensures all old packets from the previous session have expired before a new session can reuse the same port combination.
8.2 TIME_WAIT Exhaustion in High-Throughput Services
Each TIME_WAIT socket occupies a file descriptor and consumes the four-tuple (preventing reuse of that source port). The ephemeral port range is typically 32,768–60,999 on Linux — about 28,000 ports. A service that opens 1,000 short-lived TCP connections per second will fill all 28,000 available ports in 28 seconds, at which point new connections fail with EADDRINUSE.
The solutions, in order of preference: (1) Use connection pooling — reuse long-lived connections instead of opening a new TCP connection per request. (2) Enable SO_REUSEADDR on server sockets so the server can rebind immediately after restart. (3) Enable net.ipv4.tcp_tw_reuse = 1 on Linux to allow reuse of TIME_WAIT sockets for outgoing connections when it is safe to do so (timestamps must be enabled). (4) As a last resort, reduce net.ipv4.tcp_fin_timeout, but understand this increases the risk of old packet confusion.
Advanced Pitfall — tcp_tw_recycle Is Dangerous: The net.ipv4.tcp_tw_recycle sysctl (removed in Linux 4.12) was widely recommended for high-performance servers but caused mysterious connection failures for clients behind NAT. Multiple clients sharing a single NAT IP could appear to have non-monotonically increasing timestamps from the server's perspective, causing the server to silently drop their SYN packets. Never use tcp_tw_recycle. Use tcp_tw_reuse combined with connection pooling instead.
9. SYN Flood Attacks and SYN Cookies
9.1 How a SYN Flood Works
A SYN flood exploits the server's need to maintain half-open connection state during the handshake. An attacker sends thousands of SYN packets per second, each with a spoofed (randomly forged) source IP address. The server dutifully sends SYN-ACK to each spoofed IP (which ignores the response, or doesn't exist), and allocates a connection entry in its SYN backlog for each. The backlog (controlled by the second argument to listen(sockfd, backlog)) has a finite size. Once full, the server drops all new SYN packets — legitimate client connections fail.
The attack is asymmetric: the attacker sends tiny 40-byte SYN packets while the server allocates connection state (potentially hundreds of bytes per entry) and sends 60-byte SYN-ACK responses. A few Mbps of attack traffic can exhaust a server's backlog in seconds.
9.2 SYN Cookies: Stateless Defense
SYN Cookies (RFC 4987) make the server stateless during the SYN phase. Instead of allocating backlog state when it receives a SYN, the server encodes the connection parameters into the SYN-ACK's sequence number using a cryptographic hash:
No state is stored on the server at this point. When the client sends the final ACK (with acknowledgment number = SYN-ACK.seq + 1), the server recomputes the hash and verifies the ACK number matches. If it does, the server knows this is a legitimate client that actually received the SYN-ACK (the spoofed IPs in a SYN flood never do), and only then allocates connection state. A SYN flood with spoofed IPs never generates a valid ACK, so no state is ever allocated for those packets.
The trade-off: when SYN cookies are active, the server cannot store the TCP options negotiated during the SYN phase (because no state is kept), slightly degrading performance for legitimate connections. Linux enables SYN cookies automatically when the SYN backlog is full (net.ipv4.tcp_syncookies = 1 by default).
10. TCP Fast Open: Eliminating Handshake Latency (Advanced)
10.1 The Latency Problem TCP Fast Open Solves
Every new TCP connection incurs a minimum of 1 RTT before any application data can be sent. For short-lived connections — a REST API call, a database query, an HTTPS request — this handshake latency can dominate total response time. On a mobile network with 80 ms RTT, every new connection wastes 80 ms before the server even sees the first byte of the HTTP request.
TCP Fast Open (TFO, RFC 7413) allows application data to be sent in the SYN packet itself on repeat connections to the same server. It works via a cryptographic TFO cookie that the server issues on the first connection. On subsequent connections, the client includes the cookie in the SYN packet along with the initial request data. The server validates the cookie and begins processing the request immediately, without waiting for the final ACK. This reduces effective connection latency from 1 RTT to 0 RTT for repeated connections.
10.2 Enabling TCP Fast Open
On Linux, enable TFO for both client and server: sysctl -w net.ipv4.tcp_fastopen=3. In application code, pass the MSG_FASTOPEN flag to sendto() on the client side. Nginx supports TFO with listen 443 fastopen=256. The primary limitation: TFO data can be replayed if the network delivers the SYN packet twice. For idempotent requests (GET, HEAD) this is safe; for non-idempotent operations (POST with side effects), care must be taken.
Advanced Pitfall — Middlebox Interference: Many corporate firewalls and network middleboxes strip or reject TCP options they don't recognize, including TFO cookies. Applications relying on TFO must gracefully fall back to the standard handshake when TFO fails. Most modern TFO implementations handle this transparently, but debugging "why is TFO not working" often leads to discovering an upstream firewall stripping the option.
11. Interactive: Three-Way Handshake Simulator
Click through each step of the TCP handshake to watch packets travel between client and server, see state transitions, and inspect sequence number arithmetic in real time.
12. Frequently Asked Questions
Q1: Why is it called a "three-way" handshake — couldn't two packets work?
A two-way handshake (SYN → SYN-ACK) confirms only that the server received the client's message, but the client has no confirmation that its ACK reached the server. Without the third ACK, the server would enter ESTABLISHED before knowing the client is actually reachable. The third packet gives the server proof that the full round-trip path works in both directions and both sides agree on sequence numbers.
Q2: What happens if a SYN packet is lost?
The client retransmits the SYN after a timeout. Linux's initial SYN retransmission timeout is 1 second, doubling with each retry (exponential backoff): 1s, 2s, 4s, 8s, 16s, 32s. The number of retries is controlled by net.ipv4.tcp_syn_retries (default 6). After all retries are exhausted (~127 seconds total), the connection attempt fails with ETIMEDOUT.
Q3: What does "connection refused" mean at the TCP level?
Connection refused means the server's kernel received the SYN but no process is listening on that port. The kernel immediately sends a RST segment back to the client. The client's connect() call returns immediately with ECONNREFUSED — no timeout wait. This is distinct from ETIMEDOUT (no response at all) and EHOSTUNREACH (the host is not reachable at the IP level).
Q4: How does the OS decide which ephemeral port to assign to a new outgoing connection?
The kernel picks an unused port from the ephemeral port range (net.ipv4.ip_local_port_range, default 32768–60999) using a hash function that tries to spread connections across the range. It verifies the resulting four-tuple (local IP, local port, remote IP, remote port) is unique among all existing connections. If the entire port range is occupied (TIME_WAIT exhaustion), connect() fails with EADDRINUSE.
Q5: What is the listen backlog and what happens when it's full?
The listen backlog (second argument to listen(fd, backlog)) limits the number of completed connections waiting to be accept()-ed by the application, plus half-open connections in SYN_RECEIVED state. When the backlog is full, the kernel drops incoming SYN packets silently (or sends RST, depending on configuration). The client retransmits and eventually connects once the application drains the backlog with accept(). In high-traffic servers, set backlog to 1024 or higher and ensure the accept loop is fast.
Q6: Why do I see thousands of TIME_WAIT sockets on my load balancer?
Load balancers that terminate TCP connections on behalf of backend servers open a new TCP connection to the backend for each client request (in non-persistent mode). Each of these short-lived backend connections leaves a TIME_WAIT socket for 60 seconds. On a high-traffic load balancer handling 10,000 requests/second, at any moment $10{,}000 \times 60 = 600{,}000$ TIME_WAIT sockets exist simultaneously. The fix: enable HTTP keep-alive and connection pooling to the backends, dramatically reducing new connection rate.
Q7: How can I observe the TCP handshake in real time?
Use tcpdump -i any 'tcp[tcpflags] & (tcp-syn|tcp-ack|tcp-fin) != 0' -nn to capture only handshake packets. Use wireshark for GUI analysis — it color-codes connection establishment and teardown and displays sequence number arithmetic in a human-readable stream. The ss -s command gives aggregate counts by state. watch -n1 'ss -s' shows real-time state transitions during connection load tests.
Q8: What is Nagle's algorithm and when does it hurt latency?
Nagle's algorithm (RFC 896) buffers small outgoing TCP segments and delays sending them until either the buffer fills to MSS size or all previously sent data has been acknowledged. This reduces network overhead for chatty protocols (like remote terminals) but adds latency for request-response protocols like RPC or interactive database queries. Disable it with TCP_NODELAY socket option for latency-sensitive applications — this is why Redis, memcached, and most game servers set TCP_NODELAY by default.