Docker Overlay Network Details

Docker Swarm uses an overlay network for communication between containers on different hosts, and for load balancing incoming traffic to a service. On Windows Server 2016 before Windows Update KB4015217 this overlay network is not supported. After KB4015217 the communication between containers works, but the routing mesh that load balances incoming traffic is not supported. Now with Windows Server 2016 version 1709 the routing mesh works as well. The purpose of this post is to take an in depth look at how the overlay network and the routing mesh work in practice.

Testing environment

This is my environment for testing:

  1. Two hosts with Windows Server 2016 version 1709 on the same vnet in Azure
  2. Both hosts with the Hyper-V role and the Windows Containers feature
  3. Both hosts running experimental Docker 17.10
  4. A Docker Swarm service with three containers, running the image Microsoft/IIS:windowsservercore-1709, with a published port 80
  5. A third host running Portainer and the new Project Honolulu server management gateway.

I tested before that I can reach any container on any host, on port 80, from an external client. I also tested that I can ping and telnet between containers.

Theory

The Docker documentation describes how this works on Linux: Designing Scalable, Portable Docker Container Networks. Containers are assigned to a Virtual Extensible LAN (VXLAN) and traffic between containers on different hosts is encapsulated in UDP packets on port 4789. The routing mesh is implemented by Linux IP Virtual Server (IPVS) layer 4 switching.

On Windows, it is a bit more difficult to piece together the documentation. This is because containers on Windows are just part of a swathe of Azure, Hyper-V and Windows technologies.

SDN comes from implementing multi-tenant architectures in Azure, where VM’s on different hosts, in different datacentres, need to communicate securely and in isolation from other tenants. This is not very different from containers in different Swarm services communicating with each other but not with other services.

VXLAN is a generic standard documented in RFC 7348. There are a lot of different diagrams of the VXLAN, but basically a Layer 2 switched packet between containers on different hosts is encapsulated in a UDP packet and sent across the host network.

Implementation

When we initialise the Docker Swarm, a default overlay network is created, called “ingress”. We can see this with docker network ls.

NETWORK ID NAME DRIVER SCOPE
xio0654aj01a ingress overlay swarm
5bcf2a6fe500 nat nat local
cef0ceb618b6 none null local

This is in addition to the default NAT network created when we add the Containers feature. With docker network inspect ingress we can see the details of this network:

  • It has an ID of xio0654aj01a6x60kfnoe4r12 and a subnet of 10.255.0.0/16
  • Each container on the network has: an endpoint ID; an IP address on the subnet, and a unique MAC address
  • Each node has one ingress-endpoint, again with: an endpoint ID; an address and a MAC address.
"ConfigOnly": false,
"Containers": {
"206fe3c22aa9682f6db7c0ff2d2665ea647d2d2825218a9a1a6ee6bda4c80de7": {
"Name": "web.2.03uu9bab6n416jqi0reg59ohh",
"EndpointID": "136a5e8a952b7bc3da6b395e9ff3fb138cd93c97e3fafda1299f804f9cbe2bf1",
"MacAddress": "00:15:5d:71:af:d8",
"IPv4Address": "10.255.0.6/16",
"IPv6Address": ""
},
"92d6b5d2c353d43dad6e072e25865bdf91003b069fd3a527d953b9a62384f0a0": {
"Name": "web.3.nzxp6uhcvxhejp2iodd29l3gu",
"EndpointID": "b1937b9d22d2aa9881d0e45b16bc7031b2d4d07d4d0059531d64a6ade5a5242e",
"MacAddress": "00:15:5d:71:a4:c5",
"IPv4Address": "10.255.0.7/16",
"IPv6Address": ""
},
"ingress-sbox": {
"Name": "ingress-endpoint",
"EndpointID": "7037a8b3628c9d5d49730472c37a800e4d1882f0cb125ec75e75477c02104526",
"MacAddress": "00:15:5d:71:a7:dd",
"IPv4Address": "10.255.0.2/16",
"IPv6Address": ""
}
},

In this case there are two containers on the host. If we look on the other host, we see the third container (of three replicas in the service) and a different endpoint.

We can also see the ingress network, the web service and the containers in Portainer, a simple management GUI for containers:

Docker Network Ingress

If we look inside a container, with docker exec -it web.2.03uu9bab6n416jqi0reg59ohh powershell and ipconfig /all, we can see that the endpoint ID is the ID of the NIC, and the IP address and MAC address also belong to this NIC:

Ethernet adapter vEthernet (136a5e8a952b7bc3da6b395e9ff3fb138cd93c97e3fafda1299f804f9cbe2bf1):
Connection-specific DNS Suffix . : nehng5n4bb2ejkdqdqbqdv4dxe.zx.internal.cloudapp.net
Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter #5
Physical Address. . . . . . . . . : 00-15-5D-71-AF-D8
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
Link-local IPv6 Address . . . . . : fe80::7dfd:d3f7:6350:759d%32(Preferred)
IPv4 Address. . . . . . . . . . . : 10.255.0.6(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.0.0
Default Gateway . . . . . . . . . : 10.255.0.1
DNS Servers . . . . . . . . . . . : 10.255.0.1
168.63.129.16
NetBIOS over Tcpip. . . . . . . . : Disabled

To see how the ingress network is implemented, we need to look at the host networking configuration. With Get-VMSwitch we can see that there is a Hyper-V virtual switch with the same name as the Docker ingress network ID:

Name SwitchType NetAdapterInterfaceDescription
---- ---------- ------------------------------
nat Internal
xio0654aj01a6x60kfnoe4r12 External Microsoft Hyper-V Network Adapter #5

With Get-VMSwitchExtension -VMSwitchName xio0654aj01a6x60kfnoe4r12 we can see that the switch has a Microsoft Azure VFP Switch Extension:

Id : E9B59CFA-2BE1-4B21-828F-B6FBDBDDC017
Name : Microsoft Azure VFP Switch Extension

If we do ipconfig /all on the host we see two network adapters. The primary host network adapter:

Ethernet adapter vEthernet (Ethernet 5)

and an adapter attached to the Docker NAT network:

Ethernet adapter vEthernet (nat)

But if we run Get_NetworkAdapter we see three:

Name InterfaceDescription ifIndex Status MacAddress LinkSpeed
---- -------------------- ------- ------ ---------- ---------
vEthernet (Ethernet 5) Hyper-V Virtual Ethernet Adapter #2 16 Up 00-22-48-01-00-03 40 Gbps
vEthernet (nat) Hyper-V Virtual Ethernet Adapter 3 Up 00-15-5D-6A-D6-E2 10 Gbps
Ethernet 5 Microsoft Hyper-V Network Adapter #5 11 Up 00-22-48-01-00-03 40 Gbps

The extra one, named “Ethernet 5” with Interface Description “Microsoft Hyper-V Network Adapter 5”, on the same MAC address as the primary host adapter, and with no IP address, is the ingress endpoint on the overlay network.

We can see this in the Project Honolulu browser-based server manager.

The adapters:

Honolulu Docker1 Adapters

The Hyper-V ingress network switch:

Honolulu Docker1 Ingress Switch

Trace: incoming

I previously did a trace of the traffic, first into a container from a remote client and second, between containers. With Microsoft Message Analyzer we can see what happens.

Here is the flow of an HTTP request on port 80 from a remote client to one of the swarm nodes, and load balanced to a container on the same host.

In the first message a TCP packet arrives at the IP address of the host adapter:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
3526None2017-11-08T17:02:15.98638390.187222500TCPFlags: ......S., SrcPort: 53711, DstPort: HTTP(80), Length: 0, Seq Range: 1862583515 - 1862583516, Ack: 0, Win: 65535(negotiating scale factor: 3)

In the second message, the packet is received by the Hyper-V switch for the overlay network:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
3527None2017-11-08T17:02:15.98639200.000008100Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A4600B370 received from Nic /DEVICE/{DAB8937D-9AD5-460E-8652-C2E152CCE573} (Friendly Name: Microsoft Hyper-V Network Adapter #5) in switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

In the third message the packet is routed to the container adapter:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
3591None2017-11-08T17:02:15.98659060.000002200Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A492B1030 routed from Nic 533EF66B-A5F3-4926-A1EE-79AF499F85C7 (Friendly Name: Ethernet 5) to Nic F3EA5A0C-2253-472F-8FFA-3467568C6D00 (Friendly Name: 136a5e8a952b7bc3da6b395e9ff3fb138cd93c97e3fafda1299f804f9cbe2bf1) on switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

In the fourth message, the packet is received by the container adapter:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
3592None2017-11-08T17:02:15.98659320.000002600Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A492B1030 delivered to Nic F3EA5A0C-2253-472F-8FFA-3467568C6D00 (Friendly Name: 136a5e8a952b7bc3da6b395e9ff3fb138cd93c97e3fafda1299f804f9cbe2bf1) in switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

And in the fifth message the first packet is delivered:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
3593None2017-11-08T17:02:15.98661680.00002364288TCPFlags: ......S., SrcPort: 65408, DstPort: HTTP(80), Length: 0, Seq Range: 1862583515 - 1862583516, Ack: 0, Win: 65535(negotiating scale factor: 3)

You will notice that the sent packet is from port 53711 to port 80. But the arrived packet is from port 65408 to port 80. You can’t see it in this summary of the message, but the sent packet is from the client IP address 92.234.68.72 to the host IP address 10.0.0.4 while the arrived packet is from the ingress-endpoint IP address 10.255.0.2 to the container IP address 10.255.0.6. The virtual switch has re-written the source port and address of the packet. The container sends a reply packet to the ingress-endpoint, where the switch again re-writes the source and destination addresses to send the reply back to the client.

From the point of view of the host, there is:

  • no route to the ingress network 10.255.0.0/16
  • no ARP cache addresses for endpoints on the ingress network
  • no host process listening on port 80
  • a virtual adapter (Friendly Name: Microsoft Hyper-V Network Adapter #5), with the same MAC address as the primary adapter (00-22-48-01-00-03), but with no IP address, attached to a virtual switch (Friendly Name: xio0654aj01a6x60kfnoe4r12), which is the switch for the ingress network.

The virtual switch intercept the request on the published port 80 (using the Azure Virtual Filtering Platform switch extension?) and forwards it to one of the containers.

From the point of view of the container, there is:

  • no route to the host network 10.0.0.0/24
  • no ARP cache address for endpoints on the host network
  • an ARP cache address for the ingress-endpoint 10.255.0.2, with the same MAC address as the primary host network adapter (00-22-48-01-00-03)
  • a process (web server) listening on port 80
  • a virtual adapter (Friendly Name: 136a5e8a952b7bc3da6b395e9ff3fb138cd93c97e3fafda1299f804f9cbe2bf1) attached to the same virtual switch (Friendly Name: xio0654aj01a6x60kfnoe4r12) as the phantom adapter on the host.

The virtual switch receives the reply from the container and forwards it to the MAC address of the ingress-endpoint, which is the same as the MAC address of the primary network adapter of the host. The host network adapter sends the reply to the remote client.

This trace has been for incoming traffic from an external client. The next trace is for inter-container traffic across hosts.

Traffic: inter-container

Here is the flow of a ping from a container on one host to a container on the other. The trace is being performed on the receiving host. We need to dissect each packet to see what happens.

The first packet arrives, an echo (ping) request. This is the content of the packet:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
8852None2017-11-08T18:34:34.80668870.033675500ICMPEcho Operation
8852None2017-11-08T18:34:34.80668870.000000000ICMPEcho Request
8852None2017-11-08T18:34:34.80668870.000000000IPv4Next Protocol: ICMP, Packet ID: 29796, Total Length: 60
8852None2017-11-08T18:34:34.80668870.000000000EthernetType: Internet IP (IPv4)
8852None2017-11-08T18:34:34.80668870.000000000VXLANVXLAN Frame
8852None2017-11-08T18:34:34.80668870.000000000UDPSrcPort: 1085, DstPort: VXLAN(4789), Length: 90
8852None2017-11-08T18:34:34.80668870.000000000IPv4Next Protocol: UDP, Packet ID: 30052, Total Length: 110
8852None2017-11-08T18:34:34.80668870.000000000EthernetType: Internet IP (IPv4)

From inside to outside, the packet is structured as follows:

  • ICMP Echo Eequest
  • IPv4 protocol ICMP, from source address 10.255.0.5 (the remote container) to destination address 10.255.0.7 (the local container)
  • Ethernet from source MAC address 00-15-5D-BC-F9-AA (the remote container) to destination MAC address 00-15-5D-71-A4-C5 (the local container). These are Hyper-V MAC addresses on the ingress network. The host network does not know anything about these IP or MAC addresses.
  • ———– so far, this is the original packet sent by the remote container————
  • VXLAN header with network identifier 4096. This is the VXLAN ID shown by docker network inspect ingress
  • Outer UDP header, from source port 1085 to destination port 4789 (the standard port for VXLAN traffic)
  • Outer IPv4 header, protocol UDP, from source address 10.0.0.5 (the remote host) to destination address 10.0.0.4 (the local host)
  • Outer Ethernet header, from source MAC address 00-22-48-01-9E-11 (the primary adapter of the remote host) to destination MAC address 00-22-48-01-00-03 (the primary adapter of the local host)

Following the flow of messages, the packet is received by the Hyper-V switch for the overlay network:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
8853None2017-11-08T18:34:34.80669300.000004300Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A4626D6A0 received from Nic /DEVICE/{DAB8937D-9AD5-460E-8652-C2E152CCE573} (Friendly Name: Microsoft Hyper-V Network Adapter #5) in switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

The packet is routed to the container adapter:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
8867None2017-11-08T18:34:34.80672460.000001300Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A4626D6A0 routed from Nic /DEVICE/{DAB8937D-9AD5-460E-8652-C2E152CCE573} (Friendly Name: Microsoft Hyper-V Network Adapter #5) to Nic 0330EF2B-74AB-4E06-A32D-86DA92145374 (Friendly Name: b1937b9d22d2aa9881d0e45b16bc7031b2d4d07d4d0059531d64a6ade5a5242e) on switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

The packet is received by the container adapter:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
8868None2017-11-08T18:34:34.80672690.000002300Microsoft_Windows_Hyper_V_VmSwitchNBL 0xFFFF880A4626D6A0 delivered to Nic 0330EF2B-74AB-4E06-A32D-86DA92145374 (Friendly Name: b1937b9d22d2aa9881d0e45b16bc7031b2d4d07d4d0059531d64a6ade5a5242e) in switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12)

The original packet is delivered, minus the VXLAN header and UDP wrapper:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
8869None2017-11-08T18:34:34.80672960.000002700ICMPEcho Operation
8869None2017-11-08T18:34:34.80672960.000000000ICMPEcho Request
8869None2017-11-08T18:34:34.80672960.000000000IPv4Next Protocol: ICMP, Packet ID: 29796, Total Length: 60
8869None2017-11-08T18:34:34.80672960.000000000EthernetType: Internet IP (IPv4)

You can see it has taken 0.4 milliseconds to process the packet in the switch.

Traffic: incoming across hosts

With the routing mesh, incoming traffic from a remote client to any node in the swarm can be load balanced and routed to a container on a different node. This uses the routing mesh to handle the incoming and outgoing traffic, and the overlay network to handle the traffic between container and node.

In this example the incoming packet arrives at host Docker2. It is load balanced to a container running on host Docker1. The trace is running on Docker1, receiving the packet from Docker 2.

This time the incoming TCP packet has the same VXLAN and UDP headers as inter-container traffic (when it is across hosts):

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11165None2017-11-08T17:02:50.33488900.048658400TCPFlags: ......S., SrcPort: 65408, DstPort: HTTP(80), Length: 0, Seq Range: 4237068666 - 4237068667, Ack: 0, Win: 29200(negotiating scale factor: 7)
11165None2017-11-08T17:02:50.33488900.000000000IPv4Next Protocol: TCP, Packet ID: 41609, Total Length: 60
11165None2017-11-08T17:02:50.33488900.000000000EthernetType: Internet IP (IPv4)
11165None2017-11-08T17:02:50.33488900.000000000VXLANVXLAN Frame
11165None2017-11-08T17:02:50.33488900.000000000UDPSrcPort: 40558, DstPort: VXLAN(4789), Length: 90
11165None2017-11-08T17:02:50.33488900.000000000IPv4Next Protocol: UDP, Packet ID: 41865, Total Length: 110
11165None2017-11-08T17:02:50.33488900.000000000EthernetType: Internet IP (IPv4)

The UDP and VXLAN headers are stripped off by the switch, routed and presented to the container as standard TCP, coming from the ingress-endpoint on the other host with address 10.255.0.3:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11186None2017-11-08T17:02:50.33495200.000004000TCPFlags: ......S., SrcPort: 65408, DstPort: HTTP(80), Length: 0, Seq Range: 4237068666 - 4237068667, Ack: 0, Win: 29200(negotiating scale factor: 7)
11186None2017-11-08T17:02:50.33495200.000000000IPv4Next Protocol: TCP, Packet ID: 41609, Total Length: 60
11186None2017-11-08T17:02:50.33495200.000000000EthernetType: Internet IP (IPv4)

This time the container makes an ARP request to find the MAC address of the ingress-endpoint on the other host that sent it the packet:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11187None2017-11-08T17:02:50.33503730.000085345944ARPREQUEST, SenderIP: 10.255.0.7, TargetIP: 10.255.0.3
11187None2017-11-08T17:02:50.33503730.000000045944EthernetType: ARP

The ARP request is intercepted by the VFP extension in the switch and dropped:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11192None2017-11-08T17:02:50.33505780.000001745944Microsoft_Windows_Hyper_V_VmSwitchNBLs were dropped by extension {24C70E26-D4C4-42B9-854A-0A4B9BA2C286}-{E9B59CFA-2BE1-4B21-828F-B6FBDBDDC017}-0000 (Friendly Name: Virtual Filtering Platform VMSwitch Extension) in switch A404BC57-741B-4C79-8BA5-1D7D3FDA92C1 (Friendly Name: xio0654aj01a6x60kfnoe4r12). Source Nic 0330EF2B-74AB-4E06-A32D-86DA92145374 (Friendly Name: b1937b9d22d2aa9881d0e45b16bc7031b2d4d07d4d0059531d64a6ade5a5242e), Reason Outgoing packet dropped by VFP

The switch fabricates an ARP reply:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11200None2017-11-08T17:02:50.33522190.000004839363284ARPREPLY, SenderIP: 10.255.0.3, TargetIP: 10.255.0.7
11200None2017-11-08T17:02:50.33522190.000000039363284EthernetType: ARP

The container replies to the SYN with an ACK:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11201None2017-11-08T17:02:50.33522900.000007139363284TCPFlags: ...A..S., SrcPort: HTTP(80), DstPort: 65408, Length: 0, Seq Range: 3626128581 - 3626128582, Ack: 4237068667, Win: 65535(negotiating scale factor: 8)
11201None2017-11-08T17:02:50.33522900.000000039363284IPv4Next Protocol: TCP, Packet ID: 17960, Total Length: 52
11201None2017-11-08T17:02:50.33522900.000000039363284EthernetType: Internet IP (IPv4)

This is routed by the virtual switch and emerges at the host adapter as a reply, wrapped in the VXLAN and UDP headers:

MessageNumberDiagnosisTypesTimestampTimeDeltaEventRecord.Header.ProcessIdEventRecord.Header.ThreadIdModuleSummary
11217None2017-11-08T17:02:50.33528510.000001039363284TCPFlags: ...A..S., SrcPort: HTTP(80), DstPort: 65408, Length: 0, Seq Range: 3626128581 - 3626128582, Ack: 4237068667, Win: 65535(negotiating scale factor: 8)
11217None2017-11-08T17:02:50.33528510.000000039363284IPv4Next Protocol: TCP, Packet ID: 17960, Total Length: 52
11217None2017-11-08T17:02:50.33528510.000000039363284EthernetType: Internet IP (IPv4)
11217None2017-11-08T17:02:50.33528510.000000039363284VXLANVXLAN Frame
11217None2017-11-08T17:02:50.33528510.000000039363284UDPSrcPort: 37734, DstPort: VXLAN(4789), Length: 82
11217None2017-11-08T17:02:50.33528510.000000039363284IPv4Next Protocol: UDP, Packet ID: 18216, Total Length: 102
11217None2017-11-08T17:02:50.33528510.000000039363284EthernetType: Internet IP (IPv4)

This reply is forwarded across the host network to the other host. The virtual switch on the other host fabricated a reply to the remote client. This is not shown here, but is the same as the reply in the first trace above.

So there we have it: Windows Server 2016 version 1709 with the Docker overlay network and routing mesh, using Software Defined Networking, Hyper-V switches and the Azure Virtual Filtering Platform virtual switch extension.

Docker Swarm Networking

Docker Swarm enables containers to operate together to provide a service, across different nodes in a cluster. It uses an overlay network for communication between containers on different hosts. It also supports a routing mesh, which load-balances and routes incoming connections to the containers. On Windows Server 2016 before the latest version this routing mesh is not supported. Now it is, with the release of version 1709, so we can see how it all works.

Docker Swarm enables containers to operate together to provide a service, across different nodes in a cluster.

It uses an overlay network for communication between containers providing the same service. You can read an excellent description of it here, in the Docker Reference Architecture: Designing Scalable, Portable Docker Container Networks. The overlay network is implemented as a Virtual Extensible LAN (VXLAN) stretched in software across the underlying network connecting the hosts.

The network has a built-in routing mesh that directs incoming traffic on a published port, on any node, to any container running the service on any node. This diagram illustrates the routing mesh on Linux, where it is implemented in the kernel by the IP Virtual Server (IPVS) component:

Docker_Reference_Architecture-_Designing_Scalable _Portable_Docker_Container_Networks_images_routing-mesh

On Windows Server 2016 version 1607 the routing mesh does not work. Now, with the new Windows Server 2016 version 1709 it does.

Microsoft introduced support for Docker Swarm with overlay networks in April 2017, with KB4015217. This document Getting Started with Swarm Mode describes it, but down at the bottom it says that the routing mesh is not supported. Although you can still publish a port, this limits your options to either one per host, or a dynamic port, and a separate load balancer.

To get the terms straight:

  • Overlay network: a VXLAN shared by containers on different hosts, transported by the underlying host network
  • Routing mesh: load balanced routing of incoming traffic on published ports to the destination port on one of the containers in the service
  • Ingress mode: the port publishing mode that uses the routing mesh, instead of direct connection to ports on the container host (host mode or global mode)
  • "Ingress": the name of the default overlay-type network created by Docker, just as "nat" is the name of the default NAT-type network; but you can create your own overlay network.

Support for the routing mesh and ingress mode has arrived in Windows Server 2016 version 1709 and is now available in Azure too. It is still at an early stage. It requires:

  • A new installation of Windows Server 2016 version 1709
  • Docker EE version 17.10, still in Preview.

To install Docker EE Preview, run:

Install-Module DockerProvider
Install-Package Docker -ProviderName DockerProvider -RequiredVersion Preview -Force

To test this, I created a Docker Swarm service with three replicas on two nodes. I am using the microsoft/iis:windowsservercore-1709 image to have something to connect to:

docker service create --name web --replicas 3 --publish mode=ingress,target=80,published=80 microsoft/iis:windowsservercore-1709

The service is created by default on the "ingress" overlay network, because it has a published port.

With three containers on two nodes, I should be able to see:

  • Both nodes responding to a connection on port 80
  • Two containers servicing the same published port, on one node
  • One container servicing port 80 on the other node
  • Traffic arriving at a node, and going to a container either on the same node, or crossing to a container on the other node
  • All containers able to communicate with each other, on the same Layer 2 switched network.

I am using Portainer as a simple GUI to view the Docker Swarm service. Here is the web service:

Portainer Service List

and the service details:

Portainer Service Details

with the service overlay network:

Portainer Service Network

Using Portainer or the Docker command line (docker service inspect web and docker network inspect ingress), I can see that the containers are on a subnet of 10.255.0.0/16. The network also has one "ingress-endpoint" for each node, with addresses of 10.255.0.2 and .3.

First let’s check that the routing mesh works. Here you can see four different connections (click to see details):

Docker 1 to web.2 – container on same host;

Docker 1 Container 2 crop

Docker 1 to web.3 – different container on same host;

Docker 1 Container 3 crop

Docker 2 to web.1 – container on the other host;

Docker 2 Container 1 crop

Docker 2 to web.3 – container on different host;

Docker 2 Container 3 crop

If I run a network trace I can see how it works. Below isthe conversation between client and container, where the incoming request is routed to a container on the same node:

Connection to Container on Same Host

It consists of exact pairs of packets, If we take a look at one pair:

Source Destination Content
IP address MAC address IP address MAC address TCP
92.234.68.72 12:34:56:78:9a:bc 10.0.0.4 00:22:48:01:00:03 53711 → 80 [SYN]
10.255.0.2 00:22:48:01:00:03 10.255.0.6 00:15:5d:71:af:d8 65408 → 80 [SYN]

00:22:48 is the Vendor ID of adapters in the Azure VMs. 00:15:5d is the Vendor ID of Hyper-V adapters created by the Host Network Service for containers.

The packet has come from the external client on 92.234.68.72. The host adapter has received the packet from the client on its external IP address of 10.0.0.4, on port 80; and sent it with the same MAC address, but with the IP address of the ingress-endpoint 10.255.0.2, to port 80 on one of the containers. The same process happens in reverse with the reply.

Below is the conversation between client and container when the incoming request is routed to a container on a different node:

Connection to Container on Different Host

In this case we don’t see the translation between node and ingress-endpoint, because it is on the other container. Instead we see that the request comes from the ingress-endpoint of the sending node, using the MAC address of the host adapter. The reply is sent to the ingress-endpoint using the MAC address of the overlay network adapter.

Source Destination Content
IP address MAC address IP address MAC address TCP
10.255.0.3 00:22:48:01:9e:11 10.255.0.7 00:15:5d:71:a4:c5 65408 → 80 [SYN]
10.255.0.7 00:15:5d:71:a4:c5 10.255.0.3 00:15:5d:bc:f5:40 80 → 65408 [SYN, ACK]

In between the two packets, we see the container broadcast to find the MAC address of the ingress-endpoint. All communication between entities in the overlay network is by Layer 2 switching.

Below is the conversation between two containers on different nodes:

Ping Container to Container on Different Host

The containers are on the same Layer 2 broadcast domain. There is no firewall between them, even though the two nodes both operate the Windows Firewall and do not communicate openly with each other. The containers can ping each other and connect on any listening port.

We will have to dig a bit deeper to find out what makes this work, but for the moment we can see that:

  • The overlay network is a switched LAN segment stretched across the hosts
  • The ingress-endpoints act as load-balancing and routing gateways between the nodes and the container network.