Docker Swarm on Windows
Docker Overlay Network Details

Docker Swarm Networking

Docker Swarm enables containers to operate together to provide a service, across different nodes in a cluster. It uses an overlay network for communication between containers on different hosts. It also supports a routing mesh, which load-balances and routes incoming connections to the containers. On Windows Server 2016 before the latest version this routing mesh is not supported. Now it is, with the release of version 1709, so we can see how it all works.

Docker Swarm enables containers to operate together to provide a service, across different nodes in a cluster.

It uses an overlay network for communication between containers providing the same service. You can read an excellent description of it here, in the Docker Reference Architecture: Designing Scalable, Portable Docker Container Networks. The overlay network is implemented as a Virtual Extensible LAN (VXLAN) stretched in software across the underlying network connecting the hosts.

The network has a built-in routing mesh that directs incoming traffic on a published port, on any node, to any container running the service on any node. This diagram illustrates the routing mesh on Linux, where it is implemented in the kernel by the IP Virtual Server (IPVS) component:

Docker_Reference_Architecture-_Designing_Scalable _Portable_Docker_Container_Networks_images_routing-mesh

On Windows Server 2016 version 1607 the routing mesh does not work. Now, with the new Windows Server 2016 version 1709 it does.

Microsoft introduced support for Docker Swarm with overlay networks in April 2017, with KB4015217. This document Getting Started with Swarm Mode describes it, but down at the bottom it says that the routing mesh is not supported. Although you can still publish a port, this limits your options to either one per host, or a dynamic port, and a separate load balancer.

To get the terms straight:

  • Overlay network: a VXLAN shared by containers on different hosts, transported by the underlying host network
  • Routing mesh: load balanced routing of incoming traffic on published ports to the destination port on one of the containers in the service
  • Ingress mode: the port publishing mode that uses the routing mesh, instead of direct connection to ports on the container host (host mode or global mode)
  • "Ingress": the name of the default overlay-type network created by Docker, just as "nat" is the name of the default NAT-type network; but you can create your own overlay network.

Support for the routing mesh and ingress mode has arrived in Windows Server 2016 version 1709 and is now available in Azure too. It is still at an early stage. It requires:

  • A new installation of Windows Server 2016 version 1709
  • Docker EE version 17.10, still in Preview.

To install Docker EE Preview, run:

Install-Module DockerProvider
Install-Package Docker -ProviderName DockerProvider -RequiredVersion Preview -Force

To test this,  I created a Docker Swarm service with three replicas on two nodes. I am using the microsoft/iis:windowsservercore-1709 image to have something to connect to:

docker service create --name web --replicas 3 --publish mode=ingress,target=80,published=80 microsoft/iis:windowsservercore-1709

The service is created by default on the "ingress" overlay network, because it has a published port. 

With three containers on two nodes, I should be able to see:

  • Both nodes responding to a connection on port 80
  • Two containers servicing the same published port, on one node
  • One container servicing port 80 on the other node
  • Traffic arriving at a node, and going to a container either on the same node, or crossing to a container on the other node
  • All containers able to communicate with each other, on the same Layer 2 switched network.

I am using Portainer as a simple GUI to view the Docker Swarm service. Here is the web service:

Portainer Service List

and the service details:

Portainer Service Details

with the service overlay network:

Portainer Service Network

Using Portainer or the Docker command line (docker service inspect web and docker network inspect ingress), I can see that the containers are on a subnet of 10.255.0.0/16.  The network also has one "ingress-endpoint" for each node, with addresses of 10.255.0.2 and .3.

First let's check that the routing mesh works. Here you can see four different connections (click to see details):

Docker 1 to web.2 - container on same host;

Docker 1 Container 2 crop

Docker 1 to web.3 - different container on same host;

Docker 1 Container 3 crop

Docker 2 to web.1 - container on the other host;

Docker 2 Container 1 crop

Docker 2 to web.3 - container on different host;

Docker 2 Container 3 crop

If I run a network trace I can see how it works. Below is the conversation between client and container, where the incoming request is routed to a container on the same node:

Connection to Container on Same Host

It consists of exact pairs of packets, If we take a look at one pair:

Source Destination Content
IP address MAC address IP address MAC address TCP
92.234.68.72 12:34:56:78:9a:bc 10.0.0.4 00:22:48:01:00:03 53711 → 80 [SYN]
10.255.0.2 00:22:48:01:00:03 10.255.0.6 00:15:5d:71:af:d8 65408 → 80 [SYN]

 

00:22:48 is the Vendor ID of adapters in the Azure VMs. 00:15:5d is the Vendor ID of Hyper-V adapters created by the Host Network Service for containers.

The packet has come from the external client on 92.234.68.72. The host adapter has received the packet from the client on its external IP address of 10.0.0.4, on port 80; and sent it with the same MAC address, but with the IP address of the ingress-endpoint 10.255.0.2, to port 80 on one of the containers. The same process happens in reverse with the reply.

Below is the conversation between client and container when the incoming request is routed to a container on a different node:

Connection to Container on Different Host

In this case we don't see the translation between node and ingress-endpoint, because it is on the other container. Instead we see that the request comes from the ingress-endpoint of the sending node, using the MAC address of the host adapter. The reply is sent to the ingress-endpoint using the MAC address of the overlay network adapter.

Source Destination Content
IP address MAC address IP address MAC address TCP
10.255.0.3 00:22:48:01:9e:11 10.255.0.7 00:15:5d:71:a4:c5 65408 → 80 [SYN]
10.255.0.7 00:15:5d:71:a4:c5 10.255.0.3 00:15:5d:bc:f5:40 80 → 65408 [SYN, ACK]

 

In between the two packets, we see the container broadcast to find the MAC address of the ingress-endpoint. All communication between entities in the overlay network is by Layer 2 switching.

Below is the conversation between two containers on different nodes:

Ping Container to Container on Different Host

The containers are on the same Layer 2 broadcast domain. There is no firewall between them, even though the two nodes both operate the Windows Firewall and do not communicate openly with each other. The containers can ping each other and connect on any listening port. 

We will have to dig a bit deeper to find out what makes this work, but for the moment we can see that:

  • The overlay network is a switched LAN segment stretched across the hosts
  • The ingress-endpoints act as load-balancing and routing gateways between the nodes and the container network.

Comments