Networking with Kubernetes

Published in

Practo Engineering

12 min readOct 23, 2017

Containers are changing the landscape of how the applications are developed and run. With containers coming into picture, the application design has become more distributed making the network a very critical component. Container networking is somewhat complicated and not that easily understood. The types and options to network containers are many and each should be understood properly to make a right network choice to build a secure, robust and adaptable production system.

The container networking across the industry were implemented using the core Linux network technologies like IPRouting, IPTables for fire-walling, port-forwarding and load-balancing, IPAM for IP allocations, DNS, IP-Masquerading, network address translators, network interfaces, Linux namespaces, bridges etc. I intend to speak about these technologies in a series of blogs, some of them might be part of the other blog posts in the series. (Note: Please read it in desktop, to the see the well formatted tables)

Networking modes
Pod to pod communication within the same node
Pod to pod communication across nodes
Simple networking implementations

Networking Modes

As I said there are various options to network containers. A lot of standardisation efforts are happening on configuring the network interfaces in the container community. But before getting into standardisation in the industry and understanding kubernetes networking in particular, let us first understand different networking modes and their story a little bit. It is useful to understand this as it forms the fundamental of how any container orchestration system works.

Link mode

It all started with docker introducing the Docker link feature to allow containers to discover each other and communicate within the same host. The main focus of this approach was just to make the containers communicate and to make them reachable from outside. When containers were linked, information of the source container was made present in the recipient container by setting environment variables and updating /etc/hosts of the recipient containers for the source container IP addresses. This was not a scalable approach as the IP of the source container can change and then it was impossible to keep updating the environment variables and /etc/hosts across all the dependant containers. This mode is deprecated now.

Host network mode

Host network mode is also called native networking and was one of the early simple networking modes. In this mode, when a container starts, it uses the host network namespace and host IP address. This basically means the containers can see all the network interfaces of the host. And to expose a container outside, users need to choose a port from the host port space and configure the IPTables for the appropriate firewall rules to expose the container service.

Few advantages of this approach are 1) Performance is better as there are no NAT translations 2) Easier to troubleshoot 3) Simple design.

Mesos uses host network mode. Even Borg, the predecessor to Kubernetes used the IP address of the host and thus shares the host-port space and thus operated in host mode. The major drawback with this approach is that there is a need to manage the ports that the containers are attached to. This might lead to port conflicts and applications are required to pre-declare the ports they use.

Bridged network mode

Bridge mode offers an improvement over the host mode as there is no need to maintain ports. It is the most common type of container network. In this, a virtual bridge container0 is created in the host. Each container which gets created gets its own network namespace and gets an IP address from the bridge subnet. Since every container has its own namespace, they have their own port space and hence there are no port conflicts. But to communicate outside, these containers still uses hostIP and hence the network address translation is required to translate containerIp:Port to hostIp:Port. These NAT translations are done in Linux using IPTables.

Good thing about this approach is it isolates the container and also give them their own port space which prevents port conflicts but the NAT translation brings in the performance overhead.

Bridge method could be extended to multiple hosts but it requires maintaining ports or dynamically allocating ports. A simple design that supported multi host networking without the infrastructure complexity of managing the ports and without having to do any NAT translations was required.

Here comes the sun…

IP per pod mode

Kubernetes took a completely different approach to network containers. We could call it an “Internet like” approach to networking. The fundamental change was that every pod will have their own IP address and they could be accessible from the other pods regardless of which host the pod is on and there won’t be any NAT translations in pod to pod or node to pod or pod to node communications.

Let us now dig into IP-per-pod mode and understand the various communications that happen in the kubernetes cluster.

Pods and hosts send packets to each other to communicate. Packets are the most basic unit of network transmission.(Actually the layer of communication determines if it is a segment or a packet or a frame getting transmitted, but let’s just call it a packet for the discussion) Basic contents of packets are -

Sender and Receiver IP Address
Payload: Actual Data being transmitted i.e. content of the packet
MAC address is added to the packet when it is sent as frames in Layer4
Sender and destination port informations

Pod to pod communication in the same node:

Note: The red dot in all the diagrams is the packet getting transmitted.

Network namespaces

The above diagram tells us how the packet flows for a pod to pod communication in a single node(pod1 to pod2 communication). It is a bridge network within the same node. Let us understand it in detail. We have a machine, it is called a node in kubernetes. It has an IP 172.31.102.105 belonging to a subnet having CIDR 172.31.102.0/24. The node has an network interface eth0 attached. It belongs to root network namespace of the node. For pods to be isolated, they were created in their own network namespaces — these are pod1 n/w ns and pod2 n/w ns . The pods are assigned IP addresses 100.96.243.7 and 100.96.243.8 from the CIDR range 100.96.0.0/11. We will discuss on how the IP addresses are assigned to the pods later in network implementation section as it is implementation specific.

Network interfaces and veth pairs

Now, we need to connect the root and pod network namespaces to talk to each other. This is done in Linux by using veth pipe pairs (a cable with two ends, whatever data that comes in one will come out of other and vice versa). So lets call them eth0 <--> vethpod1 and eth0 <--> vethpod2

To view the above namespaces in Linux we need to create a symlink for the created pod namespace as the ip netns utility looks for namespaces in a particular directory /var/run/netns. Let us call the pod container names as pod1 and pod2

// view namespaces
admin@ip-172-31-102-105:~$ pid="$(sudo docker inspect -f '{{.State.Pid}}' "pod1")"admin@ip-172-31-102-105:~ $ sudo ln -sf /proc/$pid/ns/net "/var/run/netns/pod1"admin@ip-172-31-102-105:~ $ ip netns list
pod1

Here, we list down the interfaces attached in the namespace of pod1 that is eth0@if41. This is one end of veth1 pair.

// one end, eth0 of the pair eth0 <--> vethpod1
admin@ip-172-31-102-105:~ $ sudo ip netns exec pod1 ip a
3: eth0@if18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8973 qdisc noqueue state UP group default
    link/ether 0a:58:64:60:1d:0e brd ff:ff:ff:ff:ff:ff
    inet 100.96.243.7/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::f8c9:64ff:fe4e:aac9/64 scope link tentative dadfailed
       valid_lft forever preferred_lft forever

Let us find the other end of theveth pair. This pair is present in the root namespace, so we can use the ip a command to find it. It can be found by grepping the docker container name, pod1. The name is veth<podname>

// other end, vethpod1 of the pair eth0 <--> vethpod1
admin@ip-172-31-102-105:~ $ ip a | grep -A3 "vethpod1"
8: vethpod1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
    link/ether ee:70:85:2a:3e:6c brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ec70:85ff:fe2a:3e6c/64 scope link
       valid_lft forever preferred_lft forever

Routing and bridges

As soon as the pod1 emits a packet, it travels via eth0<->vethpod1 pair to reach the root network namespace.

The interface to send the packet is determined by checking the route table. Our packet has the destination ip of pod2e i.e. 100.96.243.8. Route table tells to forward any traffic coming for the range 100.96.243.0/24 to cbr0. cbr0 is a bridge. Let us look at the route table in the node

// route table of the node
admin@ip-172–31–102–105:~$ ip route
default via 172.31.103.1 dev eth0
100.64.0.0/10 dev flannel0 proto kernel scope link src 100.96.243.0
100.96.243.0/24 dev cbr0 proto kernel scope link src 100.96.243.1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.31.102.0/24 dev eth0 proto kernel scope link src 172.31.102.105

vethpod1 then sends the packet to the bridge cbr0

We can find the interfaces attached to the bridge using the below command which shows up our pod veth interfaces. Cool !

// interfaces attached to cbr0 in node
admin@ip-172-31-102-105:~ $ sudo brctl show
bridge name bridge id         STP enabled interfaces
cbr0        8000.0a5864601d01 no          vethpod1
                                          vethpod2     
docker0     8000.024202a3b77c no

Next, the bridge finds the destination mac address for pod2 in the packet to be 0a:28:64:60:2d:0e

We can confirm the mac address by doing below —

// destination mac address, mac address of pod2
admin@ip-172-31-102-105:~ $ sudo ip netns exec pod2 ip a
6: eth0@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8973 qdisc noqueue state UP group default
    link/ether 0a:28:64:60:2d:0e brd ff:ff:ff:ff:ff:ff
    inet 100.96.243.8/24 scope global eth0
       valid_lft forever preferred_lft forever

Bridge maintains a forwarding table where previously learnt* mapping for MAC address and ports are kept and does layer2 transmissions using mac addresses.

// forwarding table, FDB
admin@ip-172-31-102-105:~$ sudo brctl showmacs cbr0
port no mac addr            is local?   ageing timer
48      0a:28:64:60:2d:0e   yes         0.00
96      0a:58:64:60:1d:0e   yes         0.00

*Mac learning: In our case, the bridge receives a packet from MAC address of pod1 0a:58:64:60:1d:0e. If it does not finds this address in the forwarding table, it adds an entry for the source MAC address and port. Next time if it has to send packets to this MAC address it looks up this table and finds the port to forward the packet to. This is known as MAC learning.

We can clearly see that our destination mac address in the packet is 0a:28:64:60:2d:0e. Bridge will do a forward table lookup to find the port of this address and send the packet to that port i.e. 48. We can find out which interface this 48 belongs to by doing below —

// port - interface mapping in the bridge
admin@ip-172-31-102-105:~$ sudo brctl showstp cbr0 | grep -A2 vethpod2vethpod2 (48)
 port id  804e   state       forwarding

Packet then reaches vethpod2 interface. From vethpod2, the packet gets piped to the eth0 of the pod2 namespace. Congratulations! our packet reached the destination!

We now know how the packet moves and reaches our other pod in the same node. Also, the linux utilities used here can help in debugging routing problems in the node.

It gets more interesting when the communication happens across the nodes.

Pod to pod communications across nodes

Pod to pod communication within the same node was simple and involved basic Linux networking concepts. According to the kubernetes networking requirement, the pod IP should be accessible across nodes. We will get to how the pod IP is made accessible across nodes, when we discuss the specific network implementations. For now, let us just go ahead with the packet flow from a pod to a another pod in a different node.

Routing

Now again, the pod1 emits a packet, but for pod4 this time. It travels via eth0-vethpod1 pair to reach the root network namespace and then to the bridge cbr0 in node1

// route table node1
admin@ip-172–31–102–105:~$ ip route
default via 172.31.103.1 dev eth0
100.64.0.0/10 dev flannel0 proto kernel scope link src 100.96.243.0
100.96.243.0/24 dev cbr0 proto kernel scope link src 100.96.243.1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.31.102.0/24 dev eth0 proto kernel scope link src 172.31.102.105

Our packet this time has the destination ip 100.96.33.8 and say, it has the destination MAC address ab:e8:64:62:20:0e. In this case, our bridge cbr0 won’t find this MAC address in the forwarding table. So, it will perform an ARP lookup to find the MAC address from an IP address which will fail again. We can run the below command to look at the node ARP table.

// arp table
admin@ip-172-31-102-105:~$ sudo arp
Address           HWtype HWaddress           Iface
ip-172-31-103-43  ether  12:b8:2c:97:c4:98   eth0
100.96.243.8      ether  0a:28:64:60:2d:0e   cbr0 
100.96.243.7      ether  0a:58:64:60:1d:0e   cbr0

As the IP100.96.33.8belongs to a different node and this node has never been communicated with before, ARP fails(you can see the arp table above has nothing for this IP). Then the route tables are looked up again and on finding nothing specific, the packet is forwarded to the default interface eth0 of the node1.

Now the time has come for the packet to leave the node1 and go outside into the wire into the cloud. How the packet gets from the node1 to the node2 is very network implementation specific and will be discussed in the section of specific network implementations.

Our packet reaches node2 at its eth0. It is then IP-forwarded to cbr0 seeing the route table of the node2. Check the table below.

// route table node2
admin@ip-172–31–103–67:~$ ip route
default via 172.31.103.1 dev eth0
100.64.0.0/10 dev flannel0 proto kernel scope link src 100.96.33.0
100.96.33.0/24 dev cbr0 proto kernel scope link src 100.96.33.1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
172.31.103.0/24 dev eth0 proto kernel scope link src 172.31.103.67

cbr0 does the arp/fdb lookups and send the packet to the vethpod4 pipe and then our packets finally reaches the pod4 network namespace’s eth0. Awesome, our packet reached the other node’s pod !

Simple networking implementations

Kubernetes does not really care on how the network is implemented until the Kubernetes’ basic network requirements of ip-per-pod is implemented by the network provider. We will start with the simplest implementations first.

We discussed above how the packet gets routed when it does not find the destination ip within the node. How does it get routed to reach the other node?

a) Networking using cloud routing tables: In this implementation every node is allocated a POD CIDR range and every node has access to change the cloud routing table. Whenever a node comes up, the cloud routing table is updated for it. It is basically telling the cloud that any packet that comes for pod range 102.96.33.0/24, route it to 172.31.103.67.

Node communication using cloud routing tables

Couple of drawbacks with this approach

Multiple route tables needs to be maintained in case of multi AZ cluster which has single AZ NATs. Most clusters are provisioned by kops and it only supports kubenet networking for using the cloud routing tables and allows only one route table. Hence you can not use kops for private networking with cloud routing tables. For details, look at this issue.
Some cloud providers limit the route table entries, 50 in case of AWS. It means that the cluster can have only 50 nodes. This is a bottleneck.

b) Networking using node route tables: Another approach to allow nodes to communicate could be to write a program that configures all the nodes internal route tables for every node & POD-CIDR pair that comes up. We can make node1 and node2 communicate to each other by adding the below route entries for the respective nodes. Difficult to maintain this setup though, and no one really does this.

admin@172.31.102.105$ ip route add 100.96.33.0/24 via 172.31.103.67
admin@172.31.103.67$ ip route add 100.96.243.0/24 via 172.31.102.105

In the upcoming blog(s) in the series, we will dig into other networking implementations like cloud native networking, SDNs, CNIs, overlay networks, etc. and discuss pod to service communications, internal and external communications in the kubernetes cluster.

Update: future blogs not done as there are many such articles already there.

I wrote this blog series as a means to understand the networking in kubernetes. And it was made after reading many articles and by understanding concepts in the internet. But I would like to particularly thank Tim Hockin and Michael Rubin for making such a wonderful system and explaining it in style in this video. Also, want to thank Bryan Boreham from the kubernetes and weave community for helping me understand some of the stuffs. And to Hari who suggested to use the blogging approach to learn things.
Follow us on twitter for regular updates. If you liked this article, please hit the Hand icon to recommend it. This will help other Medium users find it.

Networking with Kubernetes

Contents

Networking Modes

Pod to pod communication in the same node:

Pod to pod communications across nodes

Simple networking implementations

Written by Alok Kumar Singh