Packet journey through Linux kernel

From Nix-Pro
Jump to: navigation, search

High Level Overview

The high-level path network data takes from a user program to a network device is as follows:

  • Data is written using a system call (like sendto, sendmsg, et. al.).
  • Data passes through the socket subsystem on to the socket’s protocol family’s system (in our case, AF_INET).
  • The protocol family passes data through the protocol layers which (in many cases) arrange the data into packets.
  • The data passes through the routing layer, populating the destination and neighbour caches along the way (if they are cold). This can generate ARP traffic if an ethernet address needs to be looked up.
  • After passing through the protocol layers, packets reach the device agnostic layer.
  • The output queue is chosen using XPS (if enabled) or a hash function.
  • The device driver’s transmit function is called.
  • The data is then passed on to the queue discipline (qdisc) attached to the output device.
  • The qdisc will either transmit the data directly if it can, or queue it up to be sent during the NET_TX softirq.
  • Eventually the data is handed down to the driver from the qdisc.
  • The driver creates the needed DMA mappings so the device can read the data from RAM.
  • The driver signals the device that the data is ready to be transmit.
  • The device fetches the data from RAM and transmits it.
  • Once transmission is complete, the device raises an interrupt to signal transmit completion.
  • The driver’s registered IRQ handler for transmit completion runs. For many devices, this handler simply triggers the NAPI poll loop to start running via the NET_RX softirq.
  • The poll function runs via a softIRQ and calls down into the driver to unmap DMA regions and free packet data.

Diagram

Network data flow through kernel.png

Application layer

The lowest layer of interaction with the networking from the application point of view is socket layer. The socket layer acts as the interface to and from the application layer to the transport layer (TCP/UDP) of OSI model. Connection oriented (streaming sockets) are implemented on top of TCP and the connectionless (datagram sockets) use UDP.

In Linux socket syscall allows to create a new socket. For the application to be able to interact with other sockets Linux has the following syscalls that allow to read/write to sockets:

  • send(), sendto(), and sendmsg() - are used to transmit a message to another socket.
  • recv(), recvfrom(), and recvmsg() calls are used to receive messages from a socket.

When a message sending call like send, write etc is made, the control reaches the __sock_sendmsg system call which is in net/socket.c. This checks if the user buffer is readable and if so, it obtains the socek struct by using the socket descriptor available from the user-level program which is issuing the call. It then creates the message header based on the message transmitted and a socket control message which has information about the UID, PID and GID of the process. All these operations are carried out in the process context. The control calls the sock sendmsg , which traverses to the protocol specific sendmsg function. The protocol options are consulted, through the sendmsg field of the proto_ops structure and the, protocol specific function is invoked. Thus, if it is a TCP soceket then the tcp_sendmsg function is called and if it is a UDP socket then the udp_sendmsg function is called. These decisions are made after the control passes over the Transport Layer Interface and a decision is made on which protocol specific function to call. The tcp_sendmsg function, defined in the linux/net/ipv4/tcp.c is finally invoked whenever any user-level message sending is invoked on an open SOCK_STREAM type socket.

Transport Layer

As was mentioned before, socket interface helps applications to interact with the networking and hides lower layer protocols. It helps to fill headers and pack application message into transport layer PDU. One of tasks of the socket interface is extract socket structure and check if it's functional.

In effect this layer invokes the appropriate protocol for the connection. This function is carried out in inet_sendmsg which is in net/ipv4/af inet.c. As you may have guessed from the name, this is a generic function provided by the AF_INET protocol family. This function looks up the sendmsg function on the socket’s internal protocol operations structure and calls it. The function pointer which would have been set in the proto structure will direct to tcp_sendmsg or udp_sendmsg as the case may be.

Next, the destination address and port are determined from one of two possible sources:

  • The socket itself has the destination address stored because the socket was connected at some point.
  • The address is passed in via an auxiliary structure.

kernel arranges a struct msghdr structure on behalf of the user when the user program calls sendto. If the udp_sendmsg or tcp_sendmsg function was reached by kernel function which did not arrange a struct msghdr structure, the destination address and port are retrieved from the socket itself and the socket is marked as "connected." In either case daddr and dport will be set to the destination address and port in struct msghdr.

In case of TCP tcp_sendmsg performs TCP specific work on the packet and waits for the connection to be established as TCP can't send data till connection is established. The other operation which tcp_sendmsg takes care of is setting up Maximum Segment Size

Once the connection is established, and other TCP specific operations are performed, the actual sending of message takes place. This is done through the IO vector structure, which is a mechanism for transferring data from the user space into the kernel space. This is the place where the struct sk_buff *skb is created and the user data gets copied from the user space to the socket buffers in this function part of the code.

The tcp_sendmsg checks if there is buffer space available in the previously allocated buffers. If so, it writes the user data on to that. Else a new buffer is requested for the write operation. Basically this structure, tries to copy user information into available socket buffers, if none are available, new allocation is made for the purpose.

Used Materials

https://pdfs.semanticscholar.org/53bd/0df24f43f4edb76e53d968cc1e80c06917a7.pdf
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data
https://wiki.linuxfoundation.org/networking/kernel_flow
http://rbeyah.ece.gatech.edu/classes/spring2012/ece4110/handouts/Lab9_modified.pdf
http://web.engr.illinois.edu/~caesar/courses/CS598.S11/slides/raoul_kernel_slides.pdf
http://wiki.openwrt.org/doc/networking/praxis
http://www.coverfire.com/articles/queueing-in-the-linux-network-stack/