Tuesday 23 March 2010

Linux networking secrets

The operation of Linux's networking functions has various not so well described aspects to it. Here's a few I've come across:
  • Kernel DST cache: The kernel caches congestion control related information (e.g. ssthresh, rtt, mtu, cwnd) about every TCP connection and stores it in the 'DST' (Destination) cache after each connection closes, though only for a limited time, reusing for connections to the same destination. To list/show/display the contents of the dst cache:
    ip -s route show table cache
    You just flush the DST caching using:
    ip route flush table cache
    Or you can disable the DST caching using:
    sysctl -w net.ipv4.tcp_no_metrics_save=1
  • Network interface queue size/length monitoring: Everyone goes on about tweaking the transmit queue (txqueuelen) size to get the best result out of your box but no-one tells you how to monitor/check/display/show the size of the queue actually being used. To check actual queue utilisation/use/occupancy - see the 'backlog' (in bytes and packets) as shown by tc (e.g. For eth0):
    tc -s -d qdisc ls dev eth0
  • Default Linux Queue: Amazingly enough the default queue on Linux (pfifo_fast) inspects the TOS (type of Service) bits in the IP header and does priority queuing based upon them - Not sure who - if anyone - is using that???! Anyway you can check to see the priority mappings (priomap) using the tc command above.
  • Detailed socket stats and info (congestion control algo, cwnd, window scale, rto etc) on open sockets (TCP,UDP,RAW,PACKET) on your machine you can use the socket stats app, ss  which uses an INET_DIAG netlink socket to obtain info directly from the running kernel: e.g.
    (args: t: timers, i: TCP internals, m: memory, e: extended, dport: filters on destination port):
    ss -time dport = :5001 
    Memory stats: from net/ipv4/inet_diag.c +140
    rmem = sk->sk_rmem_alloc; wmem = sk->sk_wmem_queued; fmem = sk->sk_forward_alloc; tmem = sk->sk_wmem_alloc;
  • Linux TCP stats: The tcp_probe module may be used to obtain cwnd and sequence numbers of packets in TCP flows.
  • TCP receive (or send) window size is based on sysctrl net.ipv4.tcp_rmem (or tcp_wmem) - the 3 values are min, default, and max. The actual window size relates to these numbers by tcp win = tcp_rmem-tcp_rmem/2^tcp_adv_win_scale - e.g. the system default of 87380 results in a TCP win of 65535. The max window is globally limited by net.core.rmem_max. For more details see linux-2.6/Documentation/networking/ip-sysctrl.txt and man tcp.
  • TCP/Generic Segment Offloading (TSO/GSO) information about your machine's network interface. TSO is quite common these days - it basically offloads the packetisation of TCP segments onto the network card (see my other post where it can lead to confusion). The ethtool command (which may need installing) also tells you about a whole host of features that your NIC may or may not support:
    ethtool -k eth0
  • Traffic Control (TC): I've touched on its use above but I've another post on setting filters with tc.
There's other places that have good info on Linux like: Linux foundation's Kernel networking, Linux network stack walk thru

[Updated:9jul15, 15jan15:DST cache, 1dec11]