Raw IP sockets: a tale of two endians

background

When it comes to networking, there are many ways to send and receive data programmatically: TCP/UDP sockets, TUN/TAP interfaces, raw IP sockets, etc. Using different socket or interface types involves operating at different layers of the OSI model - and if using Go - involves either using the net package, a third-party package, or creating one’s own package. For our inaugural blog post, we at PacketStories wanted to share a discovery we made while writing Go and utilizing the supplementary golang.org/x/net/ipv4 package: a tale of raw IP sockets, endianness, and a sprinkling of kernel history.

A few weeks ago, we needed to craft our own ethernet frames to write to a TAP interface. For the IP header part of the frame, we instinctively decided to use the ipv4.Header type and corresponding Marshal() function:

hdr := ipv4.Header{…}
hdrBuff, err := hdr.Marshal()
if err != nil {
	t.Fatalf(“unable to marshal test header: %v”, err)
}

When utilizing the marshalled values in tests, we noted that some fields of our packets did not contain the expected values. Digging into the underlying net ipv4 package code, we realized that the ipv4.Header Marshal() function should only be used in the context of a RAW socket. Additionally, the result is not guaranteed to match the wire format! Though it might seem peculiar that the RAW socket format is not the same as the wire format, due to historical events, this is the case.

Take a look at the ipv4.Header Marshal() function below:

Image of ipv4 Header code source

Note three key points:

The comment of the function clearly indicates that the returned byte slice is in the format of the raw IP socket and may differ from the wire format.
The switch statement masks the fact that the raw IP format might differ from one operating system to another. This has the interesting side-effect of causing unit tests checking Marshaled IP header values to frequently fail on Linux or OSX. (This has happened to us on more than one occasion.)
The switch statement alternates between Native or Big Endian for three very specific fields - total length, flags and frag off.

To fully understand this, we need to review raw IP sockets and endianness.

raw IP sockets

The comment for ipv4.Header’s Marshal function indicates that a byte slice is being generated for the format used by a raw IP socket, a special type of socket providing access to the IP layer. When reading from this type of socket, one will be able to receive both the payload and the IP header. When writing to the raw IP socket, although the IP header will be added by default by the kernel, the option IP_HDRINCL can be used to indicate that the application will include the IP header along with the payload. This type of socket provides control over the IP header fields and the protocol field. Note that this is in contrast to a layer 4 TCP or UDP socket, which only provides access to the packet payload.

Endianness

Now consider another important computer science concept - endianness or byte order.

Big endian refers to the left to right ordering of bytes - i.e. how English is written:

Image of big endian

Little endian refers to the right to left ordering of bytes:

Image of big endian source

Network byte order - how packets are transmitted on the wire - is always big endian. Host byte order, however, is dependent on the operating system; particular CPU architectures utilize particular byte order. But how does this affect you as a programmer?

Consider the following cases:

When it comes to data read from or written to disk, a file of a known format generally has a known byte order (which isn’t necessarily the host byte order).
When writing a network application, one can generally assume the data received from the network to be in network byte order.

Returning to the raw IP socket example above, if data received from the network is usually in big endian, then why are some fields of data read from the raw IP socket possibly in host byte order? The ipv4.Header Marshal() function indicates that this depends on the operating system.

Time for some kernel history.

A Tale of Two Kernels

First consider an IPv4 packet header. (More details regarding the fields themselves are here.) Image of big endian

It turns out that while network byte order is big endian, some operating systems’ networking stack - namely those in the BSD ecosystem - flipped the byte order from network byte order to host byte order for the total length, flags and fragment offset fields of the IPv4 header. When using raw IP sockets, the data was then transferred from kernelspace to userspace without flipping those fields back to network byte order. Additionally, the same header fields in data sent from userspace to kernel-space (once again for raw IP sockets) were also expected to be in the host byte order.

As explained on the FreeBSD wiki, “historically in BSD the raw sockets were not truly raw” (https://wiki.freebsd.org/SOCK_RAW). Digging further in freebsd mailing list archives, we discovered the following explanation:

“This is an artefact from the fact that the kernel stack modified every
packet at the very beginning of its processing, and worked with it in this
form. It just didn't bother to convert it before passing to raw socket,
so raw socket wasn't truly raw. This is actually a bug."

This design decision was eventually addressed in both FreeBSD and OpenBSD. Here is a small snippet of the git diff in FreeBSD:

Image of big endian

However, this characteristic is still present in OSX, leading to the ipv4.Header Marshal() function in the golang.org/x/net/ipv4 package; the kernel on “darwin” machines still flips the byte order of total length, flags, and fragment offset from network to host byte order when data is transferred via raw IP sockets from kernelspace to userspace. Conversely, data written from userspace programs to raw IP sockets should have the aforementioned fields in host byte order.

Note however that this is only for the case of reading and writing data for raw IP socket! When writing layer 2 frames to a TAP interface, there is no such byte order reversal. When writing layer 4 TCP or UDP packets, we also aren’t concerned about the layer 3 IPv4 header.

The Conclusion

If there was a singular conclusion you might draw from this post, it is that the raw IP socket isn’t truly “raw”. For a very long time in the BSD ecosystem, three key fields of the IPv4 header had byte order reversed. And though some versions of FreeBSD and OpenBSD finally addressed this, this continues to be a problem for OSX. Whereas Go hides the discrepancies of the raw IP socket between different operating systems, C does no such thing. If you happen to write raw IP socket code in C, you need to make sure to take the byte order of these key fields into consideration. And if you happen to use the supplementary Go net package for testing, be forewarned.