Python Bits and Bytes

Posted at — May 19, 2018

In C, handling binary data such as network packets feels ~~almost~~ like a core part of the language. In Python on the other hand, there are a lot of supporting library functions required to facilitate this.

As I only occasionally use Python for this purpose, I’ve written up the below as a reference for myself. All of these examples target Python 3.

Base Conversions

Python has three built in functions for base conversions. These are int(), hex() and bin(). Note that hex() and bin() both return strings.

Considering the example where x = 42:

int(x) gives 42
hex(x) gives '0x2a'
bin(x) gives '0b101010'

Alternatively, we can get slightly more control over the output by using the str.format() method and it’s format syntax.

For example, the following outputs zero-padded binary numbers to a width of 8:

"{0:08b}".format(x) produces '00101010'

If the initial value you wish to convert is a string, the int() function can be used to firstly convert it to an integer. This requires providing both the string and its base as arguments to the int() function.

In the case where x = "0x2a":

int(x,16) gives 42
bin(int(x,16)) gives '0b101010'
"{0:08b}".format(int(x,16)) gives '00101010'

Unicode Code Points

The ord() built in function returns the integer value / code point of a specified character. For example, examining the “straight” ASCII apostrophe and the “curly” opening version:

>>> ord("'")
39
>>> ord("‘")
8216

The chr() function preforms the inverse of ord(). It will return the string representation of an integer argument. If you wanted the rocket symbol you could issue:

>>> chr(0x1F680)
'🚀'

bytes and bytearray

Binary values can be stored within the bytes object. This object is immutable and can store raw binary values within the range 0 to 255. It’s constructor is the aptly named bytes(). There are several different ways to initialise a bytes object:

>>> bytes((1,2,3))
b'\x01\x02\x03'

>>> bytes("hello", "ascii")
b'hello'

The bytearray object serves the same purpose as bytes but is mutable, allowing elements in the array to be modified. It has the constructor bytearray().

>>> x = bytearray("hello.", "ascii")
>>> x
bytearray(b'hello.')

>>> x[5] = ord("!")
>>> x
bytearray(b'hello!')

Byte literals and ASCII Conversions

A bytes literal can be specified using the b or B prefix, e.g. b"bytes literal".

Comparing this with a standard string:

type("string literal") gives <class 'str'>
type(b"bytes literal") gives <class 'bytes'>

Non-ASCII bytes can be inserted using the "\xHH" escape sequence. This places the binary representation of the hexadecimal number 0xHH into the string, e.g. b"The NULL terminator is \x00".

The str object has an encode() method to return the bytes representation of the string. Similarly, the bytes object has a decode() method to return the str representation of the data:

"string to bytes".encode("ascii") gives b'string to bytes'
b"bytes to string".decode("ascii") gives 'bytes to string'

Hex Stream

The hexadecimal string representation of a single byte requires two characters, so a hex representation of a bytes string will be twice the length.

To convert from bytes to a hex representation use binascii.hexlify() and from hex to bytes binascii.unhexlify().

For example, where x = b"hello"

binascii.hexlify(x) gives b'68656c6c6f'
binascii.hexlify(x).decode() gives '68656c6c6f'

The reverse process, if y = "68656c6c6f"

binascii.unhexlify(y.encode()) gives b'hello'

Structures / Packets

The struct module provides a way to convert data to/from C structs (or network data).

The key functions in this module are struct.pack() and struct.unpack(). In addition to the data, these functions require a format string to be provided to specify the byte order and the intended binary layout of the data.

Consider an IPv4 header. This structure contains some fields that are shorter than a byte (octet), e.g. the version field is 4-bits wide (aka a nibble). The smallest data unit struct can handle is a byte, so these fields must be treated as larger data units and then extracted separately via bit shifting.

IPv4 Field	Format Character
Version and IHL	B
Type of Service	B
Total Length	H
Identification	H
Flags and Fragmentation Offset	H
Time to Live	B
Protocol	B
Header Checksum	H
Source Address	L
Destination Address	L

As this data should be in network byte order, we need to specify this with an exclamation mark, !. The format string which represents an IPv4 header is therefore: !BBHHHBBHLL.

Below is an example of packing IPv4 fields into a bytes object and hex stream:

import struct
import binascii

fmt_string = "!BBHHHBBHLL"

version_ihl = 4 << 4 | 4
tos = 0
total_length = 100
identification = 42
flags = 0
ttl = 32
protocol = 6
checksum = 0xabcd
s_addr = 0x0a0b0c0d
d_addr = 0x01010101

ip_header = struct.pack(fmt_string,
                        version_ihl,
                        tos,
                        total_length,
                        identification,
                        flags,
                        ttl,
                        protocol,
                        checksum,
                        s_addr,
                        d_addr)

print(ip_header)
print(binascii.hexlify(ip_header).decode())

The output of this is:

b'D\x00\x00d\x00*\x00\x00 \x06\xab\xcd\n\x0b\x0c\r\x01\x01\x01\x01'

44000064002a00002006abcd0a0b0c0d01010101

The unpack() method can reverse this process:

ip_header_fields = struct.unpack(fmt_string, ip_header)
print(ip_header_fields)

The unpacked data is a tuple of the individual fields:

(68, 0, 100, 42, 0, 32, 6, 43981, 168496141, 16843009)

The Unterminated String

Embedded Things and Software Stuff