style
25 May 2021
•
5 min read
Despite being a very low level language, serializing and deserializing binary data is dead simple in C. It's because C is a weak/un typed language. In fact, it started out as a simple non-optimizing frontend for assembly. All constructs in C language have straightforward analogues in assembly.
So if you have a handle to a memory location in your C program, you can just cast it to a pointer to a struct (a construct defining the structure of a region of memory), and start using the fields of that struct:
// struct definition
typedef struct {
x i32,
y i32,
z i32
} Point;
// get the mean z coordinate of first `n` point in the data blob pointed to by `ptr`
i32 avg_z(void* ptr, i32 n) {
Point* ps = (Point*) ptr;
i32 sum = 0;
for (int i=0; i<n; i++) {
sum += ps[i].z;
}
return ((float) sum) / n;
}
parse_struct is a clojure library that allows you to deserialize and serialize binary data using an API as straightforward as the one you would use in C. This page will serve as a guide for this library.
I will start by translating the above C program to parse_struct.
Defining the format of your data is the first thing you should do when parsing some data using parse_struct. The definition of the Point
type in the C example will look like this:
(ns fctorial.demo
(:require [parse_struct.common_types :as ct]
[parse_struct.core :refer :all]))
(def Point_t {:type :struct
:definition [[:x ct/i32]
[:y ct/i32]
[:z ct/i32]]})
(defn Point_Array_t [n]
{:type :array
:element Point_t
:len n})
parse_struct.common_types
contains all the fundamental data types (1, 2, 4, 8 byte little and big endian, signed and unsigned integers, 4, 8 byte little and big endian floats, and padding). You can combine them using :struct
s and :array
s to form more complex data types.
You perform the parsing operation using the deserialize
function in parse_struct.core
:
(defn avg_z [ptr n]
(let [points (deserialize (Point_Array_t n)
ptr)]
(/ (reduce (fn [res nxt]
(+ res (nxt :z)))
0
points)
n)))
The first argument to deserialize
is a type definition. The second argument is a sequence of bytes. The performance of deserialize
depends on the byte sequence it is given. Byte arrays perform the best and seqs are the worst.
parse_struct
also comes with a class ROVec that is a clojure friendly sequence type that performs as fast as a byte array.
Let's now write a program that extracts the list of symbols from an elf file. I will target only the elf64 little endian format, but making a program that targets all the formats is not too difficult.
The complete code can be found in the master branch of above linked repo (fctorial.demo
namespace).
This is the path we'll follow to find the symbols list:
We will start by defining the aliases used by elf64 specification:
(ns fctorial.demo
(:require [parse_struct.core :refer :all]
[parse_struct.common_types :refer :all]
[clojure.pprint :refer [pprint]]
[fctorial.utils :refer :all]
[fctorial.data :refer [obj]]
)
(:import (clojure.lang ROVec MMap)))
(def ElfAddr u64)
(def ElfHalf u16)
(def ElfOff u64)
(def ElfWord u32)
(def ElfXword u64)
fctorial.data.obj
is a ROVec
containing a simple executable (compiled with gcc -c t.c -o data/t.o
).
We will start by reading the elf identification segment and verifying that the file is an ELF64LE executable:
(def magic_t {:type :struct
:definition [[:ident {:type :string
:bytes 4}]
[:class (assoc i8 :adapter {1 :32 2 :64})]
[:data (assoc i8 :adapter {1 :LE 2 :BE})]
[:version i8]]})
(def magic (deserialize magic_t obj))
(assert (= (magic :class) :64))
(assert (= (magic :data) :LE))
(assert (= (magic :ident) "\u007FELF"))
Here we see the :adapter
feature of parse_struct in action. Each type is a clojure map that can optionally have an entry by the name :adapter
. Its value must be a function which will be applied to the parsed value and the result will be returned instead of the original value. Here we use it to map integers to clojure keywords, which are easier to use.
Now we parse the rest of the ELF header.:
(def elf_header_t {:type :struct
:definition [(padding 24)
[:shoff ElfOff] ; section header offset
(padding 10)
[:shentsize ElfHalf] ; section header entry size
[:shnum ElfHalf] ; section headers count
[:shstrndx (assoc ElfHalf :adapter int)]]})
(def elf_header (deserialize elf_header_t
(ROVec. obj 16)))
We are only interested in the section info so we ignore the rest of the data using parse_struct.common_types.padding
function. We are also using the ROVec.
constructor to slice the original blob at byte number 16. ROVec
class has constructor overloads that can be used like the vec
function from clojure standard library to slice and dice the blob.
Let's do a sanity check on the data we've extracted. Section headers are always at the very tail of an ELF file:
(assert (= (+ (elf_header :shoff)
(* (elf_header :shentsize)
(elf_header :shnum)))
(count obj)))
Now we know where the section headers are. Let's parse them:
(def sec_header_t {:type :struct
:definition [(padding 4)
[:type (assoc ElfWord :adapter #(get [:SectionType/NULL
:SectionType/PROGBITS
:SectionType/SYMTAB
:SectionType/STRTAB
:SectionType/RELA
:SectionType/HASH
:SectionType/DYNAMIC
:SectionType/NOTE
:SectionType/NOBITS
:SectionType/REL
:SectionType/SHLIB
:SectionType/DYNSYM]
%))]
(padding 16)
[:offset ElfOff]
[:size ElfXword]
[:link ElfWord]
(padding 20)]})
(def secs (deserialize {:type :array
:len (elf_header :shnum)
:element sec_header_t
:adapter vec}
(ROVec. obj (elf_header :shoff))))
(def symtab_header (first (filter #(= (% :type) :SectionType/SYMTAB) secs)))
(def symnames_header (secs (symtab_header :link))) ; The link field of a symbol table in the index of symbol names section
(def symnames (deserialize {:type :string
:bytes (symnames_header :size)}
(ROVec. obj (symnames_header :offset))))
Deserialization of an array gives back a lazy seq. Adding an :adapter vec
will turn it into an eager indexable array.
The symbol names section is a blob of ascii strings concatenated with null terminators. Each symbol table entry contains an index into this blob that points to the start of its name. So we use the :string
type to parse it (java ascii strings can contain any characters).
We can now parse the symbol table:
(def sym_t {:type :struct
:definition [[:name (assoc ElfWord :adapter (fn [idx]
(.substring symnames
idx
(.indexOf symnames 0 idx))))]
(padding 2)
[:shndx ElfHalf]
[:value ElfAddr]
[:size ElfXword]]})
(def symbols (deserialize {:type :array
:len (/ (symtab_header :size)
(type-size sym_t))
:element sym_t}
(ROVec. obj (symtab_header :offset))))
Once again, we are using an adapter to attach the symbols to their names. The function type-size
is also introduced. It takes a definition and returns the net size of that definition in bytes.
The result (symbols
) will look something like this:
({:name "", :shndx 0, :value 0, :size 0}
{:name "t.c", :shndx 65521N, :value 0, :size 0}
{:name "", :shndx 1, :value 0, :size 0}
{:name "", :shndx 3, :value 0, :size 0}
{:name "", :shndx 4, :value 0, :size 0}
{:name "", :shndx 5, :value 0, :size 0}
{:name "count.1913", :shndx 4, :value 0, :size 4}
{:name "f", :shndx 1, :value 65, :size 7}
{:name "", :shndx 6, :value 0, :size 0}
{:name "", :shndx 8, :value 0, :size 0}
{:name "", :shndx 9, :value 0, :size 0}
{:name "", :shndx 11, :value 0, :size 0}
{:name "", :shndx 13, :value 0, :size 0}
{:name "", :shndx 15, :value 0, :size 0}
{:name "", :shndx 16, :value 0, :size 0}
{:name "", :shndx 14, :value 0, :size 0}
{:name "x", :shndx 5, :value 0, :size 4}
{:name "y", :shndx 3, :value 0, :size 4}
{:name "z", :shndx 5, :value 8, :size 8}
{:name "eho", :shndx 1, :value 0, :size 26}
{:name "rot", :shndx 1, :value 26, :size 23}
{:name "_GLOBAL_OFFSET_TABLE_", :shndx 0, :value 0, :size 0}
{:name "main", :shndx 1, :value 49, :size 16}
{:name "missing", :shndx 0, :value 0, :size 0})
parse_struct can also be used for generating binary data. The api is quite similar to deserialization. The function is parse_struct.core.serialize
and it takes two arguments. A type definition and a clojure data type that conforms to that spec:
(def spec {:type :array
:len 20
:element i32be})
(def data1 (range 20))
(def bs (serialize spec data1))
(def data2 (deserialize spec bs))
(assert (= data1 data2))
Ground Floor, Verse Building, 18 Brunswick Place, London, N1 6DZ
108 E 16th Street, New York, NY 10003
Join over 111,000 others and get access to exclusive content, job opportunities and more!