Wednesday, 3 April 2013

Reading Binary Data Files Written in C: Python - Numpy Case Study

One of the major components of  scientific computation is producing results into a file. Usually bunch of numbers, usually structured, is written into so called data files. Writing  binary files as outputs is practised commonly. It is the choice due to good speed and/or memory efficiency compare to plain ASCII files. However, a care on documenting  endianness must be observed for good portability.

There are many popular languages to achieve the above task. For speed and efficiency reasons usually C or Fortran is used in writing out a binary file. Let's give an example of writing one integer (42) and three doubles (0.01, 1.01, 2.01) into binary file in C:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#include <stdio.h>
#include <stdlin.h>

int main() {
  FILE * myF;
  int i,j;
  double *numbers, kk;

  myF     = fopen("my.bindata", "wb") ;
  numbers = malloc(3*sizeof(double));
  i = 42;
  fwrite(&amp;i, sizeof(int), 1, myF);
  for(j=0; j<3; j++) {
    kk = (double)j+1e-2;
    numbers[j] = kk;
  }
  fwrite(numbers, sizeof(double), 3, myF);
  fclose(myF);
  return(0);
}

This code would produce a binary file called my.bindata. Our aim is to read this into Python so we can post-process the results i.e. visualisation or further data analysis. The core idea is to use higher language in processing the outputs directly instead of writing further C code; so to speak avoiding one more step in our work flow and avoiding cumbersome compilation of extra C code.

In order to read from files byte by byte, the standard library of Python provides a module called struct. Basically this module provides packing and unpacking of data into or from binary sources, in this case study our source is a file. However it is tedious and error prone to use this in a custom binary file where format would contain different types. Well at least needs an effort to read our custom binary file. At this point, our friend is Numpy facilities. Specially two functionality;
numpy.dtype and numpy.fromfile. The former provides an easy way of defining our file's format similar to Fortran syntax via creation of a data type object as its name stands. The later is a direct way of reading the binary file in one go that would return us a Python object that contains the all information present in the data file.
Here is the Numpy code that reads our binary file created by the above C code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import numpy as np
dt      = np.dtype("i4, (3)f8")
myArray = np.fromfile('my.bindata', dtype=dt)
myArray[0] 
#(42, [0.01, 1.01, 2.01])
myArray[0][1] 
#array([ 0.01,  1.01,  2.01])
myArray[0][0] 
#42
myArray[0][1][1] 
#1.01
 
I have tested this case study on GNU/Linux PC, so the binary file is little-endian hence the writing and reading patterns.  Ideally a generic wrapper around this Python code would help to simplify things.

No comments:

(c) Copyright 2008-2015 Mehmet Suzen (suzen at acm dot org)

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Unported License.