There are many popular languages to achieve the above task. For speed and efficiency reasons usually C or Fortran is used in writing out a binary file. Let's give an example of writing one integer (42) and three doubles (0.01, 1.01, 2.01) into binary file in C:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | #include <stdio.h> #include <stdlin.h> int main() { FILE * myF; int i,j; double *numbers, kk; myF = fopen("my.bindata", "wb") ; numbers = malloc(3*sizeof(double)); i = 42; fwrite(&i, sizeof(int), 1, myF); for(j=0; j<3; j++) { kk = (double)j+1e-2; numbers[j] = kk; } fwrite(numbers, sizeof(double), 3, myF); fclose(myF); return(0); } |
This code would produce a binary file called my.bindata. Our aim is to read this into Python so we can post-process the results i.e. visualisation or further data analysis. The core idea is to use higher language in processing the outputs directly instead of writing further C code; so to speak avoiding one more step in our work flow and avoiding cumbersome compilation of extra C code.
In order to read from files byte by byte, the standard library of Python provides a module called struct. Basically this module provides packing and unpacking of data into or from binary sources, in this case study our source is a file. However it is tedious and error prone to use this in a custom binary file where format would contain different types. Well at least needs an effort to read our custom binary file. At this point, our friend is Numpy facilities. Specially two functionality;
numpy.dtype and numpy.fromfile. The former provides an easy way of defining our file's format similar to Fortran syntax via creation of a data type object as its name stands. The later is a direct way of reading the binary file in one go that would return us a Python object that contains the all information present in the data file.
Here is the Numpy code that reads our binary file created by the above C code.
1 2 3 4 5 6 7 8 9 10 11 | import numpy as np dt = np.dtype("i4, (3)f8") myArray = np.fromfile('my.bindata', dtype=dt) myArray[0] #(42, [0.01, 1.01, 2.01]) myArray[0][1] #array([ 0.01, 1.01, 2.01]) myArray[0][0] #42 myArray[0][1][1] #1.01 |
I have tested this case study on GNU/Linux PC, so the binary file is little-endian hence the writing and reading patterns. Ideally a generic wrapper around this Python code would help to simplify things.
2 comments:
There are several typos in your C code. It should be stdlib, not stdlin. You have a weird "&" from a keystroke. More fundamentally the code segfaults because you have i as a pointer, but it isn't. I've rewritten here, with some extra debugging print statements.
#include
#include //fixed typo
int main() {
FILE * myF;
int *ip,i,j; //added ip pointer
double *numbers, kk;
myF = fopen("my.bindata", "wb") ;
numbers = malloc(3*sizeof(double));
i = 42;
ip = &i;
fwrite(ip, sizeof(int), 1, myF); //this line segfaulted with i
printf("%d\n",i);
fflush(stdout);
for(j=0; j<3; j++) {
kk = (double)j+1e-2;
printf("%lf\n",kk);
fflush(stdout);
numbers[j] = kk;
}
fwrite(numbers, sizeof(double), 3, myF);
correction to my previous comment, I see that you the ampersand typo and the pointer typo are related. forget the extra *ip layer and just do &i in the proper spot.
Post a Comment