Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Friday, 9 February 2024

Exact reproducibility of stochastic simulations for parallel and serial algorithms simultaneously
Random Stream Chunking

Preamble 

Figure: Visual description of
random stream chunking, M.Suzen (2017)
The advent of using computational sciences approaches, i.e., data science and machine learning, in the industry becomes more common practice in almost all organisations due to the democratisation of data science tools and availability of inexpensive cloud infrastructure. This brings the requirement or even compulsory practice of a code being so called parallelised. Note that parallelisation is used as an umbrella term here for using multiple compute resources in accelerating otherwise a serial computation and could mean a distributed or multi-core computing, i.e., multiple CPUs/GPUs. Here we provide a simple yet a very powerful approach to preserve reproducibility of parallel and serial implementation of the same algorithm that uses random numbers, i.e., stochastic simulations. We assume Single Instruction Multiple Data (SIMD) setting. 

Terminology on repeatability, reproducibility and  replicability 

Even though we only use reproducibility as a term as an umbrella term, there are clear distinctions, recommended by ACM, here. We summarise this from computational science perspective :

repeatability : Owner re-run the code to produce the same results on own environment. 
reproducibility: Others can  re-run the code to produce the same results on other's environment. 
replicability: Others can re-code the same thing differently and produce the same results on other's environment. 

In the context of this post, since parallel and serial settings would constitute different environments, the practice of getting the same results, this falls under reproducibility.   

Single Instruction Multiple Data (SIMD) setting. 

This is the most used, and probably the only one you would ever need, approach in parallel processing. It implies using the same instruction, i.e., algorithm or function/module, for the distinct portions of the data. This originates from applied mathematics techniques so called domain decomposition method

Simultaneous Random Stream Chunking 

The approach in ensuring exact reproducibility of a stochastic algorithm in both serial and parallel implementation lies in default chunking in producing the random numbers both in serial and parallel code. This approach used in the Bristol Python package Here is the mathematical definition. 

Defintion Random Stream Chunking Given a random number generator $\mathscr{R}(s_{k})$ with as seed $s_{k}$ is used over $k$-partitions. These partitions are always corresponds to $k$ datasets, MD portion of SIMD. In the case of parallel algorithms, each $k$ corresponds to a different compute resource     $\mathscr{C}_{k}$. 

By this definition we ensure that both parallel and serial processing receiving exactly the same random number, this is summarised in the Figure.

Conclusion

We outline a simple yet effective way of ensuring exact reproducibility of serial and parallel simulations simultaneously. However, reproducibility of stochastic simulations are highly hardware dependent as well, such as GPUs and NPUs, and their internal streams might not be that easy to control partitions, but generic idea presented here should be applicable.


Please cite this article as: 

 @misc{suezen24rep, 
     title = {Exact reproducibility of stochastic simulations for parallel and serial algorithms simultaneously}, 
     howpublished = {\url{https://memosisland.blogspot.com/2024/02/exact-reproducibility-of-stochastic.html}, 
     author = {Mehmet Süzen},
     year = {2024}
}
  



Tuesday, 12 May 2020

Collaborative data science: High level guidance for ethical scientific peer reviews

Preamble


Catalan Castellers are
collaborating (Wikipedia)
Availability of distributed code tracking tools and associated collaborative tools make life much easier in building collaborative scientific tools and products. This is now especially much more important in data science as it is applied in many different industries as a de-facto standard. Essentially a computational science field in academics now become industry-wide practice.

Peer-review is a pull request


Peer-reviews usually appears as pull requests, this usually entails a change to base work that achieves the certain goal by changes. A nice coincidence that acronym PR here corresponds to both peer review and pull request.


Technical excellence does come with decent behaviour


Aiming at technical excellence is all we need to practice. Requesting technical excellence in PRs is our duty as peers. However, it does come with a decent behaviour. PRs are tools for collaborative work, even if it isn't your project or you are in a different cross-function. Here we summarise some of the high-level points for PRs. This can manifest as software code, algorithmic method or a scientific or technical article:
  • Don’t be a jerk  We should not request things that we do not practice ourselves or invent new standards on the fly.   If we do, then give a hand in implementing it.
  • Focus on the scope and be considerate We should not request things that extend the scope of the task much further than the originally stated scope.   
  • Nitpicking is not attention to details Attention to details is quite valuable but excessive nitpicking is not.
  • Be objective and don’t seek revenge If someone of your recommendations on PRs is not accepted by other colleague don’t seek revenge on his suggestions on your PRs by declining her/his suggestions as an act of revenge or create hostility towards that person.

Conclusion


We provided some basic suggestion on high-level guidance on peer review processes. Life is funny, there is a saying in Cyprus and probably in Texas too, -what you seed you will harvest-..

Saturday, 23 November 2019

A simple pedagogical one-line python code for the Fibonacci sequence with recursion (and memoization)


Fibonacci, Leonardo da Pisa
(Wikipedia)
Updated with memoization on 26 Feb 2020

Preamble

The Fibonacci sequence $F_{n} = F_{n-1} + F_{n-2}$ is famous for both artistic and pure mathematical curiosity as an integer sequence. A simple implementation of this sequence appears as a standard programmer's entry-level job question, along with finding factorial and greatest common divisor. In this short post, we will show how to implement $F_{n}$ with recursion in Python one-liner.

Fibonacci Recursion in Python

Here is a one-liner
f = lambda n : n if n < 2 else f(n-1)+f(n-2) ; [f(n) for n in range(0,13)]

Let's dissect the one line. $f$ is our function that computes the $nth$ member of the sequence in a recursive fashion.

For example, a recursion of $f$ at $n=3$ follows the following depth as recursive calls:
  1. Upper level, n=3 : f(2)+f(1)
  2. n=2 : f(1) + f(0) 
    1. Returns 1+0 = 1
  3. n=1 : f(1) 
    1.  Returns 1
  4. Returns 2
Recall that every recursion needs a termination condition, which is less than two in our case. The second part of one-liner after semi-column, which indicates another line,  is a simple list-comprehension to repeat the call to $f$ from 0 to 12. Resulting
 [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144] 

Computational efficiency

To compute the sequence for a very large set, this implementation is obviously very slow. One would need to introduce caching within the recursion to avoid repeated computation.

Reading list for novice coders
Conclusion

One of the most interesting programming tasks usually comes in implementing integer sequences. Note that Fibonacci day is celebrated on 23rd of November, the day this post is published.

Postscript: Time complexity, memoization version with timings

The complexity of 1 liner code above is $O(2^N)$, with caching it can be $O(N)$. Thanks to Tim Scarfe for pointing out Big-O complexities. The solution is to introduce caching, this is so-called memoization.

The following gives timing for $m=35$ without and with memoization. It is seen that memoization is almost  5 orders of magnitude faster and will actually memory efficient as well. For larger $m$ memoization version may return a value but without memoization, we would quickly overflow on recursive calls.


import time
m = 35

# no memoization
t0 = time.process_time()
f = lambda n : n if n < 2 else f(n-1)+f(n-2) ;  sfib = [f(i) for i in range(1,m)]
time.process_time() - t0

# memoization
t0 = time.process_time()
mem={1:1,2:2} ; f = lambda n : n if n < 2 else mem.get(n) if mem.get(n) else mem.update({n:f(n-1)+f(n-2)}) ; sfib_mem = [f(i) for i in range(1,m)]; sfib_mem = [1]+list(mem.values())
time.process_time() - t0
 


Saturday, 5 August 2017

Post-statistics: Lies, damned lies and data science patents


Wikipedia
 (US Patent)

Statistics is so important field in our daily lives nowadays, the emerging field of 50 years old data science that is applied to almost every human activity now, or post-statistics, a kind of post-rock,  fusing operations research, data mining, software and performance engineering and of course multitude fields of statistics to machine learning. Even though, the reputation of statistics is a bit shaky due to quotes like  Lies, damned lies, and statistics. Post-statistics is still emerging and drive innovation in almost all industries that drive things with data.
One of the most important characteristics of data science appears to be shared ideal with open source movement, i.e. free software. Note that "free" here means freedom of using the source code and sharing recipes, i.e. a workflows/combination of algorithms for example. The entire innovation in data science we are witnessing last 5 years or so fundamentally driven by this attitude that is embraced by giants like Microsoft, Google and IBM supporting a huge number of enthusiastic individuals from industry and academics. These technology giants open source their workflows and tools to the entire community like Tensorflow and supporting community via event or investing in research that partly goes into public.  On the other hand, traditionally patents are designed to encourage innovation and invention culture. A kind of a gift and a natural right to innovator that given certain time frame he/she or organisation ripe some benefits. 


A recent patent on predicting data science project outcome, unfortunately, do not entirely served to this purpose: Data science project automated outcome prediction US 9710767 B1Even though it is very well written a patent, scope reads very restricted in the first instance, however, the core principle is identical to the standard work-flow activity a data science professional applies in daily routine: where to produce an automated outcome prediction.  The interpretation of  'data science project' is open to any activity on prediction outcome. I am of course no legal expert but based on this patent, which claims to invent outcome prediction pipeline for a 'data science project', Sci-kit learn's workflow manager, pipelines can be taken to court while it facilitates the exact same outcome prediction pipeline this patent claims to invent. It does not matter how this is enforceable but it gives right to patent holder an opportunity to sue everyone doing automated data science outcome prediction. 

This patent US 9710767 B1  is a tremendous disservice to the entire data science community and damaging to an industry and professionals that are trying to use the data in outcome prediction for the greater good in society and solve problems. We definitely do not claim that data science is the solution to our problems in general but will help us to tackle important problems in industry and society. So maybe in the post-statistics world, we have to yell; lies, damned lies and data science patents. While holders of such patent may look like encouraging a patent shark or troll, rather than  the intention of innovating or inventing.


Wednesday, 3 April 2013

Reading Binary Data Files Written in C: Python - Numpy Case Study

One of the major components of  scientific computation is producing results into a file. Usually bunch of numbers, usually structured, is written into so called data files. Writing  binary files as outputs is practised commonly. It is the choice due to good speed and/or memory efficiency compare to plain ASCII files. However, a care on documenting  endianness must be observed for good portability.

There are many popular languages to achieve the above task. For speed and efficiency reasons usually C or Fortran is used in writing out a binary file. Let's give an example of writing one integer (42) and three doubles (0.01, 1.01, 2.01) into binary file in C:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#include <stdio.h>
#include <stdlin.h>

int main() {
  FILE * myF;
  int i,j;
  double *numbers, kk;

  myF     = fopen("my.bindata", "wb") ;
  numbers = malloc(3*sizeof(double));
  i = 42;
  fwrite(&amp;i, sizeof(int), 1, myF);
  for(j=0; j<3; j++) {
    kk = (double)j+1e-2;
    numbers[j] = kk;
  }
  fwrite(numbers, sizeof(double), 3, myF);
  fclose(myF);
  return(0);
}

This code would produce a binary file called my.bindata. Our aim is to read this into Python so we can post-process the results i.e. visualisation or further data analysis. The core idea is to use higher language in processing the outputs directly instead of writing further C code; so to speak avoiding one more step in our work flow and avoiding cumbersome compilation of extra C code.

In order to read from files byte by byte, the standard library of Python provides a module called struct. Basically this module provides packing and unpacking of data into or from binary sources, in this case study our source is a file. However it is tedious and error prone to use this in a custom binary file where format would contain different types. Well at least needs an effort to read our custom binary file. At this point, our friend is Numpy facilities. Specially two functionality;
numpy.dtype and numpy.fromfile. The former provides an easy way of defining our file's format similar to Fortran syntax via creation of a data type object as its name stands. The later is a direct way of reading the binary file in one go that would return us a Python object that contains the all information present in the data file.
Here is the Numpy code that reads our binary file created by the above C code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import numpy as np
dt      = np.dtype("i4, (3)f8")
myArray = np.fromfile('my.bindata', dtype=dt)
myArray[0] 
#(42, [0.01, 1.01, 2.01])
myArray[0][1] 
#array([ 0.01,  1.01,  2.01])
myArray[0][0] 
#42
myArray[0][1][1] 
#1.01
 
I have tested this case study on GNU/Linux PC, so the binary file is little-endian hence the writing and reading patterns.  Ideally a generic wrapper around this Python code would help to simplify things.

Friday, 1 February 2013

Multicore Run in Matlab via Python: Generation of Henon Maps

Hennon Map for a=1.4 and b=0.3.
Multi-core processors are considered as a de facto standard. In scientific computing it is common to face a problem of generating data based on a single functionality with many different parameters. This can be thought as "single instruction multiple parameter sets" type of computation. It is quite natural to utilise all cores available in your host machine. While many people uses MATLAB for rapid prototyping; Here I show how to generate many Hennon Maps with different initial conditions (ICs) using MATLAB and Python. We will use Python to drive the computations and "single instruction" is being a function in MATLAB.

Let's first shortly remember the definition of the Hennon Map

$ x_{n+1} =  y_{n} + 1 - \alpha x_{n}^2 $
$ y_{n+1} = \beta x_{n}  $

It is known that for parameter ranges $\alpha \in [1.16, 1.41]$ and $\beta \in [0.2, 0.3]$ map generates chaotic behaviour i.e. sensitive dependence on initial conditions.

Here is a MATLAB function that generates a png file for a given parameters, initial condition values, file name and the upper bound for the iteration.

function hennonMap(a, b, xn, yn, upper, filename)
% HENNONMAP generate hennon map at given parameters and initial conditions
%
% Mehmet Suzen
% msuzen on gmail
% (c) 2013
% GPLv3
%
% a \in [1.16 1.41]
% b \in [0.2, 0.3]
%
% Example of running this in the command line
%  > matlab  -nodesktop -nosplash -logfile my.log -r "hennonMap(1.4, 0.3, -1.0, -0.4, 1e4, 'hennonA1.4B0.3.png'); exit;"
%
  Xval = zeros(upper, 1);
  Yval = zeros(upper, 1);
  for i=1:upper
    Xval(i) = xn;
    Yval(i) = yn;
    x  = yn + 1 - a*xn^2;
    y  = b * xn;
    xn = x;
    yn = y;
  end
  h = figure
  plot(Xval, Yval,'o')
  title(['Henon Map at ', sprintf(' a=%0.2f', a), sprintf(' b=%0.2f', b)])
  xlabel('x')
  ylabel('y')
  print(h, '-dpng', filename)
end

Running this function from a command line, as described in the function help, would generate the figure shown above. Now imagine that we need to generate many plots with different ICs. It is easy to  open many shells and run from command line many times. However it is a manual task and we developers do not like that at all!  Python would help us in this case, specifically its multiprocessing module, to spawn the process from a single script. The concept here is closely related to threading, while multiprocessing module mimics threading API.

For example, let's say we have 16 cores and we would like to generate Henon maps for 16 different initial conditions. Here is a Python code running the above Hennon Map function by invoking 16 different MATLAB instances with each using different ICs:

#
# Mehmet Suzen
# msuzen on gmail
# (c) 2013
# GPLv3
#  
#  Run Hennon Map on multi-process (spawning processes)
#

from multiprocessing import Pool
import commands
import numpy
import itertools

def f(argF):
        a            = '%0.1f' % argF[0]
        b            = '%0.1f' % argF[1]
        filename     = "hennonA" + a + "B" + b + ".png"
        commandLine  = "matlab  -nodesktop -nosplash -logfile my.log -r \"hennonMap(1.4, 0.3," +  a + "," + b + ", 1e4, '" + filename + "'); exit;\""
        print commandLine
        return commands.getstatusoutput(commandLine)

if __name__ == '__main__':
        pool     = Pool(processes=12)              # start 12 worker processes
        xns      = list(numpy.linspace(-1.0, 0.5, 4))
        yns      = list(numpy.linspace(-0.4, 1.1, 4))
        print pool.map(f, list(itertools.product(xns, yns)))

It would be interesting to do the same thing only using R. Maybe in the next post, I'll show that too.
(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License