Memo's Island

Tuesday, 5 March 2013

Big-O notation: Care on asymptotic behaviour

Big-O notation conveys us information about how a given algorithm run time scales with respect to a given input size. However, one has to be careful on the interpretation of this scaling (limiting) behaviour. First of all, when we talk about $O(N^2)$, it doesn't mean that it is valid for any input size. Let's illustrate that asymptotic behaviour shown by big-O notation could be “misleading” for “smaller” input. Let’s take number of operation $T(N)=4 N^2 – 2N$. $T’(4)$ must be 4 times of $T’(2)$ if we consider as $T’(N) = N^2$. But if we use $T(N)$ instead, $T(4)$ would be $4.666667$ times of $T(2)$.

A good example in this direction is N-body force calculations (or here). An $O(N^2)$ algorithm performs better compare to $O(NlogN)$ ones up to a break-even point even 50K particles. This is a good paper explaining this;

Pringle, Gavin J. "Comparison of an O (N) and an O (N log N) N-body solver."

[pdf].

Friday, 1 February 2013

Multicore Run in Matlab via Python: Generation of Henon Maps

Hennon Map for a=1.4 and b=0.3.

Multi-core processors are considered as a de facto standard. In scientific computing it is common to face a problem of generating data based on a single functionality with many different parameters. This can be thought as "single instruction multiple parameter sets" type of computation. It is quite natural to utilise all cores available in your host machine. While many people uses MATLAB for rapid prototyping; Here I show how to generate many Hennon Maps with different initial conditions (ICs) using MATLAB and Python. We will use Python to drive the computations and "single instruction" is being a function in MATLAB.

Let's first shortly remember the definition of the Hennon Map

$ x_{n+1} = y_{n} + 1 - \alpha x_{n}^2 $
$ y_{n+1} = \beta x_{n} $

It is known that for parameter ranges $\alpha \in [1.16, 1.41]$ and $\beta \in [0.2, 0.3]$ map generates chaotic behaviour i.e. sensitive dependence on initial conditions.

Here is a MATLAB function that generates a png file for a given parameters, initial condition values, file name and the upper bound for the iteration.

function hennonMap(a, b, xn, yn, upper, filename)
% HENNONMAP generate hennon map at given parameters and initial conditions
%
% Mehmet Suzen
% msuzen on gmail
% (c) 2013
% GPLv3
%
% a \in [1.16 1.41]
% b \in [0.2, 0.3]
%
% Example of running this in the command line
%  > matlab  -nodesktop -nosplash -logfile my.log -r "hennonMap(1.4, 0.3, -1.0, -0.4, 1e4, 'hennonA1.4B0.3.png'); exit;"
%
  Xval = zeros(upper, 1);
  Yval = zeros(upper, 1);
  for i=1:upper
    Xval(i) = xn;
    Yval(i) = yn;
    x  = yn + 1 - a*xn^2;
    y  = b * xn;
    xn = x;
    yn = y;
  end
  h = figure
  plot(Xval, Yval,'o')
  title(['Henon Map at ', sprintf(' a=%0.2f', a), sprintf(' b=%0.2f', b)])
  xlabel('x')
  ylabel('y')
  print(h, '-dpng', filename)
end

Running this function from a command line, as described in the function help, would generate the figure shown above. Now imagine that we need to generate many plots with different ICs. It is easy to open many shells and run from command line many times. However it is a manual task and we developers do not like that at all! Python would help us in this case, specifically its multiprocessing module, to spawn the process from a single script. The concept here is closely related to threading, while multiprocessing module mimics threading API.

For example, let's say we have 16 cores and we would like to generate Henon maps for 16 different initial conditions. Here is a Python code running the above Hennon Map function by invoking 16 different MATLAB instances with each using different ICs:

#
# Mehmet Suzen
# msuzen on gmail
# (c) 2013
# GPLv3
#  
#  Run Hennon Map on multi-process (spawning processes)
#

from multiprocessing import Pool
import commands
import numpy
import itertools

def f(argF):
        a            = '%0.1f' % argF[0]
        b            = '%0.1f' % argF[1]
        filename     = "hennonA" + a + "B" + b + ".png"
        commandLine  = "matlab  -nodesktop -nosplash -logfile my.log -r \"hennonMap(1.4, 0.3," +  a + "," + b + ", 1e4, '" + filename + "'); exit;\""
        print commandLine
        return commands.getstatusoutput(commandLine)

if __name__ == '__main__':
        pool     = Pool(processes=12)              # start 12 worker processes
        xns      = list(numpy.linspace(-1.0, 0.5, 4))
        yns      = list(numpy.linspace(-0.4, 1.1, 4))
        print pool.map(f, list(itertools.product(xns, yns)))

It would be interesting to do the same thing only using R. Maybe in the next post, I'll show that too.

Wednesday, 30 January 2013

R's range and loop behaviour: Zero, One, NULL

One of the most common pattern in programming languages is to ability to iterate over a given set (a vector usually) by using 'for' loops. In most modern scripting languages range operations is a build in data structure and trivial to use with 'for' loops. In this post, I would like to discuss R's behaviour when the upper bound of the range is zero. For example:

> length(0:0)
[1] 1

> for(i in 0:0) { print(i) }
[1] 0

In Python, the behaviour is somehow more intuitive (well depends on your interpretation of zero):

>>> len(range(0,0))
0
>>> for i in range(0,0): print i
...
>>>

So when you use 'for loops' in R, which you should avoid as much as you can, use 'apply' type operations instead, do not assume that 'for loop' will not enter into 'for body' when the upper bound is zero. This can be important when one uses 'length' for a variable length vector as the upper bound. Similar issue is discussed at the section 8.1.60 of R inferno book. Instead, 'seq_along' is suggested there as a replacement. However, this will also fail to catch 0 length sequence.

> for(i in seq_along(0:0)) { print(i) }
[1] 1

Hence, the usage of 'length' or 'seq_along' to determine the upper bound is not recommended. In relation to this, do not use 'seq' or range ':' in generating a sequence. To completely avoid such a bug to occur, a wrapper sequence generation can be used. This is rather quite a conservative approach but would avoid any confusion if your vector's or sequences are variable length and there is a good chance that their upper bound would be zero during run time. Here is one simple wrapper:

> seqWrapper <- function(lb, ub, by=1) {
+   s <- c()
+   if(!ub <= lb) s <- seq(lb,ub, by=by)
+   return(s)
+ }
> seqWrapper(0,0)
NULL

Sometimes being conservative in coding would help to avoid a simple run time bug.

Friday, 7 December 2012

UEFA Champions League Knockout Phase Draws: Monte Carlo Simulation with R

Draws for the knockout phase of the 2012–13 UEFA Champions League will be held in Nyon on the 20th December 2012. The rules of the draw are simple and are as follows:

8 Group winner teams will be seeded.
8 Group runner-up teams will be unseeded.
Teams coming from the same group and from same association can not be drawn against.

One can compute the probabilities of the likely outcomes by using ensemble average, a standard way of finding the frequencies where we identify whole sample space. Here we take a different approach and use time averages. What we do is to run large number of draws (20 million times) using randomly generated draws. I have called this exercise a Monte Carlo simulation, rather a simple sort while there is no fancy sampling of any sort, only rejection of pair up or the given draw if the last remaining pair does not satisfy the rules. A statistical language GNU R is used to generate the likely outcome. We use Mersenne Twister RNG, a default random number generator which is known to have very large cycle. In result we basically count the likely pair-ups in a matrix (matrix of pairs) and find the frequency of each likely pair up. Simulation takes 3-4 hours on a single CPU Intel Xeon with 16 GB RAM running Ubuntu GNU/Linux. Well, to be honest I haven't optimized the code.

Qualified teams, their groups and associations definitions, the simulation and results reporting R code are as follows (note that we use xtable for html outputs, for other functions we use, see below codes after tables):

winnerTeams               <- c('Paris-Saint-Germain', 'Schalke-04', 'Malaga', 'Borussia-Dortmund', 'Juventus', 'Bayern-Munich', 'Barcelona', 'Manchester-United');
winnerAssociation         <- c('FR', 'DE', 'ES', 'DE', 'IT', 'DE', 'ES', 'ENG');
runnersUpTeams            <- c('Porto', 'Arsenal', 'Milan', 'Real-Madrid', 'Shakhtar-Donetsk', 'Valencia', 'Celtic', 'Galatasaray');
runnersUpTeamsAssociation <- c('PT', 'ENG', 'IT', 'ES', 'UA', 'ES', 'SCO', 'TR');
countMatrix <- matrix(0, 8, 8)
row.names(countMatrix) <- winnerTeams;
colnames(countMatrix) <- runnersUpTeams;
many <- 20e6;
system.time(drawMany(winnerTeams, winnerAssociation, runnersUpTeams, runnersUpTeamsAssociation, countMatrix, many))
countMatrix <- countMatrix/many;
print(countMatrix);

Simulations results can be interpreted based on frequencies (probabilities) of pairings, rather intuitively while probabilities are not that far off . For example if we consider Barcelona, Milan and Arsenal score the largest 0.23 and 0.21. So my guess based on these frequencies, which I select maximums first ,then the second maximum and so on. If all ties I select the highest count using the second table. Here are the predicted pairs, (ordered with highest probability):

Barcelona - Milan
Malaga - Arsenal
Bayern Munich - Real Madrid
Borissia Dordmund - Valencia
Manchester United - Celtic
Juventus - Galatasaray
Schalke 04 - Porto
PSG - Donetsk

Note that predicted pairs are quite depending on your selection strategy.
Table for the frequencies:

	Porto	Arsenal	Milan	Real-Madrid	Shakhtar-Donetsk	Valencia	Celtic	Galatasaray
Paris-Saint-Germain	0.00	0.13	0.14	0.18	0.12	0.18	0.12	0.12
Schalke-04	0.12	0.00	0.15	0.19	0.12	0.19	0.13	0.12
Malaga	0.19	0.23	0.00	0.00	0.19	0.00	0.20	0.19
Borussia-Dortmund	0.13	0.14	0.15	0.00	0.13	0.19	0.13	0.13
Juventus	0.13	0.15	0.00	0.22	0.00	0.22	0.14	0.13
Bayern-Munich	0.13	0.14	0.15	0.19	0.13	0.00	0.13	0.13
Barcelona	0.18	0.21	0.23	0.00	0.19	0.00	0.00	0.18
Manchester-United	0.13	0.00	0.16	0.22	0.13	0.22	0.14	0.00

Table for the counts (actually all are integers):

	Porto	Arsenal	Milan	Real-Madrid	Shakhtar-Donetsk	Valencia	Celtic	Galatasaray
Paris-Saint-Germain	0.00	2589164.00	2897024.00	3658337.00	2356581.00	3658892.00	2494247.00	2345755.00
Schalke-04	2348458.00	0.00	2924099.00	3735314.00	2371245.00	3743246.00	2517610.00	2360028.00
Malaga	3781019.00	4506559.00	0.00	0.00	3845872.00	0.00	3997679.00	3868871.00
Borussia-Dortmund	2500221.00	2819591.00	3098797.00	0.00	2542013.00	3863888.00	2646626.00	2528864.00
Juventus	2659773.00	2983255.00	0.00	4393444.00	0.00	4392158.00	2892103.00	2679267.00
Bayern-Munich	2500835.00	2818331.00	3098181.00	3866890.00	2542742.00	0.00	2647340.00	2525681.00
Barcelona	3620075.00	4283100.00	4684023.00	0.00	3721268.00	0.00	0.00	3691534.00
Manchester-United	2589619.00	0.00	3297876.00	4346015.00	2620279.00	4341816.00	2804395.00	0.00

This is the main simulation function:

drawMany <- function(winnerTeams, winnerAssociation, runnersUpTeams, runnersUpTeamsAssociation, countMatrix, many) {
 for(i in 1:many) {
  repeat {
     dr          <- drawOneKnock(winnerTeams, winnerAssociation, runnersUpTeams, runnersUpTeamsAssociation,0);
     if(sum(dr) > 0) break;
   }
  updateCount <- mapply(incMatrix, dr[,1], dr[,2])
 }
}

A single draw can be generated as follows:

drawOneKnock <- function(winnerTeams, winnerAssociation, runnersUpTeams, runnersUpTeamsAssociation, names=1) {
    k=1
  repeat {
    k=k+1;
    if(k > 1000) return(-1);
    blockWin = 1:8     ; # tracking for draw
    blockRun = blockWin;
    winners  = c(); # Draw results
    runners  = c();
    for(i in 1:7) {
         kk =1;
      repeat {
        kk=kk+1;
        if(kk > 1000) return(-1);
        winner <- sample(blockWin, 1);
        runner <- sample(blockRun, 1);
        if(!(runner == winner) && !(winnerAssociation[winner] == runnersUpTeamsAssociation[runner])) {
          break;
        }
      }
      blockWin <- blockWin[-which(blockWin == winner)];
      blockRun <- blockRun[-which(blockRun == runner)];
      winners  <- c(winners, winner);
      runners  <- c(runners, runner);
     }
      winner <- blockWin;
      runner <- blockRun;
      # check if last remaining is ok, otherwise invalidate draw
      if(!(runner == winner) && !(winnerAssociation[winner] == runnersUpTeamsAssociation[runner])) {
        winners  <- c(winners, blockWin);
        runners  <- c(runners, blockRun);
        if(names)  dr <- cbind(winnerTeams[winners], runnersUpTeams[runners]);
        if(!names) dr <- cbind(winners, runners);
        break;
      }
   }
  dr
}

And counting the pair-ups is performed by a simple function:

incMatrix <- function(i, j) {
   countMatrix[i,j] <<- countMatrix[i,j]+1;
   return(0);
 }