Memo's Island: 2012

Friday, 7 December 2012

UEFA Champions League Knockout Phase Draws: Monte Carlo Simulation with R

Draws for the knockout phase of the 2012–13 UEFA Champions League will be held in Nyon on the 20th December 2012. The rules of the draw are simple and are as follows:

8 Group winner teams will be seeded.
8 Group runner-up teams will be unseeded.
Teams coming from the same group and from same association can not be drawn against.

One can compute the probabilities of the likely outcomes by using ensemble average, a standard way of finding the frequencies where we identify whole sample space. Here we take a different approach and use time averages. What we do is to run large number of draws (20 million times) using randomly generated draws. I have called this exercise a Monte Carlo simulation, rather a simple sort while there is no fancy sampling of any sort, only rejection of pair up or the given draw if the last remaining pair does not satisfy the rules. A statistical language GNU R is used to generate the likely outcome. We use Mersenne Twister RNG, a default random number generator which is known to have very large cycle. In result we basically count the likely pair-ups in a matrix (matrix of pairs) and find the frequency of each likely pair up. Simulation takes 3-4 hours on a single CPU Intel Xeon with 16 GB RAM running Ubuntu GNU/Linux. Well, to be honest I haven't optimized the code.

Qualified teams, their groups and associations definitions, the simulation and results reporting R code are as follows (note that we use xtable for html outputs, for other functions we use, see below codes after tables):

winnerTeams               <- c('Paris-Saint-Germain', 'Schalke-04', 'Malaga', 'Borussia-Dortmund', 'Juventus', 'Bayern-Munich', 'Barcelona', 'Manchester-United');
winnerAssociation         <- c('FR', 'DE', 'ES', 'DE', 'IT', 'DE', 'ES', 'ENG');
runnersUpTeams            <- c('Porto', 'Arsenal', 'Milan', 'Real-Madrid', 'Shakhtar-Donetsk', 'Valencia', 'Celtic', 'Galatasaray');
runnersUpTeamsAssociation <- c('PT', 'ENG', 'IT', 'ES', 'UA', 'ES', 'SCO', 'TR');
countMatrix <- matrix(0, 8, 8)
row.names(countMatrix) <- winnerTeams;
colnames(countMatrix) <- runnersUpTeams;
many <- 20e6;
system.time(drawMany(winnerTeams, winnerAssociation, runnersUpTeams, runnersUpTeamsAssociation, countMatrix, many))
countMatrix <- countMatrix/many;
print(countMatrix);

Simulations results can be interpreted based on frequencies (probabilities) of pairings, rather intuitively while probabilities are not that far off . For example if we consider Barcelona, Milan and Arsenal score the largest 0.23 and 0.21. So my guess based on these frequencies, which I select maximums first ,then the second maximum and so on. If all ties I select the highest count using the second table. Here are the predicted pairs, (ordered with highest probability):

Barcelona - Milan
Malaga - Arsenal
Bayern Munich - Real Madrid
Borissia Dordmund - Valencia
Manchester United - Celtic
Juventus - Galatasaray
Schalke 04 - Porto
PSG - Donetsk

Note that predicted pairs are quite depending on your selection strategy.
Table for the frequencies:

	Porto	Arsenal	Milan	Real-Madrid	Shakhtar-Donetsk	Valencia	Celtic	Galatasaray
Paris-Saint-Germain	0.00	0.13	0.14	0.18	0.12	0.18	0.12	0.12
Schalke-04	0.12	0.00	0.15	0.19	0.12	0.19	0.13	0.12
Malaga	0.19	0.23	0.00	0.00	0.19	0.00	0.20	0.19
Borussia-Dortmund	0.13	0.14	0.15	0.00	0.13	0.19	0.13	0.13
Juventus	0.13	0.15	0.00	0.22	0.00	0.22	0.14	0.13
Bayern-Munich	0.13	0.14	0.15	0.19	0.13	0.00	0.13	0.13
Barcelona	0.18	0.21	0.23	0.00	0.19	0.00	0.00	0.18
Manchester-United	0.13	0.00	0.16	0.22	0.13	0.22	0.14	0.00

Table for the counts (actually all are integers):

	Porto	Arsenal	Milan	Real-Madrid	Shakhtar-Donetsk	Valencia	Celtic	Galatasaray
Paris-Saint-Germain	0.00	2589164.00	2897024.00	3658337.00	2356581.00	3658892.00	2494247.00	2345755.00
Schalke-04	2348458.00	0.00	2924099.00	3735314.00	2371245.00	3743246.00	2517610.00	2360028.00
Malaga	3781019.00	4506559.00	0.00	0.00	3845872.00	0.00	3997679.00	3868871.00
Borussia-Dortmund	2500221.00	2819591.00	3098797.00	0.00	2542013.00	3863888.00	2646626.00	2528864.00
Juventus	2659773.00	2983255.00	0.00	4393444.00	0.00	4392158.00	2892103.00	2679267.00
Bayern-Munich	2500835.00	2818331.00	3098181.00	3866890.00	2542742.00	0.00	2647340.00	2525681.00
Barcelona	3620075.00	4283100.00	4684023.00	0.00	3721268.00	0.00	0.00	3691534.00
Manchester-United	2589619.00	0.00	3297876.00	4346015.00	2620279.00	4341816.00	2804395.00	0.00

This is the main simulation function:

drawMany <- function(winnerTeams, winnerAssociation, runnersUpTeams, runnersUpTeamsAssociation, countMatrix, many) {
 for(i in 1:many) {
  repeat {
     dr          <- drawOneKnock(winnerTeams, winnerAssociation, runnersUpTeams, runnersUpTeamsAssociation,0);
     if(sum(dr) > 0) break;
   }
  updateCount <- mapply(incMatrix, dr[,1], dr[,2])
 }
}

A single draw can be generated as follows:

drawOneKnock <- function(winnerTeams, winnerAssociation, runnersUpTeams, runnersUpTeamsAssociation, names=1) {
    k=1
  repeat {
    k=k+1;
    if(k > 1000) return(-1);
    blockWin = 1:8     ; # tracking for draw
    blockRun = blockWin;
    winners  = c(); # Draw results
    runners  = c();
    for(i in 1:7) {
         kk =1;
      repeat {
        kk=kk+1;
        if(kk > 1000) return(-1);
        winner <- sample(blockWin, 1);
        runner <- sample(blockRun, 1);
        if(!(runner == winner) && !(winnerAssociation[winner] == runnersUpTeamsAssociation[runner])) {
          break;
        }
      }
      blockWin <- blockWin[-which(blockWin == winner)];
      blockRun <- blockRun[-which(blockRun == runner)];
      winners  <- c(winners, winner);
      runners  <- c(runners, runner);
     }
      winner <- blockWin;
      runner <- blockRun;
      # check if last remaining is ok, otherwise invalidate draw
      if(!(runner == winner) && !(winnerAssociation[winner] == runnersUpTeamsAssociation[runner])) {
        winners  <- c(winners, blockWin);
        runners  <- c(runners, blockRun);
        if(names)  dr <- cbind(winnerTeams[winners], runnersUpTeams[runners]);
        if(!names) dr <- cbind(winners, runners);
        break;
      }
   }
  dr
}

And counting the pair-ups is performed by a simple function:

incMatrix <- function(i, j) {
   countMatrix[i,j] <<- countMatrix[i,j]+1;
   return(0);
 }

Monday, 3 December 2012

Function closures and abstraction in MATLAB/Octave

Recently a short idea on how function closures can be build in R is shown [link]. It is also discussed in detail by Darren Wilkinson's blog [link]. Here we will mimic the approach presented there. Recall that function closures are basically a reference to a function with the attachment of its lexical environment. Note that, it is not a function pointer while it also return the values of its arguments if they are defined in the given scope. This sort of language construction can be very handy when some of the inner details need to be parametrized, so to speak to prevent re-writing the function over and over again.

A trivial example of adding a number to a given number can be formulated. Let's say x is a number and we need to add 4 to x.

function add4(x)
x+4
end

So far so good, but do you really want to re-write an other function to add 7. Not really! So instead we can generate a function that generates addition function for 7 (Sounds like an onion.). This is quite standard in functional languages and noting new actually!

>> addx = @(x) @(y) x + y ;
>> add4=addx(4);
>> add7=addx(7);
>> add4(5)
9
>> add7(5)
12

So function add4 and add7 are generated by the function addx. We just utilized function handles. This might be quite trivial. Let's try the technique in a little more realistic setting. Imagine we need to take a median of each vector, which has different sizes stored in a structure. Let's generate one

>> rng(0.42,'v4'); % reproduce numbers
>> myStr.vec1 = rand(11,1);
>> myStr.vec2 = rand(5,1);
>> myStr.vec3 = rand(20,1);
>> myStr.vec4 = rand(8,1);

So a function that takes a structure like this and returns the median value of the named member

>> medianVec= @(str, memName) median(str.(memName));

For example for vec1

>> medianVec(myStr, 'vec1');
0.5194

If we need this procedure for vec1 repetitively, we don't need to re-write the second argument all the time. So

>> medianVecN = @(memName) @(str) median(str.(memName));
>> medianVec1 = medianVecN('vec1');
>> medianVec1(myStr); % Vala!
0.5194

This might still look trivial but hopefully it would give you an idea of the technique.

MATLAB/OCTAVE crash tutorial for programmers: Useful bits

In this post, I would like to give you a crash tutorial on the basics of MATLAB ™ - GNU Octave (MO). They are not fully compatible but quite close in grammar. For differences see here. We will concentrate on the basic data structures and object manipulation that would help you to write useful code quickly.

Vectors/matrices and Indexing
Vectors or vectorial operations are the core of MO
>> A = [ 3 6 7 8 2];>> A(1:3); % range indexing will print values from index 1 to index 3
>> A(4:end); % Last two values of this vector
>> A > 4 % logical indexing; this will return logical vector, components satisfying the condition
0 1 1 1 0
>> A(A > 4) % one can use this logical indexing directly to extract elements that satisfy the condition
6 7 8
A([2 3 4]) % positional indexing
logInx = A>4
A(logInx)
6 7 8
Let's define 3 by 3 matrix
>> M = [ 1 3 4 ; 5 6 9 ; 2 3 4]
>> M(:,1:2) % first two columns
Structures and Cells
A structure definition can be performs as follows
persons=struct();
Fields can be defined after .
persons.age=12; Further index can be assigned dynamically
persons(2).age=32;
persons(3).age=40;
Functions
Return vector with function definition must be in a seperate file. For example here pov.m (in the working directory)

function [ pov2, pov3] = pov(x)
^{pov2= x^2; pov3 = x^3;}
end

(pwd will tell you which directory you are working in. A return vector here is [pov2, pov3].
Function handles and apply functions
To define new function ff with function handle
it takes a structure xx and returns its field age value

ff = @(xx) xx.age

We can apply this function to each element of the structure,
returning an array (vector)

arrayfun(ff, persons)
Example Task:
Lets align two vectors, meaning that we found matching indices and delete non matching ones. Here is the solution:

p1 = [2 11 15 19 28 41 45 47 48 51];
p2 = [2 11 31 36 41 45 51 60 68 71];
[~,indexp1] = intersect(p1,p2);
% if you use ~, it means don't assign anything
[~,indexp2] = intersect(p2,p1);
p1 = p1(indexp1);
p2 = p2(indexp2);

Friday, 26 October 2012

Intersect many sets in MATLAB efficiently

Finding intersection of two sets in MATLAB is one of the most frequently used function if you are doing lots of data merging tasks. This is achieved by intersect function. However if you would like to do the same for many sets, there are not many alternatives.

Consider that you placed your arrays into a cell vector and would like to get all intersections of a given vector among other remaining arrays and its corresponding indices uniquely. The following function will do the job for you efficiently. It is not O($N^2$) algorithm, so computational cost only grows linearly. I called this function intersectall, code is maintained in the mathworks file exchange; intersectall.

Wednesday, 10 October 2012

Matlab/Octave: imagesc with variable colorbar

Plotting a discrete matrix with a colour map is a common task that one would need in many different fields of sciences. To plot a discrete matrix with MATLAB, a possibility is to use imagesc, while it handles discrete points better then other density utilities, where it can handle (0,0) placement better. However, if there are limited number of values in the matrix entries, you may want to keep only existing ones on the color labels. One way to achieve this is to re-write your 'colormap' via 'jet', that maps existing colours on the current matrix. However, this is not sufficient, you need to re-write your matrix as well. See example code below. So called a variable 'colormap' situation is handled with this approach. (This is tested with MATLAB, on Octave, I think one needs to be careful on 'colormap', values behaviour of 'jet' is little different.)

% bare minimum - example -
clear all
close all
allLevelNames   = {'one', 'two', 'three', 'four'};
levelValues     = [1 2 3 4];
myMatrix        = [3 3 3 3; 3 3 3 3; 4 4 1 1; 4 4 1 1];
% Now re-write colormap
aV              = length(levelValues);
jj              = jet(aV);
availableValues = unique(myMatrix);
getIndexes      = arrayfun(@(xx) find(levelValues == xx), availableValues);
colormap(jj(getIndexes,:));
% Re-write myMatrix in the availableValues range
myMatrix = arrayfun(@(x) find(availableValues == x), myMatrix);
% Now Plot
h    = imagesc(myMatrix);
% Handle if there is more then one color is in use
maxV = max(availableValues);
minV = min(availableValues);
if(length(availableValues) > 1) caxis([minV maxV]); end;
hcb=colorbar;
set(hcb, 'YTick', availableValues);
set(hcb, 'YTickLabel', allLevelNames(getIndexes));
set(gca,'YDir','normal') % Not to invert x-y but putting (0,0) on the bottom left
xlabel('x values');
ylabel('y values (care on (0,0) location from imagesc)');
title({['Variable colormap handling'];['by msuzen at gmail']});

Thursday, 22 March 2012

Check URL existance with R

RCurl package provides utilities for HTTP and FTP connection. One particular functionality is checking if a given URL is accessible, implying existence, using url.exists. However some unfortunate Windows users may have difficulty to get its latest build from CRAN. Here is a simple function to check existence of URL by only using R base functionality;

   urlExists <- function(address) {  
     tryCatch {  
      con <- url(address)  
      a  <- capture.output(suppressWarnings(readLines(con)))  
      close(con)  
      TRUE;  
     },  
       error = function(err) {  
       occur <- grep("cannot open the connection", capture.output(err));  
       if(length(occur) > 0) FALSE;  
      }  
    )  
   }

Memo's Island