Let your heart guide you......It whispers so listen closely

Thursday, August 30, 2012

parallel computing in R

I don't know what would have happened to me and my research if I didn't knew R. It has always been the most co-operative tool. But  RBM was never happy with the speed of programs, specially when I am doing simulations etc. So, one fine day I started googling, the ways to speed up the programs. I invested one complete day and got decent increase in speed. I was very happy and excited that I actually managed something cool; but next day his reactions made me realise that it is not sufficient. 
This led to following two consequences:
i) I was kind of disheartened; so neither I tried going in to greater depth nor did I try formulating and recording the procedure, at least for my record.
ii) I started looking into package making in R (which he always wanted me to do). In last 2 years I have tried doing that at least 2-3 times, but always gave up. But to my surprise, this time I managed within a day (will share that too, soon).

So there I was.. Within 2 days, I had successfully managed implementing 2 reasonably clumsy techniques. But my laziness never allowed me to go ahead with it.

It was only when MGK poked me, I realised that I cannot recall any of it. I didn't knew which files to look into. But eventually I managed. Thanks to ASK, who helped me in reviving. Before I forget again, I made a point to re-search, re-view and document...

Few important notes:
> I have used Revolution R Community 6.0 . One can also try with other versions of R like R 2.15.1, but then you need to make sure if all the required libraries/packages are downloaded/installed. Revolution R has got all of them.

> The version of parallel computing I have used, employs the notion of foreach (a replacement for standard "for statement"). It says "do this to everything in this set", rather than "do this x times".

> We have to register a parallel backend, otherwise foreach will execute tasks sequentially.

%dopar% executes the R expression using the currently registered backend.

So here is a sample code:

packages that have to be loaded

to detect the number of CPU cores on the current host

Creates a set of copies (say 2) of R running in parallel
cl = makeCluster(2) 

registers the parallel backend with the ‘foreach’ package

# to check if the multiple core is registered. If not, a warning will be issued that it is running sequentially. However, warning is issued only once
{  sample(c("H", "T"), 10000, replace=TRUE)

returns the number of execution workers in the currently registered doPar backend; should be same as input to makeCluster

# sample code
{    # a function which has to be executed T times
# x contains the sum of expression, on the last line of the loop (since argument to .combine is '+'). Here x is sum of I’s.

Some more notes:
 >   I am not sure, but my experimentation indicated that it is sufficient to register as many parallel backends as is the number of cores available. There was no improvement if I tried registering more backends than available no. of cores

>  Other arguments to .combine could be 'cbind', 'c', or some user defined function (not very sure about the usage). 

> Since the execution is parallel, make sure that the tasks are not related or dependent.

> This is not the only option available. Though I had to stick to this one as I was not able to exactly understand how to bring others to work, like doSNOW, which is a parallel adapter for SNOW (Simple Network of Workstations) package.  There were few which were not available for Windows, like doMC, which is a parallel adaptor for the multicore package. 

PS-1. Thanks are due to the authors of all the R help files I looked into; and offcourse the R team.
PS-2. Please post your valuable comments, so that we can get a better insight of the procedure.

Post PS: Hope to get back soon, with something light and random ;)


CYNOSURE said...

I'll study stats in my vacations and then comment... :(

Madhuri Kulkarni said...

thanks for the post deep. Very helpful to R users!

Madhuri Kulkarni said...

Arre! i had posted one comment y'day! Where is it?...........OK I write again. Nice and very useful post. I will definitely try this.

Richa said...

I have just glanced at your post above..
will understand it sometime later..:)

Good work indeed!!

It seems similar to parallel processing in SAS, where if say 1000 time series are to be forecasted together .. is it done in batches of 100 or so..

deep said...

@CYNOSURE - thats no stats.. its just your computer stuff.. to help our stats stuff ;)

@Madhuri Kulkarni - you are welcome Ma'am :) I will wait to hear about your experience with pararllel computing.

@Richa - Sounds interesting. Might help me.. if only I start using SAS. Who likes to move their cheese :P

deep said...


Post a Comment