Just another Network site

Make pleasingly parallel R code with rxExecBy

Make pleasingly parallel R code with rxExecBy  #BigData #Analytics

  • Using the foreach package (available on CRAN) is one simple way of speeding up pleasingly parallel problems using R.
  • A better idea would be to leave the data where it is, and run R within the data repository, in parallel.
  • When your data is sitting in SQL Server or Spark, you can specify a set of keys to partition the data by, and an R function (any R function, built-in or user-defined) to apply to the partitions.
  • You can also run it on local data in various formats

    The function is included in Microsoft R Client (available free) and Microsoft R Server.

  • Microsoft R Blog: Running Pleasingly Parallel workloads using rxExecBy on Spark, SQL, Local and Localpar compute contexts

Some things are easy to convert from a long-running sequential process to a system where each part runs at the same time, thus reducing the required time overall. We often call these “embarrassingly parallel” problems, but given how easy it is to reduce the time it takes to execute them by converting them into a parallel process, “pleasingly parallel” may well be a more appropriate name. Using the foreach package (available on CRAN) is one simple way of speeding up pleasingly parallel problems using R. A foreach loop is much like a regular for loop in R, and by default…

@craigbrownphd: Make pleasingly parallel R code with rxExecBy #BigData #Analytics

Some things are easy to convert from a long-running sequential process to a system where each part runs at the same time, thus reducing the required time overall. We often call these “embarrassingly parallel” problems, but given how easy it is to reduce the time it takes to execute them by converting them into a parallel process, “pleasingly parallel” may well be a more appropriate name.

loop). But by registering a parallel “backend” for foreach, you can run many (or maybe even all) iterations at the same time, using multiple processors on the same machine, or even multiple machines in the cloud.

For many applications, though, you need to provide a different chunk of data to each iteration to process. (For example, you may need to fit a statistical model within each country — each iteration will then only need the subset for one country.) You could just pass the entire data set into each iteration and subset it there, but that’s inefficient and may even be impractical when dealing with very large datasets sitting in a remote repository. A better idea would be to leave the data where it is, and run R within the data repository, in parallel.

, for exactly this purpose. When your data is sitting in SQL Server or Spark, you can specify a set of keys to partition the data by, and an R function (any R function, built-in or user-defined) to apply to the partitions. The data doesn’t actually move: R runs directly on the data platform. You can also run it on local data in various formats

, take a look at the Microsoft R Blog post linked below.

Microsoft R Blog: Running Pleasingly Parallel workloads using rxExecBy on Spark, SQL, Local and Localpar compute contexts

Make pleasingly parallel R code with rxExecBy

Comments are closed, but trackbacks and pingbacks are open.