Machine Learning in Apache Spark 2.0: Under the Hood and Over the Rainbow

#MachineLearning in #ApacheSpark 2.0: Under the Hood and Over the Rainbow

  • We’ll cast our minds forward to what may lie ahead for version 2.1 and beyond.
  • One of the main goals of the machine learning team at the Spark Technology Center is to continue to evolve Apache Spark as the foundation for end-to-end, continuous, intelligent enterprise applications.
  • It pays to be as communication-efficient as possible when constructing such an algorithm.
  • Linear models, such as logistic regression, are the work-horses of machine learning.
  • The older RDD-based API in the mllib package is now in maintenance mode, and the newer DataFrame-based API (in the ml package), with its support for DataFrames and machine learning pipelines, has become the focus of future development for machine learning in Spark

Now that the dust has settled on Apache Spark™ 2.0, the community has a chance to catch its collective breath and reflect a little on what was achieved for the largest and most complex release in the project’s history.

@IBMbigdata: #MachineLearning in #ApacheSpark 2.0: Under the Hood and Over the Rainbow

Now that the dust has settled on Apache Spark™ 2.0, the community has a chance to catch its collective breath and reflect a little on what was achieved for the largest and most complex release in the project’s history.

One of the main goals of the machine learning team here at the Spark Technology Center is to continue to evolve Apache Spark as the foundation for end-to-end, continuous, intelligent enterprise applications. With that in mind, we’ll briefly mention some of the major new features in the 2.0 release in Spark’s machine-learning library, MLlib, as well as a few important changes beneath the surface. Finally, we’ll cast our minds forward to what may lie ahead for version 2.1 and beyond.

For MLlib, there were a few major highlights in Spark 2.0:

package), with its support for DataFrames and machine learning pipelines, has become the focus of future development for machine learning in Spark

While these have already been well covered elsewhere, the STC team has worked hard to help make these initiatives a reality — congratulations!

Another key focus of the team has been feature parity — both between

mllib

and

ml

, and between the Python and Scala APIs. In the 2.0 release, we’re proud to have contributed significantly to both areas, in particular reaching close to full parity for PySpark in

ml

Despite the understandable attention paid to major features in such a large release, what happens under the hood in terms of bug fixes and performance improvements can be equally important (if not more so!).

While the team has again been involved across the board in this area, here we’d like to highlight just one example of a small (but subtle) issue that has dramatic implications for performance.

Linear models, such as logistic regression, are the work-horses of machine learning. They’re especially useful for very large datasets, such as those found in online advertising and other web-scale predictive tasks, because they are relatively less complex than, say, deep learning, and so are easier to train and more scalable. As such, they are among the most-used algorithms around, and were among the earliest algorithms added to Spark

ml

In distributed machine learning, the bottleneck for scaling large models (that is, where there are a large number of unique variables in the model) is often not computing power, as one might think, but communication across the network. This is because these algorithms are iterative in nature, and tend to send a lot of data back and forth between nodes in a cluster in each iteration. Therefore, it pays to be as communication-efficient as possible when constructing such an algorithm.

While working on adding multi-class logistic regression to Spark ML (part of the ongoing push towards parity between

ml

and

mllib

), STC team member Seth Hendrickson realized that, due to the way that Spark automatically serializes data when inter-node communication is required (e.g. during a reduce or aggregation operation), the aggregation step of the logistic regression training algorithm resulted in 3x more data being communicated than necessary.

This is illustrated in the chart below, where we compare the amount of shuffle data per iteration as the feature dimension increases.

Once fixed, this resulted in a decrease in per-iteration time of over 11% (shown in the chart below), as well as a decrease in overall execution time of over 20%, mostly due to lower shuffle read time and less data being broadcast at each iteration. We would expect the performance difference to be even larger as data and cluster size increases1.

Subsequently, various Spark community members rapidly addressed the same issue in linear regression and AFT survival regression (these patches will be released as part of version 2.1).

So there you have it – Spark 2.0 even improves your communication skills!

What does it mean when we refer to Apache Spark as the “foundation for end-to-end, continuous, intelligent enterprise applications”? In the context of Spark’s machine learning pipelines, we believe this means usability, scalability, streaming support, and closing the loop between data, training and deployment to enable automated, intelligent workflows – in short the “pot of gold” at the end of the rainbow!

In line with this vision, the focus areas for the team for Spark 2.1 and beyond include:

We’d love to hear your feedback on these areas of interest — email me at NickP@za.ibm.com, and we look forward to working with the Spark community to help drive these initiatives forward.

Tests were run on a relatively small cluster with 4 worker nodes (each with 48 cores, 100GB memory). Input data ranged from 6GB to 200GB, with 48 partitions, and was sized to fit in cluster memory at the maximum feature size. The quoted performance improvement figures are for the maximum feature size. ↩

The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.

Machine Learning in Apache Spark 2.0: Under the Hood and Over the Rainbow

You might also like More from author

Comments are closed, but trackbacks and pingbacks are open.