Physical and economic limitations have forced computer
architecture towards parallelism and away from exponential
frequency scaling. Meanwhile, increased access to ubiquitous
sensing and the web has resulted in an explosion in the size of
machine learning tasks. In order to benefit from current and
future trends in processor technology we must discover,
understand, and exploit the available parallelism in machine
learning. This workshop will achieve four key goals:
Prior NIPS workshops have focused on the topic of scaling machine
learning, which remains an important developing area. We
introduce a new perspective by focusing on how largescale machine
learning algorithms should be informed by future parallel
architectures.
While we are interested in a wide range of topics associated with
largescale, parallel learning, the following list provides a flavor
of some of the key topics:
The challenges and opportunities of large scale parallel machine
learning are relevant to a wide range of backgrounds and interests.
A goal of this workshop is to bring these diverse perspectives
together and so we are broadly targeting:
Submissions are solicited for the workshop to be held on December
11th / 12th 2009 at this year's NIPS workshop session in Whistler,
Canada. Submissions covering early and ongoing work related to
parallelism and massive learning are strongly encouraged.

Parallel exact inference on multicore
processors
Viktor K. Prasanna
Exact inference in Bayesian networks is a fundamental AI
technique that has numerous applications including medical
diagnosis, consumer help desk, pattern recognition, credit
assessment, data mining, genetics, and others. Inference is
NPhard and in many applications real time performance is
required. In this talk we show task and data parallel
techniques to achieve scalable performance on general purpose
multicore and heterogeneous multicore architectures. We
develop collaborative schedulers to dynamically map the junction
tree tasks leading to highly optimized implementations. We
design lockfree structures to reduce thread coordination
overheads in scheduling, while balancing the load across the
threads. For the Cell BE, we develop a light weight centralized
scheduler that coordinates the activities of the synergistic
processing elements (SPEs). Our scheduler is further optimized
to run on throughput oriented architectures such as SUN Niagara
processors. We demonstrate scalable and efficient
implementations using Pthreads for a wide class of Bayesian
networks with various topologies, clique widths, and number of
states of random variables. Our implementations show improved
performance compared with OpenMP and complier based
optimizations.

Parallel Online Learning
John Langford
A fundamental limit on the speed of training and prediction is
imposed by bandwidth: there is a finite amount of data that a
computer can access in a fixed amount of time. Somewhat
surprisingly, we can build an online learning algorithm fully
capable of hitting this limit. I will discuss approaches for
breaking the bandwidth limit, including empirical results.

Probabilistic Machine Learning in
Computational Advertising
Joaquin Quiñonero Candela
In the past years online advertising has grown at least an order
of magnitude faster than advertising on all other media. This
talk focuses on advertising on search engines, where accurate
predictions of the probability that a user clicks on an
advertisement crucially benefit all three parties involved: the
user, the advertiser, and the search engine. We present a
Bayesian probabilistic classification model that has the ability
to learn from terabytes of web usage data. The model explicitly
represents uncertainty allowing for fully probabilistic
predictions: 2 positives out of 10 instances or 200 out of 1000
both give an average of 20%, but in the first case the
uncertainty about the prediction should be larger. We also
present a scheme for approximate parallel inference that allows
efficient training of the algorithm on a distributed data
architecture.

Parallel nbody Solvers: Lessons Learned in the
Multicore/Manycore Era
Aparna Chandramowlishwaran, Aashay
Shringarpure, Ryan Riegel, Sam Williams, Lenny Oliker, Alex Gray,
George Biros, and Richard Vuduc
Generalized nbody problems (GNPs; Gray & Moore NIPS 2000)
constitute an important class of computations in both the
physical sciences and in massivescale data analysis. This talk
describes some of the latest results and "lessonslearned" in
parallelizing, implementing, and tuning an important example of
this classthe fast multipole methodon stateoftheart
multicore and GPUbased systems. Our lessons include careful
data layouts, vectorization, mixed precision, and automated
algorithmic and code tuning, and a surprising finding in the
debate on performance and energyefficiency of multicore vs. GPU
processors.

Scalable Learning in Computer Vision
Adam Coates, Honglak Lee, Andrew Y. Ng
Computer vision is a challenging application area of machine
learning. Recent work has shown that large training sets may
yield higher performance in vision tasks like object detection.
We overview our work in object detection using a scalable,
distributed training system capable of training on more than 100
million examples in just a few hours. We also briefly describe
recent work with deep learning algorithms that may allow us to
apply these architectures to large datasets as well.

HadoopML: An Infrastructure for the Rapid
Implementation of Parallel Reusable Analytics
Amol Ghoting, Edwin Pednault
Hadoop is an opensource implementation of Google's MapReduce
programming model. Over the past few years, it has evolved into
a popular platform for parallelization in industry and
academia. Furthermore, trends suggest that Hadoop will likely be
the analytics platform of choice on forthcoming Cloudbased
systems. Unfortunately, implementing parallel machine
learning/data mining (ML/DM) algorithms on Hadoop is complex and
time consuming. To address this challenge, we present HadoopML,
an infrastructure to facilitate the implementation of parallel
ML/DM algorithms on Hadoop. HadoopML has been designed to allow
for the specification of both taskparallel and dataparallel
ML/DM algorithms. Furthermore, it supports the composition of
parallel ML/DM algorithms using both serial as well as parallel
building blocks  this allows one to write reusable parallel
code. The proposed abstraction eases the implementation process
by requiring the user to only specify computations and their
dependencies, without worrying about scheduling, data
management, and communication. As a consequence, the codes are
portable in that the user never needs to write Hadoopspecific
code. This potentially allows one to leverage future
parallelization platforms without rewriting one's code.

LargeScale Machine Learning:
The Problems, Algorithms, and Challenges
Alex Gray
To seed discussion, I will attempt to organize research efforts
in largescale machine learning by looking at common
computational problems across all of machine learning, and the
challenges of creating efficient parallel algorithms for them.
I'll begin by identifying four common types of computational
bottlenecks that occur across all of machine learning, or
prototype algorithmic problems: Nbody problems, graph
operations, linear algebra, and optimization. Within each
category, I'll discuss what we can or cannot learn from the
existing body of work in scientific computing, highlight a few
of the most successful and recent specific serial algorithms
that have been developed for concreteness, and discuss what
makes them easy or hard to parallelize. I'll synthesize some of
these observations to obtain a list of desiderata for parallel
machine learning algorithms research and software toolkits.

1 Billion Instances, 1 Thousand Machines and
3.5 Hours
Gideon Mann, Ryan McDonald, Mehryar
Mohri, Nathan Silberman, Daniel Walker
Training conditional maximum entropy models on massive data sets
requires significant computational resources, but by distributing
the computation, training time can be significant reduced. Recent
theoretical results have demonstrated conditional maximum
entropy models trained by weight mixtures of independently
trained models converge at the same rate as traditional
distributed schemes, but significantly faster. This efficiency
is achieved primarily by reducing network communication costs, a
cost not usually considered but actually quite crucial.

FPGAbased MapReduce Framework for Machine
Learning
Bo Wang, Yi Shan, Jing Yan, Yu Wang, Ningyi
Xu, Huangzhong Yang
Machine learning algorithms are becoming increasingly important
in our daily life. However, training on very large scale
datasets is usually very slow. FPGA is a reconfigurable platform
that can achieve high parallelism and data throughput. Many
works have been done on accelerating machine learning algorithms
on FPGA. In this paper, we adapt Google's MapReduce model to
FPGA by realizing an onchip MapReduce framework for machine
learning algorithms. A processor scheduler is implemented for
the maximum computation resource utilization and load
balancing. In accordance with the characteristics of many
machine learning algorithms, a common data access scheme is
carefully designed to maximize data throughput for large scale
dataset. This framework hides the task control, synchronization
and communication away from designers to shorten development
cycles. In a case study of RankBoost acceleration, up to 31.8x
speedup is achieved versus CPUbased design, which is comparable
with a fully manually designed version. We also discuss the
implementations of two other machine learning algorithms, SVM
and PageRank, to demonstrate the capability of the framework.

LargeScale Graphbased Transductive
Inference
Amarnag Subramanya, Jeff Bilmes
We consider the issue of scalability of graphbased
semisupervised learning (SSL) algorithms. In this context, we
propose a fast graph node ordering algorithm that improves
parallel spatial locality by being cache cognizant. This
approach allows for a linear speedup on a sharedmemory parallel
machine to be achievable, and thus means that graphbased SSL
can scale to very large data sets. We use the above algorithm
an a multithreaded implementation to solve a SSL problem on a
120 million node graph in a reasonable amount of time.

Splash Belief Propagation: Efficient
Parallelization Through Asynchronous Scheduling
Joseph Gonzalez, Yucheng Low, Carlos
Guestrin, David O`Hallaron
In this work we focus on approximate parallel inference in loopy
graphical models using loopy belief propagation. We demonstrate
that the natural, fully synchronous parallelization of belief
propagation is highly inefficient. By bounding the achievable
parallel performance of loopy belief propagation on chain
graphical models we develop a theoretical understanding of the
parallel limitations of belief propagation. We then introduce
Splash belief propagation, a parallel asynchronous approach
which achieves the optimal bounds and demonstrates linear to
superlinear scaling on large graphical models. Finally we
discuss how these ideas may be generalized to parallel iterative
graph algorithms in the context of our new GraphLab framework.

Understanding Errors in Approximate Distributed Latent Dirichlet Allocation
Alexander Ihler, David Newman

Accelerating Largescale Convolutional Neural Networks with Parallel Graphics Multiprocessors
Dominik Scherer, Sven Behnke

Very Largescale Lowrank Approximation
Corinna Cortes, Sanjiv Kumar, Mehyrar Mohri, Ameet Talwalkar

FPGAbased MapReduce Framework for Machine Learning
Bo Wang, Yi Shan, Jing Yan, Yu Wang, Ningyi Xu, Huangzhong Yang

MapReduce for Massive Matrices
Vimal Mathew, Sridhar Mahadevan, Prashant Shenoy

1 Billion Instances, 1 Thousand Machines, and 3.5 Hours
Gideon Mann, Ryan McDonald, Mehryar Mohri

Regression for Large Datasets using an Ensemble of GPUaccelerated ELMs
Mark van Heeswijk, Yoan Miche, Erkki Oja, Amaury Lendasse

Scalable Learning in Computer Vision
Adam Coates, Honglak Lee, Andrew Y. Ng

HADOOPML: An Infrastructure for the Rapid Implementation of Parallel Reusable Analytics
Amol Ghoting, Edwin Pednault

Distributed BackpropagationDecorrelation Learning
Oliver Obst

LargeScale Graphbased Transductive Inference
Amarnag Subramanya, Jeff Bilmes

Locallyconnected Hierarchical Neural Networks for GPUaccelerated Object Recognition
Rafael Uetz, Sven Behnke

Mining Very Large Databases of TimeSeries: Speeding up Dynamic Time Warping using GPGPU
Marco Grimaldi, Davide Albanese, Giuseppe Jurman, Cesare Furlanello
If you have any suggests for additional related publications
please
contact

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis,
P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker,
J. Shalf, S. W. Williams and K. A. Yelick
"The Landscape of Parallel Computing Research: A view from Berkeley"
Berkeley Technical Report 2006

A. Asuncion, P. Smyth, and M. Welling
"Asynchronous distributed learning of topic models."
Neural Information Processing Systems (NIPS) 2008

A. Asuncion, P. Smyth, and M. Welling
"Distributed Inference for Latent Dirichlet Allocation"
Neural Information Processing Systems (NIPS) 2007

J. Gonzalez, Y. Low, C. Guestrin, and D. O'Hallaron
"Distributed Parallel Inference on Large Factor Graphs."
Conference on Uncertainty in Artificial Intelligence (UAI), 2009.

J. Gonzalez, Y. Low, and C. Guestrin
"Residual Splash for Optimally Parallelizing Belief Propagation."
In Artificial Intelligence and Statistics (AISTATS), 2009.

N. Vasiloglou and A. G. Gray, David Anderson.
"Scalable Semidefinite Manifold Learning."
IEEE International Workshop on Machine Learning For Signal Processing
(MLSP), 2009.

M. Holmes and A. G. Gray, C. Isbell.
"QUICSVD: Fast SVD Using Cosine Trees."
Advances in Neural Information Processing Systems (NIPS). 21. 2008.

Shen, H., S. Jegelka, A. Gretton
"Fast KernelBased Independent Component Analysis."
IEEE Transactions on Signal Processing (2009)

F. Lozano, and P. Rangel
"Algorithms for Parallel Boosting"
ICMLA International Conference on Machine Learning and Applications (2005)

J. Reichler, H. Harris, and M. Savchenko.
"Online Parallel Boosting"
AAAI Conference On Artificial Intelligence 2004

D. Pennock.
"Logarithmic time parallel Bayesian inference"
Conference on Uncertainty in Artificial Intelligence (UAI),
1998.

S. ShalevShwartz, and N. Srebro.
"SVM Optimization: Inverse Dependence on Training Set Size."
International Conference on Machine Learning ICML 2008

Y. Xia and V. K. Prasanna
"Parallel Exact Inference on the cell Broadband Engine Processor"
(Supercomputing SC) International Conference for High Performance Computing,
Networking, Storage and Analysis 2008

Y. Xia and V. K. Prasanna
"Junction Tree Decomposition for Parallel Exact Inference."
IEEE International Parallel & Distributed Processing Symposium (IPDPS'08)

Y. Xia and V. K. Prasanna
"Parallel Exact Inference"
ParCo Parallel Computing 2007

Zein, A. El, McCreath, E., Rendell, A. P., Smola, A. J.
"Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU for
Machine Learning"
International Conference Computational Science, 2008.