Physical and economic limitations have forced computer
architecture towards parallelism and away from exponential
frequency scaling. Meanwhile, increased access to ubiquitous
sensing and the web has resulted in an explosion in the size of
machine learning tasks. In order to benefit from current and
future trends in processor technology we must discover,
understand, and exploit the available parallelism in machine
learning. This workshop will achieve four key goals:
Prior NIPS workshops have focused on the topic of scaling machine
learning, which remains an important developing area. We
introduce a new perspective by focusing on how large-scale machine
learning algorithms should be informed by future parallel
While we are interested in a wide range of topics associated with
large-scale, parallel learning, the following list provides a flavor
of some of the key topics:
The challenges and opportunities of large scale parallel machine
learning are relevant to a wide range of backgrounds and interests.
A goal of this workshop is to bring these diverse perspectives
together and so we are broadly targeting:
Submissions are solicited for the workshop to be held on December
11th / 12th 2009 at this year's NIPS workshop session in Whistler,
Canada. Submissions covering early and ongoing work related to
parallelism and massive learning are strongly encouraged.
Parallel exact inference on multi-core
Exact inference in Bayesian networks is a fundamental AI
technique that has numerous applications including medical
diagnosis, consumer help desk, pattern recognition, credit
assessment, data mining, genetics, and others. Inference is
NP-hard and in many applications real time performance is
required. In this talk we show task and data parallel
techniques to achieve scalable performance on general purpose
multi-core and heterogeneous multi-core architectures. We
develop collaborative schedulers to dynamically map the junction
tree tasks leading to highly optimized implementations. We
design lock-free structures to reduce thread coordination
overheads in scheduling, while balancing the load across the
threads. For the Cell BE, we develop a light weight centralized
scheduler that coordinates the activities of the synergistic
processing elements (SPEs). Our scheduler is further optimized
to run on throughput oriented architectures such as SUN Niagara
processors. We demonstrate scalable and efficient
implementations using Pthreads for a wide class of Bayesian
networks with various topologies, clique widths, and number of
states of random variables. Our implementations show improved
performance compared with OpenMP and complier based
Parallel Online Learning
A fundamental limit on the speed of training and prediction is
imposed by bandwidth: there is a finite amount of data that a
computer can access in a fixed amount of time. Somewhat
surprisingly, we can build an online learning algorithm fully
capable of hitting this limit. I will discuss approaches for
breaking the bandwidth limit, including empirical results.
Probabilistic Machine Learning in
In the past years online advertising has grown at least an order
of magnitude faster than advertising on all other media. This
talk focuses on advertising on search engines, where accurate
predictions of the probability that a user clicks on an
advertisement crucially benefit all three parties involved: the
user, the advertiser, and the search engine. We present a
Bayesian probabilistic classification model that has the ability
to learn from terabytes of web usage data. The model explicitly
represents uncertainty allowing for fully probabilistic
predictions: 2 positives out of 10 instances or 200 out of 1000
both give an average of 20%, but in the first case the
uncertainty about the prediction should be larger. We also
present a scheme for approximate parallel inference that allows
efficient training of the algorithm on a distributed data
Parallel n-body Solvers: Lessons Learned in the
Generalized n-body problems (GNPs; Gray & Moore NIPS 2000)
constitute an important class of computations in both the
physical sciences and in massive-scale data analysis. This talk
describes some of the latest results and "lessons-learned" in
parallelizing, implementing, and tuning an important example of
this class---the fast multipole method---on state-of-the-art
multicore- and GPU-based systems. Our lessons include careful
data layouts, vectorization, mixed precision, and automated
algorithmic and code tuning, and a surprising finding in the
debate on performance and energy-efficiency of multicore vs. GPU
Scalable Learning in Computer Vision
Computer vision is a challenging application area of machine
learning. Recent work has shown that large training sets may
yield higher performance in vision tasks like object detection.
We overview our work in object detection using a scalable,
distributed training system capable of training on more than 100
million examples in just a few hours. We also briefly describe
recent work with deep learning algorithms that may allow us to
apply these architectures to large datasets as well.
Hadoop-ML: An Infrastructure for the Rapid
Implementation of Parallel Reusable Analytics
Hadoop is an open-source implementation of Google's Map-Reduce
programming model. Over the past few years, it has evolved into
a popular platform for parallelization in industry and
academia. Furthermore, trends suggest that Hadoop will likely be
the analytics platform of choice on forthcoming Cloud-based
systems. Unfortunately, implementing parallel machine
learning/data mining (ML/DM) algorithms on Hadoop is complex and
time consuming. To address this challenge, we present Hadoop-ML,
an infrastructure to facilitate the implementation of parallel
ML/DM algorithms on Hadoop. Hadoop-ML has been designed to allow
for the specification of both task-parallel and data-parallel
ML/DM algorithms. Furthermore, it supports the composition of
parallel ML/DM algorithms using both serial as well as parallel
building blocks -- this allows one to write reusable parallel
code. The proposed abstraction eases the implementation process
by requiring the user to only specify computations and their
dependencies, without worrying about scheduling, data
management, and communication. As a consequence, the codes are
portable in that the user never needs to write Hadoop-specific
code. This potentially allows one to leverage future
parallelization platforms without rewriting one's code.
Large-Scale Machine Learning:
The Problems, Algorithms, and Challenges
To seed discussion, I will attempt to organize research efforts
in large-scale machine learning by looking at common
computational problems across all of machine learning, and the
challenges of creating efficient parallel algorithms for them.
I'll begin by identifying four common types of computational
bottlenecks that occur across all of machine learning, or
prototype algorithmic problems: N-body problems, graph
operations, linear algebra, and optimization. Within each
category, I'll discuss what we can or cannot learn from the
existing body of work in scientific computing, highlight a few
of the most successful and recent specific serial algorithms
that have been developed for concreteness, and discuss what
makes them easy or hard to parallelize. I'll synthesize some of
these observations to obtain a list of desiderata for parallel
machine learning algorithms research and software toolkits.
1 Billion Instances, 1 Thousand Machines and
Training conditional maximum entropy models on massive data sets
requires significant computational resources, but by distributing
the computation, training time can be significant reduced. Recent
theoretical results have demonstrated conditional maximum
entropy models trained by weight mixtures of independently
trained models converge at the same rate as traditional
distributed schemes, but significantly faster. This efficiency
is achieved primarily by reducing network communication costs, a
cost not usually considered but actually quite crucial.
FPGA-based MapReduce Framework for Machine
Machine learning algorithms are becoming increasingly important
in our daily life. However, training on very large scale
datasets is usually very slow. FPGA is a reconfigurable platform
that can achieve high parallelism and data throughput. Many
works have been done on accelerating machine learning algorithms
on FPGA. In this paper, we adapt Google's MapReduce model to
FPGA by realizing an on-chip MapReduce framework for machine
learning algorithms. A processor scheduler is implemented for
the maximum computation resource utilization and load
balancing. In accordance with the characteristics of many
machine learning algorithms, a common data access scheme is
carefully designed to maximize data throughput for large scale
dataset. This framework hides the task control, synchronization
and communication away from designers to shorten development
cycles. In a case study of RankBoost acceleration, up to 31.8x
speedup is achieved versus CPU-based design, which is comparable
with a fully manually designed version. We also discuss the
implementations of two other machine learning algorithms, SVM
and PageRank, to demonstrate the capability of the framework.
Large-Scale Graph-based Transductive
We consider the issue of scalability of graph-based
semi-supervised learning (SSL) algorithms. In this context, we
propose a fast graph node ordering algorithm that improves
parallel spatial locality by being cache cognizant. This
approach allows for a linear speedup on a shared-memory parallel
machine to be achievable, and thus means that graph-based SSL
can scale to very large data sets. We use the above algorithm
an a multi-threaded implementation to solve a SSL problem on a
120 million node graph in a reasonable amount of time.
Splash Belief Propagation: Efficient
Parallelization Through Asynchronous Scheduling
In this work we focus on approximate parallel inference in loopy
graphical models using loopy belief propagation. We demonstrate
that the natural, fully synchronous parallelization of belief
propagation is highly inefficient. By bounding the achievable
parallel performance of loopy belief propagation on chain
graphical models we develop a theoretical understanding of the
parallel limitations of belief propagation. We then introduce
Splash belief propagation, a parallel asynchronous approach
which achieves the optimal bounds and demonstrates linear to
super-linear scaling on large graphical models. Finally we
discuss how these ideas may be generalized to parallel iterative
graph algorithms in the context of our new GraphLab framework.
Understanding Errors in Approximate Distributed Latent Dirichlet Allocation
Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors
Very Large-scale Low-rank Approximation
FPGA-based MapReduce Framework for Machine Learning
Map-Reduce for Massive Matrices
1 Billion Instances, 1 Thousand Machines, and 3.5 Hours
Regression for Large Datasets using an Ensemble of GPU-accelerated ELMs
Scalable Learning in Computer Vision
HADOOP-ML: An Infrastructure for the Rapid Implementation of Parallel Reusable Analytics
Distributed Backpropagation-Decorrelation Learning
Large-Scale Graph-based Transductive Inference
Locally-connected Hierarchical Neural Networks for GPU-accelerated Object Recognition
Mining Very Large Databases of Time-Series: Speeding up Dynamic Time Warping using GPGPU
If you have any suggests for additional related publications
"The Landscape of Parallel Computing Research: A view from Berkeley"
Berkeley Technical Report 2006
"Asynchronous distributed learning of topic models."
Neural Information Processing Systems (NIPS) 2008
"Distributed Inference for Latent Dirichlet Allocation"
Neural Information Processing Systems (NIPS) 2007
"Distributed Parallel Inference on Large Factor Graphs."
Conference on Uncertainty in Artificial Intelligence (UAI), 2009.
"Residual Splash for Optimally Parallelizing Belief Propagation."
In Artificial Intelligence and Statistics (AISTATS), 2009.
"Scalable Semidefinite Manifold Learning."
IEEE International Workshop on Machine Learning For Signal Processing
"QUIC-SVD: Fast SVD Using Cosine Trees."
Advances in Neural Information Processing Systems (NIPS). 21. 2008.
"Fast Kernel-Based Independent Component Analysis."
IEEE Transactions on Signal Processing (2009)
"Algorithms for Parallel Boosting"
ICMLA International Conference on Machine Learning and Applications (2005)
"Online Parallel Boosting"
AAAI Conference On Artificial Intelligence 2004
"Logarithmic time parallel Bayesian inference"
Conference on Uncertainty in Artificial Intelligence (UAI),
"SVM Optimization: Inverse Dependence on Training Set Size."
International Conference on Machine Learning ICML 2008
"Parallel Exact Inference on the cell Broadband Engine Processor"
(Supercomputing SC) International Conference for High Performance Computing,
Networking, Storage and Analysis 2008
"Junction Tree Decomposition for Parallel Exact Inference."
IEEE International Parallel & Distributed Processing Symposium (IPDPS'08)
"Parallel Exact Inference"
ParCo Parallel Computing 2007
"Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU for
International Conference Computational Science, 2008.