# Parallel Computing 2007: Bring your own parallel application

50 %
50 %

Published on May 7, 2007

Author: Foxsden

Source: slideshare.net

## Description

Discussion of parallel implementation of some datamining applications that might be relevant on multicore systems.

Parallel Computing 2007: Bring your own parallel application February 26-March 1 2007 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN [email_address]

Intel’s Application Stack Discussed here Rest mainly classic parallel computing

K-Means The diagrams come from Wikipedia Take N data points x in some space (can be relatively abstract such as space of chemical properties) We want to cluster into c components based on distance in space Algorithm assumes you have a guess c k for cluster centers k=1..c Associate each of N points with one and only one cluster by minimizing distance to the c k Replace c k by the centroid of points associated with it Iterate algorithm

The diagrams come from Wikipedia

Take N data points x in some space (can be relatively abstract such as space of chemical properties)

We want to cluster into c components based on distance in space

Algorithm assumes you have a guess c k for cluster centers k=1..c

Associate each of N points with one and only one cluster by minimizing distance to the c k

Replace c k by the centroid of points associated with it

Iterate algorithm

Problem used later in deterministic annealing version of K-Means

K-Means illustrated Again, the centers are moved to the centroids of the corresponding associated points. Now, the association is shown in more detail, once the centroids have been moved. Centers have been associated with the points and have been moved to the respective centroids Shows the initial randomized centers and a number of points a) b) c) d)

Parallel K-Means This algorithm is data parallel over N points x Assign N/N proc points to each of N proc processors; no ordering needed in simple algorithm Broadcast initial cluster centers c k to each processor Each processor independently calculates nearest c k for each data point it is responsible before Further it calculates partial sums for c centroids and error estimates (used for convergence) {Sums over all points} are {Sums over processors (sums over all points in given processor)} Apply MPI_Allreduce for global sums with (same) c results placed in each processor All processors calculate new c k and iterate

This algorithm is data parallel over N points x

Assign N/N proc points to each of N proc processors; no ordering needed in simple algorithm

Broadcast initial cluster centers c k to each processor

Each processor independently calculates nearest c k for each data point it is responsible before

Further it calculates partial sums for c centroids and error estimates (used for convergence)

{Sums over all points} are {Sums over processors (sums over all points in given processor)}

Apply MPI_Allreduce for global sums with (same) c results placed in each processor

All processors calculate new c k and iterate

MPI Parallel Divkmeans clustering of PubChem AVIDD Linux cluster, 5,273,852 structures (Pubchem compound collection, Nov 2005) David Wild Indiana

Performance of Parallel K-Means There is an an amount of distance calculation that is proportional to ( n =N/N proc )*c for c clusters and N points on N proc processors There is the global sum calculation proportional to c log 2 N proc So overhead f comm is log 2 N proc t comm / n t calc Appearance of log 2 N proc is quite common as global sums over used That’s why MPI has MPI_Allreduce with hope it can be optimized on whatever network is available Notice these MPI collectives are often not optimized and rarely used except by Marine Corps Note this problem has information dimension 1

There is an an amount of distance calculation that is proportional to ( n =N/N proc )*c for c clusters and N points on N proc processors

There is the global sum calculation proportional to c log 2 N proc

So overhead f comm is log 2 N proc t comm / n t calc

Appearance of log 2 N proc is quite common as global sums over used

That’s why MPI has MPI_Allreduce with hope it can be optimized on whatever network is available

Notice these MPI collectives are often not optimized and rarely used except by Marine Corps

Note this problem has information dimension 1

Find Maximum of a distributed array TEST ALLREDUCE can do many reductions typically after user has done reduction internally to each processor

ALLREDUCE can do many reductions typically after user has done reduction internally to each processor

ALLREDUCE on a multicore chip On a shared memory machine, one can use a different strategy by “transposing” the decomposition so that in global reduction you parallelize over c (the number of) centers not over geometric spatial decomposition Each core sums over contributions to a given center Computational Complexity is Max(1, c/N proc ) * Dimension of vector x Distributed version is c log 2 N proc * Dimension of vector x

On a shared memory machine, one can use a different strategy by “transposing” the decomposition so that in global reduction you parallelize over c (the number of) centers not over geometric spatial decomposition

Each core sums over contributions to a given center

Computational Complexity is Max(1, c/N proc ) * Dimension of vector x

Distributed version is c log 2 N proc * Dimension of vector x

Transposing Partial Sums Let result of parallel computation by partial sum C( i,k ) for Processor i calculating centroid k 1 ≤ i ≤ N proc and 1 ≤ k ≤ c Take special case c = N proc = 4 C(1,1) C(1,2) C(1,3) C(1,4) 1 C(2,1) C(2,2) C(2,3) C(2,4) 2 C(3,1) C(3,2) C(3,3) C(3,4) 3 C(4,1) C(4,2) C(4,3) C(4,4) 4 Calculate Partial Sums locally 1 2 3 4 C(1,1)+C(2,1)+C(3,1)+C(4,1) C(1,2)+C(2,2)+C(3,2)+C(4,2) C(1,3)+C(2,3)+C(3,3)+C(4,3) C(1,4)+C(2,4)+C(3,4)+C(4,4) Transpose and sum along rows in each processor to get 100% efficiency MPI Solution cannot transpose for free and so uses a tree in this direction

Let result of parallel computation by partial sum C( i,k ) for Processor i calculating centroid k

1 ≤ i ≤ N proc and 1 ≤ k ≤ c

Take special case c = N proc = 4

Continuing the Intel Homework Set

Clustering by Deterministic Annealing One can refine this by using multi scale methods and anneal system in position resolution (Gurewitz and Rose)

One can refine this by using multi scale methods and anneal system in position resolution (Gurewitz and Rose)

Deterministically find cluster centers y j using “mean field approximation” – could use slower Monte Carlo

Annealing avoids local minima

Deterministic Annealing Method does not need to assume a number of clusters See K. Rose , &quot;Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems,&quot; Proceedings of the IEEE, vol. 80, pp. 2210-2239, November 1998 Parallelization is similar to ordinary K-Means as we are calculating global sums which are decomposed into local averages and then summed over components calculated in each processor I found it interesting that clustering (and K-Means) very important in Chemical Informatics for finding related compounds Field does not seem to know about these multi-resolution methods

Method does not need to assume a number of clusters

See K. Rose , &quot;Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems,&quot; Proceedings of the IEEE, vol. 80, pp. 2210-2239, November 1998

Parallelization is similar to ordinary K-Means as we are calculating global sums which are decomposed into local averages and then summed over components calculated in each processor

I found it interesting that clustering (and K-Means) very important in Chemical Informatics for finding related compounds

Field does not seem to know about these multi-resolution methods

Frequent Itemsets Mining We have a transaction database TDB whose records T i are a set of items {i 1 ,i 2 …..i m } The i k are items from a source vocabulary {s 1 … s N } and we wish to find frequently occurring itemsets {s A , s B …} based on number of times this itemset appears in any order in a transaction I looked at two algorithms – Apriori and Frequent Pattern Growth Apriori focuses on the itemsets searching from smallest to largest systematically Natural for short transactions and small vocabularies Frequent Pattern Growth focuses on transactions after re-ordering them in order of item frequency Superior for finding long itemsets Effectively generates a new (compact) database with re-ordered items

We have a transaction database TDB whose records T i are a set of items {i 1 ,i 2 …..i m }

The i k are items from a source vocabulary {s 1 … s N } and we wish to find frequently occurring itemsets {s A , s B …} based on number of times this itemset appears in any order in a transaction

I looked at two algorithms – Apriori and Frequent Pattern Growth

Apriori focuses on the itemsets searching from smallest to largest systematically

Natural for short transactions and small vocabularies

Frequent Pattern Growth focuses on transactions after re-ordering them in order of item frequency

Superior for finding long itemsets

Effectively generates a new (compact) database with re-ordered items

Parallel Frequent Itemsets Mining Parallelize by partitioning transaction database and calculating independently frequent patterns from each partition Use global reduction to accumulate itemset counts from each partition Now global reduction is summing counts over candidate patterns and goes together with a pruning to only consider patterns with an occurrence > than some threshold This pruning is not easy to do before global sums (in spite of claims of at least one paper) The “ transposed multicore ” ALLREDUCE would be a good strategy

Parallelize by partitioning transaction database and calculating independently frequent patterns from each partition

Use global reduction to accumulate itemset counts from each partition

Now global reduction is summing counts over candidate patterns and goes together with a pruning to only consider patterns with an occurrence > than some threshold

This pruning is not easy to do before global sums (in spite of claims of at least one paper)

The “ transposed multicore ” ALLREDUCE would be a good strategy

Transposing Partial Itemset Counts Let result of parallel computation by partial sum C( i,k ) for Processor i counting occurrences of itemset k 1 ≤ i ≤ N proc and 1 ≤ k ≤ c Take unrealistic special case c = N proc = 4 MPI Solution cannot transpose for free and so uses a tree in this direction Multicore Algorithm Distributed MPI_ALLREDUCE C(1,1) C(1,2) C(1,3) C(1,4) 1 C(2,1) C(2,2) C(2,3) C(2,4) 2 C(3,1) C(3,2) C(3,3) C(3,4) 3 C(4,1) C(4,2) C(4,3) C(4,4) 4 Calculate Partial Sums locally 1 2 3 4 C(1,1)+C(2,1)+C(3,1)+C(4,1) C(1,2)+C(2,2)+C(3,2)+C(4,2) C(1,3)+C(2,3)+C(3,3)+C(4,3) C(1,4)+C(2,4)+C(3,4)+C(4,4) Transpose and sum along rows in each processor to get 100% efficiency

Let result of parallel computation by partial sum C( i,k ) for Processor i counting occurrences of itemset k

1 ≤ i ≤ N proc and 1 ≤ k ≤ c

Take unrealistic special case c = N proc = 4

(Mixed) Integer Programming We are solving an optimization problem such as minimize f(x) = C T x (for linear programming) Subject to constraints (which are also linear for linear programming) such as A T 1 x = b 1 or A T 2 x  0 With constraints that some (mixed case) or all the elements of x are integers (possibly 0 or 1) The non integer problem is soluble by Simplex method or by interior point methods (Karmarkar) in polynomial time The integer programming problem is NP complete

We are solving an optimization problem such as minimize f(x) = C T x (for linear programming)

Subject to constraints (which are also linear for linear programming) such as A T 1 x = b 1 or A T 2 x  0

With constraints that some (mixed case) or all the elements of x are integers (possibly 0 or 1)

The non integer problem is soluble by Simplex method or by interior point methods (Karmarkar) in polynomial time

The integer programming problem is NP complete

Integer Programming Parallelization Typically one does not parallelize the linear program solver but rather runs this sequentially and instead parallelizes a branch and bound (or cut) search over possible solutions in NP complete case e.g. search over integer choices for x The hard integer programming problem consists of Divide space into subspaces Find upper and lower bounds on f(x) in each subspace If lower bound on f(x) in a subspace is greater than current minimum of upper bounds of f(x) in other subspaces (i.e. upper bound of f(x) in any subspace), then one can prune this subspace If a subspace is still active and upper bound > lower bound , then further divide it into subspaces and iterate process Parallelism comes from “ data parallelism ” over subspaces which is suitable for thread based systems There is typically important shared knowledge such as current minimum upper bound and other information from one subspace that can be re-used by others Shared (in memory) database for performance

Typically one does not parallelize the linear program solver but rather runs this sequentially and instead parallelizes a branch and bound (or cut) search over possible solutions in NP complete case

e.g. search over integer choices for x

The hard integer programming problem consists of Divide space into subspaces Find upper and lower bounds on f(x) in each subspace If lower bound on f(x) in a subspace is greater than current minimum of upper bounds of f(x) in other subspaces (i.e. upper bound of f(x) in any subspace), then one can prune this subspace

If a subspace is still active and upper bound > lower bound , then further divide it into subspaces and iterate process

Parallelism comes from “ data parallelism ” over subspaces which is suitable for thread based systems

There is typically important shared knowledge such as current minimum upper bound and other information from one subspace that can be re-used by others

Shared (in memory) database for performance

Computer Chess I Games like computer chess are a special case of the general branch and bound strategy The space is the set of all moves where N moves by white and black is 2N plys ; at each ply there are roughly 35 legal moves so complexity is 35 2N Evaluation of of one set of moves to depth 2N is completed by evaluating the final position f( x ; x is set of moves) by rules reflecting chess wisdom and summarized by a number (Queen=10, Pawn =1 etc.) Deep Blue parallelized the calculation of f( x ) but here we explore subspace parallelization We follow work done at Caltech using a 512 node nCUBE which competed as WAYCOOL with poor reliability and results in 1987 and 1988 ACM Computer Chess Championships

Games like computer chess are a special case of the general branch and bound strategy

The space is the set of all moves where N moves by white and black is 2N plys ; at each ply there are roughly 35 legal moves so complexity is 35 2N

Evaluation of of one set of moves to depth 2N is completed by evaluating the final position f( x ; x is set of moves) by rules reflecting chess wisdom and summarized by a number (Queen=10, Pawn =1 etc.)

Deep Blue parallelized the calculation of f( x ) but here we explore subspace parallelization

We follow work done at Caltech using a 512 node nCUBE which competed as WAYCOOL with poor reliability and results in 1987 and 1988 ACM Computer Chess Championships

Computer Chess II The upper-lower bound approach is replaced by a minimax principle Assume f( x ) positive is good for white; then at each move white looks at each subspace spawned from the white move and chooses the one with the largest f( x ) In evaluating the subspace we assume that each stage, the side on move makes the best choice White always maximizes f( x ) at her move and black minimizes f(x) at his move Of course as N is finite and evaluation function approximate, this is not precise but it gets better and better the larger N is Note human players tend to use more pattern recognition and less brute force evaluation Computer games are unimaginative but have fewer errors

The upper-lower bound approach is replaced by a minimax principle

Assume f( x ) positive is good for white; then at each move white looks at each subspace spawned from the white move and chooses the one with the largest f( x )

In evaluating the subspace we assume that each stage, the side on move makes the best choice

White always maximizes f( x ) at her move and black minimizes f(x) at his move

Of course as N is finite and evaluation function approximate, this is not precise but it gets better and better the larger N is

Note human players tend to use more pattern recognition and less brute force evaluation

Computer games are unimaginative but have fewer errors

Computer Chess III Pruning is illustrated below; as it is advantageous to get (if white is to move) to get a large (good) value of f(x) as early as possible, one sorts moves at each node and looks at the most plausible first This reduces effective branching ratio from 35 to 6 4 4 -1 -7 -17 White Maximizes Black Minimizes The dotted lines show subspaces that never need to be searched ; this requires that one have done a complete depth search at first subspaces looked at 4 29 13 -1 5 2 -7 3 15 -11 -10 -17 5

Pruning is illustrated below; as it is advantageous to get (if white is to move) to get a large (good) value of f(x) as early as possible, one sorts moves at each node and looks at the most plausible first

This reduces effective branching ratio from 35 to 6

Computer Chess IV Threads were spawned in groups of 4 in Caltech example at different depths of tree and project achieved a speed up of over a 100 and the larger # plys N gets the more parallelism there will be Increasing search depth

Threads were spawned in groups of 4 in Caltech example at different depths of tree and project achieved a speed up of over a 100 and the larger # plys N gets the more parallelism there will be

Computer Chess V We have subsets of threads (4 in this example) synchronizing on node minimax value This is a global variable and there are (as in other branch and bound) very important performance gains from a shared position database This allows scores to be stored for positions and re-used In chess there are many transpositions leading to identical positions 1 e4 e5 2 Nf3 Nc6 is identical to (less usual) 1 Nf3 Nc6 2 e4 e5 There was only a few percent overhead for a distributed database on Caltech distributed memory implementation Queuing of update requests ensured no errors from multiple threads accessing same location Multicore architecture should be excellent for this and other large branch and bound and related search algorithms as support shared databases and fast thread synchronization Note that in Deep Fritz vs. Vladimir Kramnik (human world champion) in November 2006, the program ran on a personal computer containing two Intel Core 2 Duo CPUs, capable of evaluating 8 million positions per second , and searching to an average depth of 17 to 18 ply in the middlegame. Deep Fritz won 4-2

We have subsets of threads (4 in this example) synchronizing on node minimax value

This is a global variable and there are (as in other branch and bound) very important performance gains from a shared position database

This allows scores to be stored for positions and re-used

In chess there are many transpositions leading to identical positions

1 e4 e5 2 Nf3 Nc6 is identical to (less usual) 1 Nf3 Nc6 2 e4 e5

There was only a few percent overhead for a distributed database on Caltech distributed memory implementation

Queuing of update requests ensured no errors from multiple threads accessing same location

Multicore architecture should be excellent for this and other large branch and bound and related search algorithms as support shared databases and fast thread synchronization

Note that in Deep Fritz vs. Vladimir Kramnik (human world champion) in November 2006, the program ran on a personal computer containing two Intel Core 2 Duo CPUs, capable of evaluating 8 million positions per second , and searching to an average depth of 17 to 18 ply in the middlegame. Deep Fritz won 4-2

Wikipedia SVM Example We are finding optimal hyperplane splitting two samples Samples are training set Normal w to splitting hyperplane given by w =  i =1 n y i  i x i Two samples denoted by crosses y i =1 or circles y i = -1

We are finding optimal hyperplane splitting two samples

Samples are training set

Normal w to splitting hyperplane given by w =  i =1 n y i  i x i

Two samples denoted by crosses y i =1 or circles y i = -1

Support Vector Machines SVM I These divide sets by (in simplest case) hyperplanes into two in an optimal least squares fashion Minimize f(  ) = 0.5  T G  -  i =1 n  i Subject to  i =1 n y i  i = 0 and 0 ≤  i ≤ C With G ij = y i y j K( x i , x j ) for Kernel K This is a training problem where we have a total of n data points from two populations with y i = +1 for first and = -1 for second K( x i , x j ) = x i . x j is simplest case when division is by a hyperplane in space in which x is a vector but Gaussian forms are often used K = exp(- constant  x i - x j  2 ) G is an n by n dense matrix (n is number of data points) This is a a quadratic programming QP problem

These divide sets by (in simplest case) hyperplanes into two in an optimal least squares fashion

Minimize f(  ) = 0.5  T G  -  i =1 n  i

Subject to  i =1 n y i  i = 0 and 0 ≤  i ≤ C

With G ij = y i y j K( x i , x j ) for Kernel K

This is a training problem where we have a total of n data points from two populations with y i = +1 for first and = -1 for second

K( x i , x j ) = x i . x j is simplest case when division is by a hyperplane in space in which x is a vector but Gaussian forms are often used K = exp(- constant  x i - x j  2 )

G is an n by n dense matrix (n is number of data points)

This is a a quadratic programming QP problem

Support Vector Machines SVM II Differentiating wrt  gives linear equations that must solved iteratively to satisfy inequality constraints The solver matrix G is both large (10 6 by 10 6 ) and can be dense and this requires large storage space which often exceeds available memory As in much quadratic programming one can use conjugate gradient solution methods as this identifies systematically the important directions in space (roughly large eigenvalues of positive definite symmetric matrix G) There are several papers on parallel SVM but I did not see substantial use of parallel implementations There were two approaches Either solve the matrix problems in parallel or Split up dataset and solve multiple subproblems

Differentiating wrt  gives linear equations that must solved iteratively to satisfy inequality constraints

The solver matrix G is both large (10 6 by 10 6 ) and can be dense and this requires large storage space which often exceeds available memory

As in much quadratic programming one can use conjugate gradient solution methods as this identifies systematically the important directions in space (roughly large eigenvalues of positive definite symmetric matrix G)

There are several papers on parallel SVM but I did not see substantial use of parallel implementations

There were two approaches

Either solve the matrix problems in parallel or

Split up dataset and solve multiple subproblems

Support Vector Machines SVM III Solve the matrix problems in parallel Interestingly one does not solve full G but iterates up from smaller (~150 by 150) problems and so data parallelism does not exploit size n Need more reliable SVM solvers for large matrices? Split up dataset and solve multiple subproblems – Scalable! Here the difficulty is that essentially you have changed algorithm and it is not clear how best to combine solution of subproblems But original SVM is full of heuristics (choice of K) so other heuristics may be allowed! Note whereas multicore appears especially attractive for search problems, it is not so clear for SVM Multicore does not address huge size of matrix G High performance matrix solvers are available for distributed memory machines I suspect there are better “approximate” SVM solvers that will do well on multicore and reduce dimension of G but this is research

Solve the matrix problems in parallel

Interestingly one does not solve full G but iterates up from smaller (~150 by 150) problems and so data parallelism does not exploit size n

Need more reliable SVM solvers for large matrices?

Split up dataset and solve multiple subproblems – Scalable!

Here the difficulty is that essentially you have changed algorithm and it is not clear how best to combine solution of subproblems

But original SVM is full of heuristics (choice of K) so other heuristics may be allowed!

Note whereas multicore appears especially attractive for search problems, it is not so clear for SVM

Multicore does not address huge size of matrix G

High performance matrix solvers are available for distributed memory machines

I suspect there are better “approximate” SVM solvers that will do well on multicore and reduce dimension of G but this is research

Some Parallelization Results from “Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems” This paper reviews much previous work Super linear speedup in (a) due to extra memory

 User name: Comment:

## Related pages

### Parallel Computing 2007: Bring your own parallel application

Title: Parallel Computing 2007: Bring your own parallel application Author: Geoffrey Fox Last modified by: Geoffrey Fox Created Date: 2/26/2007 11:51:46 PM

View 36612 Parallel Computing ... a faster more reliable application. Bring your programming ... Computing 2007: Bring your own parallel ...

### Parallel Computing 2007: Lessons for a Multicore Future ...

Parallel Computing 2007: ... Parallel Computing 2007:Bring your own parallel application discussed at the audience's request, ...

### Parallel computing - Wikipedia, the free encyclopedia

... has rendered ASICs unfeasible for most parallel computing applications. ... Information and Computing Systems, 2007. ... you agree to the Terms of Use ...

### LINQ and Parallel Computing - magic, power, performance ...

LINQ and Parallel Computing ... PLINQ in your own applications. Perhaps you are ... making parallel computing mainstream [2]. Bring your .NET ...

### Go Parallel - Translating Multicore Power into Application ...

Translating Multicore Power into Application Performance ... which is part of Intel Parallel Studio XE 2016, ... The Argonne Leadership Computing Facility ...

### Parallel computing on any desktop | September 2007 ...

Parallel computing on ... bring parallel computing to ... thread of the original parallel region becomes the master of its own ...