Advanced Usage¶
As mentioned above, ParallelAccelerator aims to optimize implicitly parallel Julia programs that are safe to parallelize. It also tries to be non-invasive, which means a user function or program should continue to work as expected even when only a part of it is accelerated. It is still important to know what parts are accelerated, however. As a general guideline, we encourage users to write programs using high-level array operations rather than writing explicit for-loops which can have unrestricted mutations or unknown side-effects. High-level operations are more amenable to analysis and optimization provided by ParallelAccelerator.
To help user verify program correctness, the optimizations of ParallelAccelerator
can be turned off by setting environment variable PROSPECT_MODE=none
before
running Julia. Programs that use ParallelAccelerator will still run
(including those that use runStencil
, described below), but no optimizations or
Julia-to-C translation will take place. Users can also use @noacc
at the function call site to use the original version of the function.
Map and Reduce¶
Array operations that work uniformly on all elements of input arrays and
produce an output array of equal size are called point-wise operations.
Point-wise binary operations in Julia usually have a . prefix in the
operator name. These operations are translated internally into data-parallel map operations by
ParallelAccelerator. The following are recognized by @acc
as map
operations:
- Unary functions:
-
,+
,acos
,acosh
,angle
,asin
,asinh
,atan
,atanh
,cbrt
,cis
,cos
,cosh
,exp10
,exp2
,exp
,expm1
,lgamma
,log10
,log1p
,log2
,log
,sin
,sinh
,sqrt
,tan
,tanh
,abs
,copy
,erf
- Binary functions:
-
,+
,.+
,.-
,.*
,./
,.\
,.%
,.>
,.<
,.<=
,.>=
,.==
,.<<
,.>>
,.^
,div
,mod
,rem
,&
,|
,$
,min
,max
Array assignments are also recognized and converted into in-place map
operations. Expressions like a = a .+ b
will be turned into an in-place map
that takes two inputs arrays, a
and b
, and updates a
in-place.
Array operations that compute a single result by repeating an associative
and commutative operator on all input array elements are called reduce operations.
The following are recognized by @acc
as reduce
operations:
minimum
, maximum
, sum
, prod
, any
, all
.
We also support range operations to a limited extent. For example, a[r] =
b[r]
where r
is either a BitArray
or UnitRange
(e.g., 1:s
) is
internally converted to parallel operations when the ranges can be inferred
statically to be compatible. However, such support is still
experimental, and occasionally ParallelAccelerator will complain about not
being able to optimize them. We are working on improving this feature
to provide more coverage and better error messages.
Parallel Comprehension¶
Array comprehensions in Julia are in general also parallelizable, because
unlike general loops, their iteration variables have no inter-dependencies.
So the @acc
macro will turn them into an internal form that we call
cartesianarray
:
A = Type[ f (x1, x2, ...) for x1 in r1, x2 in r2, ... ]
becomes:
cartesianarray((i1,i2,...) -> begin x1 = r1[i1]; x2 = r2[i2]; f(x1,x2,...) end,
Type,(length(r1), length(r2), ...))
This cartesianarray
function is also exported by ParallelAccelerator
and
can be directly used by the user. Both the above two forms are acceptable
programs, and equivalent in semantics. They both produce a N-dimensional array
whose element is of Type
, where N
is the number of x and r variables, and
currently only up-to-3 dimensions are supported.
It should be noted, however, that not all comprehensions are safe to parallelize.
For example, if the function f
above reads and writes to a variable outside of the comprehension,
then making it run in parallel can produce a non-deterministic
result. Therefore, it is the responsibility of the user to avoid using @acc
in such situations.
Another difference between parallel comprehension and the aforementioned map operation is that array indexing operations in the body of a parallel comprehension remain explicit and therefore should go through necessary bounds-checking to ensure safety. On the other hand, in all map operations such bounds-checking is skipped.
Stencil¶
Stencils are commonly found in scientific computing and image processing. A stencil
computation is one that computes new values for all elements of an array based
on the current values of their neighboring elements. Since Julia’s base library
does not provide such an API, ParallelAccelerator exports a general
runStencil
interface to help with stencil programming:
runStencil(kernel :: Function, buffer1, buffer2, ...,
iteration :: Int, boundaryHandling :: Symbol)
As an example, the following (taken from
our Gaussian blur example)
performs a 5x5 stencil computation (note the use of Julia’s do
-block syntax that lets
the user write a lambda function):
runStencil(buf, img, iterations, :oob_skip) do b, a
b[0,0] =
(a[-2,-2] * 0.003 + a[-1,-2] * 0.0133 + a[0,-2] * 0.0219 + a[1,-2] * 0.0133 + a[2,-2] * 0.0030 +
a[-2,-1] * 0.0133 + a[-1,-1] * 0.0596 + a[0,-1] * 0.0983 + a[1,-1] * 0.0596 + a[2,-1] * 0.0133 +
a[-2, 0] * 0.0219 + a[-1, 0] * 0.0983 + a[0, 0] * 0.1621 + a[1, 0] * 0.0983 + a[2, 0] * 0.0219 +
a[-2, 1] * 0.0133 + a[-1, 1] * 0.0596 + a[0, 1] * 0.0983 + a[1, 1] * 0.0596 + a[2, 1] * 0.0133 +
a[-2, 2] * 0.003 + a[-1, 2] * 0.0133 + a[0, 2] * 0.0219 + a[1, 2] * 0.0133 + a[2, 2] * 0.0030)
return a, b
end
It takes two input arrays, buf and img, and performs an iterative stencil
loop (ISL) of the number of iterations given by iterations.
The stencil kernel is specified by a lambda
function that takes two arrays a and b (that correspond to buf and
img), and computes the value of the output buffer using relative indices
as if a cursor is traversing all array elements. [0,0] represents
the current cursor position. The return statement in this lambda reverses
the position of a and b to specify a buffer rotation that should happen
in between the stencil iterations. runStencil
assumes that
all input and output buffers are of the same dimension and size.
Stencil boundary handling can be specified as one of the following symbols:
:oob_skip
: Writing to output is skipped when input indexing is out-of-bound.:oob_wraparound
: Indexing is “wrapped around” at the array boundaries so they are always safe.:oob_dst_zero
: Write 0 to the output array when any of the input indices is out-of-bounds.:oob_src_zero
. Assume 0 is returned by a read operation when indexing is out-of-bounds.
Just as with parallel comprehension, accessing the variables outside the body
of the runStencil
lambda expression is allowed.
However, accessing outside array values is
not supported, and reading/writing the same outside variable can cause
non-determinism.
All arrays that need to be relatively indexed can be specified as
input buffers. runStencil
does not impose any implicit buffer rotation
order, and the user can choose not to rotate buffers in return
. There
can be multiple output buffers as well. Finally, the call to runStencil
does
not have any return value, and inputs are rotated iterations - 1
times if rotation is specified.
ParallelAccelerator exports a naive Julia implementation of runStencil
that
runs without using @acc
. Its purpose is mostly for correctness checking.
When @acc
is being used with environment variable PROSPECT_MODE=none
,
instead of parallelizing the stencil computation @acc
will expand the call
to runStencil
to a fast sequential implementation.