Harlan, new Lispy language for GPU programming

swannodette · on July 1, 2013

Pretty cool, region inference is handled via miniKanren, a relational / constraint logic programming embedding in Scheme (also available in Clojure as core.logic):

http://github.com/eholk/harlan/blob/master/harlan/middle/inf...

eholk · on July 1, 2013

We did start out with a miniKanren type inferencer and region inferencer, but I replaced it a couple months back with a unified type and region inferencer. Region inference is definitely an interesting challenge for miniKanren, since a lot of the constraints it introduces are soft. There are many legal region assignments, but deciding the best one requires some heuristics.

octo_t · on July 1, 2013

Have you looked at GPUVerify[1] from Imperial College to verify the correctness of code produced?

[1] - http://multicore.doc.ic.ac.uk/tools/GPUVerify/

swannodette · on July 1, 2013

I see, thanks for the update! :)

BruceIV · on July 1, 2013

I read through the user guide in the git repo, but it hasn't been updated for a year, and doesn't tell me what I'm most interested in (namely, what kind of optimizations does the compiler do, and what's the win for this language over CUDA, besides being able to write kernels in Lisp instead of C++) - does anyone have any insight?

eholk · on July 1, 2013

I really should update the user guide, since the language has changed a lot in the last year.

I have an earlier post that discusses a couple of the optimizations that Harlan does: http://blog.theincredibleholk.org/blog/2013/06/10/some-simpl...

To me the win for Harlan over CUDA is its region system, which lets you work with more intricate pointer structures in the GPU. For example, there is an interpreter for the Lambda Calculus as one of the test cases, which would be much harder to do in straight CUDA: https://github.com/eholk/harlan/blob/master/test/lambda3.kfc

Very soon, Harlan will have support for higher order procedures, which is also not available in CUDA.

iskander · on July 1, 2013

How do you support higher order procedures? Do you do control flow analysis & defunctionalization or are these "real" higher order functions?

eholk · on July 1, 2013

It's basically defunctionalization. It's hard to get access to function pointers on the GPU to do "real" higher order functions.

seanmcdirmid · on July 1, 2013

What version of CUDA are you talking about. I though the latest version supported indirect function pointers (finally).

iskander · on July 2, 2013

That's good news! Did that change with Kepler? Any links to more info?

iskander · on July 1, 2013

I wish they'd at least give us a computed goto...but alas, every switch is destined for predicated branching.

BruceIV · on July 1, 2013

Thanks! What kind of performance do you get with some of those more intricate pointer structures? My own research right now is focused on how to build some common abstract data types with less pointer indirection, so as to be more easily vectorizable.

eholk · on July 1, 2013

Lately I've been more focused on getting the features running and less on performance, so I haven't completely explored the impact of using these structures.

I'd be interested to hear more about what you're doing. Do you have any links you could share?

BruceIV · on July 1, 2013

I've actually only really just started, I'm less than a year into my PhD, so I don't have anything published yet. My current project involves doing some non-trivial calculations on a large matrix of multi-precision integers - the obvious way to lay out the data (an array of dynamically resizable vectors) is dead wrong for GPU (the memory bandwidth is completely swamped by fetching from a different pointer for each thread - I got 100x slowdown from the CPU reference implementation), so right now I'm redoing the experiments based on a vector of fixed-length arrays (the 0th elements of all the vectors from the first formulation go in one array, the 1st elements in the next, and so forth).

iskander · on July 3, 2013

"Array of structs" -> "Struct of arrays"?

I guess if you generalize this transformation sufficiently you end up doing whole-program flattening like NESL and DPH. Or is there a less intrusive way to pull that off?

carterschonwald · on July 14, 2013

Sounds like in his case the unboxed vector soa rep might be enough, though I'd enjoy hearing more about this too. (Doing some gpu work with Haskell a bunch this fall)

BruceIV · on July 4, 2013

I don't think there's a less intrusive way to pull it off, but with sufficient compiler support you should be able to make it fairly transparent to the programmer.

BruceIV · on July 1, 2013

I found a couple relevant links - a blog post of the author's about GPU optimizations [1], and an early paper on this work [2].

[1] http://blog.theincredibleholk.org/blog/2013/06/10/some-simpl... [2] http://www.cs.indiana.edu/~eholk/papers/parco2011.pdf

z3phyr · on July 2, 2013

Eric is also working on the Rust language for GPU. https://github.com/eholk/rust/tree/nvptx

Ihmahr · on July 1, 2013

This is very much needed to make gpu more mainstream.

vault_ · on July 1, 2013

Things like NumbaPro[1] and OpenACC[2] are going to go much further towards making gpu programming more mainstream than a lisp dialect.

NumbaPro is a python library that let's you JIT python into CUDA or CPU vector extensions via decorators.

OpenACC is similar (or possibly developed out of) OpenMP in that it takes a directive driven approach to accelerating code. Given a loop you can add a pragma above it with instructions of how it should run a GPU and it will make it happen.

[1] http://docs.continuum.io/numbapro/ [2] http://www.openacc-standard.org/

EDIT: I forgot to mention that I believe we'll be seeing less writing of GPU code directly and more using of libraries that do the hard work for you. Some algorithms on GPUs are quite tricky to get right, and even with a language that makes them easy to express they are non-trivial to design and test. (Nvidia already provides Thrust with cuda which provides a nice STL like library that implements a lot of common routines)

[3] http://docs.nvidia.com/cuda/thrust/

iskander · on July 1, 2013

>Harlan already has native support for rich data structures, including trees and ragged arrays. Very soon the language will support higher order procedures.

Structured data and nonuniform computations are very hard to express right now on the GPU and typically require a long and terrible process of "creative" CUDA programming. Harlan seems to be pushing the boundaries of what sorts of GPU programs we can effectively/efficiently implement. But, because the syntax is unusual, you're saying it's irrelevant compared to NumbaPro...which is what, a thin wrapper over elementwise array operations?

vault_ · on July 1, 2013

Syntax is syntax, whether it's Lisp or C. What I mean is I hardly think that a new language is what GPU programming needs to break into the mainstream.

What I believe GPU programming needs is tools that people can easily add to their current environments or that they can easily extend their current algorithms with. NumbaPro, OpenACC, and Thrust allow that.

In addition, the reason it's hard to express structured data and nonuniform computation is because those aren't things GPU architecture excels at. It excels at doing uniform operations over large chunks of memory. I'm sure that Harlan is pushing the boundary of what you can do on a GPU, but it's not going to exceed the limitations imposed by the hardware.

iskander · on July 1, 2013

>In addition, the reason it's hard to express structured data and nonuniform computation is because those aren't things GPU architecture excels at.

If you write simple GPU kernels (like those that Theano/Thrust/Copperhead/NumbaPro let you easily express) then you're mostly stuck "doing uniform operations over large chunks of memory". However, the latest GPUs are packed with features for going beyond this simple model. There's a rich set of global atomic operations, a fast register shuffle, better caching and most importantly: dynamic parallelism via nested kernel invocations. We're not programming for the G80 any more, Keplers can run a much larger swath of programs. Sure, you won't reach the theoretical peak FLOPS by traversing irregular structures via recursive kernels but you might still beat the pants off a CPU.

phren0logy · on July 1, 2013

I'm not sure if you are joking (given that it is a lisp, which is notoriously not mainstream), but my limited experience with GPU programming with the existing tools has been pretty painful. I, for one, welcome this development.

burke · on July 1, 2013

The lisp part isn't necessary (or necessarily helpful), but it's certainly an improvement on embedding C-as-a-string in more C.

...though I am concerned that abstracting this stuff too much makes it less obvious how to optimize for the GPU. This may be a language for people that already have a solid grasp on GPU programming.

octo_t · on July 1, 2013

>> embedding C-as-a-string in more C

Um. OpenCL/Cuda look absolutely nothing like that.

_pmf_ · on July 2, 2013

>> embedding C-as-a-string in more C >Um. OpenCL/Cuda look absolutely nothing like that.

Absolutely right. You can also include an external C file from C++ [1]; that makes it so much prettier.

[1] http://developer.amd.com/tools-and-sdks/heterogeneous-comput...

bajsejohannes · on July 1, 2013

Yes, anything to make it easier to get started. I've tried before, but got lost in figuring out which SDK to download. Too bad this toolkit doesn't help in that process.

Happily, writing this answer I discovered that OS X 10.7+ ships with OpenCL already! Can't believe I missed that last time.

For anyone else wanting to try it, the first example is pretty straight forward:

https://developer.apple.com/library/mac/#documentation/Perfo...

(You'll also want the code examples at https://developer.apple.com/library/mac/#documentation/Perfo... )

reeses · on July 1, 2013

The OpenCL implementation provided by Apple is less than ideal. It lacks a number of optimizations that are provided by, for example, AMD's OpenCL driver packages for Linux and Windows.

That, coupled with the wretched GPU options available on the Mac Pro, make the current stack not really cost effective. (I think a 5770 is still $250.)

However, it is better than CPU-only.

I've found slaving a cheap Linux box with expensive GPUs to be a much more gratifying experience.

eholk · on July 1, 2013

I've done most of my development on a Mac, but I keep finding many bugs in Apple's OpenCL implementation. Intel's seems to follow the standard most closely, which is useful for keeping me honest.

octo_t · on July 1, 2013

OpenCL is used to JIT OpenGL calls based on the hardware. Its magic.

pi18n · on July 1, 2013

Are GLSL and OpenCL equally capable?

bajsejohannes · on July 1, 2013

Not exactly sure what you're asking, but GLSL and OpenCL are for two different domains: GLSL for graphics and OpenCL for calculations that are not intended to show up as graphics.

They both run on the same hardware, but I don't have enough (or really any) experience with OpenCL to say if it feels more or less powerful than GLSL.

Edit: It turns out that I have no OpenCL hardware support (mid 2011 MBA), so I guess the hardware isn't necessarily completely similar (i.e. it can support GLSL and not OpenCL)

Impossible · on July 1, 2013

This isn't entirely true. As of OpenGL 4.3 (latest spec), OpenGL adds compute shaders as well. Their primary usage is to make interopt with the graphics pipeline less painful than it would be with OpenCL, but in theory you could use OpenGL (or DX11 on Windows) compute shaders for non-graphics tasks, and they'd have roughly the same capabilities as CUDA or OpenCL.

Ihmahr · on July 2, 2013

Not joking.

s-phi-nl · on July 1, 2013

This reminds me of the Chapel programming language (http://chapel.cray.com/), which seeks to be a relatively high-level language for all parallel computation.

They differ in that Harlan is just for GPUs, while Chapel aims to be good for all parallel computation. Also, Harlan uses Lisp syntax, while Chapel's syntax is C-based.

Any comments on similarities/differences between them from those who know either one better than I?

SeanDav · on July 1, 2013

A pity this is not available under Windows (unless I missed something).

eholk · on July 1, 2013

It's not, although I don't think it should be too hard to port. I haven't done it yet because I don't have ready access to a Windows machine with an OpenCL-capable GPU.