The key to Pogo’s speed is the use of GPUs – Graphical Processing Units. The standard CPU (Central Processing Unit) on a computer is very flexible at one-off tasks, such as appear when running a word processing package or a web browser. However, when it comes to graphics rendering, the same (often very simple) calculation is run for each pixel, each time with a slightly different input such as a coordinate location. GPUs were developed to address this problem, featuring typically hundreds of cores which can be run in parallel, with the cores optimised to run the same calculation each time. Additionally, in order to allow for the highest possible frame-rates, GPUs are designed to have high memory bandwidth, i.e. allow memory to be accessed much faster than a traditional CPU can access the system RAM, which is in part achieved through having memory integrated into the GPU itself.
The two key features, lightweight parallel capabilities and high memory bandwidth, enable Pogo’s speed. The explicit time domain solver updates values at every node at each time step, a highly parallel problem requiring fast access to memory.
Pogo uses Nvidia’s CUDA to access the GPU. This is a flexible technology, allowing the software to run on any CUDA capable card. It does mean that only Nvidia cards are suitable for Pogo, however. It is advised that the CUDA compute version is greater than 2.0 (see list on the Nvidia website), although it is not possible to purchase cards these days which are less than this. Note that the latest version of the CUDA toolkit only supports v3.0 onwards, so it is likely that Pogo will be unable to be run on these older cards in the near future.
The main performance limiting parameter is the memory bandwidth. Pogo (and the explicit time domain method generally) is bandwidth limited; this means that the speed is limited by how quickly the solver can load data from memory rather than how quickly the calculation can be done. In practice this means that the number of GPU cores is irrelevant; instead the run time is directly proportional to the memory bandwidth. Typical good bandwidths are around 300GB/s. Gaming GPUs are a good choice for Pogo, being cheap due to market competition, yet generally possessing high bandwidths. However for more dedicated multi-GPU systems, it may be necessary to go for professional cards such as those from the Tesla range due to practical constraints such as cooling.
The second consideration is model size. A typical GPU has far less memory than the system RAM, and this can limit the size of model which can be stored in the GPU memory. Note that the entire model must fit in GPU memory; transferring data from system RAM to the GPU is very slow and this would have to be done at every time step, so Pogo does not support this. Most 2D models can be run on a single card, however medium-to-large 3D models will need multiple cards.
The amount of time Pogo takes to run is dependent on the GPU (as discussed above this is primary dependent on memory bandwidth - more information can be found in the paper doi:10.1016/j.jcp.2013.10.017). It can also depend on CPU and other system parameters for pre- and post-processing (if these are included in the run-time comparison).
There is a card calculator below which may be useful for estimating the requirements for a particular problem.
At the time of writing a good individual card for Pogo would be the RTX 2080 Ti; this has 11GB of memory and an impressive 616GB/s bandwidth. The developers of Pogo currently use a 1080 Ti for local testing. Given a suitable system, i.e. with adequate cooling, four of these cards could be combined together.
Imperial has several systems which are used to run Pogo. The primary one for the largest jobs consists of 8 K80 cards; these are dual cards of 12GB each, making 8 x 2 x 12 = 192 GB total memory. This has proved suitable for almost all 3D problems (inevitably there is always a PhD student who wants to run the biggest model possible!). Bandwidth for the K80 is 240GB/s per card, making 3840GB/s in total. Nvidia have released a 48 GB card, the RTX 8000; eight of these combined would enable even larger models to be run.
Below is a simple calculator which can be used to estimate the specs of GPUs required to run a typical job. Note that several simplifying assumptions are made and the results from this should be used as approximate indicators only (particularly the simulation time, which is an order of magnitude estimation only); we do not accept responsibility for GPUs incorrectly purchased based on this!
The Pogo solver primarily runs on Linux, but it is possible to run a Windows version although this has not been extensively tested. The developers primarily use Ubuntu (14.04, 16.04 and 17.10) for smaller runs, but the larger systems use Red Hat. Pogo is known to run on CentOS 6 and Open Suse. Efforts have been made to ensure it is compatible across the majority of current Linux distributions and support can be provided to assist with this.
Pogo performs well across multiple GPUs, with very little resulting overhead. It achieves this by efficiently splitting the model into separate sections, one for each GPU. The boundary values for each section are calculated first, then these are transferred to the other GPUs. While the transfer is happening, the remaining, internal, sections are calculated, making the transfer transparent in general.
In some cases it may be necessary to use multiple GPUs; as highlighted above this may be the case if a single card would not be big enough to run the model on its own. It is also possible to improve speed by using multiple GPUs, although while the transfer generally has very little overhead associated with it, the final performance is dependent on the entire system.
Clearly there are physical requirements for a system to be able to house multiple GPUs, particularly since high performance GPUs are large. Cooling is also extremely important. When setting up a multi-GPU system, it is therefore advised that users purchase an entire system, including the GPUs, rather than fitting multiple cards into a separately purchased chassis. Such a system will have been designed to cope with the power, cooling and physical requirements, and is likely to be covered by warranty. Most manufactures of such systems will provide a choice of GPUs.
In gaming systems with multiple GPUs, SLI (Scalable Link Interface) is sometimes proposed. This enables image output from separate cards to be combined prior to sending to the display. CUDA (and general purpose GPU computing in general) cannot use this technology, so Pogo will not see any benefit from it.
Pogo performs preprocessing calculations on standard CPU architecture, so this may be considered when purchasing a suitable system. We have no advice regarding specific CPUs themselves - most modern systems should provide sufficient performance.
Regarding RAM, we advise having at least twice as much standard RAM as total GPU memory, in order to give sufficent space for preprocessing calculations. For example, for our 8 x K80 machine, with 8 x 24GB = 192GB GPU memory, we have 512GB of RAM, which is more than double.