HPC FAQ: Difference between revisions

Content deleted Content added
 
(14 intermediate revisions by 4 users not shown)
Line 28:
'''Over utilisation of CPU threads/cores'''<br>
Trying to run multiple HPC simulations across the same CPU threads. If, for example, you have 4 CPU threads on your computer and you run two simulations that both request 4 threads, then effectively you are overloading the CPU hardware by requesting 8 threads in total. This will slow down the simulations by more than a factor of 2. The most efficient approach in this case is to run both simulations using 2 threads each, noting that if you are performing other CPU intensive tasks, this also needs to be considered.<br>
By default, from the 2020-01 release onwards the number of CPU threads taken is four (4). Previously, the default was two (2). You can control the number of threads requested by either using the <font color="blue"><tt>-nt<number_threads></tt></font> run time option, e.g. -nt2, or use the TCF command <font color="blue"><tt>CPU Threads</tt></font>. The -nt run time option overrides <font color="blue"><tt>CPU Threads</tt></font>.<br>
Note: If Windows hyperthreading is active there typically will be two threads for each physical core. For computationally intensive processes such as TUFLOW, it is recommended that hyperthreading is deactivated so there is one thread for each core.<br>
 
Line 76:
Not all HPC models will show an increase in run time bigger than 20% when changing from the pre 2020 to post 2020 releases. Models that are controlled by the <u>[[HPC_Adaptive_Timestepping | Wave Celerity or Courant Control Numbers]]</u> are likely to be similar in runtime. However, especially where the cell size is smaller than the depth, the Wu approach is vastly superior to the Smagorinsky, and the more sophisticated Wu solution may start causing the <u>[[HPC_Adaptive_Timestepping | Diffusion Control Number]]</u> to control the timestepping causing longer run times.<br>
Despite the possible increase in runtime for some models, the Wu turbulence scheme is warranted particularly as cell sizes are typically getting smaller.<br>
<br>
 
= Is it possible run two TUFLOW HPC simulations using a single GPU and how does this affect performance? =
Yes, it is possible to run two (and more) than one simulation on a single GPU. The performance depends on whether the model needs more than half of the GPU’s resources. At worst, the models complete just as fast as if they ran one after the other. At best, the GPU has enough resources to run both side by side and complete them as if they ran on two GPUs. Realistically, the model may use a bit more than half of the resources, causing some slowdown. Model run’s startup phase can require 4-6x the RAM than the model calculations, the available memory needs to be sufficient to handle two (or more) models starting up side by side. If it is not sufficient, the model's startup phases can be staggered.<br>
<br>
 
= With Wu turbulence scheme being the new default, are old models using Smagorinsky wrong? =
Turbulence is pronounced in areas of highly transient flow, e.g. high velocities, bends, ledges and flow contraction/expansion. Where the flow is more benign and/or bed roughness is high, turbulence is not so important as it only applies where there are strong spatial velocity gradients, for example, for uniform flow in a straight rectangular channel the turbulence term is zero as there is no spatial velocity gradient.<br>
The problem with Smagorinsky, which is a large scale eddy turbulence model originally developed for coastal modelling, is that it is cell size dependent (is proportional to cell surface area) and tends to zero as the cell size tends to zero. This has historically not been a major issue as cell sizes have typically been greater than the depth. The general recommendation in the <u>[https://docs.tuflow.com/classic-hpc/manual/latest/ TUFLOW Manual]</u> is to be careful of using cell sizes significantly smaller than the depth (see Section 1.4). However, as cells have been becoming finer and finer with the advent of GPU models this issue has increasingly emerged, and is particularly pertinent if using a quadtree or flexible mesh and very small cells relative to the depth are being used. If this is the case, bigger differences will be present for bigger events where the water level and velocities are higher.<br>
TUFLOW, many years ago, changed from purely Constant or purely Smagorinsky to Smagorinsky plus (a small amount of) Constant. This improved absorption of eddies into the streamlines behind a bluff body and helped by varying degrees the modelling at finer cell sizes. This was working well in the time being, however new cell size turbulence scheme has now been implemented to help with the situation even further. <br>
The Smagorinsky/Constant turbulence combination has served the industry well and can continue to be used where the cell sizes are not significantly smaller than the depth where highly transient flows are occurring. If the model is well calibrated (using conventional parameters), continuing to use the Smagorinsky/Constant turbulence option is fine. Therefore, it is not considered that TUFLOW (or other good 2D solvers) have been producing questionable results, but that an improved turbulence representation is needed for 2D schemes with fine-scale cells. This is especially the case for the new Quadtree mesh option and for flexible meshes that utilise fine-scale cells.<br>
Line 86 ⟶ 90:
<br>
 
= Why is my HPC model getting unstable? AndWhy whyam I getting a timestep error? Why are my control numbers lower than they couldso below? =
If there are no wet cells at the beginning of the simulation, the adaptive timestep can get quite big. Once the flow increases rapidly, instabilities can develop, which leads to oscillations in variables that grow over time, eventually leading to NaNs in the solution. Two situations can occur:
* The solver can't reduce the big timestep to a sufficiently small timestep within the ten default trials, the simulation gets unstable and the model is stopped with an error.
Line 96 ⟶ 100:
The below suggestions can be implemented to eliminate the instability and/or the decrease in control numbers:
* Specify initial water level for the whole model with <font color="blue"><tt>Set IWL</tt></font> command or locally with <font color="blue"><tt>Read GIS IWL</tt></font> command. The wet cells can limit the adaptive timestep through the <u>[[HPC_Adaptive_Timestepping#HPC_2D_Timestep | Shallow Wave Celerity Number]]</u>, and prevent the HPC solver from using big timesteps.
* Use <font color="blue"><tt>Timestep Maximum</tt></font> command to cap the maximum timestep to not get too high. A good Timestep Maximum value to start with might be a half the cell size in metres, e.g. if the cell size is 5m, the Timestep Maximum is to be 2.5 seconds. The .hpc.tlf file can be checked if further refinement is needed.
<br>
 
Line 110 ⟶ 114:
 
= Should I see 100% GPU utilisation when no other processes are running on GPU? =
HPC still does a small amount of work on CPU such as the model initiation and the final step of data reduction for model volume, control numbers, and stability checks. Frequent map outputs specifically for large datasets might also contribute to lower GPU utilisation as writing of outputs happens on CPU. Even in a perfect world and 2D only model it isn't possible to see 100% GPU utilisation. If there are any 1D features in the model the GPU utilisation will be even lower as 1D is processed on CPU only. A model with 1D ESTRY connection can potentially be doing a lot of work on CPU, perhaps as much as 90% CPU and 10% GPU. If the CPU hardware is not matched correctly with the GPU card it can become a bottleneck for HPC- GPU runs even with a few 1D elements. We are investigating the possibility of parallelising 1D for future releases so it is able to run on GPU.<br>
<br>
 
=Why TUFLOW Classic cannot be parallelised like TUFLOW HPC?=
= What is the difference between an implicit solution and an explicit solution? =
It is due to its implicit solution using matrices, which means some steps in the calculations have dependencies within the numerical loops so cannot or are difficult to parallelise with any worthwhile benefit. We have started work on parallelising sections of the code, but the reduced run times would not be as significant as if using an explicit scheme. Explicit schemes (like Tuflow GPU or FV) have no dependencies in their numerical loops, so all variables on the right hand side of the equation do not appear on the left (i.e. everything on the right hand side is from the previous timestep, except for values at the model’s boundaries).<br>
The numerical methods allowing to move from a continuous to discrete model of the world in which quantities are computed at a finite number of points in distance and time. (Mesh or Grid). Therefore, approximating the partial differential equation by a system of algebraic equations.<br>
It is really important to understand that different schemes can have vastly different run times and being parallelised does not necessarily mean that one scheme is faster than another:
*Implicit schemes like TUFLOW Classic use much bigger timesteps than explicit schemes, hence why on a single core, like-for-like comparison they are faster and often a lot faster than explicit schemes.
*An explicit scheme that is parallelised will run a single simulation faster by around a factor of 5 on an 8 core machine – you will never get a mark-up of 8 on an 8 core machine as there is an overhead in managing the computations across the cores.
*Users are often doing two or more simulations at the same time. For example different events (100, 20 year…, different durations, etc). In these situations, even if a scheme is parallelised, it is usually better and sometimes much better, to run each simulation unparallelised on their own core. For example, if you have four simulations and four cores, definitely don’t run them parallelised, but run all four at once unparallelised. If a fifth simulation is started up this will then slow down the other simulations.<br>
<br>
 
=What are the two times in my _TUFLOW Simulations.log?=
The mathematical description of many physical processes involve differential equations. In the case of the shallow water equations, the rate of change of depth and velocity at a particular point in space and time are described as functions of the depth and velocity, and the spatial gradients of the depth and velocity, at that location. The solution evolves according to its current state. The depth and velocity at some small time increment into the future can be estimated from the current values and their computed rates of change, but here is the problem: the rates of change keep changing and are a function of the solution as it evolves. The small change in the solution that occurs during the time increment influences the derivatives, and therefore the change in the solution depends on itself. The bigger the time increment the stronger the dependence. There are two fundamentally different approaches to finding solutions for problems where the rates of change (in time or space) of a set variables are a complex function of themselves.<br>
There are three pieces of information in this section of the file:<br>
<ol>[[File:TUFLOW simulations log.png]]</ol>
The first (Explicit) approach is to estimate the future state based solely on the current state and the current rate of change. The simplest approach is known as the “forward Euler method” which is easy to understand and implement, but often unstable for systems with little or no energy loss mechanism, and even when it is stable the solution is usually quite sensitive to the size of the timestep. Higher-order methods break the time increment down into sub-steps in order to attempt to account for the “the rate of change of the rates of change”. Such methods include the second order Euler and the fourth order Runge-Kutta method (which is what TUFLOW HPC uses). Explicit methods attempt to capture all of physics in the solution that the equations admit, and often will become unstable if the timestep is not sufficiently small enough to capture the physics accurately. Hence explicit schemes in fluid flow problems have to adjust timestep according to flow velocity (Courant number), wave speed (Celerity number), and diffusion speed (Peclet number).<br>
 
*In green is the 'clock time', i.e. the time that has elapsed on a clock. It is the time the simulation has finished minus the start time.
The second (Implicit) approach is to estimate the future state based on the current state and the future rate of change. There are two common approaches to this “circular reference” problem. One is to reformulate the equations as a matrix problem, and the other is to use an iterative approach whereby the future state is repeatedly updated after successive calculations of the rate of change at the future time – also known as a backward implicit scheme. Both approaches can be very stable and enable larger timesteps, but the solution is permitted to “skip over” physics that happens on timescales smaller than the time increment. The timestep must still be appropriate based on courant, celerity, and Peclet numbers, but due to the iterative nature of the solution the time step can often be 10-20 times larger than that required for an explicit solution.<br>
*In blue is the engine compute hardware.
*In red is the 'total processor time' that has been used for the simulation, including simulation start up. If multiple cores or GPU's have been used it is the total time across each core or GPU. For example, if 12 cores all compute for 10 minutes, the compute time is 12x10 min = 2 hours, despite only 10 minutes elapsing in clock time. If the simulation is set to use only a single core or GPU then this should be less than the clock time.
 
<br>
<br>
 
{{Tips Navigation
= What is the difference between a finite difference scheme and a finite volume scheme? =
|uplink=[[ HPC_Modelling_Guidance | Back to HPC Modelling Guidance]]
}}
A finite difference scheme considers the solution data at discrete points (or nodes) in space and attempts to compute the time derivatives based on the solution data and its spatial derivatives evaluated at those discrete points. While this approach can be relatively simple to implement, the solution is often non-conservative – i.e. the total sum of a certain property (that should be conserved) over all points in the model might increase or decrease slightly from one timestep to the next even in the absence of internal sources or boundary fluxes.<br>
A finite volume scheme uses a mesh that defines interconnected volumes (or cells). The solution data for each cell represents the volume integral (or average) of a conserved property (e.g. mass and momentum) over that cell. The fluxes of the conserved values across cell faces are computed, and the time derivatives for each cell computed according to the total sum of inflows and outflows. The solution scheme is a little more involved to implement, but is guaranteed to be conservative – the model-wise integral of conserved properties remains constant save for internal sources and boundary fluxes.<br>
Tuflow classic is an implicit finite difference scheme. This means that it can use larger timesteps, but can miss short time-scale physics and it is non-conservative. The exact scheme used (Stelling and Syme) becomes reasonably conservative when the timestep is appropriate and the number of convergence iterations are sufficient. However, as the scheme utilises a matrix solution, it requires a particular cell ordering for computations - and this makes it very difficult to parallelise. This is why TUFLOW classic remains a single CPU-core process.<br>
Tuflow HPC utilises an explicit finite volume scheme. This means that it has to use smaller timesteps and is guaranteed to capture the shortest time-scale physics that the given spatial resolution admits. The solution is conserving of mass and momentum to numerical precision. The scheme is not as computationally efficient as the implicit finite difference scheme of TUFLOW classic - if forced to execute on a single CPU core it is many times slower than classic. However, as the cell-by-cell computation of fluxes and derivatives are completely independent, the scheme is well suited to utilise highly parallelised compute hardware such as modern GPUs. The end result is that with a good GPU, TUFLOW HPC can be up to 100 times faster than classic for some models.<br>