HPC FAQ: Difference between revisions
Content deleted Content added
(19 intermediate revisions by 4 users not shown) | |||
Line 28:
'''Over utilisation of CPU threads/cores'''<br>
Trying to run multiple HPC simulations across the same CPU threads. If, for example, you have 4 CPU threads on your computer and you run two simulations that both request 4 threads, then effectively you are overloading the CPU hardware by requesting 8 threads in total. This will slow down the simulations by more than a factor of 2. The most efficient approach in this case is to run both simulations using 2 threads each, noting that if you are performing other CPU intensive tasks, this also needs to be considered.<br>
By default, from the 2020-01 release onwards the number of CPU threads taken is four (4). Previously, the default was two (2). You can control the number of threads requested by either using the <font color="blue"><tt>-nt<number_threads></tt></font> run time option, e.g. -nt2, or use the TCF command <font color="blue"><tt>CPU Threads</tt></font>. The -nt run time option overrides <font color="blue"><tt>CPU Threads</tt></font>.<br>
Note: If Windows hyperthreading is active there typically will be two threads for each physical core. For computationally intensive processes such as TUFLOW, it is recommended that hyperthreading is deactivated so there is one thread for each core.<br>
Line 68:
= Why is my model using post 2020 HPC slower than pre 2020 HPC? =
From 2020 TUFLOW includes a number of new features (Quadtree, Sub-Grid Sampling, Wu turbulence model) which are making the solution scheme more computationally complex. As such 2020 and newer releases of TUFLOW are running on average about 20% slower than their predecessors even though the new features are not used. This applies for models with unchanged cell size. Using Quadtree and SGS can warrant changing to a bigger cell size at some parts of the model and decrease runtimes far beyond the 20%.<br>
<font color="blue"><tt>Viscosity Formulation </tt></font> <font color="red"><tt>==</tt></font><tt> Smagorinsky</tt> <br>
<font color="blue"><tt>Viscosity Coefficients</tt></font> <font color="red"><tt>==</tt></font><tt> 0.5, 0.05</tt> <br>
Not all HPC models will show an increase in run time bigger than 20% when changing from the
Despite the possible increase in runtime for some models, the Wu turbulence scheme is warranted particularly as cell sizes are typically getting smaller.<br>
<br>
= Is it possible run two TUFLOW HPC simulations using a single GPU and how does this affect performance? =
Yes, it is possible to run two (and more) than one simulation on a single GPU. The performance depends on whether the model needs more than half of the GPU’s resources. At worst, the models complete just as fast as if they ran one after the other. At best, the GPU has enough resources to run both side by side and complete them as if they ran on two GPUs. Realistically, the model may use a bit more than half of the resources, causing some slowdown. Model run’s startup phase can require 4-6x the RAM than the model calculations, the available memory needs to be sufficient to handle two (or more) models starting up side by side. If it is not sufficient, the model's startup phases can be staggered.<br>
<br>
= With Wu turbulence scheme being the new default, are old models using Smagorinsky wrong? =
Turbulence is pronounced in areas of highly transient flow, e.g. high velocities, bends, ledges and flow contraction/expansion. Where the flow is more benign and/or bed roughness is high, turbulence is not so important as it only applies where there are strong spatial velocity gradients, for example, for uniform flow in a straight rectangular channel the turbulence term is zero as there is no spatial velocity gradient.<br>
The problem with Smagorinsky, which is a large scale eddy turbulence model originally developed for coastal modelling, is that it is cell size dependent (is proportional to cell surface area) and tends to zero as the cell size tends to zero. This has historically not been a major issue as cell sizes have typically been greater than the depth. The general recommendation in the <u>[https://docs.tuflow.com/classic-hpc/manual/latest/ TUFLOW Manual]</u> is to be careful of using cell sizes significantly smaller than the depth (see Section 1.4). However, as cells have been becoming finer and finer with the advent of GPU models this issue has increasingly emerged, and is particularly pertinent if using a quadtree or flexible mesh and very small cells relative to the depth are being used. If this is the case, bigger differences will be present for bigger events where the water level and velocities are higher.<br>
TUFLOW, many years ago, changed from purely Constant or purely Smagorinsky to Smagorinsky plus (a small amount of) Constant. This improved absorption of eddies into the streamlines behind a bluff body and helped by varying degrees the modelling at finer cell sizes. This was working well in the time being, however new cell size turbulence scheme has now been implemented to help with the situation even further. <br>
The Smagorinsky/Constant turbulence combination has served the industry well and can continue to be used where the cell sizes are not significantly smaller than the depth where highly transient flows are occurring. If the model is well calibrated (using conventional parameters), continuing to use the Smagorinsky/Constant turbulence option is fine. Therefore, it is not considered that TUFLOW (or other good 2D solvers) have been producing questionable results, but that an improved turbulence representation is needed for 2D schemes with fine-scale cells. This is especially the case for the new Quadtree mesh option and for flexible meshes that utilise fine-scale cells.<br>
With Wu turbulence scheme, the same viscosity parameter(s) can be applied across a wide range of scales from flume tests to large rivers.<br>
<br>
= Why is my HPC model getting unstable?
If there are no wet cells at the beginning of the simulation, the adaptive timestep can get quite big. Once the flow increases rapidly, instabilities can develop, which leads to oscillations in variables that grow over time, eventually leading to NaNs in the solution. Two situations can occur:
* The solver can't reduce the big timestep to a sufficiently small timestep within the ten default trials, the simulation gets unstable and the model is stopped with an error.
Line 91 ⟶ 96:
** Warning 2550 - instability timestep corrections might be recorded for some or all of the cells in the model at the end of the simulation.
** If a wave celerity or courant control number exceed its target by up to 20% or diffusion number by 10%, the step is still accepted, but the next timestep is factored down by the same percentage.
** If there are NaNs or control number exceeds its target by more than 20% or 10% respectively, the step is rejected, the timestep is cut to half, a repeat step is performed and the control number will be cut to 90% of the previous value. The control number is then allowed to increase only by 0.001% per timestep. It will take 10,000 timesteps to creep back up to the original value if there are no other reduced timesteps, keeping the model running slower than it could and not reaching the usual control number limits for a while.
The below suggestions can be implemented to eliminate the instability and/or the decrease in control numbers:
* Specify initial water level for the whole model with <font color="blue"><tt>Set IWL</tt></font> command or locally with <font color="blue"><tt>Read GIS IWL</tt></font> command. The wet cells can limit the adaptive timestep through the <u>[[HPC_Adaptive_Timestepping#HPC_2D_Timestep | Shallow Wave Celerity Number]]</u>, and prevent the HPC solver from using big timesteps.
* Use <font color="blue"><tt>Timestep Maximum</tt></font> command to cap the maximum timestep to not get too high. A good Timestep Maximum value to start with might be a half the cell size in metres, e.g. if the cell size is 5m, the Timestep Maximum is to be 2.5 seconds. The .hpc.tlf file can be checked if further refinement is needed.
<br>
Line 109 ⟶ 114:
= Should I see 100% GPU utilisation when no other processes are running on GPU? =
HPC still does a small amount of work on CPU such as the model initiation and the final step of data reduction for model volume, control numbers, and stability checks. Frequent map outputs specifically for large datasets might also contribute to lower GPU utilisation as writing of outputs happens on CPU. Even in a perfect world and 2D only model it isn't possible to see 100% GPU utilisation. If there are any 1D features in the model the GPU utilisation will be even lower as 1D is processed on CPU only. A model with 1D ESTRY connection can potentially be doing a lot of work on CPU, perhaps as much as 90% CPU and 10% GPU. If the CPU hardware is not matched correctly with the GPU card it can become a bottleneck for HPC
<br>
=Why TUFLOW Classic cannot be parallelised like TUFLOW HPC?=
It is due to its implicit solution using matrices, which means some steps in the calculations have dependencies within the numerical loops so cannot or are difficult to parallelise with any worthwhile benefit. We have started work on parallelising sections of the code, but the reduced run times would not be as significant as if using an explicit scheme. Explicit schemes (like Tuflow GPU or FV) have no dependencies in their numerical loops, so all variables on the right hand side of the equation do not appear on the left (i.e. everything on the right hand side is from the previous timestep, except for values at the model’s boundaries).<br>
It is really important to understand that different schemes can have vastly different run times and being parallelised does not necessarily mean that one scheme is faster than another:
*Implicit schemes like TUFLOW Classic use much bigger timesteps than explicit schemes, hence why on a single core, like-for-like comparison they are faster and often a lot faster than explicit schemes.
*An explicit scheme that is parallelised will run a single simulation faster by around a factor of 5 on an 8 core machine – you will never get a mark-up of 8 on an 8 core machine as there is an overhead in managing the computations across the cores.
*Users are often doing two or more simulations at the same time. For example different events (100, 20 year…, different durations, etc). In these situations, even if a scheme is parallelised, it is usually better and sometimes much better, to run each simulation unparallelised on their own core. For example, if you have four simulations and four cores, definitely don’t run them parallelised, but run all four at once unparallelised. If a fifth simulation is started up this will then slow down the other simulations.<br>
<br>
=What are the two times in my _TUFLOW Simulations.log?=
There are three pieces of information in this section of the file:<br>
<ol>[[File:TUFLOW simulations log.png]]</ol>
*In green is the 'clock time', i.e. the time that has elapsed on a clock. It is the time the simulation has finished minus the start time.
*In blue is the engine compute hardware.
*In red is the 'total processor time' that has been used for the simulation, including simulation start up. If multiple cores or GPU's have been used it is the total time across each core or GPU. For example, if 12 cores all compute for 10 minutes, the compute time is 12x10 min = 2 hours, despite only 10 minutes elapsing in clock time. If the simulation is set to use only a single core or GPU then this should be less than the clock time.
<br>
<br>
{{Tips Navigation
|uplink=[[ HPC_Modelling_Guidance | Back to HPC Modelling Guidance]]
}}
|