HPC FAQ: Difference between revisions
Content deleted Content added
| (39 intermediate revisions by 4 users not shown) | |||
Line 1:
=Will TUFLOW HPC and TUFLOW Classic results match?=
No. There are number of reasons for the differences:
* Solution Scheme: TUFLOW Classic uses a 2nd order ADI (Alternating Direction Implicit) finite difference solution of the 2D SWE, while the HPC solver uses a 2nd order explicit finite volume TVD (Total Variation Diminishing) solution (a 1st order HPC solution is also available, however 2nd order HPC is preferred for higher accuracy). As there is no exact solution of the equations (hence all the different solvers!), the two schemes produce different results. However, in 2nd order mode (default) the two schemes are generally consistent with testing thus far indicating Classic and HPC 2nd order produce peak level differences usually within a few percentage points of the depth in the primary conveyance flow paths. Greater differences can occur in areas adjoining the main flow paths and around the edge of the inundation extent where floodwaters are still rising or are sensitive to a minor rise in main flow path levels, or where upstream controlled weir flow across thick or wide embankments occurs due to the different numerical approaches. <br>
* HQ boundary treatment: Significant differences may occur at 2D HQ boundaries for models using TUFLOW release 2020-01-AB and earlier. Classic treats the 2D HQ boundary as one HQ boundary across the whole HQ line, setting a water level based on the total flow across the line. Due to model splitting to parallelise the 2D domain across CPU or GPU cores, HPC applies the HQ boundary slope to each individual cell along the boundary. As of 2020-10-AA release HPC new default for HQ boundaries is similar to Classic and treating the 2D HQ boundary as one. Nevertheless, as with all HQ boundaries, the effect of the boundary should be well away from the area of interest, and sensitivity testing carried out to demonstrate this.<br>▼
For deep, fast flowing waterways, 1st order HPC tends to produce higher water levels and steeper gradients compared with the Classic and HPC 2nd order solutions. These differences can exceed 10% of the primary flow path depth. ▼
* Turbulence scheme: As of 2020 release TUFLOW HPC uses cell size insensitive Wu turbulence scheme as oppose to Smagorinsky turbulence scheme used by Classic and earlier HPC releases. It is acceptable for well calibrated models to revert back to Smagorinsky turbulence scheme if required.
* Timestepping: HPC uses adaptive timestepping to maintain stability, whereas Classic uses fixed timestep.
▲Significant differences may occur at 2D HQ boundaries. Classic treats the 2D HQ boundary as one HQ boundary across the whole HQ line, setting a water level based on the total flow across the line. Due to model splitting to parallelise the 2D domain across CPU or GPU cores, HPC applies the HQ boundary slope to each individual cell along the boundary. As of 2020-10-AA release HPC new default for HQ boundaries is similar to Classic and treating the 2D HQ boundary as one. Nevertheless, as with all HQ boundaries, the effect of the boundary should be well away from the area of interest, and sensitivity testing carried out to demonstrate this.<br>
▲* If using 1st order HPC: For deep, fast flowing waterways, 1st order HPC tends to produce higher water levels and steeper gradients compared with the Classic and HPC 2nd order solutions. These differences can exceed 10% of the primary flow path depth. Typically, lower Manning’s n values are required for HPC 1st order (or the original TUFLOW GPU), to achieve a similar result to TUFLOW Classic or HPC 2nd order. <br>
<br>
Line 27 ⟶ 28:
'''Over utilisation of CPU threads/cores'''<br>
Trying to run multiple HPC simulations across the same CPU threads. If, for example, you have 4 CPU threads on your computer and you run two simulations that both request 4 threads, then effectively you are overloading the CPU hardware by requesting 8 threads in total. This will slow down the simulations by more than a factor of 2. The most efficient approach in this case is to run both simulations using 2 threads each, noting that if you are performing other CPU intensive tasks, this also needs to be considered.<br>
By default, from the 2020-01 release onwards the number of CPU threads taken is four (4). Previously, the default was two (2). You can control the number of threads requested by either using the <font color="blue"><tt>-nt<number_threads></tt></font> run time option, e.g. -nt2, or use the TCF command <font color="blue"><tt>CPU Threads</tt></font>. The -nt run time option overrides <font color="blue"><tt>CPU Threads</tt></font>.<br>
Note: If Windows hyperthreading is active there typically will be two threads for each physical core. For computationally intensive processes such as TUFLOW, it is recommended that hyperthreading is deactivated so there is one thread for each core.<br>
Line 59 ⟶ 60:
= How much faster is TUFLOW HPC compared to Classic? =
This is largely based on hardware that is used to run HPC models (CPU and GPU) and its performance. On average, HPC using GPU hardware runs about 10 to 20 times faster than Classic and about 30 to 40 times faster than HPC using the default number of CPU threads. Even though HPC using CPU hardware is with default settings slower than Classic, more CPU threads can be used to achieve faster run times. As TUFLOW Classic is not parallelised it can only run on one CPU thread and the runtime cannot be further improved with more CPU resources.<br>
For further information and discussion see: <u>[[Hardware_Benchmarking_Topic_HPC_on_CPU_vs_GPU | Hardware Benchmarking Topic HPC on CPU vs GPU]]</u>.<br>
<br>
Line 66 ⟶ 67:
<br>
= Why is my model using post 2020 HPC slower than
From 2020 TUFLOW includes a number of new features (Quadtree, Sub-Grid Sampling, Wu turbulence model) which are making the solution scheme more computationally complex. As such 2020 and newer releases of TUFLOW are running on average about 20% slower than their predecessors even though the new features are not used. This applies for models with unchanged cell size. Using Quadtree and SGS can warrant changing to a bigger cell size at some parts of the model and decrease runtimes far beyond the 20%.<br>
<font color="blue"><tt>Viscosity Formulation </tt></font> <font color="red"><tt>==</tt></font><tt> Smagorinsky</tt> <br>
<font color="blue"><tt>Viscosity Coefficients</tt></font> <font color="red"><tt>==</tt></font><tt> 0.5, 0.05</tt> <br>
Not all HPC models will show an increase in run time bigger than 20% when changing from the
Despite the possible increase in runtime for some models, the Wu turbulence scheme is warranted particularly as cell sizes are typically getting smaller.<br>
<br>
= Is it possible run two TUFLOW HPC simulations using a single GPU and how does this affect performance? =
Yes, it is possible to run two (and more) than one simulation on a single GPU. The performance depends on whether the model needs more than half of the GPU’s resources. At worst, the models complete just as fast as if they ran one after the other. At best, the GPU has enough resources to run both side by side and complete them as if they ran on two GPUs. Realistically, the model may use a bit more than half of the resources, causing some slowdown. Model run’s startup phase can require 4-6x the RAM than the model calculations, the available memory needs to be sufficient to handle two (or more) models starting up side by side. If it is not sufficient, the model's startup phases can be staggered.<br>
<br>
= With Wu turbulence scheme being the new default, are old models using Smagorinsky wrong? =
Turbulence is pronounced in areas of highly transient flow, e.g.
The problem with Smagorinsky,
TUFLOW, many years ago, changed from purely Constant or purely Smagorinsky to Smagorinsky plus (a small amount of) Constant. This improved absorption of eddies into the streamlines behind a bluff body and helped by varying degrees the modelling at finer cell sizes. This was working well in the time being, however new cell size turbulence scheme has now been implemented to help with the situation even further. <br>
The Smagorinsky/Constant turbulence combination has served the industry well and can continue to be used where the cell sizes are not significantly smaller than the depth where highly transient flows are occurring. If the model is well calibrated (using conventional parameters), continuing to use the Smagorinsky/Constant turbulence option is fine. Therefore, it is not considered that TUFLOW (or other good 2D solvers) have been producing questionable results, but that an improved turbulence representation is needed for 2D schemes with fine-scale cells. This is especially the case for
<br>
= Why is my HPC model getting unstable?
If there are no wet cells at the beginning of the simulation, the adaptive timestep can get quite big
* The solver can't reduce the big timestep to a sufficiently small timestep within the ten default trials, the simulation gets unstable
* The solver is able to reduce the timestep to a sufficient timestep within the ten default trials and the simulation continues running. This however comes with a price of slower run time:
** Warning 2550 - instability timestep corrections might be recorded for some or all of the cells in the model at the end of the simulation.
** Currently if a control number exceed its target by up to 20%, the step is still accepted but the next timestep is factored down by the same percentage. If there are NaNs or control number exceeds its target by more than 20%, the step is rejected, the timestep is cut to half, a repeat step is performed and the control number will be cut to 90% of the previous value. The control number is then allowed to increase only by 0.001% per timestep. It will take 10,000 timesteps to creep back up to the original value if there are no other reduced timesteps, keeping the model running slower than it could.▼
** If a wave celerity or courant control number exceed its target by up to 20% or diffusion number by 10%, the step is still accepted, but the next timestep is factored down by the same percentage.
▲**
The below suggestions can be implemented to eliminate the instability and/or the decrease in control numbers:
* Specify initial water level for the whole model with <font color="blue"><tt>Set IWL
* Use <font color="blue"><tt>Timestep Maximum
<br>
= Should I expect zero mass error for TUFLOW HPC models? =
HPC solver uses a finite volume scheme which is volume conservative and shouldn't produce any mass error for 2D only models. Mass error can still occur when coupling HPC with the ESTRY engine that isn't volume conservative. The cause could be 1D structures, 1D/2D linking or 1D/2D timestep synchronisation. Where there is not a one to one synchronisation of the 1D and 2D timesteps, mass error may occur due to the interpolation of the 1D/2D boundary values over time. A 'healthy' model will usually report up to 1% mass error. Higher mass error is an indication the solution is not converging, usually in isolated locations. In nearly all cases it is because of poor data (e.g. cliff edge between data sets) or model schematisation (e.g. large pit connected to one tiny 2D cell or boundary poorly digitised so it is not roughly perpendicular to the flow). The total model mass error can also be observed in the TUFLOW Summary File (*.tsf) output as oppose to the tlf file.<br>
<br>
= My HPC model uses double precision due to unstable 1D channel. How can I make it stable in single precision? =
Currently, it is not possible to run the 1D ESTRY engine in double precision and 2D TUFLOW engine in single precision for the same model. There are two ways how to reduce the mass balance errors and switch to a single precision:
* Improve stability of the 1D features and 1D/2D links
* Remove 1D channel completely and use Quadtree mesh to sufficiently refine resolution of the creek in 2D.
<br>
= Should I see 100% GPU utilisation when no other processes are running on GPU? =
HPC still does a small amount of work on CPU such as the model initiation and the final step of data reduction for model volume, control numbers, and stability checks. Frequent map outputs specifically for large datasets might also contribute to lower GPU utilisation as writing of outputs happens on CPU. Even in a perfect world and 2D only model it isn't possible to see 100% GPU utilisation. If there are any 1D features in the model the GPU utilisation will be even lower as 1D is processed on CPU only. A model with 1D ESTRY connection can potentially be doing a lot of work on CPU, perhaps as much as 90% CPU and 10% GPU. If the CPU hardware is not matched correctly with the GPU card it can become a bottleneck for HPC GPU runs even with a few 1D elements.<br>
<br>
=Why TUFLOW Classic cannot be parallelised like TUFLOW HPC?=
It is due to its implicit solution using matrices, which means some steps in the calculations have dependencies within the numerical loops so cannot or are difficult to parallelise with any worthwhile benefit. We have started work on parallelising sections of the code, but the reduced run times would not be as significant as if using an explicit scheme. Explicit schemes (like Tuflow GPU or FV) have no dependencies in their numerical loops, so all variables on the right hand side of the equation do not appear on the left (i.e. everything on the right hand side is from the previous timestep, except for values at the model’s boundaries).<br>
It is really important to understand that different schemes can have vastly different run times and being parallelised does not necessarily mean that one scheme is faster than another:
*Implicit schemes like TUFLOW Classic use much bigger timesteps than explicit schemes, hence why on a single core, like-for-like comparison they are faster and often a lot faster than explicit schemes.
*An explicit scheme that is parallelised will run a single simulation faster by around a factor of 5 on an 8 core machine – you will never get a mark-up of 8 on an 8 core machine as there is an overhead in managing the computations across the cores.
*Users are often doing two or more simulations at the same time. For example different events (100, 20 year…, different durations, etc). In these situations, even if a scheme is parallelised, it is usually better and sometimes much better, to run each simulation unparallelised on their own core. For example, if you have four simulations and four cores, definitely don’t run them parallelised, but run all four at once unparallelised. If a fifth simulation is started up this will then slow down the other simulations.<br>
<br>
=What are the two times in my _TUFLOW Simulations.log?=
There are three pieces of information in this section of the file:<br>
<ol>[[File:TUFLOW simulations log.png]]</ol>
*In green is the 'clock time', i.e. the time that has elapsed on a clock. It is the time the simulation has finished minus the start time.
*In blue is the engine compute hardware.
*In red is the 'total processor time' that has been used for the simulation, including simulation start up. If multiple cores or GPU's have been used it is the total time across each core or GPU. For example, if 12 cores all compute for 10 minutes, the compute time is 12x10 min = 2 hours, despite only 10 minutes elapsing in clock time. If the simulation is set to use only a single core or GPU then this should be less than the clock time.
<br>
<br>
{{Tips Navigation
|uplink=[[ HPC_Modelling_Guidance | Back to HPC Modelling Guidance]]
}}
| |||