HPC FAQ: Difference between revisions
| Chris Huxley (talk | contribs) | |||
| (80 intermediate revisions by 5 users not shown) | |||
| Line 1: | Line 1: | ||
| =Will TUFLOW HPC and TUFLOW Classic results match?= | =Will TUFLOW HPC and TUFLOW Classic results match?= | ||
| No. There are number of reasons for the differences: | |||
| No.  TUFLOW Classic uses a 2nd order ADI (Alternating Direction Implicit) finite difference solution of the 2D SWE, while the HPC solver uses a 2nd order explicit finite volume TVD (Total Variation Diminishing) solution (a 1st order HPC solution is also available).  As there is no exact solution of the equations (hence all the different solvers!), the two schemes produce different results.  <br> | |||
| * Solution Scheme: TUFLOW Classic uses a 2nd order ADI (Alternating Direction Implicit) finite difference solution of the 2D SWE, while the HPC solver uses a 2nd order explicit finite volume TVD (Total Variation Diminishing) solution (a 1st order HPC solution is also available, however 2nd order HPC is preferred for higher accuracy).  As there is no exact solution of the equations (hence all the different solvers!), the two schemes produce different results. However, in 2nd order mode (default) the two schemes are generally consistent with testing thus far indicating Classic and HPC 2nd order produce peak level differences usually within a few percentage points of the depth in the primary conveyance flow paths.  Greater differences can occur in areas adjoining the main flow paths and around the edge of the inundation extent where floodwaters are still rising or are sensitive to a minor rise in main flow path levels, or where upstream controlled weir flow across thick or wide embankments occurs due to the different numerical approaches.  <br> | |||
| * HQ boundary treatment: Significant differences may occur at 2D HQ boundaries for models using TUFLOW release 2020-01-AB and earlier.  Classic treats the 2D HQ boundary as one HQ boundary across the whole HQ line, setting a water level based on the total flow across the line.  Due to model splitting to parallelise the 2D domain across CPU or GPU cores, HPC applies the HQ boundary slope to each individual cell along the boundary. As of 2020-10-AA release HPC new default for HQ boundaries is similar to Classic and treating the 2D HQ boundary as one. Nevertheless, as with all HQ boundaries, the effect of the boundary should be well away from the area of interest, and sensitivity testing carried out to demonstrate this.<br> | |||
| * Turbulence scheme: As of 2020 release TUFLOW HPC uses cell size insensitive Wu turbulence scheme as oppose to Smagorinsky turbulence scheme used by Classic and earlier HPC releases. It is acceptable for well calibrated models to revert back to Smagorinsky turbulence scheme if required. | |||
| * Timestepping: HPC uses adaptive timestepping to maintain stability, whereas Classic uses fixed timestep. | |||
| * If using 1st order HPC: For deep, fast flowing waterways, 1st order HPC tends to produce higher water levels and steeper gradients compared with the Classic and HPC 2nd order solutions.  These differences can exceed 10% of the primary flow path depth. Typically, lower Manning’s n values are required for HPC 1st order (or the original TUFLOW GPU), to achieve a similar result to TUFLOW Classic or HPC 2nd order. <br> | |||
| <br> | <br> | ||
| However, in 2nd order mode the two schemes are generally consistent with testing thus far indicating Classic and HPC 2nd order produce peak level differences usually within a few percentage points of the depth in the primary conveyance flow paths.  Greater differences can occur in areas adjoining the main flow paths and around the edge of the inundation extent where floodwaters are still rising or are sensitive to a minor rise in main flow path levels, or where upstream controlled weir flow across thick or wide embankments occurs due to the different numerical approaches.  <br> | |||
| <br> | |||
| For deep, fast flowing waterways, 1st order HPC tends to produce higher water levels and steeper gradients compared with the Classic and HPC 2nd order solutions.  These differences can exceed 10% of the primary flow path depth.   | |||
| Typically, lower Manning’s n values are required for HPC 1st order (or the original TUFLOW GPU), to achieve a similar result to TUFLOW Classic or HPC 2nd order. <br> | |||
| <br> | |||
| Significant differences may occur at 2D HQ boundaries.  Classic treats the 2D HQ boundary as one HQ boundary across the whole HQ line, setting a water level based on the total flow across the line.  Due to model splitting to parallelise the 2D domain across CPU or GPU cores, HPC applies the HQ boundary slope to each individual cell along the boundary.  As with all HQ boundaries, the effect of the boundary should be well away from the area of interest, and sensitivity testing carried out to demonstrate this. | |||
| =Is recalibration necessary if I switch from TUFLOW Classic to TUFLOW HPC?= | =Is recalibration necessary if I switch from TUFLOW Classic to TUFLOW HPC?= | ||
| Yes, if transitioning from Classic to HPC (or any other solver), it is best practice to compare the results, and if there are unacceptable differences, or the model calibration has deteriorated, to fine-tune the model performance through adjustment of key parameters. | Yes, if transitioning from Classic to HPC (or any other solver), it is best practice to compare the results, and if there are unacceptable differences, or the model calibration has deteriorated, to fine-tune the model performance through adjustment of key parameters.<br> | ||
| <br> | |||
| Typically, between TUFLOW Classic and HPC 2nd order this would only require a slight adjustment to Manning’s n values, any additional form losses at bends/obstructions or eddy viscosity values.  Regardless, industry standard Manning’s n and other key parameters should only be used/needed.  Use of non-standard values is a strong indicator there are other issues such as inflows, poor boundary representation or missing/erroneous topography. <br> | Typically, between TUFLOW Classic and HPC 2nd order this would only require a slight adjustment to Manning’s n values, any additional form losses at bends/obstructions or eddy viscosity values.  Regardless, industry standard Manning’s n and other key parameters should only be used/needed.  Use of non-standard values is a strong indicator there are other issues such as inflows, poor boundary representation or missing/erroneous topography. <br> | ||
| A greater adjustment of parameters would be expected if transitioning between HPC 1st order (or the original TUFLOW GPU) and Classic or HPC 2nd order.<br> | |||
| <br> | <br> | ||
| A greater adjustment of parameters would be expected if transitioning between HPC 1st order (or the original TUFLOW GPU) and Classic or HPC 2nd order. | |||
| =Do I need to change anything to run a TUFLOW Classic model in TUFLOW HPC?= | =Do I need to change anything to run a TUFLOW Classic model in TUFLOW HPC?= | ||
| Line 20: | Line 18: | ||
| <font color="blue"><tt>Solution Scheme </tt></font> <font color="red"><tt>==</tt></font><tt> HPC </tt> <br> | <font color="blue"><tt>Solution Scheme </tt></font> <font color="red"><tt>==</tt></font><tt> HPC </tt> <br> | ||
| The following command is also required to run the model using GPU hardware:<br>  | The following command is also required to run the model using GPU hardware:<br>  | ||
| <font color="blue"><tt>Hardware </tt></font> <font color="red"><tt>==</tt></font><tt> GPU </tt> | <font color="blue"><tt>Hardware </tt></font> <font color="red"><tt>==</tt></font><tt> GPU </tt><br> | ||
| <br> | |||
| =Why does my TUFLOW HPC simulation take longer than TUFLOW Classic?= | =Why does my TUFLOW HPC simulation take longer than TUFLOW Classic?= | ||
| The primary reasons why the HPC may run slow are discussed below: | The primary reasons why the HPC may run slow are discussed below:<br> | ||
| '''If run on a single CPU thread, Classic is a more efficient scheme'''<br> | '''If run on a single CPU thread, Classic is a more efficient scheme'''<br> | ||
| If running on the same CPU hardware, a well-constructed Classic model on a good timestep is nearly always faster than HPC running on a single CPU thread (i.e. not using GPU hardware).  Running a single HPC simulation across multiple CPU threads may produce a faster simulation than Classic. HPC is best run using GPU hardware. HPC run using good GPU hardware should be faster than Classic on CPU. The <u>[[Hardware_Benchmarking | Computer Hardware  Benchmark]]</u> page included guidance on the fastest available hardware for TUFLOW modelling. | If running on the same CPU hardware, a well-constructed Classic model on a good timestep is nearly always faster than HPC running on a single CPU thread (i.e. not using GPU hardware).  Running a single HPC simulation across multiple CPU threads may produce a faster simulation than Classic. HPC is best run using GPU hardware. HPC run using good GPU hardware should be faster than Classic on CPU. The <u>[[Hardware_Benchmarking | Computer Hardware  Benchmark]]</u> page included guidance on the fastest available hardware for TUFLOW modelling.<br> | ||
| '''Over utilisation of CPU threads/cores'''<br> | '''Over utilisation of CPU threads/cores'''<br> | ||
| Trying to run multiple HPC simulations across the same CPU threads.  If, for example, you have 4 CPU threads on your computer and you run two simulations that both request 4 threads, then effectively you are overloading the CPU hardware by requesting 8 threads in total.  This will slow down the simulations by more than a factor of 2.  The most efficient approach in this case is to run both simulations using 2 threads each, noting that if you are performing other CPU intensive tasks, this also needs to be considered.<br> | Trying to run multiple HPC simulations across the same CPU threads.  If, for example, you have 4 CPU threads on your computer and you run two simulations that both request 4 threads, then effectively you are overloading the CPU hardware by requesting 8 threads in total.  This will slow down the simulations by more than a factor of 2.  The most efficient approach in this case is to run both simulations using 2 threads each, noting that if you are performing other CPU intensive tasks, this also needs to be considered.<br> | ||
| By default, the number of CPU threads taken is two (2). You can control the number of threads requested by either using the <font color="blue"><tt>-nt<number_threads></tt></font> run time option, e.g. -nt2, or use the TCF command <font color="blue"><tt>CPU Threads</tt></font>. The -nt run time option overrides <font color="blue"><tt>CPU Threads</tt></font>.<br> | By default, from the 2020-01 release onwards the number of CPU threads taken is four (4). Previously, the default was two (2). You can control the number of threads requested by either using the <font color="blue"><tt>-nt<number_threads></tt></font> run time option, e.g. -nt2, or use the TCF command <font color="blue"><tt>CPU Threads</tt></font>. The -nt run time option overrides <font color="blue"><tt>CPU Threads</tt></font>.<br> | ||
| Note:  If Windows hyperthreading is active there typically will be two threads for each physical core.  For computationally intensive processes such as TUFLOW, it is recommended that hyperthreading is deactivated so there is one thread for each core. | Note:  If Windows hyperthreading is active there typically will be two threads for each physical core.  For computationally intensive processes such as TUFLOW, it is recommended that hyperthreading is deactivated so there is one thread for each core.<br> | ||
| '''Poor GPU Hardware'''<br> | '''Poor GPU Hardware'''<br> | ||
| If running a simulation using a low end or old GPU device, simulations may only be marginally faster, than running Classic or HPC on CPU hardware.  If running on a GPU device, high end NVidia graphics are strongly recommended.  The performance of different NVidia cards varies by orders of magnitude.The <u>[[Hardware_Benchmarking | Computer Hardware  Benchmark]]</u> page included guidance on the fastest available hardware for TUFLOW modelling. | If running a simulation using a low end or old GPU device, simulations may only be marginally faster, than running Classic or HPC on CPU hardware.  If running on a GPU device, high end NVidia graphics are strongly recommended.  The performance of different NVidia cards varies by orders of magnitude.The <u>[[Hardware_Benchmarking | Computer Hardware  Benchmark]]</u> page included guidance on the fastest available hardware for TUFLOW modelling.<br> | ||
| '''The HPC adaptive timestep is reducing to an extremely small number'''<br> | '''The HPC adaptive timestep is reducing to an extremely small number'''<br> | ||
| See <u>[[HPC_Adaptive_Timestepping | HPC Adaptive Timestepping]]</u> | See <u>[[HPC_Adaptive_Timestepping | HPC Adaptive Timestepping]]</u><br> | ||
| <br> | |||
| = Why is the TUFLOW HPC adaptive timestepping | = Why is the TUFLOW HPC adaptive timestepping selecting very small timesteps?= | ||
| Common reasons for TUFLOW HPC selecting very small timesteps are: | Common reasons for TUFLOW HPC selecting very small timesteps are: | ||
| * The model has one or more or erroneous deep cells.  The Celerity Control Number described further above reduces the timestep proportionally to the square root of the depth, so any unintended deep cells can cause a reduction in the timestep. | * The model has one or more or erroneous deep cells.  The Celerity Control Number described further above reduces the timestep proportionally to the square root of the depth, so any unintended deep cells can cause a reduction in the timestep. | ||
| Line 47: | Line 47: | ||
| * .hpc.dt.csv file (this file contains every timestep)  | * .hpc.dt.csv file (this file contains every timestep)  | ||
| * “Minimum dt” map output (excellent for identifying the location of the minimum timestep adopted – add “dt” to <font color="blue"><tt>Map Output Data Types </tt></font> <font color="red"><tt>==</tt></font>) | * “Minimum dt” map output (excellent for identifying the location of the minimum timestep adopted – add “dt” to <font color="blue"><tt>Map Output Data Types </tt></font> <font color="red"><tt>==</tt></font>) | ||
| <br> | |||
| = I know TUFLOW Classic, do I need to be aware of anything different with TUFLOW HPC?= | = I know TUFLOW Classic, do I need to be aware of anything different with TUFLOW HPC?= | ||
| Line 55: | Line 56: | ||
| * Be more thorough in reviewing the model results.  Although this is best practice for any modelling, it is paramount for unconditionally stable solvers like HPC that thorough checks of the model’s flow patterns, performance at boundaries and links is carried out.  | * Be more thorough in reviewing the model results.  Although this is best practice for any modelling, it is paramount for unconditionally stable solvers like HPC that thorough checks of the model’s flow patterns, performance at boundaries and links is carried out.  | ||
| * The CME%, which is an excellent indicator that the Classic 2D solver is numerically converging, is not generally of use for HPC, which is volume conserving and effectively 0% subject to numerical precision.  Non-zero whole of model CME% for HPC 1D/2D linked models is usually an indication of either the 1D and 2D adaptive timesteps being significantly different, or a poorly configured 1D/2D link. | * The CME%, which is an excellent indicator that the Classic 2D solver is numerically converging, is not generally of use for HPC, which is volume conserving and effectively 0% subject to numerical precision.  Non-zero whole of model CME% for HPC 1D/2D linked models is usually an indication of either the 1D and 2D adaptive timesteps being significantly different, or a poorly configured 1D/2D link. | ||
| <br> | |||
| = How much faster is TUFLOW HPC compared to Classic? = | |||
| This is largely based on hardware that is used to run HPC models (CPU and GPU) and its performance. On average, HPC using GPU hardware runs about 10 to 20 times faster than Classic and about 30 to 40 times faster than HPC using the default number of CPU threads. Even though HPC using CPU hardware is with default settings slower than Classic, more CPU threads can be used to achieve faster run times. As TUFLOW Classic is not parallelised it can only run on one CPU thread and the runtime cannot be further improved with more CPU resources.<br> | |||
| For further information and discussion see: <u>[[Hardware_Benchmarking_Topic_HPC_on_CPU_vs_GPU | Hardware Benchmarking Topic HPC on CPU vs GPU]]</u>.<br> | |||
| <br> | |||
| =Will results from TUFLOW HPC using CPU match with HPC using GPU?= | |||
| TUFLOW HPC using CPU should produce identical results with TUFLOW HPC using GPU, because it uses the same solver. However, HPC GPU and HPC CPU are compiled by different compilers, which can produce minor differences down to numerical precision. Also note that minor difference between HPC CPU and HPC GPU can be amplified in a model that is already unstable. If there are large differences in modelling results, it could be an indicator of model instability.<br> | |||
| <br> | |||
| = Why is my model using post 2020 HPC slower than pre 2020 HPC? = | |||
| From 2020 TUFLOW includes a number of new features (Quadtree, Sub-Grid Sampling, Wu turbulence model) which are making the solution scheme more computationally complex. As such 2020 and newer releases of TUFLOW are running on average about 20% slower than their predecessors even though the new features are not used. This applies for models with unchanged cell size. Using Quadtree and SGS can warrant changing to a bigger cell size at some parts of the model and decrease runtimes far beyond the 20%.<br> | |||
| Further change in run time can be due to different timestepping applied with the new default mesh size insensitive turbulence model (Wu instead of Smagorinsky). To confirm this is the case, the model can be run with 2020 release and the following commands:<br> | |||
| <font color="blue"><tt>Viscosity Formulation </tt></font> <font color="red"><tt>==</tt></font><tt> Smagorinsky</tt> <br> | |||
| <font color="blue"><tt>Viscosity Coefficients</tt></font> <font color="red"><tt>==</tt></font><tt> 0.5, 0.05</tt> <br> | |||
| Not all HPC models will show an increase in run time bigger than 20% when changing from the pre 2020 to post 2020 releases. Models that are controlled by the <u>[[HPC_Adaptive_Timestepping | Wave Celerity or Courant Control Numbers]]</u> are likely to be similar in runtime. However, especially where the cell size is smaller than the depth, the Wu approach is vastly superior to the Smagorinsky, and the more sophisticated Wu solution may start causing the <u>[[HPC_Adaptive_Timestepping | Diffusion Control Number]]</u> to control the timestepping causing longer run times.<br> | |||
| Despite the possible increase in runtime for some models, the Wu turbulence scheme is warranted particularly as cell sizes are typically getting smaller.<br> | |||
| <br> | |||
| = Is it possible run two TUFLOW HPC simulations using a single GPU and how does this affect performance? = | |||
| Yes, it is possible to run two (and more) than one simulation on a single GPU. The performance depends on whether the model needs more than half of the GPU’s resources. At worst, the models complete just as fast as if they ran one after the other. At best, the GPU has enough resources to run both side by side and complete them as if they ran on two GPUs. Realistically, the model may use a bit more than half of the resources, causing some slowdown. Model run’s startup phase can require 4-6x the RAM than the model calculations, the available memory needs to be sufficient to handle two (or more) models starting up side by side. If it is not sufficient, the model's startup phases can be staggered.<br> | |||
| <br> | |||
| = With Wu turbulence scheme being the new default, are old models using Smagorinsky wrong? = | |||
| Turbulence is pronounced in areas of highly transient flow, e.g. high velocities, bends, ledges and flow contraction/expansion. Where the flow is more benign and/or bed roughness is high, turbulence is not so important as it only applies where there are strong spatial velocity gradients, for example, for uniform flow in a straight rectangular channel the turbulence term is zero as there is no spatial velocity gradient.<br> | |||
| The problem with Smagorinsky, which is a large scale eddy turbulence model originally developed for coastal modelling, is that it is cell size dependent (is proportional to cell surface area) and tends to zero as the cell size tends to zero. This has historically not been a major issue as cell sizes have typically been greater than the depth. The general recommendation in the <u>[https://docs.tuflow.com/classic-hpc/manual/latest/ TUFLOW Manual]</u> is to be careful of using cell sizes significantly smaller than the depth (see Section 1.4). However, as cells have been becoming finer and finer with the advent of GPU models this issue has increasingly emerged, and is particularly pertinent if using a quadtree or flexible mesh and very small cells relative to the depth are being used. If this is the case, bigger differences will be present for bigger events where the water level and velocities are higher.<br> | |||
| TUFLOW, many years ago, changed from purely Constant or purely Smagorinsky to Smagorinsky plus (a small amount of) Constant. This improved absorption of eddies into the streamlines behind a bluff body and helped by varying degrees the modelling at finer cell sizes. This was working well in the time being, however new cell size turbulence scheme has now been implemented to help with the situation even further. <br>   | |||
| The Smagorinsky/Constant turbulence combination has served the industry well and can continue to be used where the cell sizes are not significantly smaller than the depth where highly transient flows are occurring. If the model is well calibrated (using conventional parameters), continuing to use the Smagorinsky/Constant turbulence option is fine. Therefore, it is not considered that TUFLOW (or other good 2D solvers) have been producing questionable results, but that an improved turbulence representation is needed for 2D schemes with fine-scale cells. This is especially the case for the new Quadtree mesh option and for flexible meshes that utilise fine-scale cells.<br> | |||
| With Wu turbulence scheme, the same viscosity parameter(s) can be applied across a wide range of scales from flume tests to large rivers.<br> | |||
| <br> | |||
| = Why is my HPC model getting unstable? Why am I getting a timestep error? Why are my control numbers so low? = | |||
| If there are no wet cells at the beginning of the simulation, the adaptive timestep can get quite big. Once the flow increases rapidly, instabilities can develop, which leads to oscillations in variables that grow over time, eventually leading to NaNs in the solution. Two situations can occur: | |||
| * The solver can't reduce the big timestep to a sufficiently small timestep within the ten default trials, the simulation gets unstable and the model is stopped with an error. | |||
| * The solver is able to reduce the timestep to a sufficient timestep within the ten default trials and the simulation continues running. This however comes with a price of slower run time: | |||
| ** Warning 2550 - instability timestep corrections might be recorded for some or all of the cells in the model at the end of the simulation. | |||
| ** If a wave celerity or courant control number exceed its target by up to 20% or diffusion number by 10%, the step is still accepted, but the next timestep is factored down by the same percentage. | |||
| ** If there are NaNs or control number exceeds its target by more than 20% or 10% respectively, the step is rejected, the timestep is cut to half, a repeat step is performed and the control number will be cut to 90% of the previous value. The control number is then allowed to increase only by 0.001% per timestep. It will take 10,000 timesteps to creep back up to the original value if there are no other reduced timesteps, keeping the model running slower than it could and not reaching the usual control number limits for a while. | |||
| The below suggestions can be implemented to eliminate the instability and/or the decrease in control numbers: | |||
| * Specify initial water level for the whole model with <font color="blue"><tt>Set IWL</tt></font>  command or locally with <font color="blue"><tt>Read GIS IWL</tt></font> command. The wet cells can limit the adaptive timestep through the <u>[[HPC_Adaptive_Timestepping#HPC_2D_Timestep | Shallow Wave Celerity Number]]</u>, and prevent the HPC solver from using big timesteps. | |||
| * Use <font color="blue"><tt>Timestep Maximum</tt></font>  command to cap the maximum timestep to not get too high. A good Timestep Maximum value to start with might be a half the cell size in metres, e.g. if the cell size is 5m, the Timestep Maximum is to be 2.5 seconds. The .hpc.tlf file can be checked if further refinement is needed. | |||
| <br> | |||
| = Should I expect zero mass error for TUFLOW HPC models? = | |||
| HPC solver uses a finite volume scheme which is volume conservative and shouldn't produce any mass error for 2D only models. Mass error can still occur when coupling HPC with the ESTRY engine that isn't volume conservative. The cause could be 1D structures, 1D/2D linking or 1D/2D timestep synchronisation. Where there is not a one to one synchronisation of the 1D and 2D timesteps, mass error may occur due to the interpolation of the 1D/2D boundary values over time. A 'healthy' model will usually report up to 1% mass error. Higher mass error is an indication the solution is not converging, usually in isolated locations. In nearly all cases it is because of poor data (e.g. cliff edge between data sets) or model schematisation (e.g. large pit connected to one tiny 2D cell or boundary poorly digitised so it is not roughly perpendicular to the flow). The total model mass error can also be observed in the TUFLOW Summary File (*.tsf) output as oppose to the tlf file.<br> | |||
| <br> | |||
| = My HPC model uses double precision due to unstable 1D channel. How can I make it stable in single precision? = | |||
| Currently, it is not possible to run the 1D ESTRY engine in double precision and 2D TUFLOW engine in single precision for the same model. There are two ways how to reduce the mass balance errors and switch to a single precision: | |||
| * Improve stability of the 1D features and 1D/2D links to perform well in single precision. The _TSMB and _TSMB1d2d layers might help locating specific features that could be worked on. | |||
| * Remove 1D channel completely and use Quadtree mesh to sufficiently refine resolution of the creek in 2D. | |||
| <br> | |||
| = Should I see 100% GPU utilisation when no other processes are running on GPU? = | |||
| HPC still does a small amount of work on CPU such as the model initiation and the final step of data reduction for model volume, control numbers, and stability checks. Frequent map outputs specifically for large datasets might also contribute to lower GPU utilisation as writing of outputs happens on CPU. Even in a perfect world and 2D only model it isn't possible to see 100% GPU utilisation. If there are any 1D features in the model the GPU utilisation will be even lower as 1D is processed on CPU only. A model with 1D ESTRY connection can potentially be doing a lot of work on CPU, perhaps as much as 90% CPU and 10% GPU. If the CPU hardware is not matched correctly with the GPU card it can become a bottleneck for HPC GPU runs even with a few 1D elements.<br> | |||
| <br> | |||
| =Why TUFLOW Classic cannot be parallelised like TUFLOW HPC?= | |||
| It is due to its implicit solution using matrices, which means some steps in the calculations have dependencies within the numerical loops so cannot or are difficult to parallelise with any worthwhile benefit.  We have started work on parallelising sections of the code, but the reduced run times would not be as significant as if using an explicit scheme.  Explicit schemes (like Tuflow GPU or FV) have no dependencies in their numerical loops, so all variables on the right hand side of the equation do not appear on the left (i.e. everything on the right hand side is from the previous timestep, except for values at the model’s boundaries).<br> | |||
| It is really important to understand that different schemes can have vastly different run times and being parallelised does not necessarily mean that one scheme is faster than another: | |||
| *Implicit schemes like TUFLOW Classic use much bigger timesteps than explicit schemes, hence why on a single core, like-for-like comparison they are faster and often a lot faster than explicit schemes. | |||
| *An explicit scheme that is parallelised will run a single simulation faster by around a factor of 5 on an 8 core machine – you will never get a mark-up of 8 on an 8 core machine as there is an overhead in managing the computations across the cores.  | |||
| *Users are often doing two or more simulations at the same time.  For example different events (100, 20 year…, different durations, etc).  In these situations, even if a scheme is parallelised, it is usually better and sometimes much better, to run each simulation unparallelised on their own core.  For example, if you have four simulations and four cores, definitely don’t run them parallelised, but run all four at once unparallelised.  If a fifth simulation is started up this will then slow down the other simulations.<br> | |||
| <br> | |||
| =What are the two times in my _TUFLOW Simulations.log?= | |||
| There are three pieces of information in this section of the file:<br> | |||
| <ol>[[File:TUFLOW simulations log.png]]</ol> | |||
| *In green is the 'clock time', i.e. the time that has elapsed on a clock. It is the time the simulation has finished minus the start time.  | |||
| *In blue is the engine compute hardware. | |||
| *In red is the 'total processor time' that has been used for the simulation, including simulation start up. If multiple cores or GPU's have been used it is the total time across each core or GPU. For example, if 12 cores all compute for 10 minutes, the compute time is 12x10 min = 2 hours, despite only 10 minutes elapsing in clock time. If the simulation is set to use only a single core or GPU then this should be less than the clock time.  | |||
| <br> | |||
| <br> | |||
| {{Tips Navigation | |||
| |uplink=[[ HPC_Modelling_Guidance | Back to HPC Modelling Guidance]] | |||
| }} | |||
Latest revision as of 11:14, 20 November 2024
Will TUFLOW HPC and TUFLOW Classic results match?
No. There are number of reasons for the differences:
- Solution Scheme: TUFLOW Classic uses a 2nd order ADI (Alternating Direction Implicit) finite difference solution of the 2D SWE, while the HPC solver uses a 2nd order explicit finite volume TVD (Total Variation Diminishing) solution (a 1st order HPC solution is also available, however 2nd order HPC is preferred for higher accuracy).  As there is no exact solution of the equations (hence all the different solvers!), the two schemes produce different results. However, in 2nd order mode (default) the two schemes are generally consistent with testing thus far indicating Classic and HPC 2nd order produce peak level differences usually within a few percentage points of the depth in the primary conveyance flow paths.  Greater differences can occur in areas adjoining the main flow paths and around the edge of the inundation extent where floodwaters are still rising or are sensitive to a minor rise in main flow path levels, or where upstream controlled weir flow across thick or wide embankments occurs due to the different numerical approaches.  
- HQ boundary treatment: Significant differences may occur at 2D HQ boundaries for models using TUFLOW release 2020-01-AB and earlier.  Classic treats the 2D HQ boundary as one HQ boundary across the whole HQ line, setting a water level based on the total flow across the line.  Due to model splitting to parallelise the 2D domain across CPU or GPU cores, HPC applies the HQ boundary slope to each individual cell along the boundary. As of 2020-10-AA release HPC new default for HQ boundaries is similar to Classic and treating the 2D HQ boundary as one. Nevertheless, as with all HQ boundaries, the effect of the boundary should be well away from the area of interest, and sensitivity testing carried out to demonstrate this.
- Turbulence scheme: As of 2020 release TUFLOW HPC uses cell size insensitive Wu turbulence scheme as oppose to Smagorinsky turbulence scheme used by Classic and earlier HPC releases. It is acceptable for well calibrated models to revert back to Smagorinsky turbulence scheme if required.
- Timestepping: HPC uses adaptive timestepping to maintain stability, whereas Classic uses fixed timestep.
- If using 1st order HPC: For deep, fast flowing waterways, 1st order HPC tends to produce higher water levels and steeper gradients compared with the Classic and HPC 2nd order solutions.  These differences can exceed 10% of the primary flow path depth. Typically, lower Manning’s n values are required for HPC 1st order (or the original TUFLOW GPU), to achieve a similar result to TUFLOW Classic or HPC 2nd order. 
Is recalibration necessary if I switch from TUFLOW Classic to TUFLOW HPC?
Yes, if transitioning from Classic to HPC (or any other solver), it is best practice to compare the results, and if there are unacceptable differences, or the model calibration has deteriorated, to fine-tune the model performance through adjustment of key parameters.
Typically, between TUFLOW Classic and HPC 2nd order this would only require a slight adjustment to Manning’s n values, any additional form losses at bends/obstructions or eddy viscosity values.  Regardless, industry standard Manning’s n and other key parameters should only be used/needed.  Use of non-standard values is a strong indicator there are other issues such as inflows, poor boundary representation or missing/erroneous topography. 
A greater adjustment of parameters would be expected if transitioning between HPC 1st order (or the original TUFLOW GPU) and Classic or HPC 2nd order.
Do I need to change anything to run a TUFLOW Classic model in TUFLOW HPC?
For single 2D domain models, no, other than inserting the following basic TCF commands: 
Solution Scheme  == HPC  
The following command is also required to run the model using GPU hardware:
 
Hardware  == GPU 
Why does my TUFLOW HPC simulation take longer than TUFLOW Classic?
The primary reasons why the HPC may run slow are discussed below:
If run on a single CPU thread, Classic is a more efficient scheme
If running on the same CPU hardware, a well-constructed Classic model on a good timestep is nearly always faster than HPC running on a single CPU thread (i.e. not using GPU hardware).  Running a single HPC simulation across multiple CPU threads may produce a faster simulation than Classic. HPC is best run using GPU hardware. HPC run using good GPU hardware should be faster than Classic on CPU. The  Computer Hardware  Benchmark page included guidance on the fastest available hardware for TUFLOW modelling.
Over utilisation of CPU threads/cores
Trying to run multiple HPC simulations across the same CPU threads.  If, for example, you have 4 CPU threads on your computer and you run two simulations that both request 4 threads, then effectively you are overloading the CPU hardware by requesting 8 threads in total.  This will slow down the simulations by more than a factor of 2.  The most efficient approach in this case is to run both simulations using 2 threads each, noting that if you are performing other CPU intensive tasks, this also needs to be considered.
By default, from the 2020-01 release onwards the number of CPU threads taken is four (4). Previously, the default was two (2). You can control the number of threads requested by either using the -nt<number_threads> run time option, e.g. -nt2, or use the TCF command CPU Threads. The -nt run time option overrides CPU Threads.
Note:  If Windows hyperthreading is active there typically will be two threads for each physical core.  For computationally intensive processes such as TUFLOW, it is recommended that hyperthreading is deactivated so there is one thread for each core.
Poor GPU Hardware
If running a simulation using a low end or old GPU device, simulations may only be marginally faster, than running Classic or HPC on CPU hardware.  If running on a GPU device, high end NVidia graphics are strongly recommended.  The performance of different NVidia cards varies by orders of magnitude.The  Computer Hardware  Benchmark page included guidance on the fastest available hardware for TUFLOW modelling.
The HPC adaptive timestep is reducing to an extremely small number
See  HPC Adaptive Timestepping
Why is the TUFLOW HPC adaptive timestepping selecting very small timesteps?
Common reasons for TUFLOW HPC selecting very small timesteps are:
- The model has one or more or erroneous deep cells. The Celerity Control Number described further above reduces the timestep proportionally to the square root of the depth, so any unintended deep cells can cause a reduction in the timestep.
- Poorly configured or schematised 2D boundary or 1D/2D link causing uncontrolled or inaccurate flow patterns. The high velocities may cause the Courant Number to control the timestep or the high velocity differentials can cause the Diffusion Number to force the timestep downwards. In these situations, Classic would often become unstable alerting the modeller to an issue. However, HPC will remain stable relying on the modeller to perform more thorough reviews of flow patterns at boundaries and 1D/2D links.
- If using the SRF (Storage Reduction Factor), this proportionally reduces the Δx and Δy length values in the control number formulae. This may further reduce the minimum timestep if a cell with an SRF value greater than 0.0) is the controlling cell. For example, applying a SRF of 0.8 to reduce the storage of a cell by 80% or a factor of 5, also reduces the controlling timestep for that cell by a factor of 5.
To review and isolate the location of the minimum timestep the timesteps are output to:
- Console window and .hpc.tlf file
- .hpc.dt.csv file (this file contains every timestep)
- “Minimum dt” map output (excellent for identifying the location of the minimum timestep adopted – add “dt” to Map Output Data Types ==)
I know TUFLOW Classic, do I need to be aware of anything different with TUFLOW HPC?
Yes!  TUFLOW Classic tells you where your model has deficient or erroneous data, or where the model is poorly set up by going unstable, or producing a high mass error (a sign of poor numerical convergence of the matrix solution).  The best approach when developing a Classic model is to keep the timestep high (typically a half to a quarter of the cell size in metres), and if the simulation becomes unstable to investigate why.  In most cases, there will be erroneous data or poor set up such as a badly orientated boundary, connecting a large 1D culvert to a single SX cell, etc. 
 
HPC, however, remains stable by reducing its timestep and does not alert the modeller to these issues.  Therefore, the following tips are highly recommended, otherwise there will be a strong likelihood that any deficient aspects to the modelling won’t be found till much further down the track, potentially causing costly reworking.  So, it’s very much modeller beware!
- Use of excessively small timesteps is a strong indicator of poor model health (see discussion further above).
- If the timestepping is erratic (i.e. not changing smoothly), or there is a high occurrence of repeated timesteps, these are indicators of an issue in the model data or set up.
- Be more thorough in reviewing the model results. Although this is best practice for any modelling, it is paramount for unconditionally stable solvers like HPC that thorough checks of the model’s flow patterns, performance at boundaries and links is carried out.
- The CME%, which is an excellent indicator that the Classic 2D solver is numerically converging, is not generally of use for HPC, which is volume conserving and effectively 0% subject to numerical precision. Non-zero whole of model CME% for HPC 1D/2D linked models is usually an indication of either the 1D and 2D adaptive timesteps being significantly different, or a poorly configured 1D/2D link.
How much faster is TUFLOW HPC compared to Classic?
This is largely based on hardware that is used to run HPC models (CPU and GPU) and its performance. On average, HPC using GPU hardware runs about 10 to 20 times faster than Classic and about 30 to 40 times faster than HPC using the default number of CPU threads. Even though HPC using CPU hardware is with default settings slower than Classic, more CPU threads can be used to achieve faster run times. As TUFLOW Classic is not parallelised it can only run on one CPU thread and the runtime cannot be further improved with more CPU resources.
For further information and discussion see:  Hardware Benchmarking Topic HPC on CPU vs GPU.
Will results from TUFLOW HPC using CPU match with HPC using GPU?
TUFLOW HPC using CPU should produce identical results with TUFLOW HPC using GPU, because it uses the same solver. However, HPC GPU and HPC CPU are compiled by different compilers, which can produce minor differences down to numerical precision. Also note that minor difference between HPC CPU and HPC GPU can be amplified in a model that is already unstable. If there are large differences in modelling results, it could be an indicator of model instability.
Why is my model using post 2020 HPC slower than pre 2020 HPC?
From 2020 TUFLOW includes a number of new features (Quadtree, Sub-Grid Sampling, Wu turbulence model) which are making the solution scheme more computationally complex. As such 2020 and newer releases of TUFLOW are running on average about 20% slower than their predecessors even though the new features are not used. This applies for models with unchanged cell size. Using Quadtree and SGS can warrant changing to a bigger cell size at some parts of the model and decrease runtimes far beyond the 20%.
Further change in run time can be due to different timestepping applied with the new default mesh size insensitive turbulence model (Wu instead of Smagorinsky). To confirm this is the case, the model can be run with 2020 release and the following commands:
Viscosity Formulation  == Smagorinsky 
Viscosity Coefficients == 0.5, 0.05 
Not all HPC models will show an increase in run time bigger than 20% when changing from the pre 2020 to post 2020 releases. Models that are controlled by the  Wave Celerity or Courant Control Numbers are likely to be similar in runtime. However, especially where the cell size is smaller than the depth, the Wu approach is vastly superior to the Smagorinsky, and the more sophisticated Wu solution may start causing the  Diffusion Control Number to control the timestepping causing longer run times.
Despite the possible increase in runtime for some models, the Wu turbulence scheme is warranted particularly as cell sizes are typically getting smaller.
Is it possible run two TUFLOW HPC simulations using a single GPU and how does this affect performance?
Yes, it is possible to run two (and more) than one simulation on a single GPU. The performance depends on whether the model needs more than half of the GPU’s resources. At worst, the models complete just as fast as if they ran one after the other. At best, the GPU has enough resources to run both side by side and complete them as if they ran on two GPUs. Realistically, the model may use a bit more than half of the resources, causing some slowdown. Model run’s startup phase can require 4-6x the RAM than the model calculations, the available memory needs to be sufficient to handle two (or more) models starting up side by side. If it is not sufficient, the model's startup phases can be staggered.
With Wu turbulence scheme being the new default, are old models using Smagorinsky wrong?
Turbulence is pronounced in areas of highly transient flow, e.g. high velocities, bends, ledges and flow contraction/expansion. Where the flow is more benign and/or bed roughness is high, turbulence is not so important as it only applies where there are strong spatial velocity gradients, for example, for uniform flow in a straight rectangular channel the turbulence term is zero as there is no spatial velocity gradient.
The problem with Smagorinsky, which is a large scale eddy turbulence model originally developed for coastal modelling, is that it is cell size dependent (is proportional to cell surface area) and tends to zero as the cell size tends to zero. This has historically not been a major issue as cell sizes have typically been greater than the depth. The general recommendation in the TUFLOW Manual is to be careful of using cell sizes significantly smaller than the depth (see Section 1.4). However, as cells have been becoming finer and finer with the advent of GPU models this issue has increasingly emerged, and is particularly pertinent if using a quadtree or flexible mesh and very small cells relative to the depth are being used. If this is the case, bigger differences will be present for bigger events where the water level and velocities are higher.
TUFLOW, many years ago, changed from purely Constant or purely Smagorinsky to Smagorinsky plus (a small amount of) Constant. This improved absorption of eddies into the streamlines behind a bluff body and helped by varying degrees the modelling at finer cell sizes. This was working well in the time being, however new cell size turbulence scheme has now been implemented to help with the situation even further. 
  
The Smagorinsky/Constant turbulence combination has served the industry well and can continue to be used where the cell sizes are not significantly smaller than the depth where highly transient flows are occurring. If the model is well calibrated (using conventional parameters), continuing to use the Smagorinsky/Constant turbulence option is fine. Therefore, it is not considered that TUFLOW (or other good 2D solvers) have been producing questionable results, but that an improved turbulence representation is needed for 2D schemes with fine-scale cells. This is especially the case for the new Quadtree mesh option and for flexible meshes that utilise fine-scale cells.
With Wu turbulence scheme, the same viscosity parameter(s) can be applied across a wide range of scales from flume tests to large rivers.
Why is my HPC model getting unstable? Why am I getting a timestep error? Why are my control numbers so low?
If there are no wet cells at the beginning of the simulation, the adaptive timestep can get quite big. Once the flow increases rapidly, instabilities can develop, which leads to oscillations in variables that grow over time, eventually leading to NaNs in the solution. Two situations can occur:
- The solver can't reduce the big timestep to a sufficiently small timestep within the ten default trials, the simulation gets unstable and the model is stopped with an error.
- The solver is able to reduce the timestep to a sufficient timestep within the ten default trials and the simulation continues running. This however comes with a price of slower run time:
- Warning 2550 - instability timestep corrections might be recorded for some or all of the cells in the model at the end of the simulation.
- If a wave celerity or courant control number exceed its target by up to 20% or diffusion number by 10%, the step is still accepted, but the next timestep is factored down by the same percentage.
- If there are NaNs or control number exceeds its target by more than 20% or 10% respectively, the step is rejected, the timestep is cut to half, a repeat step is performed and the control number will be cut to 90% of the previous value. The control number is then allowed to increase only by 0.001% per timestep. It will take 10,000 timesteps to creep back up to the original value if there are no other reduced timesteps, keeping the model running slower than it could and not reaching the usual control number limits for a while.
 
The below suggestions can be implemented to eliminate the instability and/or the decrease in control numbers:
- Specify initial water level for the whole model with Set IWL command or locally with Read GIS IWL command. The wet cells can limit the adaptive timestep through the Shallow Wave Celerity Number, and prevent the HPC solver from using big timesteps.
- Use Timestep Maximum command to cap the maximum timestep to not get too high. A good Timestep Maximum value to start with might be a half the cell size in metres, e.g. if the cell size is 5m, the Timestep Maximum is to be 2.5 seconds. The .hpc.tlf file can be checked if further refinement is needed.
Should I expect zero mass error for TUFLOW HPC models?
HPC solver uses a finite volume scheme which is volume conservative and shouldn't produce any mass error for 2D only models. Mass error can still occur when coupling HPC with the ESTRY engine that isn't volume conservative. The cause could be 1D structures, 1D/2D linking or 1D/2D timestep synchronisation. Where there is not a one to one synchronisation of the 1D and 2D timesteps, mass error may occur due to the interpolation of the 1D/2D boundary values over time. A 'healthy' model will usually report up to 1% mass error. Higher mass error is an indication the solution is not converging, usually in isolated locations. In nearly all cases it is because of poor data (e.g. cliff edge between data sets) or model schematisation (e.g. large pit connected to one tiny 2D cell or boundary poorly digitised so it is not roughly perpendicular to the flow). The total model mass error can also be observed in the TUFLOW Summary File (*.tsf) output as oppose to the tlf file.
My HPC model uses double precision due to unstable 1D channel. How can I make it stable in single precision?
Currently, it is not possible to run the 1D ESTRY engine in double precision and 2D TUFLOW engine in single precision for the same model. There are two ways how to reduce the mass balance errors and switch to a single precision:
- Improve stability of the 1D features and 1D/2D links to perform well in single precision. The _TSMB and _TSMB1d2d layers might help locating specific features that could be worked on.
- Remove 1D channel completely and use Quadtree mesh to sufficiently refine resolution of the creek in 2D.
Should I see 100% GPU utilisation when no other processes are running on GPU?
HPC still does a small amount of work on CPU such as the model initiation and the final step of data reduction for model volume, control numbers, and stability checks. Frequent map outputs specifically for large datasets might also contribute to lower GPU utilisation as writing of outputs happens on CPU. Even in a perfect world and 2D only model it isn't possible to see 100% GPU utilisation. If there are any 1D features in the model the GPU utilisation will be even lower as 1D is processed on CPU only. A model with 1D ESTRY connection can potentially be doing a lot of work on CPU, perhaps as much as 90% CPU and 10% GPU. If the CPU hardware is not matched correctly with the GPU card it can become a bottleneck for HPC GPU runs even with a few 1D elements.
Why TUFLOW Classic cannot be parallelised like TUFLOW HPC?
It is due to its implicit solution using matrices, which means some steps in the calculations have dependencies within the numerical loops so cannot or are difficult to parallelise with any worthwhile benefit.  We have started work on parallelising sections of the code, but the reduced run times would not be as significant as if using an explicit scheme.  Explicit schemes (like Tuflow GPU or FV) have no dependencies in their numerical loops, so all variables on the right hand side of the equation do not appear on the left (i.e. everything on the right hand side is from the previous timestep, except for values at the model’s boundaries).
It is really important to understand that different schemes can have vastly different run times and being parallelised does not necessarily mean that one scheme is faster than another:
- Implicit schemes like TUFLOW Classic use much bigger timesteps than explicit schemes, hence why on a single core, like-for-like comparison they are faster and often a lot faster than explicit schemes.
- An explicit scheme that is parallelised will run a single simulation faster by around a factor of 5 on an 8 core machine – you will never get a mark-up of 8 on an 8 core machine as there is an overhead in managing the computations across the cores.
- Users are often doing two or more simulations at the same time.  For example different events (100, 20 year…, different durations, etc).  In these situations, even if a scheme is parallelised, it is usually better and sometimes much better, to run each simulation unparallelised on their own core.  For example, if you have four simulations and four cores, definitely don’t run them parallelised, but run all four at once unparallelised.  If a fifth simulation is started up this will then slow down the other simulations.
What are the two times in my _TUFLOW Simulations.log?
There are three pieces of information in this section of the file:
- In green is the 'clock time', i.e. the time that has elapsed on a clock. It is the time the simulation has finished minus the start time.
- In blue is the engine compute hardware.
- In red is the 'total processor time' that has been used for the simulation, including simulation start up. If multiple cores or GPU's have been used it is the total time across each core or GPU. For example, if 12 cores all compute for 10 minutes, the compute time is 12x10 min = 2 hours, despite only 10 minutes elapsing in clock time. If the simulation is set to use only a single core or GPU then this should be less than the clock time.
| Up | 
|---|
|  Back to HPC Modelling Guidance | 
