Hardware Selection Advice: Difference between revisions

Content deleted Content added
No edit summary
Removing statement on misinformation, added links, made testing advice specific to non-ECC memory
 
(2 intermediate revisions by one other user not shown)
Line 24:
*TUFLOW FV - Run on GPU Hardware: A single model run uses the GPU(s) for computation. In general terms: The maximum model size is dependent on the available GPU and CPU RAM and the runtime is driven by the CUDA core speed, the number of CUDA cores available and the GPU architecture. GPU performance is complex and is not easily inferred from GPU clock speed and number of cores, it is also very dependent on the ‘generation’ or architecture of GPU. As TUFLOW FV requires some data exchange between GPU and CPU, the motherboard bus speeds and CPU speeds also play a role but typically a much lesser role compared to the GPU CUDA compute.<br>
 
The <u>[[Hardware_Benchmarking_-_Results#CPU_Results | Hardware Benchmarking]]</u> page shows recently run combinations of GPU, CPU and RAM. These can be compared with the system intended for purchase. The recommendation is to seek advise from an appropriate computer hardware vendor who can advise on the compatibility and optimisation of the setup.<br>
<br>
 
=GPU Advice=
Line 33 ⟶ 32:
*TUFLOW HPC on GPU Hardware can be run in either single or double precision. However, for the vast majority of flood applications single precision is sufficient. We typically run our models on single precision. If you are unsure we recommend running with both the single and double precision solvers and comparing your results.
The precision solver you require will determine the type of GPU card that is best suited for your compute. For any given generation/architecture of cards, the “gaming” cards such as the GTX GeForce and RTX provide excellent single precision performance – typically comparable to that of the “scientific” cards such as the Tesla series. If double precision is required then the scientific cards are substantially faster, but these are also significantly more expensive. The Quadro series cards sit in between for both double precision performance and cost. When checking the specifications of the card it should provide you with a breakdown of the single and double precision throughput in flops. Single precision compute is typically sufficient for TUFLOW HPC modelling.
 
For the higher end GPU cards, users may wish to consider server-based computers rather than workstations, and also weigh the cost of an extra TUFLOW licence against the cost of the high end hardware.
 
Line 41:
===CPU RAM===
TUFLOW HPC on GPU hardware still uses the CPU to compute and store data (in CPU RAM) during model initialisation and for all 1D calculations. While we are working on improving our CPU RAM usage, currently we tend to find that CPU RAM is often the limiter to the size of the model domain you can run, particularly if using running over multiple GPU cards. During initialisation and simulation a model will typically require 4-6 times the amount of CPU RAM relative to GPU RAM. As an example, a model that utilises 11GB of GPU RAM (typical memory for high-end gaming card, and corresponds to about a 50 million cell model) the CPU RAM required during initialisation will typically be in range 44GB to 66GB. A model that fully utilises two 11 GB GPUs (i.e. a 100 million cell model) may require as much as 128GB of CPU RAM during initialisation. Note that anything more than 256GB of CPU RAM will exceed the limitations of consumer chipsets that are available in 2025 and requires more expensive workstation hardware - additionally, users should consult a hardware expert to check limitations of specific hardware.
 
=== RAM Reliability (ECC vs non-ECC) ===
ECC (Error-Correcting Code) RAM detects and corrects memory errors, improving reliability, while non-ECC cannot. Use of non-ECC memory may raise worries about global error rates affecting simulation results. However, large [https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf field studies] show errors are usually caused by physical faults in specific DIMMs (Dual In-line Memory Modules, the removable RAM sticks), not uniform random events. Most DIMMs experience no errors, while a small number produce the vast majority of faults. Modern DDR5 memory also includes on-die correction that silently fixes some errors before they leave the chip.
 
A failing DIMM on a non-ECC system is more likely to cause crashes or obvious corruption than a silent incorrect result. In numerical solvers, bit flips often trigger instability or failure rather than plausible but wrong outputs. For a single TUFLOW workstation, ECC is generally not required solely to protect result quality, though it may be beneficial for servers, critical workloads, or environments operating many machines.
 
If additional confidence is desired, run a memory test (for example [https://www.memtest.org/ Memtest86+]) for multiple passes after installation, for non-ECC memory. Consistent errors indicate defective hardware that should be replaced.
 
===CUDA Cores, GPU Clock speed, and FLOPs ===