Difference between revisions of "Hardware Benchmarking"

From Tuflow
Jump to navigation Jump to search
Line 196: Line 196:
 
*12.6x reduction in runtime for the 30m model (80,000 cells)
 
*12.6x reduction in runtime for the 30m model (80,000 cells)
 
*23.8x reduction in runtime for the 15m model (325,000 cells)
 
*23.8x reduction in runtime for the 15m model (325,000 cells)
These results highlight the relationship between GPU/CPU runtime reduction relative the number of the cells in a model. The reduction ratio increases with the size of the model (number of cells). Up to a 100x reduction in runtime have been recorded using a 18,000,000 cell GPU model refer [[Hardware_Benchmarking#Large_Model_GPU_Benchmarking| Large Model GPU Benchmarking ]].
+
These results highlight the relationship between GPU/CPU runtime reduction relative the number of the cells in a model. The reduction ratio increases with the size of the model (number of cells). Up to a 100x reduction in runtime have been recorded using a 18,000,000 cell GPU model refer [[Hardware_Benchmarking#Large_Model_GPU_Benchmarking| Large Model GPU Benchmarking]].
 
[[File:Bench_CPU_vs_GPU.png | 800px ]]
 
[[File:Bench_CPU_vs_GPU.png | 800px ]]
  

Revision as of 15:55, 26 October 2015

Introduction

We frequently get asked, "What is the minimum or recommended hardware to use for TUFLOW modelling". This is always a tricky question, as the answer depends on the type and size of the models you are going to be run. For a small model, TUFLOW should run on any modern PC or laptop that is capable of running Windows XP or later. However, for large models there may be requirements for a hefty computer running a 64 bit version of Windows.
The tables below showing computer specifications and model run-time should help you compare systems.
In this page we outline a hardware benchmark model which is available to download from the TUFLOW website the model can be simulated without a TUFLOW dongle (license). This makes it easy to benchmark on a range of computers and the results are complied below.
We have typically found that the CPU speed is the largest influence on TUFLOW runtimes, with the RAM speed also having an influence for large models. In order to quantify this we are compiling the computational times required for a range of different machines.

Benchmark Model

The benchmark model is based on a “challenge” issued prior to the 2012 Flood Managers Association (FMA) Conference in Sacramento, USA. There is more information on the model setup and purpose in the FMA challenge model introduction.
This hardware benchmark is based on the second challenge which involves a coastal river in flood with two ocean outlets. The model has been modified slightly (mainly in terms of the outputs). It is setup to run use both the TUFLOW "classic" (CPU) and TUFLOW GPU (graphics card) solvers for a range of cell sizes.
Cell sizes

Cell Size (m) Number of cells
30 80,887
15 323,536
10 (GPU only) 727,865

The model runs for three days of simulation time (72 hours). The approximate run time for the 30m model on the CPU is likely to be ~20min and for the 15m version approximately 4 hours. Given the runtime for the CPU model at 10m resolution is likely to be > 12 hours, this is skipped in the benchmark (this can also be run with a licence).
To participate in the benchmark, please follow the steps below:

  • Download the model from http://www.tuflow.com/Download/TUFLOW/Benchmark_Models/FMA2_GPU_CPU_Benchmark.zip
  • Extract the model on a local drive of the computer you would like to benchmark.
  • Navigate to the TUFLOW\runs\ folder and run the "Run_Benchmark.bat" file. This checks if you are running a 32 or 64 bit system and then runs the benchmark. This also generates some output files that contain more information on the processor, memory and GPU that you are using.
  • Email the _ TUFLOW Simulations.log, cpu.txt, ram.txt and GPU.txt files to support@tuflow.com and we will includes these in the results tables below.

A nVidia graphics card that is CUDA compatible is required to run the GPU model . For more information on this please see the release notes.
The computer information is determined in the batch file using the wmic and dxdiag commands.

CPU Results

The following table summarises the runtimes for a range of computers. More will be added when additional results are obtained. The table is ordered based on the combined 30m and 15m runtimes, with the fastest computers at the top of the table.
Runtimes for CPU benchmarks

Processor Name Processor Frequency (GHz)** RAM size (GB) RAM frequency (MHz) Runtime 30m (mins) Runtime 15m (mins) Runtime 10m (mins) Runtime Combined (mins) System Name
Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz 4 32 1333 20.5 220.4 N/A 240.9 BRA
Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz 4 16 1600 22.68 244.25 N/A 266.93 RH1
Intel(R) Core(TM) i7-5960 XCPU @ 3.00GHz 3 64 2133 21.23 247.55 N/A 268.78 MON
Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz 3.4 8 1600 23.9 256.7 N/A 280.6 PAR
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz 3.5 32 2133 23.6 269.25 N/A 292.85 RH2
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz 3.5 128 2133 24.58 277.1 N/A 301.63 PTR
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz 3.6 32 1600 25.8 268.3 N/A 294.12 CCA
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz 2.8 8 1600 26.9 284.1 N/A 311.05 EUK
Intel(R) Core(TM) i5-4570S CPU @ 2.90GHz 2.9 8 1600 27.65 283.71 N/A 311.36 LM2
Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.50GHz 3.1 16 2133 24.93 291.7 N/A 316.63 MBA
Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz 3.5 32 1600 28.5 285.9 N/A 314.4 RH3
Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz 3.2 16 1600 31.1 297.43 N/A 328.53 RH3
Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz 2.7 16 1600 31.7 301.5 N/A 333.2 MJS
Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz 2.7 32 1600 29.1 308.12 N/A 337.22 JT1
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz 3.3 64 2133 29.2 317.1 N/A 346.3 EOG
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz 3.3 64 2133 33.08 317.86 N/A 350.94 JAC
Intel(R) Xeon(R) CPU E5-2670 V3 @ 2.30GHz 2.3 96 2133 28.4 333.35 N/A 361.75 RK2
Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz 3.4 32 1600 39.0 334.4 N/A 373.4 XEO
Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz 3.6 32 1600 44.18 335.82 N/A 380.00 DCO
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.80GHz 2.8 8 800 47.43 343.23 N/A 390.66 AJI
Intel(R) Core(TM) i5-4300U CPU @ 3.30GHz 1.9 8 1600 35.63 365.81 N/A 393.98 LP1
Intel(R) Xeon(R) W3565 CPU @ 3.20GHz 3.2 12 1333 37.88 356.1 N/A 401.44 LP2
2 x Intel(R) Xeon(R) X5680 CPU @ 3.33GHz 3.3 64 1333 40.5 368.9 N/A 409.35 WMD
Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz 2.2 16 1333 40.3 375.33 N/A 415.63 FFN
2 x Intel(R) Xeon(R) CPU E5-2643 V3 @ 3.40GHz 3.4 128 2133 40.5 377.1 N/A 418.14 XYG
Intel(R) Xeon(R) E5-2630 CPU @ 2.30GHz 2.3 64 1333 40.1 393.92 N/A 434.02 HUH
Intel(R) Xeon(R) E5-1603 0 CPU @ 2.80GHz 2.8 16 1600 40.85 395.81 N/A 436.66 LMD
2 x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.80GHz 2.3 38 1333 41.3 401.12 N/A 444.42 RH5
Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz 2.7 8 1600 39.5 420.7 N/A 60.2 HUK
Intel(R) Core(TM) i7-920 CPU @ 2.67GHz 2.67 12 1066 45.05 420.7 N/A 465.75 REJ
Intel(R) Xeon(R) CPU W3505 @ 2.53GHz 2.53 4 1333 49.12 453.5 N/A 502.62 JT2

GPU Results

The following table summarises the runtimes for a range of computers. More will be added when additional results are obtained. The table is ordered based on the combined 30m, 15m and 10m runtimes with the fastest computers at the top of the table.
The GPU benchmark only uses a single GPU card. TUFLOW GPU can be run across multiple nVidia GPU devices. However, the benefits of these are typically more noticeable for larger models with more than 1 million cells. A number of additional benmarking tests have been completed on a 2m model and multiple GPU cards.
Runtimes for GPU benchmarks

Processor Name Graphic Card GPU RAM (GB) Number of CUDA Cores* Runtime 30m (mins) Runtime 15m (mins) Runtime 10m (mins) Combined Runtime (mins) System Name
Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz NVIDIA GeForce GTX 980 4 2,048 1.4 7.8 24.4 33.5 BRA
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz NVIDIA GeForce GTX 980 4 2,048 1.77 8.42 25.35 35.54 PTR
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz NVIDIA GeForce GTX 980 4 2,048 1.8 8.7 25.2 35.7 EOG
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz NVIDIA GeForce GTX 980 4 2,048 1.73 9.05 24.95 35.73 JAC
Intel(R) Xeon(R) CPU E5-2670 V3 @ 2.30GHz NVIDIA GeForce GTX 980 4 2048 1.95 8.76 25.16 35.84 RK2
Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz NVIDIA GeForce GTX TITAN Black 4 2880 2.05 10.56 30.78 43.39 DCO
2 x Intel(R) Xeon(R) CPU E5-2643 V3 @ 3.40GHz NVIDIA Quadro K6000 4 2880 2.63 11.45 32.23 46.31 XYG
Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz NVIDIA GeForce GTX 770 2 1,536 1.9 11.5 36.8 50.2 PAR
Intel(R) Xeon(R) E5-2630 CPU @ 2.30GHz NVIDIA GeForce GTX 680 2 1536 2.35 12.95 41.5 56.8 HUH
Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz NVIDIA GeForce GTX 690 2 1,536 2.3 13.7 43.6 59.6 XEO
Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz NVIDIA Tesla K20c 5 2,496 2.13 13.82 44.47 60.42 MBA
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz NVIDIA Quadro K4200 4 1,344 2.5 16.8 55.1 74.3 CCA
2 x Intel(R) Xeon(R) CPU X5680 @ 3.33GHz NVIDIA Tesla C2075 4 448 3.4 19.1 58.4 80.85 WMD
Intel(R) Core(TM) i7-5960 XCPU @ 3.00GHz NVIDIA GeForce GTX 750 Ti 2 640 2.93 18.9 61.48 83.31 MON
Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz NVIDIA GeForce GTX 750 Ti 2 640 4.78 18.555 60.4 83.76 RH1
Intel(R) Core(TM) 2 Quad CPU Q9550 @ 2.80GHz NVIDIA Quadro 4000 4 768 5.2 32.23 103.99 141.24 AJI
Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz NVIDIA Quadro K3100M 4 768 5.2 37.42 107.33 149.95 JT1
Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz NVIDIA GeForce GTX 560M 2 192 6.78 46.8 154.72 208.3 FFN
Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz NVIDIA NVS 5200M 1 96 12.7 89.3 303.2 405.2 MJS
 * it is noted that the number of CUDA cores is not provided as an output from the '''dxdiag''' command and this information has been sourced from the nvidia website.<br>
** The output  cpu.txt only provides the 'out of the box' processor speed. If you have overclocked your cpu, then please send these details through to TUFLOW Support so we can add the correct clock speed.  

Discussion

The below preliminary results of the benchmark models have been based on the data submitted so far.

Preliminary CPU Results

The below comparison of the CPU results presents a few interesting points for discussion:

  • The runtimes for both models display similar variance as a percentage of the total time across hardware capabilities (26% and 21% relative standard deviation for the 30m and 15m models respectively).
  • The runtimes for both the 15m and 30m model show variance largely linked to CPU frequency but not totally. The results are dispersed, perhaps reflecting chip variability, chipset or other systems factors.
  • The difference in runtime between the fastest and slowest hardware (~300%) is much less than the difference in average runtime for the 30m and 15m models (970%). Thus, nothing can improve your model runtime like efficient model design!

800px


Preliminary GPU Results

  • Similar to the CPU results, decreasing the model cell size increases the variability in what runtime you'll get per CUDA cores
  • Unlike the CPU results, the variability in runtimes to cards is greater than the change in model cell size. Thus, it could be argued that the runtime of your GPU model is more dependent on the type of card you have than the runtime of your CPU model is on the processor frequency.
  • From the preliminary results, the NVIDIA GTX 980 seems a crowd favorite and performs well, returning the top 3 smallest runtimes. It is likely that as model size increases that the Titan Black and K6000 with 2880 cores will result in faster runtimes.

800px


Average reduction in Runtime from CPU to GPU When comparing the CPU and GPU runtimes for the 15 and 30 m models on average the following runtime improvments are achieved:

  • 12.6x reduction in runtime for the 30m model (80,000 cells)
  • 23.8x reduction in runtime for the 15m model (325,000 cells)

These results highlight the relationship between GPU/CPU runtime reduction relative the number of the cells in a model. The reduction ratio increases with the size of the model (number of cells). Up to a 100x reduction in runtime have been recorded using a 18,000,000 cell GPU model refer Large Model GPU Benchmarking. 800px


Large Model GPU Benchmarking

In addition to the benchmarking completed on the 10m, 15m and 30m models, a number of tests were completed by running the FMA Demo Model 2 at a 2m resolution on up to four GPU cards.The 2m model has approximately 18.2 M cells and was simulated for the following test cases:

  • Run with 1 x NVIDIA Geforce GTX 680 GPU card
  • Run with 2 x NVIDIA Geforce GTX 680 GPU cards
  • Run with 3 x NVIDIA Geforce GTX 680 GPU cards
  • Run with 4 x NVIDIA Geforce GTX 680 GPU cards
  • Run with CPU Only
  • The five runs detailed above were also re-run on the 10m grid.

    Explanation of Tabulated Results

    The large model benchmarking results are summarised in the below table. The contents of each column is detailed as follows:

  • 2m Runtime: Total time for the 2m model to complete.
  • 2m Runtime: Total time for the 10m model to complete.
  • 2m Runtime (realtime (mins) / simtime (hour)): Number of minutes in 'real time' to run 60mins of model time. For example, with 1 GPU card it takes 51.3 mins to run 60 mins of model time. With four GPU cards it takes 22.4 mins to run 60 mins of model time.
  • 10m Runtime (realtime (mins) / simtime (hour)): Number of minutes in 'real time' to run 60 mins of model time.
  • 2m CPU/GPU Speedup Factor: How much faster the GPU/Multi GPU runs are compared to the CPU only for the 2m model.
  • 10m CPU/GPU Speedup Factor: How much faster the GPU/Multi GPU runs are compared to the CPU only for the 10m model.
  • 2m MultiGPU Speedup Factor: How much faster the Multi GPU runs complete compared to when only a single GPU card is used
  • 10m MultiGPU Speedup Factor: How much faster the Multi GPU runs complete compared to when only a single GPU card is used.
    Runtimes for GPU benchmarks
    Run ID 2m Runtime (min) 10m Runtime (min) 2m Runtime (realtime (mins) / simtime (hour)) 10m Runtime (realtime (mins) / simtime (hour)) 2m CPU/GPU Speedup Factor 10m CPU/GPU Speedup Factor 2m MulitGPU Speedup Factor 10m MulitGPU Speedup Factor
    1 x NVIDIA Geforce GTX 680 GPU 513.2 4.6 51.3 0.5 44 18 1 1
    2 x NVIDIA Geforce GTX 680 GPU 318.5 3.5 31.8 0.01 71 23 1.6 1.3
    3 x NVIDIA Geforce GTX 680 GPU 230.6 3.2 23.1 0.01 98 26 2.2 1.4
    4 x NVIDIA Geforce GTX 680 GPU 223.7 3.6 22.4 0.01 101 23 2.3 1.3
    CPU Only 23478.3 81.5 2347.8 0.14 NA NA NA NA

    Discussion

    The results of the large model GPU testing indicate:

  • GPU 44-101 times faster than CPU for 2 m grid dependent on number of GPU cards
  • GPU 18-26 times faster than CPU for 10 m grid dependent on number of GPU cards
  • Using Multiple GPUs 1.6 to 2.3 times faster than 1 GPU for 2m model
  • Using Multiple GPUs 1.3 to 1.4 times faster than 1 GPU for 10m model
  • GPU performance increases with increasing model size as does the use of multiple GPUs were initialisation isn’t the major factor in run times.
  • If you have done any testing on much larger models, then we would love to hear how you have gone!!! Please send in details to support@tuflow.com.

    Up
    Go-up.png Back to TUFLOW Benchmarking