Organisation Cloud Software Execution

From Tuflow
Jump to navigation Jump to search

The TUFLOW End User Licence Agreement was updated in 2018 allowing companies to host their own licences on the cloud. The only restrictions associated with users running TUFLOW simulations on their own company public or private cloud environment are:

  1. The licence must be a “Network” type (use of “Local” licences is not permitted on the cloud).
  2. Usage of TUFLOW software on a virtual machine is confined to Authorised Users within the Licensee's Network. This clause means companies cannot on-sell access to TUFLOW licences hosted in the cloud or otherwise (excluding TUFLOW vendor contract arrangements).

Configuration of your cloud environment is your own responsibility. There are numerous ways TUFLOW licensing and simulation can be configured in a cloud environment depending on the cloud provider (Microsoft, Google, Amazon, other etc.) and internal company protocols. We recommend engaging a professional with suitable cloud architecture expertise to design your bespoke system. Clients who have already migrated to the cloud have done so in a variety ways:

  • Some use a hardware lock (USB) dongle that resides in their office on a physical computer or server. Cloud virtual machines link to the network licence via the IP address of the hardware lock.
  • Others use a software lock. Software locks are a digital licence file that is locked to a dedicated host computer, server or virtual machine. When using a software lock please select the host carefully as the software licence will be bound to it. Relocating the licence to a new location will require TUFLOW sales staff to reissue the licence, which incurs a small administration fee.
Please Note: Network licence rentals can be used to upscale the available licences on your cloud system when demand requires it. 
Refer to the TUFLOW Pricelist for more information.

This detailed report from the TUFLOW Library discusses some benefits, challenges and solutions relating to cloud computing to help people who are setting up their own system: Running TUFLOW on the Cloud (Whitepaper)
2021 Running TUFLOW on the Cloud.png

Common Questions Answered (FAQ)

How do I execute a simulation on the cloud? Can I still use batch files?

Running a simulation on the cloud can be very similar to running it on any other computer. You can access a VM remotely just like you would any other remote computer, using Remote Desktop, SSH, VNC, an X-Server client, etc. - whatever you are used to and what is set up on the VM. However, that assumes the VM is set up for that type of access and is running when you need to connect to it. If you want to make use of the real benefits of the cloud, like the ability to run on many computers at once, starting them automatically only when needed, doing it through such a process would be very cumbersome. You may want to consider looking at more advanced techniques like Azure Batch, AWS Batch, or Google Cloud Batch. In either case, you will need access to a TUFLOW licence server from VMs running the model. Have a look at "Do I need a different licence to run models on the cloud?" below. And the VMs will always need to have CodeMeter installed, configured to find the licence you plan to use, as well as appropriate drivers for hardware like GPUs.
When running on the cloud, consider that you may not have network access to locations where you would normally store your results. You may need to set up storage in the cloud separate from the VM, but connected to it, to collect your results and still have them available to you once the VM stops running.
If you're using remote access to desktop VMs, you can still use batch files or scripts like you're used to. If you look into batch services, you will need more involved scripting, and you would typically not use batch files, but split up the work into separate tasks for the cloud platform to schedule on available computers. Keep in mind that this is a substantial and complex task, requiring some development and IT skills. If you plan on this type of cloud use, plan ahead and be ready with a working and tested solution, before you take on a deadline.

Do I need a different TUFLOW executable to run models on the cloud?

No, you can use the same executable appropriate to the operating system you are on. Keep in mind that running TUFLOW with a licence does require that CodeMeter is installed as well and configured to find the licence. And if you are using a GPU on the cloud, you will need to have the appropriate NVIDIA drivers with CUDA installed, and a GPU licence available.
Although you do use the same executable, it may be advantageous to provide some additional command line options to TUFLOW when you run it on the cloud. Since you typically won't be present and looking at the screen, consider using the `-nc` switch, which prevents user interaction on the console. Also, the familiar `-b` option will prevent the simulation waiting for a key press at the end of the simulation. And finally, given the possible cost of running models at scale, you would do well to test your model with the `-t` switch before sending it to the cloud. In addition to command line options, learn about TUFLOW override files to override configuration that may need to be different on the cloud VM, like the location where TUFLOW should write results.

What steps do I need to take to run my model on the cloud?

In no particular order:

  • Assuming you have chosen a cloud provider you will use, make sure you understand the answers to the previous questions. If some of this is too technical, ensure you go over this with staff with appropriate IT skills and administrative access.
  • With regard to the model itself, ensure that it has no references to files on computers that wouldn't be accessible from the cloud VM running the model. Ideally, construct your model configuration so that it can be self-contained within a single folder and would run wherever you put it.
  • Ensure you have sufficient TUFLOW licences available and accessible to your cloud VMs to run the number of simulations you plan to run in parallel on the cloud.
  • Ensure you have sufficient quota for storage and cloud resources you need to run the number of simulations you plan to run, specifically when using the 'Batch' services mention under Q1.
  • Ensure you have the right level of access to make use of the cloud resources you need, and that you're able to use and manage them when you do.
  • Ensure that what you're planning on the cloud complies with your company and client's security policies for the work. Think about where the cloud computers are, how data is transferred to and from the cloud, and who has access.
  • If you can, pick a region that puts the compute and storage relatively close to your own location, ensuring that your access (or perhaps your clients' access) to them over the internet can achieve good total network speeds.
  • Test you model before putting it on the cloud and test your preferred method of running a model on the cloud before scaling it up.
  • Make sure your model configuration matches your actual needs before sending it to the cloud. Consider the frequency of writing outputs, whether you need check files, etc.

When in doubt, feel free to contact TUFLOW Support and TUFLOW Sales with questions, but keep in mind that we can only offer limited guidance when it comes to the specifics of your chosen cloud provider, and that your company's IT policies may further limit your options.

How can I download the simulation results?

This depends on your chosen solution. If you have cloud VMs that have access to your company's internal network, you may be able to copy the results automatically (with a script or batch file) after a simulation completes, and no download would be needed. If you have cloud VMs that you interactively use remotely, you can use whatever tools you would use from any remote machine, like OneDrive, Dropbox, FTP, SSH, to name but a few. However, all cloud service providers also provide cloud storage, and it may be cheaper and faster to keep unprocessed results in the cloud. Once a run completes, you typically do not want to keep the results on storage that is local to the VM that ran the model (e.g. its C: or D: drive on a Windows computer), unless you plan to use the same VM for post-processing of the results. But you can set up network file shares in the cloud that can be connected to your VM as extra drives or mounts, or you can make use of blob storage like Azure Blob, S3 Buckets, etc. Depending on the cloud service provider, there will be relatively user-friendly tools to access these remotely and download your data later.
For particularly massive datasets, some cloud providers also offer services where they can put the data on physical media and ship them to you. However, keep in mind that this takes substantial time to reserve beforehand and then some time to execute after you complete the work. And the service may not be available for smaller volumes you may need. Finally, at the risk of stating the obvious: perform the download on a good internet connection. Cloud providers charge a small amount per GB downloaded, and in return they offer very good download speeds for your data. But your internet connection may end up limiting how quickly you get your data to your computer.

What are the benefits of running a simulation in the cloud rather than locally?

Not all benefits apply in all cases, but consider these:

  • You can get access to as many cloud VMs (and GPUs) you need to run as many runs you need in parallel, provided you have sufficient licences and quota with the provider.
  • If you only need compute infrequently, it's there in the cloud when you need it and you only pay for it when you use it.
  • If your workload suddenly increases (which may be a good thing), you can quickly increase the amount of compute with cloud computing, provided you're set up to do so.
  • Most cloud providers offer access to a variety of very capable hardware, that may allow you to run models larger or longer running than you could on your own hardware.
  • If you collaborate with others from various locations (wherever they are in the world), having the results in the cloud may be a real benefit.

However, there are some potential downsides to consider as well:

  • If you make efficient use of hardware you own, the compute is likely cheaper per model run than it would be compared to cloud computing, especially for on-demand compute.
  • Although it's not very complicated to set up a VM for cloud runs and to get up and running, it may be complicated to do so in a way that satisfies your company or client's security policies.
  • Similarly, just running some models on an interactively accessible VM may be simple, but developing scripts for automated model running may require time and skills that prevent you from doing so yourself.

Do I need to add in any extra commands in my control files?

If your model is self-contained and could run from its folder on any computer, perhaps not. However, you may want to change where a VM in the cloud tries to write its results, for example. You can achieve that with extra command in your control files, but also consider the use of TUFLOW override control files, which you can tailor to the cloud VMs you're using, without affecting the control files you use for running or testing locally. To keep costs of storage and transport manageable, as well as saving on some run time, configure your model to write only the outputs you need. This includes selecting the right variables to output, at the appropriate time intervals. Have a look at our Output Management Advice webinar (15 minutes) for more tips on that.
Also look at the command line switches mentioned in the answer to Q2.

Do I need a different licence to run models on the cloud?

Not necessarily, but there are some things to keep in mind. If your existing licences are on a dongle, they would need to be network licences and the server they are installed on would have to be accessible over the network from the cloud VMs you're looking to run models on. If you have sufficient existing network licences you can use in this manner, including licences for special hardware you'd be using on the cloud (like a GPU), you will not need different licences.
You can also set up a dedicated VM to run a small CodeMeter network licence server in the cloud for software network licences. But keep in mind that licences on such a server cannot be moved elsewhere - they are bound to this specific VM. Access to this licence server would be limited to VMs in the cloud, on the same virtual network as the licence server. Or you'd need to have someone with the appropriate IT skills make the licence server accessible from all locations you need access from.
Alternatively, you may be able to make use of web licences, please contact TUFLOW Sales for more information on that.

What can go wrong when running models on the cloud?

For starters, almost everything that can go wrong when running models locally, although power failure and loss of network connection is exceedingly rare on the cloud.
Common problems arise from the differences in the computer's environment: software you may have installed that batch files rely on, software required to run TUFLOW (CodeMeter, NVIDIA drivers for GPU), access to networked resources you get inputs from, or write results to, etc.
Also, if you're using Batch services from your cloud provider, once a VM completes its tasks, it may disappear. If something went wrong during the run, you may have very limited access to information about what went wrong, so you want to be careful about logging and where logs are written to.
Similarly, but much simpler: of you run models interactively on a desktop VM, once you turn it off, you will no longer have access to its local storage. And once you remove the VM to save on cost, keep in mind that its attached disk storage will be removed as well, so ensure you have your results in a safe place before that.
Finally, access to licences using Codemeter from the cloud VM can sometimes cause some complications. And access for users to the VM or the data may cause some complications, depending on your IT setup. None of these should stop you from trying, but ensure everything works like you expect, before scaling up to many model runs at once.

If I stop the cloud VM after models are finished, can I still download the results?

If the results were written to local storage on the VM (like the default C: or D: drive on a Windows VM), you will only be able to access these when the cloud VM is running. If you stopped it, you could restart it to gain access again. Once you delete the VM, data on those volumes will be deleted as well, and cannot be recovered.
To be able to access results in the cloud even when a VM is stopped, or deleted, copy the results to a network share on the cloud. On the VM, you may be able to mount this storage as a network share, or tools will be available to perform a copy to cloud storage, depending on the cloud provider and operating system you are using.

Why is my run on the cloud slower than I expected based on the specs?

Although cloud hardware may be faster for some use cases, and certainly a lot more expensive to purchase, it may not be guaranteed to run your TUFLOW model faster. This mostly depends on how modern the NVIDIA hardware architecture is, how many CUDA cores it has available and specific metrics of the hardware like the amount of memory, the clock speed of the memory, the clock speed of the cores, and how the GPU is connected to the rest of the hardware. For a good assessment of whether you should expect better performance, refer to our Hardware Benchmarking pages.
If you're wondering why TUFLOW software doesn't benefit from these supposedly faster and more expensive GPUs, consider that a GPU has many different features, and TUFLOW only makes use of an important subset of these. Also, most TUFLOW models are executed using the single-precision floating-point executable, which is faster than the double-precision executable. Desktop GPUs are highly optimised for single-precision compute, because this is what benefits gaming, and as it happens, TUFLOW runs. Data centre GPUs are more optimised for double-precision compute, but most TUFLOW simulations don't benefit in result quality from using this.
Even when the hardware should be faster according to benchmarks, it's possible that you have some other restrictions. For one, if your cloud environment shares GPUs between many users, the part of the GPU available to your model run may only see a small percentage of the performance it would show with exclusive access to the GPU. This is particularly true in Virtual Desktop Infrastructure (VDI) setups. The way TUFLOW uses the GPU is very different from normal graphics processing, and VDI solutions are often not good for model running.
Another common cause of slowing is writing results directly to network shares that may be accessed over network connections that are orders of magnitude slower than local disk access. In these situations, the recommendation is to write results locally (with minimal overhead) on the cloud VM and then copy the results to other storage in one go, when the run completes. Even if you perform this copy while another run starts, you'll find that running first and copying after is a lot faster than writing directly to the network share. To understand why, imagine writing and sending an email one word at a time, or writing it all in one go. The amount of typing you have to do is roughly the same, if you do it cleverly, but clearly the whole process will take longer, and you can imagine the network having to send far more data back and forth. The difference between writing results to the network one part at a time, instead of all at once is analogous.

How can I lower the cost of running simulations on the cloud?

The first step would be to select the hardware that's best suited to your needs, at the lowest price, from the most affordable provider.
Secondly, if you get cloud hardware on-demand, you're paying the highest rates for the flexibility this affords. You can also reserve instances of specific hardware types, for periods like a year, or three years (depending on the cloud provider), dramatically lowering the price - but then you will have to pay for the entire period for the reserved instances. If your organisation is large enough, it can be worthwhile to have access to a pool of reserved resources, as long as the business achieves high utilisation over time, so that you only pay on-demand prices when you exceed your reserved instances.
If you do end up using on-demand hardware, ensure you only run it when you're actually using it. By automatically turning off VMs when the work is done and copied to appropriate storage, you can save on compute costs - you're not paying for how much power they use, you're paying for the hours they're on. And keep data on cheap storage like blob storage or online file shares, where you pay only for the size you're using, instead of keeping expensive VMs around that have massive virtual hard drives that you're paying for as long as they exist, empty or not.
Don't download data repeatedly, especially if you need access to it frequently. If you only need access to a small part of the data, it may be worth it to do so remotely. But if you need to process entire files, or multiple users need a copy, it will be more economical to download the data to your network once and use it from there.
You may have heard about 'spot pricing' for VMs. This may be suitable if you're running many small simulations in sequence, and if you're not under strict time pressure to deliver results, but in many cases, it won't be ideal, especially if your model is not set up with restart files that get stored away from the VM. We find that the added complexity rarely outweighs the price difference for the hardware, but the discounts on VMs obtained through spot pricing can be substantial.
If you find that the number of licences you need to scale up model running on the cloud is the main limiting factor for cost, contact our TUFLOW Sales to discuss options for your situation.
Finally, read through these questions, and take the advice given to heart. Optimising your model configuration and making the right choices when running on the cloud can save a lot of run time, and thus cost.

Is there a developed service to run large numbers of model runs on the cloud, if we cannot set it up ourselves?

As of 2019, TUFLOW offer an on-demand cloud simulation service that may suit your needs if your project is sufficiently urgent or large. As of 2023, you may find third parties providing services on the cloud as well, and TUFLOW may support use of its software in such services.

Which machine size / hardware type do you recommend for my model runs?

Hardware selection is very specific to the modelling requirements of each organisation and project. There is no one-size-fits-all recommendation to make.
However, some comments that generally apply:

  • As with physical hardware, top speed comes at a premium. If you compare model run times between different VM sizes, you may find that running on the slower machines may work out cheaper for a certain amount of work, than using the faster ones. Of course, you will have to consider project lead time, and time spent on licences as well.
  • For most cloud providers, the number of vCPU cores scales together with the type and number of available GPUs. And together with vCPUs, the amount of available RAM and storage scales up as well. As a result, you may end up with a lot of unused resources on some machine types.
  • If you're considering purchasing cloud infrastructure for permanent use, keep in mind that Virtual Desktop Infrastructure solutions often share resources like GPUs between many users. You may find that a specific type of hardware works really well in a test setup, where you're the only user on it, but performs really poorly when under load from many users. If you purchase access to a cloud VM directly, you will have it all to yourself, but additional infrastructure on top of the VMs may affect your performance greatly.
  • Conversely, when selecting a VM type that provides access to only 'half' of a data centre VM, you don't have to worry about negative performance impact. This type of sharing (that your IT can also achieve in your own data centre with NVIDIA MIG) still ensures that you always get full access to a dedicated part of the GPU and performance should be as expected.

Technical Terms Glossary

A brief explanation of some of the technical terms used in this FAQ, and relevant to cloud model running:

  • Azure Batch: A cloud computing service provided by Microsoft Azure for running large-scale parallel and batch compute jobs.
  • Azure Blob: Microsoft Azure's object storage solution for the cloud.
  • AWS Batch: Amazon Web Services' batch computing service that enables the processing of a large number of batch jobs.
  • Blob Storage (Binary Large Object Storage): A storage service for large amounts of unstructured data.
  • CodeMeter: A software technology developed by WIBU, used by TUFLOW for software licensing and protection.
  • CUDA: A parallel computing platform and application programming interface model created by Nvidia.
  • GPU (Graphics Processing Unit): A specialized processor designed to accelerate graphics rendering.
  • Google Cloud Batch: A batch computing service provided by Google Cloud, similar in functionality to Azure Batch and AWS Batch.
  • Network File Shares: Storage locations on a network that multiple users can access to store and retrieve files, as on a regular file system.
  • NVIDIA MIG (Multi-Instance GPU): A technology that provides hardware partitioning of NVIDIA GPUs.
  • On-Demand Compute: A cloud computing service model where computing resources are made available immediately to the user as needed.
  • Remote Desktop: A program or feature that allows a user to connect to a computer in another location, see that computer's desktop, and interact with it as if it were local.
  • S3 Buckets: Amazon Web Services' scalable storage buckets in the cloud.
  • Spot Pricing: A pricing model in cloud computing where available compute capacity can be purchased at potentially lower costs compared to on-demand rates.
  • SSH (Secure Shell): A cryptographic network protocol for operating network services securely over an unsecured network.
  • Virtual Desktop Infrastructure (VDI): Technology that hosts a desktop operating system on a centralized server in a data center.
  • VM (Virtual Machine): A software emulation of a physical computer that runs an operating system and applications just like a physical machine.
  • VNC (Virtual Network Computing): A graphical desktop-sharing system that uses the Remote Frame Buffer protocol to remotely control another computer.
  • X-Server: A program that manages and displays graphical user interfaces in a Unix or Linux environment.