User to cloud network path monitoring is an essential part of network performance management taken to the cloud age. It drives a great part of your users’ digital experience and while people think “because it is in the cloud, it is all taken care of”, well it is not always taken care of and there is a lot you can do to optimize it… provided you have the right visibility.
Cloud: It is all taken care of, isn’t it?
One of the main reasons for moving workloads to the cloud is to completely outsource your datacenter infrastructure for a better one. The cloud is more secure, more stable, more scalable, more automated and more flexible (see this article for more). Basically it is all taken care of and done in a much better way that you can ever do it. It also takes away a lot of things you had to manage until then (datacenter operations, systems, backup, virtualization, capacity planning, connectivity…).
Among other things, you rely on your cloud service provider(s) to provide the right connectivity on public networks for your users to access your applications through the Internet. In principle, they monitor network traffic and network performance as part of that service.
Most people understand that they do not need to monitor connectivity from their users to the cloud. They assume their cloud provider does it in accordance with the best practices.
Network performance on the user to cloud path: what can go wrong?
Your cloud provider obviously monitors it internet reachability. But this monitoring does not reflect things that can still happen to your users-to-application connectivity.
Network incidents on public and cloud networks
Network outages can impact the availability or performance of your cloud providers. These incidents can occur at different levels: ISPs, peering / transit points, tier one networks or on the cloud gateways themselves or affect essential services like DNS.
They may also be the consequence of attacks like BGP hijacks and DDoS targeting the infrastructure of your CSP.
Depending on the cause and the part of the infrastructure which is impacted, the scope of degradation may vary from only certain users / geographies to all and from some services to all.
An example of this would be the incident impacting IBM Cloud in June 2020; the network outage was caused by a failure from one of IBM Cloud’s 3rd party network providers (details here).
Unpredictable changes at multiple layers
The path from your users to the cloud is influenced by a series of network infrastructures whose behaviors can be unpredictable:
- The users’ own behavior, his/her local network connectivity and the security gateways used
- How the users operators route the traffic to your platform based on their peering / transit arrangements
- The BGP policy influencing the AS path in both ways (if you are interested in this specific topic, we recommend that you take a look at this article)
- All the congestion and degradations taking place on the path (either driven by traffic or device malfunctions)
- The destination to which DNS servers will point your users (your hostname may be resolved in different IP addresses based on your DNS setup)
- The behavior inside your cloud infrastructure at the network level but also how they manage the load on the server side
Uneven cloud coverage
Depending on where each of your users is located, your CDN (Content Delivery Network) and cloud platform may be well located.
If you are using one of the leading cloud service providers (say GCP, Azure, AWS or Alibaba). You have the choice of spreading your compute capacity in multiple regions to put your front end (and eventually some of the back end compute) closer to your users. This is a good starting point, but not all CSPs are made equal and that applies to geographical coverage. The same thinking applies to CDN providers.
Some areas are particularly uncovered by global providers (take the examples of Africa and LATAM as the most obvious) ones:
The prices for the same services in different regions can vary. This makes a massive difference to the actual coverage, by making certain zones quite prohibitive.
Let’s assume that you are running a multi cloud infrastructure. You may reduce the gap by leveraging each provider’s specifics to your advantage.
Nevertheless, depending on the region where your users are located, your CSP will remain further away from them from a latency standpoint.
What should you do to optimize your user to cloud network path?
What are the key steps to optimize the path from your users to your cloud platform?
- Understand where your users are and beyond the rough numbers, which regions / countries generate the largest part of your revenue
- List all the hosts and services used to deliver your digital service (including your CDN and 3rd party services), make sure you have a clear view of how they are hosted.
- Identify the user to cloud connectivity gaps which are structural for your strategic regions
- Know of events / incidents which impact user performance in the day to day and distinguish quickly between those you and your providers can take action vs the ones where no action can be taken.
Getting the right observability: network path monitoring
The very first thing you need is observability on the user to cloud path and the network performance attached to it.
You need to
- know where your users are located, with what experience:
2. Understand which part of your app is standing close or far (meaning with short or high latency) for the key ISPs (Internet Service Providers) offering connectivity to your users.
3. monitor this 365 days a year and to locate where degradations and changes on the route to your cloud platform are coming from:
Here is an example of a strong degradation on a route. Packet loss, latency and the number of hops from users to the application platform bursted for a period of time. It is good to know whether these changes are one off events or remaining for long.
Understand the root cause for network performance to the cloud
Once you identify the event, you have to define whether it is within the scope of what you control directly or indirectly:
- Did the cloud destination (host) change location?
- Did the route change?
- Is there congestion on the way? Where is it located?
Comparing cloud performance in multi cloud deployments
If there is a change in the structure or the network performance on the path is not correct, what are your options?
If you run a multi cloud architecture, you can consider:
- switching to another gateway for the considered region or
- use another way to access your cloud (e.g. AWS global cloud accelerator) or redirect to another cloud provider.
In the same way, the CDN providers offer very different regional coverage. For static content, switching to another CDN provider for the considered region can be an excellent option.
3rd party service providers
Your 3rd party providers also have an underlying infrastructure whose performance will vary on a per region basis.
To make decisions on your architecture, you need hard data on
- the performance from the location of your users to the different elements of your platform
- the route taken and
- the resulting performance (network latency, packet loss, number of hops and stability).
To find out more on how to manage performance in public networks, take a look at this article.
We also recommend that you take a look on how to deploy cloud services at global scale with the best possible user experience: article.