I’ve been working a lot with cloud networking lately. I will share some of my discoveries with you! 😄
Let me just first start with two statements that I have seen made around cloud networking:
Cloud networking is easy! – Not necessarily so. I’ll explain more.
We don’t need networking in cloud! – Wrong. You do but in basic implementations it’s not visible to you.
This post will be divided into different areas describing the different components in cloud networking. You will see that there are many things in common between Azure and AWS.
"The mobile-first, cloud-first is a very rich canvas for innovation - it is not the device that is mobile, it is the person that is mobile."
Satya Nadella, Microsoft CEO
What makes cloud computing different?
"You don't generate your own electricity. Why generate your own computing?"
Jeff Bezos, Amazon.
Most importantly, the service you use is provided by someone else and managed on your behalf. If you're using Google Documents, you don't have to worry about buying umpteen licenses for word-processing software or keeping them up-to-date. Nor do you have to worry about viruses that might affect your computer or about backing up the files you create. Google does all that for you. One basic principle of cloud computing is that you no longer need to worry how the service you're buying is provided: with Web-based services, you simply concentrate on whatever your job is and leave the problem of providing dependable computing to someone else.
Cloud services are available on-demand and often bought on a "pay-as-you go" or subscription basis. So you typically buy cloud computing the same way you'd buy electricity, telephone services, or Internet access from a utility company. Sometimes cloud computing is free or paid-for in other ways (Hotmail is subsidized by advertising, for example). Just like electricity, you can buy as much or as little of a cloud computing service as you need from one day to the next. That's great if your needs vary unpredictably: it means you don't have to buy your own gigantic computer system and risk have it sitting there doing nothing.
It's public or private
Now we all have PCs on our desks, we're used to having complete control over our computer systems—and complete responsibility for them as well. Cloud computing changes all that. It comes in two basic flavors, public and private, which are the cloud equivalents of the Internet and Intranets. Web-based email and free services like the ones Google provides are the most familiar examples of public clouds.
The world's biggest online retailer, Amazon, became the world's largest provider of public cloud computing in early 2006. When it found it was using only a fraction of its huge, global, computing power, it started renting out its spare capacity over the Net through a new entity called Amazon Web Services (AWS).
Private cloud computing works in much the same way but you access the resources you use through secure network connections, much like an Intranet. Companies such as Amazon also let you use their publicly accessible cloud to make your own secure private cloud, known as a Virtual Private Cloud (VPC), using virtual private network (VPN) connections.
Within a VPC/VNET, there are system routes. If 10.0.0.0/22 was assigned to the VPC/VNET, there will be a system route saying along the lines of “10.0.0.0/22 local”. Subnets are then deployed in the VPC/VNET and there is full connectivity due to the system route. This route will point to a virtual router which is the responsibility of AWS/Azure. Normally this router will have a “leg” in each subnet, at the first IP address of the subnet, for example 10.0.0.1 for the 10.0.0.0/24 subnet.
In AWS, system routes can not be overridden. It doesn’t matter if you try to put a static route or a longer route, the system route always takes precedence.
In Azure, you CAN override system routes. These are called UDR, User Defined Routing. While very useful, it can be confusing, and a little dangerous, that Azure chooses a route according to the list below:
This means that a BGP route of equal length will be preferred to the system route. I learned this because I was advertising 0.0.0.0/0 over BGP and it made a host lose internet connectivity because it was following the BGP route instead of the system route.
In AWS, to provide internet connectivity to a subnet, an Internet Gateway (IGW) or a NAT Gateway must be attached to the VPC, and the route table associated with the subnet, must have a route towards the IGW or NAT Gateway. A subnet with internet connectivity is often referred to as a public subnet.
In Azure, internet connectivity is provided by default. My only guess as to why is that they want to let people easily get started even if they don’t have any networking knowledge. I don’t like this default though. Subnets shouldn’t have internet access unless I decided that they should have. This also means that if you don’t want the subnet to have internet access, you have to play stupid tricks by writing ACLs and by doing so you risk blocking access to Azure services. I don’t think this is a sensible default and at the very least Azure should provide an easy mechanism to remove internet access for a subnet.
As I described above, both in AWS and Azure there is a virtual router that lives in each and every subnet. Because broadcasts aren’t supported, there are some tricks behind the scene like answering ARP replies in order for virtual machines to have connectivity between each other.
One thing I’ve noticed in Azure, which is kind of annoying, is that the virtual router does not reply to ICMP. This makes troubleshooting more difficult as you can’t ping the virtual router or see it as a hop in a traceroute.
The other thing with Azure is, that the virtual router inserts itself into EVERY flow. I was very surprised by this as this is not really documented anywhere and certainly not in Azure’s official documentation. The way I discovered this was that I had two devices talking BGP to each other, all the expected routes were there, and still traffic was being dropped somewhere. I did extensive troubleshooting and then I found something weird. I had two devices in the same subnet, let’s say we had devices A and B and A was 10.0.0.4 and B was 10.0.0.5. Let’s say that A’s MAC was 0001.aaaa.aaaa and that B’s MAC was 0001.bbbb.bbbb. I noticed that A did not have a MAC address of 0001.bbbb.bbbb for B. This was confusing as they are in the same subnet. Instead I saw a MAC address of 0123.4567.89ab. What was this MAC address? It turns out that the Azure virtual router does this “man in the middle” where it replies to ARP requests with its own MAC.
All traffic is then relayed through the virtual router. Even though I had two devices on the same subnet, they were not directly communicating with each other. This broke my forwarding of traffic because my traffic was now not only between these two devices, the traffic was hitting Azure’s routing tables as well. I hadn’t put any routes into those because I shouldn’t be needing them. I then had to update Azure’s routing tables with static routes in order to get traffic through. I don’t really like how the virtual router inserts itself into every flow.
Both AWS and Azure offers VPN gateways, whether they be called Virtual Network Gateway (Azure) or Virtual Private Gateway (AWS). AWS now also has the Transit Gateway. The AWS VGW/TGW is fairly straight forward. It’s fast to create, it supports BGP, it’s easy to configure. The only caveat was it didn’t do IKEv2 until just recently. IKEv2 is now supported. Make sure to enable route propagation to get your routes from the VGW to your route table. Keep in mind that it may not be possible to change some settings for the VGW after it has been created.
Azure though is the exact opposite. It does support IKEv2 and has done for a long time but that’s pretty much the only thing positive I can say about it. It takes a LONG time to create the VNG, often 30 minutes or so. Why, Azure, why?! When you create the VNG, you need to point out which VNET it belongs to. This can NOT be changed at a later stage. You also can’t change from policy based VNG to dynamic VNG. You also have to select a SKU to size your VNG. Excuse me, I thought this was cloud?!
The worst part of Azure’s VNG though, is that it’s very poorly documented how you do active/active BGP. I basically had to reverse engineer this to figure it out. First off, the VNG is deployed into something that is called a Gatewaysubnet.
This is a subnet that you need so that the VNG can get an IP address that you will later peer to. This means that you don’t get a shared subnet between Azure’s VNG and your VPN device, which is in the case in AWS. Something like 169.254.0.0/30. Instead you have to run multi-hop eBGP to Azure but not only that, you also need to peer from a loopback because you can’t configure Azure to have two BGP peers on your side. This also means that you need to come up with a link network yourself, and then put a static route to Azure’s BGP peer IP over that link network. This was not documented anywhere to be found.
Azure’s VNG is a job poorly done and I can’t imagine why they’ve designed it this way. It feels like someone lacking enough networking knowledge designed the VNG. I know this is probably not the case but that just what it feels like.
Also keep in mind that you can’t start out with a VNG that is static and then move to dynamic. Considering how long it takes to create and delete VNGs, a change like that means you’ll likely lose connectivity to your on-premises environment for an hour or more.
I’ve tried to cover as many things as I could but I’m sure I’ve missed something. Networking in the cloud is different and not always as cloud like as you would expect. Especially in the case of Azure. While they do support you overriding system routes, that’s pretty much the only advantage they have over AWS. It’s quite obvious that they are a couple of years behind AWS feature-wise. I can only hope that they seriously reconsider the way they deploy and configure the VNG.
I hope you find some good information in this post as it’s quite difficult to find the proper documentation, especially for Azure. Thanks for reading!
Lower upfront costs and reduced infrastructure costs.
Easy to grow your applications.
Only pay for what you use.
Everything managed under SLAs.
Scale up or down at short notice.
Overall environmental benefit (lower carbon emissions) of many users efficiently sharing large systems. (But see the box below.)