NSX outage causing a Nutanix cluster failure
A few weeks back, I started a POC running the Nutanix 3000 series and NSX. The theory was to consolidate VPN, load balancing, storage and compute into a single appliance:
Oh please ignore for not using a flat network, this is a political and cultural challenge as people are not able to grasp concepts. Sometimes to go fwd we need to go backwards.
Back to NSX. According to the deployment and rules NSX shouldn’t have touched the local and only the DVswitch associated port groups.. How wrong was I.
While testing some firewalls rule sets the on logical routers and not the Edge, NSX managed to lock all communication to the 192.168.5.0/24. Anyone familiar with Nutanix knows that the 192.168.5.0/24 is the network that the CVM’s communicate over. CVM’s in non technical terms present the clustered storage of a Nutanix stack and present to the cluster using a vmkernal port. The CVM’s have a management port which is used for SSH connections.
See Example of the vmk port and the CVM.
The end result of the failure was
- No storage,
- No VM’s to mange the network (They are all virtual)
- All communication’s blocked on the 192.168.5.0/24
- CVM’s where the only running machines via their management port as they didnt used the pooled storage
So what did we do?
Tried to restart the CVM cluster
#cluster stop in the cli – failed to stop any processes as there where none to stop! #cluster started failed as none of the prerequisites services where started. All the CVM’s where responding on their management ips. The local vswitch was rebuilt via cli as Nutanix uses a specific vswitch name. this didn’t resolve the issue.
The decision i took was to remove the VIBs installed onto each ESX, as these where somehow blocking traffic on the 192.168.5.0 network. The following VIB’s are installed onto each host that uses NSX. esx-dvfilter-switch-security, esx-vsip & esx-vxlan
Below are the commands used to remove the NSX specific VIB’s:
# esxcli software vib remove –vibname=esx-dvfilter-switch-security
# esxcli software vib remove –vibname=esx-vsip
# esxcli software vib remove –vibname=esx-vxlan
Once the VIB’s where removed and hosts rebooted, the 192.168.5.0/24 network started communicating and the vms on the management network where manageable as they where not on a VXLAN.
The root cause of this issue is that the NSX kernel firewall still enforces policy even when the NSX manager and Controllers are down. In order to prevent this happening moving forward, either the rule base can be crafted to allow the connection, or the CVMs can be added to the exception list in NSX, as the NSX manager was. The exception list is a list of hosts that the dFW does not enforce policy on.
In Larger environments NSX should be on is own isolated cluster ideally using block storage.