vCloud Director Cluster got you down?
But there are some trials you must encounter and complete before things work smoothly.
TL;DR: I failed at first but found the problem with the issue being firewall services blocking inter-cell communications. This post is about the log entry and how it isn’t very clear that there is an error to resolve.
My final trial is over after many days (well, about 8 business days, it did not take 8 days to find this problem, though).
vCloud Director requires a Red Hat based system, like RHEL5 or Centos 5.5 and both of these Linux distributions include a host based firewall based on iptables. This info is relevant though not clear immediately to my final challenge.
Let’s start with the error message I was receiving:
2011-04-27 16:24:08,895 | INFO | ActiveMQ Transport: tcp:///192.168.1.1:61616 | CellDiscoveryAgent | REMOVED Failed Cell in Broker Network. Cell UUID: 859fc787-1111-1111-1111-d2ab88a58088, Broker URI: tcp://192.168.1.1:61616 |
Notice that the severity column (second column with the pipe symbol as delimiter) says INFO. This is not a critical error by any means, things were operating just fine, though some actions had high latency (thanks Mr. Boche for the laughs at vBeers and latency).
Searching for this error via Google found me nothing.
Searching the VMware community discussion boards was also fail for me (though found lots of references to DNS, DNS, and more DNS – ie: DNS forward and reverse MUST be usable, edit your host files if you are doing NAT). BTW: Make sure your DNS works and both forward and reverse resolution returns correct values. This goes not only for vCloud Director but for all of the VMware pieces required to make a successful VMware cluster.
Oh, and in case I forgot, make sure your DNS works.
Now back to that iptables firewall issue I mentioned earlier…(and I’ll stop mentioning DNS)
Though I had left the firewall on and opened up the ports required for things to work (HTTP, HTTPS, SSH so I could mange things), the port of 61616 was never opened. (top of page 13 of vcd_10_install.pdf lists 61611 and 61616 as needing to be open between vCloud Director hosts)
I missed that part of the documentation (my bad, I was moving too quick and missed this vital piece of information) but the kicker in all of this?
Things worked just fine (for some values of fine), but there were times when there was high latency between the tasks. Powering on a vApp (or single VM) would sometimes be fast with all the tasks being delivered and completed quickly (call it 15 seconds or less). Other times … each task would be delivered and completed quickly, but the next task in the queue would take between 2-5 minutes to be sent to vCenter.
For a vApp with external and internal networks, this could mean that you are waiting upwards of 20 minutes to power on a VM, or to even shut it down so you can change some piece of the VM configuration. (more on this below)
Now, besides my failure to fully read the documentation dealing with the ports 61611 and 61616, this message should have a severity of at least WARNING, potentially ERROR, because when the cells can not communicate, it seems that the message queueing breaks down in a way that creates this high task to task latency (or in rare cases, the eventual failure to complete the action).
Here’s to hoping others can use the above information to help them pass this trial and enjoy their vCloud Director experience.
Now, about those tasks…
Within your vApp you have virtual machines (or VMs). When you power on your vApp (or the VM itself), vCloud Director (say vCD) sends off a task to reconfigure the VM (attach network interface), then a task to reconfigure the VM (attach second network interface), then a task to initiate a power on event, etc.
First task completes in the same second (according to vCenter), then you wait around waiting, and during this time you have no idea what is going on.
If you are logged into vCenter, you’ll see the first task get delivered, acted upon, and completed. Then nothing, for long periods.