Upgrading a RabbitMQ Cluster

I had a RabbitMQ cluster that was running 3.0.0, and bug #25556, “allow multiple URIs to be specified against an upstream”, was killing me. Had to federate some exchanges to another cluster, but I couldn’t specify a set of upstream hosts – only a single upstream host. This was quite annoying! I would get results like this:

$ rabbitmqctl set_parameter federation-upstream my_upstream_name '{ "uri": [ "amqp://user:[email protected]:5672/%2f", "amqp://user:[email protected]:5672/%2f" ], "max-hops": 2 }'

Error: Validation failed

uri should be binary, actually was [<<"amqp://user:[email protected]:5672/%2f">>,
<<"amqp://user:[email protected]:5672/%2f">>]

This was driving me crazy. The documentation specifically states that you can specify a set of upstream URIs:

To connect to an upstream cluster, you can specify multiple URIs in a single upstream. The federation link process will choose one of these URIs at random each time it attempts to connect.

I was trying all kinds of various formats, to see if there was some parsing bug. Eventually started slogging through changelogs and found the bug causing the issue. (Hey RabbitMQ people, how about a public bugtracker?) Only fix was to upgrade the cluster.

When upgrading your cluster, the mnesia database is supposed to automagically update itself, too. I’ve experienced problems with this in the past, and heard the same from friends, where the internal state gets a bit funky, and the only way to avoid this is a fresh rebuild from the ground up. So, instead of relying on the mnesia upgrade, we export the broker definitions (via API or the management GUI), rebuild the cluster, and re-import the definitions. You still need to be mindful of the order you bring your cluster up and down.

Here’s the general procedure (which did NOT work this time, read to the end before you follow anything here!):

  1. Export broker definitions to a file using the API or GUI.
  2. Upgrade to new RabbitMQ.
  3. Blow away old mnesia database in /var/lib/rabbitmq/mnesia (or better yet, move the mnesia dir out of there but keep it as a backup, you’ll see why!).
  4. Start RabbitMQ on host01 of the cluster.
  5. Import broker definitions file to host01.
  6. Verify that it worked and your config is sane.
  7. Start all other hosts and join_cluster them to host01 using the usual method (stop_app, reset, join_cluster, start_app).
  8. Have a beer.

The problem was that 3.2.0 did not want to read the broker definitions JSON file from a 3.0.0 export. There must be some small format change that screwed it up, and I was getting errors like this one:

error: "bad_request",
reason: "Validation failed name not recognised: undefined (my_vhost/federation/undefined)"

Fun! The broker had most of its configuration restored, but some things looked off, the dashboard GUI had a few undefined sprinkled around, and I didn’t trust it at all. So, plan B – the mnesia auto-upgrade. Luckily, I did not delete the mnesia dir, but I backed it up to my home dir before deleting:

$ cp -R /var/lib/rabbitmq/mnesia ~

This saved my butt, but I still didn’t fully trust the auto-upgrade. Perhaps it would work – but I was running 3.0.0, and any x.0.0 is going to have lots of bugs. Also, this was a heavy-flux development environment that had multiple people changing configuration using multiple techniques for quite some time, so the internal state had much opportunity to get weird if it wanted to – and it had done so in the past on this cluster. So how do I get the cluster back up, but trust that the configuration is stable? Here’s what I ended up doing:

  1. Stop rabbitmq-server on all cluster nodes, one at a time, until you get to the last (it must be a disk node, not RAM).
  2. The last node you shut down has the most extant broker state in the mnesia database, let’s call it host01 for sake of discussion.
  3. Backup the mnesia database from the last node, and delete it.
  4. Upgrade RabbitMQ on all cluster machines.
  5. On the last node, host01, make sure RabbitMQ is not running, and kill epmd also (Erlang port mapper daemon) just to be sure.
  6. Don’t kill epmd if you have other running Erlang apps, of course. Caveat.
  7. Delete the (now fresh) mnesia directory if it’s there.
  8. Copy the old mnesia directory in to place.
  9. Start RabbitMQ on host01
  10. Verify that the broker config looks sane via the GUI or whatever works for you.

Yes, you could just leave the mnesia directory in place and have it upgrade when you start the new Rabbit version up – but copying it out forces you to make that backup, so if RabbitMQ’s automagical upgrade screws it up, you can at least restore the original cluster’s state.

[root@host01 ~]# service rabbitmq-server stop
Stopping rabbitmq-server: rabbitmq-server.
[root@host01 ~]# killall epmd
[root@host01 ~]# rm -rf /var/lib/rabbitmq/mnesia
[root@host01 ~]# cp -R ~/mnesia /var/lib/rabbitmq
[root@host01 ~]# service rabbitmq-server start
Starting rabbitmq-server: SUCCESS

Then, you export the broker defs via the API or GUI, blow away the config again, and re-import the defs. This way, the internal data structures are all rebuilt from the definition file, and any inconsistencies in the mnesia data structures won’t carry over.

...after exporting broker definition file...

[root@host01 ~]# service rabbitmq-server stop
Stopping rabbitmq-server: rabbitmq-server.
[root@host01 ~]# killall epmd
[root@host01 ~]# rm -rf /var/lib/rabbitmq/mnesia
[root@host01 ~]# service rabbitmq-server start
Starting rabbitmq-server: SUCCESS

...now re-import the broker definition file...

After that, join all your other cluster members to the now-configured node in the usual way. All should be well.

I hate to do stuff that seems like computer voodoo, but sometimes these excessive steps work around real-world glitches that can cause a lot of pain. I’d rather not have an insane broker with wacky internal state, rare as it is – it has happened.


Aaron K.

Aaron Kondziela is a technologist and serial entrepreneur living in New York City.


Leave a Reply

Your email address will not be published. Required fields are marked *