SUMMARY: External failover machine

From: Bugs (bb1@humboldt.edu)
Date: Tue Sep 10 2002 - 12:05:17 EDT


Thanks to all of you for the solutions. I have included all the replies,
because they are all so helpful.

My problem:
On Fri, 6 Sep 2002, Bugs wrote:

->
->Hi Managers,
->We would like to have a way of an automated system in which
->if our web server fails, another machine takes over the IP,
->and displays a web page notification of the web server down, check again
->in XX minutes.
->Then when the web server comes back up, it must get back its IP, and
->notify the temp machine to relinquish the IP, and go back to a standy
->mode.
->We cannot carry out the change in the nameserver because of the time
->it would take to cache in the new name/IP throughout the internet.
->
->This will be a short term solution until a load balancing system is
->implemented.
->
->If anyone is already doing this, or has ideas they wish to share about
->it, please reply.
->Thanks!

The solution we are using...

assume the following in nameserver:
master = 137.150.1.1
slave = 137.150.1.2
www = 137.150.1.3 (webserver)

A webserver on slave has a web page with info about the web server
being unavailable.
The following script, submitted by cron every minute, runs on slave:

start of script ##########################################
#!/bin/csh

# webmonitor Author: Bugs 09-Sep-2002

ping -c 1 www
if ($status == "1") then
        echo webserver is dead
        ifconfig tu1 alias 137.150.1.3
endif

end of script ##########################################

If it cant ping 137.150.1.3, it assings www to itself.

on master:
When master is fixed and starts back up, it issues this command:
# disable 137.150.1.3 on slave
rsh -l root yew /usr/sbin/ifconfig tu1 -alias 137.150.1.3 abort
# enable 137.150.1.3 on master
/usr/sbin/ifconfig tu1 alias 137.150.1.3

Of course, a .rhost has to exist on slave.

Thanks againg for all your great replies...

=====================================================
>From: Simon Millard <simon.g.millard@ntlworld.com>

Relatively straight forward this one.

You need two ip addresses per machine.

When the master starts up, it starts with a temporary ip address then
checks for the existence of the main ip address. If the main IP address
is found, then it assume the role as slave.

If, however, it does not find the master ip, it assumes the role of the
master and sets the ip address accordingly.

The slave does the same, always looking for the existence of the master
and determining if it will be the master and start your application.

If the slave server starts up and cannot find the master, it assumes the
master role displaying your messages.

Now.. You slave server can monitor the master server to determine it's
existence and if it disappears, can start up the applications manually.

There are, however, several problems with this.

1. You may have just lost a cable, so you don't really want to failover
without taking the master down first.

2. Has the master really disappears. In one application, I had a
dedicated network link between the master and slave, (fddi) as the
master was send transaction information to the slave ensuring that the
slave databases were as current as the live. Fail over was automatic.
To failback, the master became the slave and the data transferred back
the other way to re-sync the databases. Once synched, the slave was
shutdown and master started and they sorted themselves out as master and
slave.

I hope the above makes some sense as this was done before DEC introduced
it's DECsafe product which became DECASE then TruCluster.

Simon

=====================================================
>From: Mandell.Degerness@gems2.gov.bc.ca

I can easily imagine a script which sends a "heartbeat" (ping or other) to
the web server. At the point that the web server fails, it performs the
ifconfig change to add the web server address as an alias (NOTE: the web
server should have the IP address as an alias as well - see the next step).
When the web server is booting, it should signal the temporary server to
relinquish the IP address before attempting to configure the address for
itself.

Of course, the optimal solution is to use Available Server Environment
(ASE/TruCluster) to perform this same set of tasks.

=====================================================
>From: alan@nabeth.cxo.cpqcorp.net

        What version of Tru64 UNIX are you running? Failover
        have been the feature of ASE and Trucluster since about
        the beginning of time.
.
.
.

        I believe Truclusters replaces ASE, but check the documentation
        and QuickSpecs to be sure. For a small enough number of hosts
        you can use V5.1A to use a private LAN for cluster communication.
        You'll still have the license cost, but the hardware is less
        expensive (another NIC for each system).

        As for trying to do it by hand, I can't offer any advice.
        There may be people that have, and in a restrictive enough
        environment, it may be enough easier than a full cluster
        or ASE to setup to warrant the work. However, if you
        ever into into shared storage issues, you'll want the
        clustering software to keep the data safe.

=====================================================
>From: broderic@MIT.EDU

This is easy if you use virtual IPs for the service that is different than the
IP tied to the network interface. I.e., Get a second address on the same
network as the machine sits. Make the name mapped to this IP the name that
users specify to access the webserver. If the hostname is already using this
name, get a new IP/name for the system first. Then use

ifconfig tu0 alias/delete X.X.X.X

To add/remove the second virtual "service" IP to the interface all whic hcan be
done programatically.

                                                        _Mike
=====================================================
>From: Greg Freemyer <freemyer@NorcrossGroup.com>

I'm sure you know TruClusters would do that easily, but .....

If all you want is something quick and dirty, it should be easy to do what you want.

You need 3 IPs.

IP 1 - Permanent IP for the main server
IP 2 - Permanent IP for the slave server for failure notification.
IP 3 - A virtual IP that is assigned to main if it is up, the slave if it is down.

Then have the slave ping IP 1 continuously. If the pings work, it does nothing.

If the pings fail, it adds IP 3 as an alias.

The only hard part is when the you bring up the alias you have to send out unsolicited ARP packets to notify everybody else on the subnet that the IP moved. I'm sure Tru64 does this with there cluster, but I don't know if it a user level app you can use as well, or if it is a kernel feature.

If you can't use the Tru64 app, then send_arp is supposed to be pretty portable. Find it as a sub-component of heartbeat (http://www.linux-ha.org).

Pseudo code is like:

Failed=FALSE

while true:
do
    switch Failed:
    case FALSE
        if ! ping
            Failed=TRUE
            ifconfig alias IP3
            send_arp
            /sbin/init.d/httpd restart (You may not need this)
         endcase
    case TRUE
        if ping
            Failed=FALSE
            ifconfig -alias IP3
        endcase
    sleep 10
done

Then in the master, be sure to setup the alias IP permanently and you should do a send_arp every minute or so while it is up.

====
If you want a more permanent solution, the whole heartbeat package I mention above can be ported to Tru64. It already supports Linux/OpenBSD/FreeBSD/Solaris, so it should not be too hard to add Tru64 support.

I may do that myself after they release there next version. (They have been having problems recently with build tool changes, and trying to add a new OS in while they are doing that would be problematic.)

=====================================================

^From: Danny Petterson <Danny.Petterson@dmsn.dk>

I suppose you jast would make a script ping'ing the other server, and if
the ping isnt responding, changes its own IP-adress to primary webservers
IP, using ipconfig.

On the primary make the default IP a temporary IP. The last thing the
bootsequence should do is pinging the prefered IP (which may be on the
backupserver). If the prefered answers (because its used by the
backupserver) it uses rsh/ssh/whatever to logon to the backupserver and
changing the IP-adress to the backups default. After that the primary
changes its own IP to the required using ifconfig.

Greetings
Danny Petterson
Denmark

=====================================================
>From: aru.arunasalam@us.abb.com

What you might be needing is a load balancer which will transparently
switch you over to a different web server if one dies. We found the
cheapest one to be CoyotePoint Equalizer. But CISCO also has a Local
Director product that you may want to try. Try the cdw.com web site for
prices on various Network Load Balancing products. If you are using
Windows, then Windows 2000 Advanced server also supports some load
balancing capabilities.

Aru.

=====================================================
>From: "THOMAS, JEFFREY W." <jeff.thomas@peninsula.org>

Here are some links that may help:

http://www.samag.com/documents/s=1146/sam0109c/0109c.htm
http://linux-ha.org/

=====================================================
>From: Colin Bull <c.bull@videonetworks.com>

We are doing this with clustering, but it is almost as easy to do it
without.

We check if interface is live by -

 if ping -c 1 10.254.190.57 > /dev/null ; then
 echo `ping -c 1 10.254.190.57`
 echo "The IP Address is already up please verfiy with ifconfig."
 echo " There maybe a manual instance running. "
 else
        ifconfig nr0 alias 10.254.190.57 netmask 255.255.255.0
 fi

This can be wrapped in a script to check every minute or so.

To discontinue the interface

ifconfig nr0 -alias 10.254.190.57 abort

This can be done in a rsh command so that the main server is in control.
=====================================================

Bugs Brouillard Unix system administrator
Humboldt State Univ. Information Technology Services
Arcata, Calif.

email bb1@humboldt.edu



This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:48:51 EDT