summary:gridware application on 2-node cluster failed

From: Dr. Martin Körfer (koerfer@mpch-mainz.mpg.de)
Date: Mon Mar 07 2005 - 04:47:04 EST


The only answer I received to my demand below:

Did you try the mailing list?

http://gridengine.sunsource.net/project/gridengine/maillist.html

 -Ron Chen

For sure I did.
There were some hints on the error-mesasge described, but non of them solved my
problem.
So I tried around and found somewhat I would call a
"workaround", where I can live with.

-I used the single Tru64-Unix AS "gridsrv" as qmaster
-started install_execd on the cluster-member "server1...."
  => no sge_execd was running
- so on "server1...." I ran:
  #.../rcsge -migrate
  => same error-message: .......commd - qmaster not enrolled at commd-
- going to "gridsrv" and running here (on reverse):
  #.../rcsge -migrate
  => the qmaster was successfully started on the single AS
     and surprisingly on "server1...." the sge_execd was running
     and I could use him as execution-host.
This fullfilled my demands and I stopped further investigations in the problem.
Anyway it would be interesting what caused the problem????

Martin

-------------------------------------------------------------------------------
Demand:

Hi managers,

after a system-crash -> successful restore of a 2ES40-node / HSG80-cluster
running Tru64 V5.1a PK6, all services were restarted successfully except for
"SGE 5.3-gridware".
It came up with the error-message:

-unable to contact qmaster via "server1.mpch-mainz.mpg.de" commd - qmaster not
enrolled at commd-

were "server1.mpch-mainz.mpg.de" is one of the cluster-nodes, used as "qmaster".

-> no "sge_qmaster" was started
-> no "sge_execd" was started

Using a single Alpha-Server (not in the cluster-envitonment) as "qmaster" I
succeded -> all daemons running;
Now using "server1.mpch-mainz.mpg.de" as execution-host and
starting "install_execd" on it, ran without error, but
only "sge_commd" was running !not! "sge_execd" (as on other "execution-hosts"
not in the cluster).
Even on the second cluster-member "server2.mpch-mainz.mpg.de" I got the same
result as on "server1".

Trying a brand new Installation of the "SGE-5.3-Software")

#/soft/gridware/sge/inst_sge

at least I resulted in the error-message:

Grid Engine qmaster and scheduler startup
-----------------------------------------

Starting qmaster and scheduler daemon. Please wait ...
   starting sge_qmaster
starting program: /soft/gridware/sge/bin/tru64/sge_commd
using service "sge_commd"
bound to port 536
Reading in complexes:
        Complex "host".
        Complex "queue".
Reading in execution hosts.
Reading in administrative hosts.
Reading in parallel environments:
        PE "make".
Reading in scheduler configuration
   starting sge_schedd

error: getting configuration: unable to contact qmaster via "" commd - qmaster
not enrolled at commd
error: can't get configuration from qmaster -- backgrounding

-> "sge_commd" and "sge_schedd" were started but
   "sge_qmaster-" and "sge_execd-" were missing

So I came to the conclusion that due to the system-restore on the cluster
something is missing (possibly a "socket" or something else).

Anybody has any idea, why the "sge_qmaster-" and "sge_execd-" not were started
on the cluster-nodes, but run on the Single Alpha-Server???

Right now (after a week working on it) I am out of ideas.

Any help would be appreciated

Thanks in advance

Martin Körfer

--
Dr.Martin Körfer
Max-Planck-Institut für Chemie
Elektronik
J.J.Becherweg 27
55128 Mainz
Tel.: -49-6131-305488
Fax:  -49-6131-305318
-------------------------------------------------
This mail sent through IMP: webmail.mpch-mainz.mpg.de


This archive was generated by hypermail 2.1.7 : Sat Apr 12 2008 - 10:50:16 EDT