Page MenuHomeClusterLabs Projects

No OneTemporary

diff --git a/tools/README.hb_report b/tools/README.hb_report
index 043898184c..ed6fef4c96 100644
--- a/tools/README.hb_report
+++ b/tools/README.hb_report
@@ -1,297 +1,305 @@
Heartbeat reporting
===================
Dejan Muhamedagic <dmuhamedagic@suse.de>
v1.0
`hb_report` is a utility to collect all information relevant to
Heartbeat over the given period of time.
Quick start
-----------
Run `hb_report` on one of the nodes or on the host which serves as
a central log server. Run `hb_report` without parameters to see usage.
A few examples:
1. Last night during the backup there were several warnings
encountered (logserver is the log host):
+
logserver# hb_report -f 3:00 -t 4:00 /tmp/report
+
collects everything from all nodes from 3am to 4am last night.
The files are stored in /tmp/report and compressed to a tarball
/tmp/report.tar.gz.
2. Just found a problem during testing:
node1# date : note the current time
node1# /etc/init.d/heartbeat start
node1# nasty_command_that_breaks_things
node1# sleep 120 : wait for the cluster to settle
node1# hb_report -f time /tmp/hb1
Introduction
------------
Managing clusters is cumbersome. Heartbeat v2 with its numerous
configuration files and multi-node clusters just adds to the
complexity. No wonder then that most problem reports were less
than optimal. This is an attempt to rectify that situation and
make life easier for both the users and the developers.
On security
-----------
`hb_report` is a fairly complex program. As some of you are
-probably going to run it as root let us state a few important
+probably going to run it as `root` let us state a few important
things you should keep in mind:
-1. Don't run `hb_report` as root! It is fairly simple to setup
+1. Don't run `hb_report` as `root`! It is fairly simple to setup
things in such a way that root access is not needed. I won't go
into details, just to stress that all information collected
should be readable by accounts belonging the haclient group.
2. If you still have to run this as root. Well, don't use the
`-C` option.
3. Of course, every possible precaution has been taken not to
disturb processes, or touch or remove files out of the given
destination directory. If you (by mistake) specify an existing
directory, `hb_report` will bail out soon. If you specify a
-relative path, it won't work either. The final product of
-`hb_report` is a tarball. However, the destination directory is
-not removed on any node, unless the user specifies `-C`. If you're
-too lazy to cleanup the previous run, do yourself a favour and
-just supply a new destination directory. You've been warned. If
-you worry about the space used, just put all your directories
-under /tmp and setup a cronjob to remove those directories once a
-week:
+relative path, it won't work either.
+
+The final product of `hb_report` is a tarball. However, the
+destination directory is not removed on any node, unless the user
+specifies `-C`. If you're too lazy to cleanup the previous run,
+do yourself a favour and just supply a new destination directory.
+You've been warned. If you worry about the space used, just put
+all your directories under `/tmp` and setup a cronjob to remove
+those directories once a week:
..........
for d in /tmp/*; do
test -d $d ||
continue
test -f $d/description.txt || test -f $d/.env ||
continue
grep -qs 'By: hb_report' $d/description.txt ||
grep -qs '^UNIQUE_MSG=Mark' $d/.env ||
continue
rm -r $d
done
..........
Mode of operation
-----------------
Cluster data collection is straightforward: just run the same
procedure on all nodes and collect the reports. There is,
apart from many small ones, one large complication: central
syslog destination. So, in order to allow this to be fully
automated, we should sometimes run the procedure on the log host
too. Actually, if there is a log host, then the best way is to
run `hb_report` there.
-We use ssh for the remote program invocation. Even though it is
+We use `ssh` for the remote program invocation. Even though it is
possible to run `hb_report` without ssh by doing a more menial job,
the overall user experience is much better if ssh works. Anyway,
how else do you manage your cluster?
Another ssh related point: In case your security policy
proscribes loghost-to-cluster-over-ssh communications, then
you'll have to copy the log file to one of the nodes and point
`hb_report` to it.
Prerequisites
-------------
1. ssh
+
This is not strictly required, but you won't regret having a
password-less ssh. It is not too difficult to setup and will save
you a lot of time. If you can't have it, for example because your
security policy does not allow such a thing, or you just prefer
menial work, then you will have to resort to the semi-manual
semi-automated report generation. See below for instructions.
++
+If you need to supply a password for your passphrase/login, then
+please use the `-u` option.
2. Times
+
In order to find files and messages in the given period and to
parse the `-f` and `-t` options, `hb_report` uses perl and one of the
`Date::Parse` or `Date::Manip` perl modules. Note that you need
-only one of these.
+only one of these. Furthermore, on nodes which have no logs and
+where you don't run `hb_report` directly, no date parsing is
+necessary. In other words, if you run this on a loghost then you
+don't need these perl modules on the cluster nodes.
+
On rpm based distributions, you can find `Date::Parse` in
`perl-TimeDate` and on Debian and its derivatives in
`libtimedate-perl`.
3. Core dumps
+
-To backtrace core dumps gdb is needed and the Heartbeat packages
+To backtrace core dumps `gdb` is needed and the Heartbeat packages
with the debugging info. The debug info packages may be installed
at the time the report is created. Let's hope that you will need
this really seldom.
What is in the report
---------------------
1. Heartbeat related
- heartbeat version/release information
- heartbeat configuration (CIB, ha.cf, logd.cf)
- heartbeat status (output from crm_mon, crm_verify, ccm_tool)
- pengine transition graphs (if any)
- backtraces of core dumps (if any)
- heartbeat logs (if any)
2. System related
- general platform information (`uname`, `arch`, `distribution`)
-- system statistics (`uptime`, `top`, `ps`)
+- system statistics (`uptime`, `top`, `ps`, `netstat -i`, `arp`)
3. User created :)
- problem description (template to be edited)
4. Generated
- problem analysis (generated)
It is preferred that the Heartbeat is running at the time of the
report, but not absolutely required. `hb_report` will also do a
quick analysis of the collected information.
Times
-----
Specifying times can at times be a nuisance. That is why we have
chosen to use one of the perl modules--they do allow certain
freedom when talking dates. You can either read the instructions
at the
http://search.cpan.org/dist/TimeDate/lib/Date/Parse.pm#EXAMPLE_DATES[Date::Parse
examples page].
or just rely on common sense and try stuff like:
3:00 (today at 3am)
15:00 (today at 3pm)
2007/9/1 2pm (September 1st at 2pm)
`hb_report` will (probably) complain if it can't figure out what do
you mean.
Try to delimit the event as close as possible in order to reduce
the size of the report, but still leaving a minute or two around
for good measure.
Note that `-f` is not an optional option. And don't forget to quote
dates when they contain spaces.
Should I send all this to the rest of Internet?
-----------------------------------------------
We make an effort to remove sensitive data from the Heartbeat
configuration (CIB, ha.cf, and transition graphs). However, you
_have_ to tell us what is sensitive! Use the `-p` option to specify
additional regular expressions to match variable names which may
contain information you don't want to leak. For example:
# hb_report -f 18:00 -p "user.*" -p "secret.*" /var/tmp/report
We look by default for variable names matching "pass.*" and the
stonith_host ha.cf directive.
Logs and other files are not filtered. Please filter them
yourself if necessary.
Logs
----
It may be tricky to find syslog logs. The scheme used is to log a
unique message on all nodes and then look it up in the usual
syslog locations. This procedure is not foolproof, in particular
if the syslog files are in a non-standard directory. We look in
/var/log /var/logs /var/syslog /var/adm /var/log/ha
/var/log/cluster. In case we can't find the logs, please supply
their location:
# hb_report -f 5pm -l /var/log/cluster1/ha-log -S /tmp/report_node1
If you have different log locations on different nodes, well,
-perhaps you'd like to make them the same. Or read about the
-manual report collection.
+perhaps you'd like to make them the same and make life easier for
+everybody.
The log files are collected from all hosts where found. In case
your syslog is configured to log to both the log server and local
files and `hb_report` is run on the log server you will end up with
multiple logs with same content.
Files starting with "ha-" are preferred. In case syslog sends
messages to more than one file, if one of them is named ha-log or
ha-debug those will be favoured to syslog or messages.
If there is no separate log for Heartbeat, possibly unrelated
messages from other programs are included. We don't filter logs,
just pick a segment for the period you specified.
NB: Don't have a central log host? Read the CTS README and setup
one.
Manual report collection
------------------------
So, your ssh doesn't work. In that case, you will have to run
this procedure on all nodes. Use `-S` so that we don't bother with
ssh:
# hb_report -f 5:20pm -t 5:30pm -S /tmp/report_node1
If you also have a log host which is not in the cluster, then
you'll have to copy the log to one of the nodes and tell us where
it is:
# hb_report -f 5:20pm -t 5:30pm -l /var/tmp/ha-log -S /tmp/report_node1
Furthermore, to prevent `hb_report` from asking you to edit the
report to describe the problem on every node use `-D` on all but
one:
# hb_report -f 5:20pm -t 5:30pm -DS /tmp/report_node1
If you reconsider and want the ssh setup, take a look at the CTS
README file for instructions.
Analysis
--------
The point of analysis is to get out the most important
information from probably several thousand lines worth of text.
Perhaps this should be more properly named as report review as it
is rather simple, but let's pretend that we are doing something
utterly sophisticated.
The analysis consists of the following:
- compare files coming from different nodes; if they are equal,
make one copy in the top level directory, remove duplicates,
and create soft links instead
- print errors, warnings, and lines matching `-L` patterns from logs
- report if there were coredumps and by whom
- report crm_verify results
The goods
---------
1. Common
+
- ha-log (if found on the log host)
- description.txt (template and user report)
- analysis.txt
2. Per node
+
- ha.cf
- logd.cf
- ha-log (if found)
- cib.xml (`cibadmin -Ql` or `cp` if Heartbeat is not running)
- ccm_tool.txt (`ccm_tool -p`)
- crm_mon.txt (`crm_mon -1`)
- crm_verify.txt (`crm_verify -V`)
- pengine/ (only on DC, directory with pengine transitions)
- sysinfo.txt (static info)
- sysstats.txt (dynamic info)
- backtraces.txt (if coredumps found)
- DC (well...)
+- RUNNING or STOPPED
diff --git a/tools/hb_report.in b/tools/hb_report.in
index c02a3df378..f4ee7fbee9 100755
--- a/tools/hb_report.in
+++ b/tools/hb_report.in
@@ -1,608 +1,663 @@
#!/bin/sh
# Copyright (C) 2007 Dejan Muhamedagic <dmuhamedagic@suse.de>
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This software is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#
. @sysconfdir@/ha.d/shellfuncs
. $HA_NOARCHBIN/utillib.sh
PROG=`basename $0`
# FIXME: once this is part of the package!
PROGDIR=`dirname $0`
echo "$PROGDIR" | grep -qs '^/' || {
test -f @sbindir@/$PROG &&
PROGDIR=@sbindir@
test -f $HA_NOARCHBIN/$PROG &&
PROGDIR=$HA_NOARCHBIN
}
LOGD_CF=`findlogdcf @sysconfdir@ $HA_DIR`
export LOGD_CF
-: ${SSH_OPTS="-T -o Batchmode=yes"}
+: ${SSH_OPTS="-T"}
LOG_PATTERNS="CRIT: ERROR:"
#
# the instance where user runs hb_report is the master
# the others are slaves
#
if [ x"$1" = x__slave ]; then
SLAVE=1
fi
#
# if this is the master, allow ha.cf and logd.cf in the current dir
# (because often the master is the log host)
#
if [ "$SLAVE" = "" ]; then
[ -f ha.cf ] && HA_CF=ha.cf
[ -f logd.cf ] && LOGD_CF=logd.cf
fi
usage() {
cat<<EOF
usage: hb_report -f time [-t time] [-u user] [-l file] [-p patt] [-L patt]
[-e prog] [-SDC] dest
-f time: time to start from
-t time: time to finish at (dflt: now)
- -u user: ssh user to access other nodes (dftl: hacluster)
+ -u user: ssh user to access other nodes (dflt: empty, hacluster, root)
-l file: log file
-p patt: regular expression to match variables to be removed;
this option is additive (dflt: "passw.*")
-L patt: regular expression to match in log files for analysis;
this option is additive (dflt: $LOG_PATTERNS)
-e prog: your favourite editor
-D : don't invoke editor to write description
-C : remove the destination directory
-S : single node operation; don't try to start report
collectors on other nodes
dest : destination directory
EOF
[ "$1" != short ] &&
cat<<EOF
. the multifile output is first stored in a directory {dest}
of which a tarball {dest}.tar.gz is created
. the time specification is as in either Date::Parse or
Date::Manip, whatever you have installed; Date::Parse is
preferred
. we try to figure where is the logfile; if we can't, please
clue us in
Examples
hb_report -f 2pm /tmp/report_1
hb_report -f "2007/9/5 12:30" -t "2007/9/5 14:00" /tmp/report_2
hb_report -f 1:00 -t 3:00 -l /var/log/cluster/ha-debug /tmp/report_3
hb_report -f "09sep07 2:00" -u hbadmin /tmp/report_4
hb_report -f 18:00 -p "usern.*" -p "admin.*" /tmp/report_5
. WARNING . WARNING . WARNING . WARNING . WARNING . WARNING .
We try to sanitize the CIB and the peinputs files. If you
have more sensitive information, please supply additional
patterns yourself. The logs and the crm_mon, ccm_tool, and
crm_verify output are *not* sanitized.
IT IS YOUR RESPONSIBILITY TO PROTECT THE DATA FROM EXPOSURE!
EOF
exit
}
#
# these are "global" variables
#
setvarsanddefaults() {
now=`perl -e 'print time()'`
# used by all
DESTDIR=""
FROM_TIME=""
TO_TIME=0
HA_LOG=""
UNIQUE_MSG="Mark:HB_REPORT:$now"
SANITIZE="passw.*"
REMOVE_DEST=""
# used only by the master
NO_SSH=""
SSH_USER=""
- TRY_SSH="hacluster"
+ TRY_SSH="hacluster root"
SLAVEPIDS=""
NO_DESCRIPTION=""
}
chkdirname() {
[ "$1" ] || usage short
[ $# -ne 1 ] && fatal "bad directory name: $1"
echo $1 | grep -qs '^/' ||
fatal "destination directory must be an absolute path"
[ "$1" = / ] &&
fatal "no root here, thank you"
}
chktime() {
[ "$1" ] || fatal "bad time specification: $2"
}
msgcleanup() {
fatal "destination directory $DESTDIR exists, please cleanup"
}
nodistdirectory() {
fatal "could not create the destination directory $DESTDIR"
}
time2str() {
perl -e "use POSIX; print strftime('%x %X',localtime($1));"
}
#
# find log files
#
logmarks() {
sev=$1 msg=$2
- forall "logger -p $HA_LOGFACILITY.$sev $msg"
+ c="logger -p $HA_LOGFACILITY.$sev $msg"
+
+ for n in `getnodes`; do
+ if [ "$n" = "`uname -n`" ]; then
+ $c
+ else
+ [ "$ssh_good" ] &&
+ echo $c | ssh $ssh_opts $n
+ fi
+ done
}
findlog() {
if [ "$HA_LOGFACILITY" ]; then
findmsg $UNIQUE_MSG | awk '{print $1}'
else
echo ${HA_DEBUGFILE:-$HA_LOGFILE}
fi
}
#
# this is how we pass environment to other hosts
#
dumpenv() {
cat<<EOF
FROM_TIME=$FROM_TIME
TO_TIME=$TO_TIME
HA_LOG=$HA_LOG
DESTDIR=$DESTDIR
UNIQUE_MSG=$UNIQUE_MSG
SANITIZE="$SANITIZE"
REMOVE_DEST="$REMOVE_DEST"
EOF
}
send_config() {
for node in `getnodes`; do
[ "$node" = "$WE" ] && continue
dumpenv |
- ssh $SSH_OPTS $SSH_USER@$node "mkdir -p $DESTDIR; cat > $DESTDIR/.env"
+ ssh $ssh_opts $node "mkdir -p $DESTDIR; cat > $DESTDIR/.env"
done
}
start_remote_collectors() {
for node in `getnodes`; do
[ "$node" = "$WE" ] && continue
- ssh $SSH_OPTS $SSH_USER@$node "$PROGDIR/hb_report __slave $DESTDIR" |
+ ssh $ssh_opts $node "$PROGDIR/hb_report __slave $DESTDIR" |
(cd $DESTDIR && tar xf -) &
SLAVEPIDS="$SLAVEPIDS $!"
done
}
#
# does ssh work?
#
-findsshuser() {
- for n in `getnodes`; do
- [ "$node" = "$WE" ] && continue
- trysshusers $n $TRY_SSH && break
- done
+testsshuser() {
+ if [ "$2" ]; then
+ ssh -T -o Batchmode=yes $2@$1 true 2>/dev/null
+ else
+ ssh -T -o Batchmode=yes $1 true 2>/dev/null
+ fi
}
-checkssh() {
- for n in `getnodes`; do
- [ "$node" = "$WE" ] && continue
- checksshuser $n $SSH_USER || return 1
+findsshuser() {
+ for u in "" $TRY_SSH; do
+ rc=0
+ for n in `getnodes`; do
+ [ "$node" = "$WE" ] && continue
+ testsshuser $n $u || {
+ rc=1
+ break
+ }
+ done
+ if [ $rc -eq 0 ]; then
+ echo $u
+ return 0
+ fi
done
- return 0
+ return 1
}
#
# the usual stuff
#
getbacktraces() {
flist=`find_files $HA_VARLIB/cores $1 $2`
[ "$flist" ] &&
getbt $flist > $3
}
getpeinputs() {
n=`basename $3`
flist=$(
if [ -f $3/ha-log ]; then
grep " $n peng.*PEngine Input stored" $3/ha-log | awk '{print $NF}'
else
find_files $HA_VARLIB/pengine $1 $2
fi | sed "s,$HA_VARLIB/,,g"
)
[ "$flist" ] &&
(cd $HA_VARLIB && tar cf - $flist) | (cd $3 && tar xf -)
}
touch_DC_if_dc() {
dc=`crmadmin -D 2>/dev/null | awk '{print $NF}'`
if [ "$WE" = "$dc" ]; then
touch $1/DC
fi
}
#
# some basic system info and stats
#
sys_info() {
echo "Heartbeat version: `hb_ver`"
crm_info
echo "Platform: `uname`"
echo "Kernel release: `uname -r`"
echo "Architecture: `arch`"
[ `uname` = Linux ] &&
echo "Distribution: `distro`"
}
sys_stats() {
set -x
uptime
ps axf
ps auxw
top -b -n 1
netstat -i
+ arp -an
set +x
}
#
# replace sensitive info with '****'
#
sanitize() {
for f in $1/ha.cf $1/cib.xml $1/pengine/*; do
[ -f "$f" ] && sanitize_one $f
done
}
#
# remove duplicates if files are same, make links instead
#
consolidate() {
for n in `getnodes`; do
if [ -f $1/$2 ]; then
rm $1/$n/$2
else
mv $1/$n/$2 $1
fi
ln -s ../$2 $1/$n
done
}
#
# some basic analysis of the report
#
checkcrmvfy() {
for n in `getnodes`; do
if [ -s $1/$n/crm_verify.txt ]; then
echo "WARN: crm_verify reported warnings at $n:"
cat $1/$n/crm_verify.txt
fi
done
}
checkbacktraces() {
for n in `getnodes`; do
[ -s $1/$n/backtraces.txt ] && {
echo "WARN: coredumps found at $n:"
egrep 'Core was generated|Program terminated' \
$1/$n/backtraces.txt |
sed 's/^/ /'
}
done
}
checklogs() {
logs=`find $1 -name ha-log`
[ "$logs" ] || return
pattfile=`maketempfile` ||
fatal "cannot create temporary files"
for p in $LOG_PATTERNS; do
echo "$p"
done > $pattfile
echo ""
echo "Log patterns:"
for n in `getnodes`; do
cat $logs | grep -f $pattfile
done
rm -f $pattfile
}
#
# check if files have same content in the cluster
#
cibdiff() {
- crm_diff -c -n $1 -o $2
+ d1=`dirname $1`
+ d2=`dirname $2`
+ if [ -f $d1/RUNNING -a -f $d2/RUNNING ] ||
+ [ -f $d1/STOPPED -a -f $d2/STOPPED ]; then
+ crm_diff -c -n $1 -o $2
+ else
+ echo "can't compare cibs from running and stopped systems"
+ fi
}
txtdiff() {
diff $1 $2
}
diffcheck() {
+ [ -f "$1" ] || {
+ echo "$1 does not exist"
+ return 1
+ }
+ [ -f "$2" ] || {
+ echo "$2 does not exist"
+ return 1
+ }
case `basename $1` in
ccm_tool.txt)
txtdiff $1 $2;; # worddiff?
cib.xml)
cibdiff $1 $2;;
ha.cf)
txtdiff $1 $2;; # confdiff?
crm_mon.txt|sysinfo.txt)
txtdiff $1 $2;;
esac
}
analyze_one() {
rc=0
node0=""
for n in `getnodes`; do
if [ "$node0" ]; then
diffcheck $1/$node0/$2 $1/$n/$2
rc=$((rc+$?))
else
node0=$n
fi
done
return $rc
}
analyze() {
flist="ccm_tool.txt cib.xml crm_mon.txt ha.cf sysinfo.txt"
for f in $flist; do
perl -e "printf \"Diff $f... \""
ls $1/*/$f >/dev/null 2>&1 || continue
if analyze_one $1 $f; then
echo "OK"
consolidate $1 $f
else
echo "varies"
fi
done
checkcrmvfy $1
checkbacktraces $1
checklogs $1
}
#
# description template, editing, and other notes
#
mktemplate() {
cat<<EOF
Please edit this template and describe the issue/problem you
encountered. Then, post to
Linux-HA@lists.linux-ha.org
or file a bug at
http://old.linux-foundation.org/developer_bugzilla/
See http://linux-ha.org/ReportingProblems for detailed
description on how to report problems.
Thank you.
Date: `date`
By: $PROG $userargs
Subject: [short problem description]
Severity: [choose one] enhancement minor normal major critical blocking
-Component: [choose one] CRM LRM CCM RA fencing comm GUI other
+Component: [choose one] CRM LRM CCM RA fencing heartbeat comm GUI tools other
Detailed description:
---
[...]
---
-$(
-if [ -f $DESTDIR/sysinfo.txt ]; then
- cat $DESTDIR/sysinfo.txt
-else
- for n in `getnodes`; do
- [ -f $DESTDIR/$n/sysinfo.txt ] &&
- echo "Info $n:"; sed 's/^/ /' $DESTDIR/$n/sysinfo.txt
- done
-fi
-)
EOF
+
+ if [ -f $DESTDIR/sysinfo.txt ]; then
+ echo "Common system info found:"
+ cat $DESTDIR/sysinfo.txt
+ else
+ for n in `getnodes`; do
+ if [ -f $DESTDIR/$n/sysinfo.txt ]; then
+ echo "System info $n:"
+ sed 's/^/ /' $DESTDIR/$n/sysinfo.txt
+ fi
+ done
+ fi
}
edittemplate() {
if ec=`pickfirst $EDITOR vim vi emacs nano`; then
$ec $1
else
warning "could not find a text editor"
fi
}
finalword() {
cat<<EOF
The report is saved in $DESTDIR.tar.gz.
Thank you for taking time to create this report.
EOF
}
checksize() {
ls -s $DESTDIR.tar.gz | awk '$1>=100{exit 1}' ||
cat <<EOF
NB: size of the tarball exceeds 100kb and if posted to the
mailing list will have to be first approved by the moderator.
Try reducing the period (use the -f and -t options).
EOF
}
[ $# -eq 0 ] && usage
-# check for the major prereq
+# check for the major prereq for a) parameter parsing and b)
+# parsing logs
+#
+NO_str2time=""
t=`str2time "12:00"`
if [ "$t" = "" ]; then
- fatal "please install the perl Date::Parse module"
+ NO_str2time=1
+ [ "$SLAVE" ] ||
+ fatal "please install the perl Date::Parse module"
fi
WE=`uname -n` # who am i?
THIS_IS_NODE=""
getnodes | grep -wqs $WE && # are we a node?
THIS_IS_NODE=1
getlogvars
#
# part 1: get and check options; and the destination
#
if [ "$SLAVE" = "" ]; then
setvarsanddefaults
userargs="$@"
args=`getopt -o f:t:l:u:p:L:e:SDCh -- "$@"`
[ $? -ne 0 ] && usage
eval set -- "$args"
while [ x"$1" != x ]; do
case "$1" in
-h) usage;;
-f) FROM_TIME=`str2time "$2"`
chktime "$FROM_TIME" "$2"
shift 2;;
-t) TO_TIME=`str2time "$2"`
chktime "$TO_TIME" "$2"
shift 2;;
-u) SSH_USER="$2"; shift 2;;
-l) HA_LOG="$2"; shift 2;;
-e) EDITOR="$2"; shift 2;;
-p) SANITIZE="$SANITIZE $2"; shift 2;;
-L) LOG_PATTERNS="$LOG_PATTERNS $2"; shift 2;;
-S) NO_SSH=1; shift 1;;
-D) NO_DESCRIPTION=1; shift 1;;
-C) REMOVE_DEST=1; shift 1;;
--) shift 1; break;;
*) usage short;;
esac
done
[ $# -ne 1 ] && usage short
DESTDIR=$1
chkdirname $DESTDIR
[ "$FROM_TIME" ] || usage short
fi
# this only on master
if [ "$SLAVE" = "" ]; then
#
# part 2: ssh business
#
# find out if ssh works
- if [ "$NO_SSH" = "" ]; then
+ ssh_good=""
+ if [ -z "$NO_SSH" ]; then
[ "$SSH_USER" ] ||
SSH_USER=`findsshuser`
- [ "$SSH_USER" ] && checkssh || # check if it works on _all_ nodes
- SSH_USER=""
+ if [ $? -eq 0 ]; then
+ ssh_good=1
+ if [ "$SSH_USER" ]; then
+ ssh_opts="-l $SSH_USER $SSH_OPTS"
+ else
+ ssh_opts="$SSH_OPTS"
+ fi
+ fi
fi
# final check: don't run if the destination directory exists
[ -d $DESTDIR ] && msgcleanup
- [ "$SSH_USER" ] &&
+ [ "$ssh_good" ] &&
for node in `getnodes`; do
[ "$node" = "$WE" ] && continue
- ssh $SSH_OPTS $SSH_USER@$node "test -d $DESTDIR" &&
+ ssh $ssh_opts $node "test -d $DESTDIR" &&
msgcleanup
done
fi
if [ "$SLAVE" ]; then
DESTDIR=$2
[ -d $DESTDIR ] || nodistdirectory
. $DESTDIR/.env
else
mkdir -p $DESTDIR
[ -d $DESTDIR ] || nodistdirectory
fi
if [ "$SLAVE" = "" ]; then
#
# part 3: log marks to be searched for later
# important to do this now on _all_ nodes
#
if [ "$HA_LOGFACILITY" ]; then
sev="info"
cfdebug=`getcfvar debug` # prefer debuglog if set
[ "$cfdebug" -a "$cfdebug" -gt 0 ] &&
sev="debug"
logmarks $sev $UNIQUE_MSG
fi
#
# part 4: start this program on other nodes
#
- if [ "$SSH_USER" ]; then
+ if [ "$ssh_good" ]; then
send_config
start_remote_collectors
else
[ `getnodes | wc -w` -gt 1 ] &&
warning "ssh does not work to all nodes"
fi
fi
# only cluster nodes need their own directories
[ "$THIS_IS_NODE" ] && mkdir -p $DESTDIR/$WE
#
# part 5: find the logs and cut out the segment for the period
#
if [ "$HA_LOG" ]; then # log provided by the user?
[ -f "$HA_LOG" ] || { # not present
[ "$SLAVE" ] || # warning if not on slave
warning "$HA_LOG not found; we will try to find log ourselves"
HA_LOG=""
}
fi
if [ "$HA_LOG" = "" ]; then
HA_LOG=`findlog`
[ "$HA_LOG" ] &&
cnt=`fgrep -c $UNIQUE_MSG < $HA_LOG`
fi
nodecnt=`getnodes | wc -w`
if [ "$cnt" ] && [ $cnt -eq $nodecnt ]; then
info "found the central log!"
info "you can ignore warnings about missing logs"
fi
if [ -f "$HA_LOG" ]; then
- dumplog $HA_LOG $FROM_TIME $TO_TIME |
- if [ "$THIS_IS_NODE" ]; then
- cat > $DESTDIR/$WE/ha-log
+ if [ "$NO_str2time" ]; then
+ warning "a log found; but we cannot slice it"
+ warning "please install the perl Date::Parse module"
else
- cat > $DESTDIR/ha-log # we are log server, probably
+ dumplog $HA_LOG $FROM_TIME $TO_TIME |
+ if [ "$THIS_IS_NODE" ]; then
+ cat > $DESTDIR/$WE/ha-log
+ else
+ cat > $DESTDIR/ha-log # we are log server, probably
+ fi
fi
else
warning "could not find the log file on $WE"
fi
#
# part 6: get all other info (config, stats, etc)
#
if [ "$THIS_IS_NODE" ]; then
getconfig $DESTDIR/$WE
getpeinputs $FROM_TIME $TO_TIME $DESTDIR/$WE
getbacktraces $FROM_TIME $TO_TIME $DESTDIR/$WE/backtraces.txt
touch_DC_if_dc $DESTDIR/$WE
sanitize $DESTDIR/$WE
sys_info > $DESTDIR/$WE/sysinfo.txt
sys_stats > $DESTDIR/$WE/sysstats.txt 2>&1
fi
#
# part 7: endgame:
# slaves tar their results to stdout, the master waits
# for them, analyses results, asks the user to edit the
# problem description template, and prints final notes
#
if [ "$SLAVE" ]; then
(cd $DESTDIR && tar cf - $WE)
else
wait $SLAVEPIDS
analyze $DESTDIR > $DESTDIR/analysis.txt
mktemplate > $DESTDIR/description.txt
[ "$NO_DESCRIPTION" ] || {
echo press enter to edit the problem description...
read junk
edittemplate $DESTDIR/description.txt
}
cd $DESTDIR/..
- tar czf $DESTDIR.tar.gz $DESTDIR/
+ tar czf $DESTDIR.tar.gz `basename $DESTDIR`
finalword
checksize
fi
[ "$REMOVE_DEST" ] &&
rm -r $DESTDIR
diff --git a/tools/utillib.sh b/tools/utillib.sh
index 05e259120a..2187624d9d 100644
--- a/tools/utillib.sh
+++ b/tools/utillib.sh
@@ -1,384 +1,354 @@
# Copyright (C) 2007 Dejan Muhamedagic <dmuhamedagic@suse.de>
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public
# License as published by the Free Software Foundation; either
# version 2.1 of the License, or (at your option) any later version.
#
# This software is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public
# License along with this library; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
#
#
# ha.cf/logd.cf parsing
#
getcfvar() {
[ -f $HA_CF ] || return
sed 's/#.*//' < $HA_CF |
grep -w "^$1" |
sed 's/^[^[:space:]]*[[:space:]]*//'
}
iscfvarset() {
test "`getcfvar \"$1\"`"
}
iscfvartrue() {
getcfvar "$1" |
egrep -qsi "^(true|y|yes|on|1)"
}
getnodes() {
getcfvar node
}
-#
-# ssh
-#
-checksshuser() {
- ssh -o Batchmode=yes $2@$1 true 2>/dev/null
-}
-trysshusers() {
- n=$1
- shift 1
- for u; do
- if checksshuser $n $u; then
- echo $u
- break
- fi
- done
-}
-
#
# logging
#
syslogmsg() {
severity=$1
shift 1
logtag=""
[ "$HA_LOGTAG" ] && logtag="-t $HA_LOGTAG"
logger -p ${HA_LOGFACILITY:-"daemon"}.$severity $logtag $*
}
#
# find log destination
#
uselogd() {
iscfvartrue use_logd &&
return 0 # if use_logd true
iscfvarset logfacility ||
iscfvarset logfile ||
iscfvarset debugfile ||
return 0 # or none of the log options set
false
}
findlogdcf() {
for f in \
`which strings > /dev/null 2>&1 &&
strings $HA_BIN/ha_logd | grep 'logd\.cf'` \
`for d; do echo $d/logd.cf $d/ha_logd.cf; done`
do
if [ -f "$f" ]; then
echo $f
return 0
fi
done
return 1
}
getlogvars() {
savecf=$HA_CF
if uselogd; then
[ -f "$LOGD_CF" ] ||
fatal "could not find logd.cf or ha_logd.cf"
HA_CF=$LOGD_CF
fi
HA_LOGFACILITY=`getcfvar logfacility`
HA_LOGFILE=`getcfvar logfile`
HA_DEBUGFILE=`getcfvar debugfile`
HA_SYSLOGMSGFMT=""
iscfvartrue syslogmsgfmt &&
HA_SYSLOGMSGFMT=1
HA_CF=$savecf
}
findmsg() {
# this is tricky, we try a few directories
syslogdir="/var/log /var/logs /var/syslog /var/adm /var/log/ha /var/log/cluster"
favourites="ha-*"
mark=$1
log=""
for d in $syslogdir; do
[ -d $d ] || continue
log=`fgrep -l "$mark" $d/$favourites` && break
log=`fgrep -l "$mark" $d/*` && break
done 2>/dev/null
echo $log
}
#
# print a segment of a log file
#
str2time() {
perl -e "\$time='$*';" -e '
eval "use Date::Parse";
if (!$@) {
print str2time($time);
} else {
eval "use Date::Manip";
if (!$@) {
print UnixDate(ParseDateString($time), "%s");
}
}
'
}
getstamp() {
if [ "$HA_SYSLOGMSGFMT" -o "$HA_LOGFACILITY" ]; then
awk '{print $1,$2,$3}'
else
awk '{print $2}' | sed 's/_/ /'
fi
}
linetime() {
l=`tail -n +$2 $1 | head -1 | getstamp`
str2time "$l"
}
findln_by_time() {
logf=$1
tm=$2
first=1
last=`wc -l < $logf`
while [ $first -le $last ]; do
mid=$(((last+first)/2))
tmid=`linetime $logf $mid`
if [ -z "$tmid" ]; then
warning "cannot extract time: $logf:$mid"
return
fi
if [ $tmid -gt $tm ]; then
last=$((mid-1))
elif [ $tmid -lt $tm ]; then
first=$((mid+1))
else
break
fi
done
echo $mid
}
dumplog() {
logf=$1
from_time=$2
to_time=$3
from_line=`findln_by_time $logf $from_time`
if [ -z "$from_line" ]; then
warning "couldn't find line for time $from_time; corrupt log file?"
return
fi
tail -n +$from_line $logf |
if [ "$to_time" != 0 ]; then
to_line=`findln_by_time $logf $to_time`
if [ -z "$to_line" ]; then
warning "couldn't find line for time $to_time; corrupt log file?"
return
fi
head -$((to_line-from_line+1))
else
cat
fi
}
#
# find files newer than a and older than b
#
touchfile() {
t=`maketempfile` &&
perl -e "\$file=\"$t\"; \$tm=$1;" -e 'utime $tm, $tm, $file;' &&
echo $t
}
find_files() {
dir=$1
from_time=$2
to_time=$3
from_stamp=`touchfile $from_time`
findexp="-newer $from_stamp"
if [ "$to_time" -a "$to_time" -gt 0 ]; then
to_stamp=`touchfile $to_time`
findexp="$findexp ! -newer $to_stamp"
fi
find $dir -type f $findexp
rm -f $from_stamp $to_stamp
}
#
# coredumps
#
findbinary() {
random_binary=`which cat 2>/dev/null` # suppose we are lucky
binary=`gdb $random_binary $1 < /dev/null 2>/dev/null |
grep 'Core was generated' | awk '{print $5}' |
sed "s/^.//;s/[.']*$//"`
[ x = x"$binary" ] && return
fullpath=`which $binary 2>/dev/null`
if [ x = x"$fullpath" ]; then
[ -x $HA_BIN/$binary ] && echo $HA_BIN/$binary
else
echo $fullpath
fi
}
getbt() {
which gdb > /dev/null 2>&1 || {
warning "please install gdb to get backtraces"
return
}
for corefile; do
absbinpath=`findbinary $corefile`
[ x = x"$absbinpath" ] && return 1
echo "====================== start backtrace ======================"
ls -l $corefile
gdb -batch -n -quiet -ex ${BT_OPTS:-"thread apply all bt full"} -ex quit \
$absbinpath $corefile 2>/dev/null
echo "======================= end backtrace ======================="
done
}
#
# heartbeat configuration/status
#
iscrmrunning() {
crmadmin -D >/dev/null 2>&1
}
dumpstate() {
crm_mon -1 | grep -v '^Last upd' > $1/crm_mon.txt
cibadmin -Ql > $1/cib.xml
ccm_tool -p > $1/ccm_tool.txt 2>&1
}
getconfig() {
- cp -p $HA_CF $1/
+ [ -f $HA_CF ] &&
+ cp -p $HA_CF $1/
[ -f $LOGD_CF ] &&
cp -p $LOGD_CF $1/
if iscrmrunning; then
dumpstate $1
+ touch $1/RUNNING
else
cp -p $HA_VARLIB/crm/cib.xml $1/ 2>/dev/null
+ touch $1/STOPPED
fi
[ -f "$1/cib.xml" ] &&
crm_verify -V -x $1/cib.xml >$1/crm_verify.txt 2>&1
}
#
# remove values of sensitive attributes
#
# this is not proper xml parsing, but it will work under the
# circumstances
sanitize_xml_attrs() {
sed $(
for patt in $SANITIZE; do
echo "-e /name=\"$patt\"/s/value=\"[^\"]*\"/value=\"****\"/"
done
)
}
sanitize_hacf() {
awk '
$1=="stonith_host"{ for( i=5; i<=NF; i++ ) $i="****"; }
{print}
'
}
sanitize_one() {
file=$1
compress=""
echo $file | grep -qs 'gz$' && compress=gzip
echo $file | grep -qs 'bz2$' && compress=bzip2
if [ "$compress" ]; then
decompress="$compress -dc"
else
compress=cat
decompress=cat
fi
tmp=`maketempfile` && ref=`maketempfile` ||
fatal "cannot create temporary files"
touch -r $file $ref # save the mtime
if [ "`basename $file`" = ha.cf ]; then
sanitize_hacf
else
$decompress | sanitize_xml_attrs | $compress
fi < $file > $tmp
mv $tmp $file
touch -r $ref $file
rm -f $ref
}
#
# keep the user posted
#
fatal() {
- echo "ERROR: $*" >&2
+ echo "`uname -n`: ERROR: $*" >&2
exit 1
}
warning() {
- echo "WARN: $*" >&2
+ echo "`uname -n`: WARN: $*" >&2
}
info() {
- echo "INFO: $*" >&2
+ echo "`uname -n`: INFO: $*" >&2
}
pickfirst() {
for x; do
which $x >/dev/null 2>&1 && {
echo $x
return 0
}
done
return 1
}
-#
-# run a command everywhere
-#
-forall() {
- c="$*"
- for n in `getnodes`; do
- if [ "$n" = "`uname -n`" ]; then
- $c
- else
- if [ "$SSH_USER" ]; then
- echo $c | ssh $SSH_OPTS $SSH_USER@$n
- fi
- fi
- done
-}
-
#
# get some system info
#
distro() {
which lsb_release >/dev/null 2>&1 && {
lsb_release -d
return
}
relf=`ls /etc/debian_version 2>/dev/null` ||
relf=`ls /etc/slackware-version 2>/dev/null` ||
relf=`ls -d /etc/*-release 2>/dev/null` && {
for f in $relf; do
test -f $f && {
echo "`ls $f` `cat $f`"
return
}
done
}
warning "no lsb_release no /etc/*-release no /etc/debian_version"
}
hb_ver() {
which dpkg > /dev/null 2>&1 && {
dpkg-query -f '${Version}' -W heartbeat 2>/dev/null ||
dpkg-query -f '${Version}' -W heartbeat-2
return
}
which rpm > /dev/null 2>&1 && {
rpm -q --qf '%{version}' heartbeat
return
}
# more packagers?
}
crm_info() {
$HA_BIN/crmd version 2>&1
}

File Metadata

Mime Type
text/x-diff
Expires
Wed, Oct 15, 11:50 PM (2 h, 51 m)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
2530583
Default Alt Text
(35 KB)

Event Timeline