diff --git a/tools/README.hb_report b/tools/README.hb_report index cbe12e04c0..7879e9eb93 100644 --- a/tools/README.hb_report +++ b/tools/README.hb_report @@ -1,297 +1,297 @@ Heartbeat reporting =================== Dejan Muhamedagic v1.0 `hb_report` is a utility to collect all information relevant to Heartbeat over the given period of time. Quick start ----------- Run `hb_report` on one of the nodes or on the host which serves as a central log server. Run `hb_report` without parameters to see usage. A few examples: 1. Last night during the backup there were several warnings encountered (logserver is the log host): + - logserver# /usr/share/heartbeat/hb_report -f 3:00 -t 4:00 /tmp/report + logserver# hb_report -f 3:00 -t 4:00 /tmp/report + collects everything from all nodes from 3am to 4am last night. The files are stored in /tmp/report and compressed to a tarball /tmp/report.tar.gz. 2. Just found a problem during testing: node1# date : note the current time node1# /etc/init.d/heartbeat start node1# nasty_command_that_breaks_things node1# sleep 120 : wait for the cluster to settle - node1# /usr/share/heartbeat/hb_report -f time /tmp/hb1 + node1# hb_report -f time /tmp/hb1 Introduction ------------ Managing clusters is cumbersome. Heartbeat v2 with its numerous configuration files and multi-node clusters just adds to the complexity. No wonder then that most problem reports were less than optimal. This is an attempt to rectify that situation and make life easier for both the users and the developers. On security ----------- `hb_report` is a fairly complex program. As some of you are probably going to run it as root let us state a few important things you should keep in mind: 1. Don't run `hb_report` as root! It is fairly simple to setup things in such a way that root access is not needed. I won't go into details, just to stress that all information collected should be readable by accounts belonging the haclient group. 2. If you still have to run this as root. Well, don't use the `-C` option. 3. Of course, every possible precaution has been taken not to disturb processes, or touch or remove files out of the given destination directory. If you (by mistake) specify an existing directory, `hb_report` will bail out soon. If you specify a relative path, it won't work either. The final product of `hb_report` is a tarball. However, the destination directory is not removed on any node, unless the user specifies `-C`. If you're too lazy to cleanup the previous run, do yourself a favour and just supply a new destination directory. You've been warned. If you worry about the space used, just put all your directories under /tmp and setup a cronjob to remove those directories once a week: .......... for d in /tmp/*; do test -d $d || continue test -f $d/description.txt || test -f $d/.env || continue grep -qs 'By: hb_report' $d/description.txt || grep -qs '^UNIQUE_MSG=Mark' $d/.env || continue rm -r $d done .......... Mode of operation ----------------- Cluster data collection is straightforward: just run the same procedure on all nodes and collect the reports. There is, apart from many small ones, one large complication: central syslog destination. So, in order to allow this to be fully automated, we should sometimes run the procedure on the log host too. Actually, if there is a log host, then the best way is to run `hb_report` there. We use ssh for the remote program invocation. Even though it is possible to run `hb_report` without ssh by doing a more menial job, the overall user experience is much better if ssh works. Anyway, how else do you manage your cluster? Another ssh related point: In case your security policy proscribes loghost-to-cluster-over-ssh communications, then you'll have to copy the log file to one of the nodes and point `hb_report` to it. Prerequisites ------------- 1. ssh + This is not strictly required, but you won't regret having a password-less ssh. It is not too difficult to setup and will save you a lot of time. If you can't have it, for example because your security policy does not allow such a thing, or you just prefer menial work, then you will have to resort to the semi-manual semi-automated report generation. See below for instructions. 2. Times + In order to find files and messages in the given period and to parse the `-f` and `-t` options, `hb_report` uses perl and one of the `Date::Parse` or `Date::Manip` perl modules. Note that you need only one of these. + On rpm based distributions, you can find `Date::Parse` in `perl-TimeDate` and on Debian and its derivatives in `libtimedate-perl`. 3. Core dumps + To backtrace core dumps gdb is needed and the Heartbeat packages with the debugging info. The debug info packages may be installed at the time the report is created. Let's hope that you will need this really seldom. What is in the report --------------------- 1. Heartbeat related - heartbeat version/release information - heartbeat configuration (CIB, ha.cf, logd.cf) - heartbeat status (output from crm_mon, crm_verify, ccm_tool) - pengine transition graphs (if any) - backtraces of core dumps (if any) - heartbeat logs (if any) 2. System related - general platform information (`uname`, `arch`, `distribution`) - system statistics (`uptime`, `top`, `ps`) 3. User created :) - problem description (template to be edited) 4. Generated - problem analysis (generated) It is preferred that the Heartbeat is running at the time of the report, but not absolutely required. `hb_report` will also do a quick analysis of the collected information. Times ----- Specifying times can at times be a nuisance. That is why we have chosen to use one of the perl modules--they do allow certain freedom when talking dates. You can either read the instructions at the http://search.cpan.org/dist/TimeDate/lib/Date/Parse.pm#EXAMPLE_DATES[Date::Parse examples page]. or just rely on common sense and try stuff like: 3:00 (today at 3am) 15:00 (today at 3pm) 2007/9/1 2pm (September 1st at 2pm) `hb_report` will (probably) complain if it can't figure out what do you mean. Try to delimit the event as close as possible in order to reduce the size of the report, but still leaving a minute or two around for good measure. Note that `-f` is not an optional option. And don't forget to quote dates when they contain spaces. Should I send all this to the rest of Internet? ----------------------------------------------- We make an effort to remove sensitive data from the Heartbeat configuration (CIB, ha.cf, and transition graphs). However, you _have_ to tell us what is sensitive! Use the `-p` option to specify additional regular expressions to match variable names which may contain information you don't want to leak. For example: # hb_report -f 18:00 -p "user.*" -p "secret.*" /var/tmp/report We look by default for variable names matching "pass.*" and the stonith_host ha.cf directive. Logs and other files are not filtered. Please filter them yourself if necessary. Logs ---- It may be tricky to find syslog logs. The scheme used is to log a unique message on all nodes and then look it up in the usual syslog locations. This procedure is not foolproof, in particular if the syslog files are in a non-standard directory. We look in /var/log /var/logs /var/syslog /var/adm /var/log/ha /var/log/cluster. In case we can't find the logs, please supply their location: # hb_report -f 5pm -l /var/log/cluster1/ha-log -S /tmp/report_node1 If you have different log locations on different nodes, well, perhaps you'd like to make them the same. Or read about the manual report collection. The log files are collected from all hosts where found. In case your syslog is configured to log to both the log server and local files and `hb_report` is run on the log server you will end up with multiple logs with same content. Files starting with "ha-" are preferred. In case syslog sends messages to more than one file, if one of them is named ha-log or ha-debug those will be favoured to syslog or messages. If there is no separate log for Heartbeat, possibly unrelated messages from other programs are included. We don't filter logs, just pick a segment for the period you specified. NB: Don't have a central log host? Read the CTS README and setup one. Manual report collection ------------------------ So, your ssh doesn't work. In that case, you will have to run this procedure on all nodes. Use `-S` so that we don't bother with ssh: # hb_report -f 5:20pm -t 5:30pm -S /tmp/report_node1 If you also have a log host which is not in the cluster, then you'll have to copy the log to one of the nodes and tell us where it is: # hb_report -f 5:20pm -t 5:30pm -l /var/tmp/ha-log -S /tmp/report_node1 Furthermore, to prevent `hb_report` from asking you to edit the report to describe the problem on every node use `-D` on all but one: # hb_report -f 5:20pm -t 5:30pm -DS /tmp/report_node1 If you reconsider and want the ssh setup, take a look at the CTS README file for instructions. Analysis -------- The point of analysis is to get out the most important information from probably several thousand lines worth of text. Perhaps this should be more properly named as report review as it is rather simple, but let's pretend that we are doing something utterly sophisticated. The analysis consists of the following: - compare files coming from different nodes; if they are equal, make one copy in the top level directory, remove duplicates, and create soft links instead - print errors, warnings, and lines matching `-L` patterns from logs - report if there were coredumps and by whom - report crm_verify results The goods --------- 1. Common + - ha-log (if found on the log host) - description.txt (template and user report) - analysis.txt 2. Per node + - ha.cf - logd.cf - ha-log (if found) - cib.xml (`cibadmin -Ql` or `cp` if Heartbeat is not running) - ccm_tool.txt (`ccm_tool -p`) - crm_mon.txt (`crm_mon -1`) - crm_verify.txt (`crm_verify -V`) - pengine/ (only on DC, directory with pengine transitions) - sysinfo.txt (static info) - sysstats.txt (dynamic info) - backtraces.txt (if coredumps found) - DC (well...) diff --git a/tools/hb_report.in b/tools/hb_report.in index bf4146f52d..c02a3df378 100755 --- a/tools/hb_report.in +++ b/tools/hb_report.in @@ -1,600 +1,608 @@ #!/bin/sh # Copyright (C) 2007 Dejan Muhamedagic # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public # License as published by the Free Software Foundation; either # version 2.1 of the License, or (at your option) any later version. # # This software is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # General Public License for more details. # # You should have received a copy of the GNU General Public # License along with this library; if not, write to the Free Software # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA # . @sysconfdir@/ha.d/shellfuncs . $HA_NOARCHBIN/utillib.sh PROG=`basename $0` +# FIXME: once this is part of the package! +PROGDIR=`dirname $0` +echo "$PROGDIR" | grep -qs '^/' || { + test -f @sbindir@/$PROG && + PROGDIR=@sbindir@ + test -f $HA_NOARCHBIN/$PROG && + PROGDIR=$HA_NOARCHBIN +} LOGD_CF=`findlogdcf @sysconfdir@ $HA_DIR` export LOGD_CF : ${SSH_OPTS="-T -o Batchmode=yes"} LOG_PATTERNS="CRIT: ERROR:" # # the instance where user runs hb_report is the master # the others are slaves # if [ x"$1" = x__slave ]; then SLAVE=1 fi # # if this is the master, allow ha.cf and logd.cf in the current dir # (because often the master is the log host) # if [ "$SLAVE" = "" ]; then [ -f ha.cf ] && HA_CF=ha.cf [ -f logd.cf ] && LOGD_CF=logd.cf fi usage() { cat< $DESTDIR/.env" done } start_remote_collectors() { for node in `getnodes`; do [ "$node" = "$WE" ] && continue - ssh $SSH_OPTS $SSH_USER@$node "$HA_NOARCHBIN/hb_report __slave $DESTDIR" | + ssh $SSH_OPTS $SSH_USER@$node "$PROGDIR/hb_report __slave $DESTDIR" | (cd $DESTDIR && tar xf -) & SLAVEPIDS="$SLAVEPIDS $!" done } # # does ssh work? # findsshuser() { for n in `getnodes`; do [ "$node" = "$WE" ] && continue trysshusers $n $TRY_SSH && break done } checkssh() { for n in `getnodes`; do [ "$node" = "$WE" ] && continue checksshuser $n $SSH_USER || return 1 done return 0 } # # the usual stuff # getbacktraces() { flist=`find_files $HA_VARLIB/cores $1 $2` [ "$flist" ] && getbt $flist > $3 } getpeinputs() { n=`basename $3` flist=$( if [ -f $3/ha-log ]; then grep " $n peng.*PEngine Input stored" $3/ha-log | awk '{print $NF}' else find_files $HA_VARLIB/pengine $1 $2 fi | sed "s,$HA_VARLIB/,,g" ) [ "$flist" ] && (cd $HA_VARLIB && tar cf - $flist) | (cd $3 && tar xf -) } touch_DC_if_dc() { dc=`crmadmin -D 2>/dev/null | awk '{print $NF}'` if [ "$WE" = "$dc" ]; then touch $1/DC fi } # # some basic system info and stats # sys_info() { echo "Heartbeat version: `hb_ver`" crm_info echo "Platform: `uname`" echo "Kernel release: `uname -r`" echo "Architecture: `arch`" [ `uname` = Linux ] && echo "Distribution: `distro`" } sys_stats() { set -x uptime ps axf ps auxw top -b -n 1 netstat -i set +x } # # replace sensitive info with '****' # sanitize() { for f in $1/ha.cf $1/cib.xml $1/pengine/*; do [ -f "$f" ] && sanitize_one $f done } # # remove duplicates if files are same, make links instead # consolidate() { for n in `getnodes`; do if [ -f $1/$2 ]; then rm $1/$n/$2 else mv $1/$n/$2 $1 fi ln -s ../$2 $1/$n done } # # some basic analysis of the report # checkcrmvfy() { for n in `getnodes`; do if [ -s $1/$n/crm_verify.txt ]; then echo "WARN: crm_verify reported warnings at $n:" cat $1/$n/crm_verify.txt fi done } checkbacktraces() { for n in `getnodes`; do [ -s $1/$n/backtraces.txt ] && { echo "WARN: coredumps found at $n:" egrep 'Core was generated|Program terminated' \ $1/$n/backtraces.txt | sed 's/^/ /' } done } checklogs() { logs=`find $1 -name ha-log` [ "$logs" ] || return pattfile=`maketempfile` || fatal "cannot create temporary files" for p in $LOG_PATTERNS; do echo "$p" done > $pattfile echo "" echo "Log patterns:" for n in `getnodes`; do cat $logs | grep -f $pattfile done rm -f $pattfile } # # check if files have same content in the cluster # cibdiff() { crm_diff -c -n $1 -o $2 } txtdiff() { diff $1 $2 } diffcheck() { case `basename $1` in ccm_tool.txt) txtdiff $1 $2;; # worddiff? cib.xml) cibdiff $1 $2;; ha.cf) txtdiff $1 $2;; # confdiff? crm_mon.txt|sysinfo.txt) txtdiff $1 $2;; esac } analyze_one() { rc=0 node0="" for n in `getnodes`; do if [ "$node0" ]; then diffcheck $1/$node0/$2 $1/$n/$2 rc=$((rc+$?)) else node0=$n fi done return $rc } analyze() { flist="ccm_tool.txt cib.xml crm_mon.txt ha.cf sysinfo.txt" for f in $flist; do perl -e "printf \"Diff $f... \"" ls $1/*/$f >/dev/null 2>&1 || continue if analyze_one $1 $f; then echo "OK" consolidate $1 $f else echo "varies" fi done checkcrmvfy $1 checkbacktraces $1 checklogs $1 } # # description template, editing, and other notes # mktemplate() { cat<=100{exit 1}' || cat < $DESTDIR/$WE/ha-log else cat > $DESTDIR/ha-log # we are log server, probably fi else warning "could not find the log file on $WE" fi # # part 6: get all other info (config, stats, etc) # if [ "$THIS_IS_NODE" ]; then getconfig $DESTDIR/$WE getpeinputs $FROM_TIME $TO_TIME $DESTDIR/$WE getbacktraces $FROM_TIME $TO_TIME $DESTDIR/$WE/backtraces.txt touch_DC_if_dc $DESTDIR/$WE sanitize $DESTDIR/$WE sys_info > $DESTDIR/$WE/sysinfo.txt sys_stats > $DESTDIR/$WE/sysstats.txt 2>&1 fi # # part 7: endgame: # slaves tar their results to stdout, the master waits # for them, analyses results, asks the user to edit the # problem description template, and prints final notes # if [ "$SLAVE" ]; then (cd $DESTDIR && tar cf - $WE) else wait $SLAVEPIDS analyze $DESTDIR > $DESTDIR/analysis.txt mktemplate > $DESTDIR/description.txt [ "$NO_DESCRIPTION" ] || { echo press enter to edit the problem description... read junk edittemplate $DESTDIR/description.txt } cd $DESTDIR/.. tar czf $DESTDIR.tar.gz $DESTDIR/ finalword checksize fi [ "$REMOVE_DEST" ] && rm -r $DESTDIR