diff --git a/README-testing b/README-testing index d647d17..6d936f0 100644 --- a/README-testing +++ b/README-testing @@ -1,204 +1,210 @@ There's a booth-test RPM available that contains two types of tests. It installs the necessary files into `/usr/share/booth/tests`. === Live tests (booth operation) BEWARE: Run this with _test_ clusters only! The live testing utility tests booth operation using the given `booth.conf`: $ /usr/share/booth/tests/test/live_test.sh booth.conf It is possible to run only specific tests. Run the script without arguments to see usage and the list of tests and netem network emulation functions. +There are some restrictions on how booth.conf is formatted. +There could be several tickets defined, but only the first ticket +is used for testing. This ticket must have expire and timeout +parameters configured. + Example booth.conf: ------------ transport="UDP" port="9929" arbitrator="10.2.12.53" arbitrator="10.2.13.82" site="10.2.12.101" site="10.2.13.101" site="10.121.187.99" ticket="ticket-A" expire = 30 timeout = 3 retries = 3 before-acquire-handler = /usr/share/booth/service-runnable d-src1 ------------ A split brain condition is also tested. For that to work, all sites need `iptables` installed. The supplied script `booth_path` is used to manipulate iptables rules. ==== Pacemaker configuration This is a sample pacemaker configuration for a single-node cluster: primitive booth ocf:pacemaker:booth-site primitive d-src1 ocf:heartbeat:Dummy +rsc_ticket global-d-src1 ticket-A: d-src1 Please adjust to your environment. ==== Network environment emulation To introduce packet loss or network delays, set the NETEM_ENV environment variable. There are currently three netem network emulation settings supported: - loss: all servers emulate packet loss (30% by default) - single_loss: the first site in the configuration emulates packet loss (30% by default) - net_delay: all servers emulate packet delay (100ms by default with random variation of 10%) The settings can be supplied by adding ':' to the emulator name. For instance: # NETEM_ENV=loss:50 /usr/share/booth/tests/test/live_test.sh booth.conf It is not necessary to run the test script on one of the sites. Just copy the script and make the test `booth.conf` available locally: $ scp testsite:/usr/share/booth/tests/test/live_test.sh . $ scp testsite:/etc/booth/booth.conf . $ sh live_test.sh booth.conf You need at least two sites and one arbitrator. The configuration can contain just one ticket. It is not necessary to configure the `before-acquire-handler`. Notes: - (BEWARE!) the supplied configuration files is copied to /etc/booth/booth.conf to all sites/arbitrators thus overwriting any existing configuration - the utility uses ssh to manage booth at all sites/arbitrators and logs in as user `root` - it is required that ssh public authentication works without providing the passphrase (otherwise it is impractical) - the log file is ./test_booth.log (it is actually a shell trace, with timestamps if you're running bash) - in case one of the tests fail, hb_report is created If you want to open a bug report, please attach all hb_reports and `test_booth.log`. === Simple tests (commandline, config file) Run (as non-root) # python test/runtests.py to run the tests written in python. === Unit tests These use gdb and pexpect to set boothd state to some configured value, injecting some input and looking at the output. # python script/unit-test.py src/boothd unit-tests/ Or, if using the 'booth-test' RPM, # python unit-test.py src/boothd unit-tests/ This must (currently?) be run as a non-root user; another optional argument is the test to start from, eg. '003'. Basically, boothd is started with the config file `unit-tests/booth.conf`, and gdb gets attached to it. Then, some ticket state is set, incoming messages are delivered, and outgoing messages and the state is compared to expected values. `unit-tests/_defaults.txt` has default values for the initial state and message data. Each test file consists of headers and key/value pairs: -------------------- ticket: state ST_STABLE message0: # optional comment for the log file header.cmd OP_ACCEPTING ticket.id "asdga" outgoing0: header.cmd OP_PREPARING last_ack_ballot 42 finally: new_ballot 1234 -------------------- A few details to the the above example: * Ticket states in RAM (`ticket`, `finally`) are written in host-endianness. * Message data (`messageN`, `outgoingN`) are automatically converted via `htonl` resp. `ntohl`. They are delivered/checked in the order defined by the integer `N` component. * Strings are done via `strcpy()` * `ticket` and `messageN` are assignment chunks * `finally` and `outgoingN` are compare chunks * In `outgoingN` you can check _both_ message data (keys with a `.` in them) and ticket state * Symbolic names are useable, GDB translates them for us * The test scripts in `unit-tests/` need to be named with 3 digits, an underscore, some text, and `.txt` * The "fake" `crm_ticket` script gets the current test via `UNIT_TEST`; test scripts can pass additional information via `UNIT_TEST_AUX`. ==== Tips and Hints There's another special header: `gdb__N__`. These lines are sent to GDB after injecting a message, but before waiting for an outgoing line. Values that contain `§` are sent as multiple lines to GDB. This means that a stanza like -------------------- gdb0: watch booth_conf->ticket[0].owner § commands § bt § c § end -------------------- will cause a watchpoint to be set, and when it is triggered a backtrace (`bt`) is written to the log file. This makes it easy to ask for additional data or check for a call-chain when hitting bugs that can be reproduced via such a unit-test. # vim: set ft=asciidoc : diff --git a/test/live_test.sh b/test/live_test.sh index 9613012..2ba8b5a 100755 --- a/test/live_test.sh +++ b/test/live_test.sh @@ -1,1009 +1,1009 @@ #!/bin/sh # # see README-testing for more information # do some basic booth operation tests for the given config # PROG=`basename $0` usage() { cat<[:]] $PROG [ ...] EOF if [ $1 -eq 0 ]; then list_all examples fi exit } list_all() { echo "Tests:" grep "^test_.*{$" $0 | sed 's/test_//;s/(.*//;s/^/ /' echo echo "Netem functions:" grep "^NETEM_ENV_.*{$" $0 | sed 's/NETEM_ENV_//;s/(.*//;s/^/ /' } examples() { cat</dev/null 2>&1 for h in $arbitrators; do stop_arbitrator $h rc=$((rc|$?)) done >/dev/null 2>&1 wait_timeout return $rc } start_booth() { local h rc for h in $sites; do start_site $h rc=$((rc|$?)) done >/dev/null 2>&1 for h in $arbitrators; do start_arbitrator $h rc=$((rc|$?)) done >/dev/null 2>&1 wait_timeout return $rc } restart_booth() { local h procs for h in $sites; do restart_site $h & procs="$! $procs" done >/dev/null 2>&1 for h in $arbitrators; do restart_arbitrator $h done >/dev/null 2>&1 wait $procs wait_timeout } is_we_server() { local h for h in $sites $arbitrators; do ip a l | fgrep -wq $h && return done return 1 } sync_conf() { local h rc=0 for h in $sites $arbitrators; do rsync -q $cnf $h:$run_cnf rc=$((rc|$?)) done return $rc } dump_conf() { echo "test configuration file $cnf:" grep -v '^#' $cnf | grep -v '^[[:space:]]*$' | sed "s/^/$cnf: /" } forall() { local h rc=0 for h in $sites $arbitrators; do runcmd $h $@ rc=$((rc|$?)) done return $rc } forall_sites() { local h rc=0 for h in $sites; do runcmd $h $@ rc=$((rc|$?)) done return $rc } forall_fun() { local h rc=0 f=$1 for h in $sites $arbitrators; do $f $h rc=$((rc|$?)) [ $rc -ne 0 ] && break done return $rc } # run on all hosts whatever function produced on stdout forall_fun2() { local h rc=0 f f=$1 shift 1 for h in $sites $arbitrators; do $f $@ | ssh $SSH_OPTS $h rc=$((rc|$?)) [ $rc -ne 0 ] && break done return $rc } run_site() { local n=$1 h shift 1 h=`echo $sites | awk '{print $'$n'}'` runcmd $h $@ } run_arbitrator() { local n=$1 h shift 1 h=`echo $arbitrators | awk '{print $'$n'}'` runcmd $h $@ } get_port() { grep "^port" | sed -n 's/.*="//;s/"//p' } get_servers() { grep "^$1" | sed -n 's/.*="//;s/"//p' } get_rsc() { awk '/before-acquire-handler/{print $NF}' $cnf } break_external_prog() { run_site $1 crm configure "location $PREFNAME `get_rsc` rule -inf: defined \#uname" } show_pref() { run_site $1 crm configure show $PREFNAME > /dev/null } repair_external_prog() { run_site $1 crm configure delete __pref_booth_live_test } get_tkt() { grep "^ticket=" | head -1 | sed 's/ticket=//;s/"//g' } get_tkt_settings() { awk ' -n && /^ / && /expire|timeout|renewal-freq/ { +n && /^[[:space:]]*(expire|timeout|renewal-freq)/ { sub(" = ", "=", $0); gsub("-", "_", $0); - sub("^ ", "T_", $0); + sub("^[[:space:]]*", "T_", $0); print next } -n && /^$/ {exit} +n && (/^$/ || /^ticket.*/) {exit} /^ticket.*'$tkt'/ {n=1} ' $cnf } wait_exp() { sleep $T_expire } wait_renewal() { sleep $T_renewal_freq } wait_timeout() { local t=2 [ "$T_timeout" -gt $t ] && t=$T_timeout sleep $t } set_netem_env() { local modfun args modfun=`echo $1 | sed 's/:.*//'` args=`echo $1 | sed 's/[^:]*//;s/:/ /g'` if ! is_function NETEM_ENV_$modfun; then echo "NETEM_ENV_$modfun: doesn't exist" exit 1 fi NETEM_ENV_$modfun $args } reset_netem_env() { [ -z "$NETEM_ENV" ] && return [ -n "$__NETEM_RESET" ] && return __NETEM_RESET=1 forall $0 $run_cnf __netem__ netem_reset } setup_netem() { [ -z "$NETEM_ENV" ] && return __NETEM_RESET= for env in $NETEM_ENV; do set_netem_env $env done trap "reset_netem_env" EXIT } cib_status() { local h=$1 stat stat=`runcmd $h crm_ticket -L | grep "^$tkt" | awk '{print $2}'` test "$stat" != "-1" } is_cib_granted() { local stat h=$1 stat=`runcmd $h crm_ticket -L | grep "^$tkt" | awk '{print $2}'` [ "$stat" = "granted" ] } check_cib_consistency() { local h gh="" rc=0 for h in $sites; do if is_cib_granted $h; then [ -n "$gh" ] && rc=1 # granted twice gh="$gh $h" fi done [ -z "$gh" ] && gh="none" if [ $rc -eq 0 ]; then echo $gh return $rc fi cat<= 0 ? x : -x; } } ' | sort -n | tail -1 } booth_leader_consistency() { test `booth_list_fld 2 | sort -u | wc -l` -eq 1 } check_booth_consistency() { local tlist rc maxdiff tlist=`forall booth list 2>/dev/null | grep $tkt | sed 's/commit:.*//;s/NONE/none/'` maxdiff=`echo "$tlist" | max_booth_time_diff` test "$maxdiff" -eq 0 rc=$? echo "$tlist" | booth_leader_consistency rc=$(($rc | $?<<1)) test $rc -eq 0 && return cat</dev/null wait_timeout } run_report() { local start_ts=$1 end_ts=$2 name=$3 local quick_opt="" logmsg "running hb_report" hb_report -Q 2>&1 | grep -sq "illegal.option" || quick_opt="-Q" hb_report $hb_report_opts $quick_opt -f "`date -d @$((start_ts-5))`" \ -t "`date -d @$((end_ts+60))`" \ -n "$sites $arbitrators" $name 2>&1 | logmsg } runtest() { local start_ts end_ts rc booth_status local start_time end_time local usrmsg TEST=$1 start_time=`date` start_ts=`date +%s` echo -n "Testing: $1... " can_run_test $1 || return 0 echo "starting booth test $1 ..." | logmsg if is_function setup_$1; then if ! setup_$1; then echo "setup test $1 failed" | logmsg return 1 fi fi setup_netem test_$1 rc=$? case $rc in 0) # wait a bit more if we're losing packets [ -n "$PKT_LOSS" ] && wait_timeout check_$1 rc=$? if [ $rc -eq 0 ]; then usrmsg="SUCCESS" else usrmsg="check FAIL: $rc" fi ;; $ERR_SETUP_FAILED) usrmsg="setup FAIL" ;; *) usrmsg="test FAIL: $rc" ;; esac end_time=`date` end_ts=`date +%s` echo "finished booth test $1 ($usrmsg)" | logmsg is_function recover_$1 && recover_$1 reset_netem_env sleep 3 all_booth_status booth_status=$? if [ $rc -eq 0 -a $booth_status -eq 0 ]; then echo OK [ "$GET_REPORT" ] && run_report $start_ts $end_ts $TEST else echo "$usrmsg (running hb_report ... $1.tar.bz2; see also $logf)" [ $booth_status -ne 0 ] && echo "unexpected: some booth daemons not running" run_report $start_ts $end_ts $TEST fi revoke_ticket } # # the tests # # most tests start by granting ticket grant_ticket() { run_site $1 booth grant -w $tkt >/dev/null } ## TEST: grant ## # just a grant test_grant() { grant_ticket 1 } check_grant() { check_consistency `get_site 1` } ## TEST: longgrant ## # just a grant followed by three expire times test_longgrant() { grant_ticket 1 || return $ERR_SETUP_FAILED wait_exp wait_exp wait_exp } check_longgrant() { check_consistency `get_site 1` } ## TEST: longgrant2 ## # just a grant followed by three expire times setup_longgrant2() { grant_ticket 1 || return $ERR_SETUP_FAILED } test_longgrant2() { local i for i in `seq 10`; do wait_exp done } check_longgrant2() { check_consistency `get_site 1` } ## TEST: grant_noarb ## # just a grant with no arbitrators test_grant_noarb() { local h for h in $arbitrators; do stop_arbitrator $h || return $ERR_SETUP_FAILED done >/dev/null 2>&1 sleep 1 grant_ticket 1 || return $ERR_SETUP_FAILED } check_grant_noarb() { check_consistency `get_site 1` } recover_grant_noarb() { local h for h in $arbitrators; do start_arbitrator $h done >/dev/null 2>&1 } applicable_grant_noarb() { [ -n "$arbitrators" ] } ## TEST: revoke ## # just a revoke test_revoke() { grant_ticket 1 || return $ERR_SETUP_FAILED revoke_ticket } check_revoke() { check_consistency } ## TEST: grant_elsewhere ## # just a grant to another site test_grant_elsewhere() { run_site 1 booth grant -w -s `get_site 2` $tkt >/dev/null } check_grant_elsewhere() { check_consistency `get_site 2` } ## TEST: grant_site_lost ## # grant with one site lost test_grant_site_lost() { stop_site `get_site 2` || return $ERR_SETUP_FAILED wait_timeout grant_ticket 1 || return $ERR_SETUP_FAILED check_cib `get_site 1` || return 1 wait_exp } check_grant_site_lost() { check_consistency `get_site 1` } recover_grant_site_lost() { start_site `get_site 2` } ## TEST: grant_site_reappear ## # grant with one site lost then reappearing test_grant_site_reappear() { stop_site `get_site 2` || return $ERR_SETUP_FAILED sleep 1 grant_ticket 1 || return $ERR_SETUP_FAILED check_cib `get_site 1` || return 1 wait_timeout start_site `get_site 2` || return $ERR_SETUP_FAILED wait_timeout wait_timeout } check_grant_site_reappear() { check_consistency `get_site 1` && is_cib_granted `get_site 1` } recover_grant_site_reappear() { start_site `get_site 2` } ## TEST: simultaneous_start_even ## # simultaneous start of even number of members test_simultaneous_start_even() { local serv grant_ticket 2 || return $ERR_SETUP_FAILED stop_booth || return $ERR_SETUP_FAILED wait_timeout for serv in $(echo $sites | sed "s/`get_site 1` //"); do start_site $serv & done for serv in $arbitrators; do start_arbitrator $serv & done wait_renewal start_site `get_site 1` wait_timeout wait_timeout } check_simultaneous_start_even() { check_consistency `get_site 2` } ## TEST: slow_start_granted ## # slow start test_slow_start_granted() { grant_ticket 1 || return $ERR_SETUP_FAILED stop_booth || return $ERR_SETUP_FAILED wait_timeout for serv in $sites; do start_site $serv wait_timeout done for serv in $arbitrators; do start_arbitrator $serv wait_timeout done } check_slow_start_granted() { check_consistency `get_site 1` } ## TEST: restart_granted ## # restart with ticket granted test_restart_granted() { grant_ticket 1 || return $ERR_SETUP_FAILED restart_site `get_site 1` || return $ERR_SETUP_FAILED wait_timeout } check_restart_granted() { check_consistency `get_site 1` } ## TEST: reload_granted ## # reload with ticket granted test_reload_granted() { grant_ticket 1 || return $ERR_SETUP_FAILED reload_site `get_site 1` || return $ERR_SETUP_FAILED wait_timeout } check_reload_granted() { check_consistency `get_site 1` } ## TEST: restart_granted_nocib ## # restart with ticket granted (but cib empty) test_restart_granted_nocib() { grant_ticket 1 || return $ERR_SETUP_FAILED stop_site_clean `get_site 1` || return $ERR_SETUP_FAILED wait_timeout start_site `get_site 1` || return $ERR_SETUP_FAILED wait_timeout wait_timeout wait_timeout } check_restart_granted_nocib() { check_consistency `get_site 1` } ## TEST: restart_notgranted ## # restart with ticket not granted test_restart_notgranted() { grant_ticket 1 || return $ERR_SETUP_FAILED stop_site `get_site 2` || return $ERR_SETUP_FAILED sleep 1 start_site `get_site 2` || return $ERR_SETUP_FAILED wait_timeout } check_restart_notgranted() { check_consistency `get_site 1` } ## TEST: failover ## # ticket failover test_failover() { grant_ticket 1 || return $ERR_SETUP_FAILED stop_site_clean `get_site 1` || return $ERR_SETUP_FAILED booth_status `get_site 1` && return $ERR_SETUP_FAILED wait_exp wait_timeout wait_timeout wait_timeout } check_failover() { check_consistency any } recover_failover() { start_site `get_site 1` } ## TEST: split_leader ## # split brain (leader alone) test_split_leader() { grant_ticket 1 || return $ERR_SETUP_FAILED run_site 1 $iprules stop $port >/dev/null wait_exp wait_timeout wait_timeout check_cib any || return 1 run_site 1 $iprules start $port >/dev/null wait_timeout wait_timeout wait_timeout } check_split_leader() { check_consistency any } recover_split_leader() { run_site 1 $iprules start $port >/dev/null } ## TEST: split_follower ## # split brain (follower alone) test_split_follower() { grant_ticket 1 || return $ERR_SETUP_FAILED run_site 2 $iprules stop $port >/dev/null wait_exp wait_timeout run_site 2 $iprules start $port >/dev/null wait_timeout } check_split_follower() { check_consistency `get_site 1` } ## TEST: split_edge ## # split brain (leader alone) test_split_edge() { grant_ticket 1 || return $ERR_SETUP_FAILED run_site 1 $iprules stop $port >/dev/null wait_exp run_site 1 $iprules start $port >/dev/null wait_timeout wait_timeout } check_split_edge() { check_consistency any } ## TEST: external_prog_failed ## # external test prog failed test_external_prog_failed() { grant_ticket 1 || return $ERR_SETUP_FAILED break_external_prog 1 show_pref 1 || return $ERR_SETUP_FAILED wait_renewal wait_timeout } check_external_prog_failed() { check_consistency any && [ `booth_where_granted` != `get_site 1` ] } recover_external_prog_failed() { repair_external_prog 1 } applicable_external_prog_failed() { [ -n `get_rsc` ] } # # environment modifications # # packet loss at one site 30% NETEM_ENV_single_loss() { run_site 1 $0 $run_cnf __netem__ netem_loss ${1:-30} PKT_LOSS=${1:-30} } # packet loss everywhere 30% NETEM_ENV_loss() { forall $0 $run_cnf __netem__ netem_loss ${1:-30} PKT_LOSS=${1:-30} } # network delay 100ms NETEM_ENV_net_delay() { forall $0 $run_cnf __netem__ netem_delay ${1:-100} } # duplicate packets NETEM_ENV_duplicate() { forall $0 $run_cnf __netem__ netem_duplicate ${1:-10} } # reorder packets NETEM_ENV_reorder() { forall $0 $run_cnf __netem__ netem_reorder ${1:-25} ${2:-50} } [ -f "$cnf" ] || { echo "ERROR: configuration file $cnf doesn't exist" usage 1 } sites=`get_servers site < $cnf` arbitrators=`get_servers arbitrator < $cnf` port=`get_port < $cnf` : ${port:=9929} site_cnt=`echo $sites | wc -w` arbitrator_cnt=`echo $arbitrators | wc -w` tkt=`get_tkt < $cnf` eval `get_tkt_settings` if [ "$1" = "__netem__" ]; then shift 1 _JUST_NETEM=1 local_netem_env $@ exit fi [ -z "$sites" ] && { echo no sites in $cnf usage 1 } [ -z "$T_expire" ] && { echo set $tkt expire time in $cnf usage 1 } if [ -z "$T_renewal_freq" ]; then T_renewal_freq=$((T_expire/2)) fi exec 2>$logf BASH_XTRACEFD=2 PS4='+ `date +"%T"`: ' set -x WE_SERVER="" is_we_server && WE_SERVER=1 PREFNAME=__pref_booth_live_test sync_conf || exit restart_booth all_booth_status || { start_booth all_booth_status || { echo "some booth servers couldn't be started" exit 1 } } revoke_ticket dump_conf | logmsg TESTS="$@" : ${TESTS:="grant longgrant grant_noarb grant_elsewhere grant_site_lost grant_site_reappear revoke simultaneous_start_even slow_start_granted restart_granted reload_granted restart_granted_nocib restart_notgranted failover split_leader split_follower split_edge external_prog_failed"} for t in $TESTS; do runtest $t done