diff --git a/README-testing b/README-testing index 71994f5..c5aadfe 100644 --- a/README-testing +++ b/README-testing @@ -1,226 +1,231 @@ There's a booth-test package which contains two types of tests. It installs the necessary files into `/usr/share/booth/tests`. === Live tests (booth operation) BEWARE: Run this with _test_ clusters only! The live testing utility tests booth operation using the given `booth.conf`: $ /usr/share/booth/tests/test/live_test.sh booth.conf It is possible to run only specific tests. Run the script without arguments to see usage and the list of tests and netem network emulation functions. There are some restrictions on how booth.conf is formatted. There may be several tickets defined and all of them will be tested, one after another (they will be tested separately). The tickets must have expire and timeout parameters configured. Example booth.conf: ------------ transport="UDP" port="9929" arbitrator="10.2.12.53" arbitrator="10.2.13.82" site="10.2.12.101" site="10.2.13.101" site="10.121.187.99" ticket="ticket-A" expire = 30 timeout = 3 retries = 3 before-acquire-handler = /usr/share/booth/service-runnable d-src1 ------------ A split brain condition is also tested. For that to work, all sites need `iptables` installed. The supplied script `booth_path` is used to manipulate iptables rules. ==== Pacemaker configuration This is a sample pacemaker configuration for a single-node cluster: primitive booth ocf:pacemaker:booth-site primitive d-src1 ocf:heartbeat:Dummy rsc_ticket global-d-src1 ticket-A: d-src1 Additionally, you may also add an ocf:booth:sharedrsc resource to also check that the ticket is granted always to only one site: primitive shared ocf:booth:sharedrsc \ params dir="10.2.13.82:/var/tmp/boothtestdir" rsc_ticket global-shared ticket-A: shared Please adjust to your environment. ==== Network environment emulation To introduce packet loss or network delays, set the NETEM_ENV environment variable. There are currently three netem network emulation settings supported: - loss: all servers emulate packet loss (30% by default) - single_loss: the first site in the configuration emulates packet loss (30% by default) - net_delay: all servers emulate packet delay (100ms by default with random variation of 10%) The settings can be supplied by adding ':' to the emulator name. For instance: # NETEM_ENV=loss:50 /usr/share/booth/tests/test/live_test.sh booth.conf It is not necessary to run the test script on one of the sites. Just copy the script and make the test `booth.conf` available locally: $ scp testsite:/usr/share/booth/tests/test/live_test.sh . $ scp testsite:/etc/booth/booth.conf . $ sh live_test.sh booth.conf You need at least two sites and one arbitrator. The configuration can contain just one ticket. It is not necessary to configure the `before-acquire-handler`. Notes: - (BEWARE!) the supplied configuration files is copied to /etc/booth/booth.conf to all sites/arbitrators thus overwriting any existing configuration - the utility uses ssh to manage booth at all sites/arbitrators and logs in as user `root` - it is required that ssh public authentication works without providing the passphrase (otherwise it is impractical) - the log file is ./test_booth.log (it is actually a shell trace, with timestamps if you're running bash) - in case one of the tests fail, hb_report is created If you want to open a bug report, please attach all hb_reports and `test_booth.log`. === Simple tests (commandline, config file) Run (as non-root) # make check or # make test/runtests.py # python test/runtests.py to run the tests written in python. It is also possible to run the tests as a root when "--allow-root-user" parameter is used or if the BOOTH_RUNTESTS_ROOT_USER environment variable is defined. +By default tests uses TCP port based on current PID in range +from 9929 to 10937 to allow running multiple instances in parallel. +It is possible to use "--single-instance" parameter or define +BOOTH_RUNTESTS_SINGLE_INSTANCE environment variable to make tests use +only single port (9929), but parallel instances will fail. === Unit tests These use gdb and pexpect to set boothd state to some configured value, injecting some input and looking at the output. # python script/unit-test.py src/boothd unit-tests/ Or, if using the 'booth-test' RPM, # python unit-test.py src/boothd unit-tests/ This must (currently?) be run as a non-root user; another optional argument is the test to start from, eg. '003'. Basically, boothd is started with the config file `unit-tests/booth.conf`, and gdb gets attached to it. Then, some ticket state is set, incoming messages are delivered, and outgoing messages and the state is compared to expected values. `unit-tests/_defaults.txt` has default values for the initial state and message data. Each test file consists of headers and key/value pairs: -------------------- ticket: state ST_STABLE message0: # optional comment for the log file header.cmd OP_ACCEPTING ticket.id "asdga" outgoing0: header.cmd OP_PREPARING last_ack_ballot 42 finally: new_ballot 1234 -------------------- A few details to the the above example: * Ticket states in RAM (`ticket`, `finally`) are written in host-endianness. * Message data (`messageN`, `outgoingN`) are automatically converted via `htonl` resp. `ntohl`. They are delivered/checked in the order defined by the integer `N` component. * Strings are done via `strcpy()` * `ticket` and `messageN` are assignment chunks * `finally` and `outgoingN` are compare chunks * In `outgoingN` you can check _both_ message data (keys with a `.` in them) and ticket state * Symbolic names are useable, GDB translates them for us * The test scripts in `unit-tests/` need to be named with 3 digits, an underscore, some text, and `.txt` * The "fake" `crm_ticket` script gets the current test via `UNIT_TEST`; test scripts can pass additional information via `UNIT_TEST_AUX`. ==== Tips and Hints There's another special header: `gdb__N__`. These lines are sent to GDB after injecting a message, but before waiting for an outgoing line. Values that contain `§` are sent as multiple lines to GDB. This means that a stanza like -------------------- gdb0: watch booth_conf->ticket[0].owner § commands § bt § c § end -------------------- will cause a watchpoint to be set, and when it is triggered a backtrace (`bt`) is written to the log file. This makes it easy to ask for additional data or check for a call-chain when hitting bugs that can be reproduced via such a unit-test. # vim: set ft=asciidoc : diff --git a/test/boothtestenv.py.in b/test/boothtestenv.py.in index 12d9807..26a40cb 100644 --- a/test/boothtestenv.py.in +++ b/test/boothtestenv.py.in @@ -1,71 +1,75 @@ import os import subprocess import time import tempfile import unittest from assertions import BoothAssertions +from utils import use_single_instance class BoothTestEnvironment(unittest.TestCase, BoothAssertions): abs_test_src_path = os.path.abspath('TEST_SRC_DIR') example_config_path = os.path.join(abs_test_src_path, '../conf/booth.conf.example') abs_test_build_path = os.path.abspath('TEST_BUILD_DIR') boothd_path = os.path.join(abs_test_build_path, '../src/boothd') def setUp(self): if not self._testMethodName.startswith('test_'): raise RuntimeError("unexpected test method name: " + self._testMethodName) self.test_name = self._testMethodName[5:] self.test_path = os.path.join(self.test_run_path, self.test_name) os.makedirs(self.test_path) # Give all users permisions for temp directory so boothd running as "hacluster" # can delete the lock file if os.geteuid() == 0: os.chmod(self.test_path, 0o777) - self.ensure_boothd_not_running() + # It's not good idea to kill other instancies so call following function + # only if single_instance mode is used + if use_single_instance(): + self.ensure_boothd_not_running() def ensure_boothd_not_running(self): # Need to redirect STDERR in case we're not root, in which # case netstat's -p option causes a warning. However we only # want to kill boothd processes which we own; -p will list the # pid for those and only those, which is exactly what we want # here. subprocess.call("netstat -tpln 2>&1 | perl -lne 'm,LISTEN\s+(\d+)/boothd, and kill 15, $1'", shell=True) def get_tempfile(self, identity): tf = tempfile.NamedTemporaryFile( prefix='%s.%d.' % (identity, time.time()), dir=self.test_path, delete=False ) return tf.name def init_log(self): self.log_file = self.get_tempfile('log') os.putenv('HA_debugfile', self.log_file) # See cluster-glue/lib/clplumbing/cl_log.c def read_log(self): if not os.path.exists(self.log_file): return '' l = open(self.log_file) msgs = ''.join(l.readlines()) l.close() return msgs def check_return_code(self, pid, return_code, expected_exitcode): if return_code is None: print("pid %d still running" % pid) if expected_exitcode is not None: self.fail("expected exit code %d, not long-running process" % expected_exitcode) else: print("pid %d exited with code %d" % (pid, return_code)) if expected_exitcode is None: msg = "should not exit" else: msg = "should exit with code %s" % expected_exitcode msg += "\nLog follows (see %s)" % self.log_file msg += "\nN.B. expect mlockall/setscheduler errors when running tests non-root" msg += "\n-----------\n%s" % self.read_log() self.assertEqual(return_code, expected_exitcode, msg) diff --git a/test/runtests.py.in b/test/runtests.py.in index 687495c..3b93f62 100644 --- a/test/runtests.py.in +++ b/test/runtests.py.in @@ -1,80 +1,82 @@ #!PYTHON_SHEBANG import os import shutil import sys import tempfile import time import unittest sys.path.append('TEST_SRC_DIR') sys.path.append('TEST_BUILD_DIR') from clienttests import ClientConfigTests from sitetests import SiteConfigTests #from arbtests import ArbitratorConfigTests +from utils import use_single_instance + if __name__ == '__main__': # Likely assumption for the root exclusion is the amount of risk # associated with what naturally accompanies root privileges: # - accidental overwrite (eventually also deletion) of unrelated, # legitimate and perhaps vital files # - accidental termination of unrelated, legitimate and perhaps # vital processes # - and so forth, possibly amplified with awkward parallel test # suite run scenarios (containers partly sharing state, etc.) # # Nonetheless, there are cases like self-contained CI runs where # all these concerns are absent, so allow opt-in relaxing of this. # Alternatively, the config generator could inject particular # credentials for a booth proces to use, but that might come too # late to address the above concerns reliably. if (os.geteuid() == 0 and "--allow-root-user" not in sys.argv and not(os.environ.get("BOOTH_RUNTESTS_ROOT_USER"))): sys.stderr.write("Must be run non-root; aborting.\n") sys.exit(1) tmp_path = '/tmp/booth-tests' if not os.path.exists(tmp_path): os.makedirs(tmp_path) test_run_path = tempfile.mkdtemp(prefix='%d.' % time.time(), dir=tmp_path) if os.geteuid() == 0: # Give all users at least rx permisions for temp directory so hacluster running booth # can delete lock file os.chmod(test_run_path, 0o755) suite = unittest.TestSuite() testclasses = [ SiteConfigTests, #ArbitratorConfigTests, ClientConfigTests, ] for testclass in testclasses: testclass.test_run_path = test_run_path suite.addTests(unittest.TestLoader().loadTestsFromTestCase(testclass)) runner_args = { #'verbosity' : 2, } major, minor, micro, releaselevel, serial = sys.version_info if major > 2 or (major == 2 and minor >= 7): # New in 2.7 runner_args['buffer'] = True runner_args['failfast'] = True pass - if os.geteuid() != 0: + if os.geteuid() != 0 and use_single_instance(): # not root, so safe # needed because old instances might still use the UDP port. os.system("killall boothd") runner = unittest.TextTestRunner(**runner_args) result = runner.run(suite) if result.wasSuccessful(): shutil.rmtree(test_run_path) sys.exit(0) else: print("Left %s for debugging" % test_run_path) sys.exit(1) diff --git a/test/serverenv.py b/test/serverenv.py index 1b8300a..1892046 100644 --- a/test/serverenv.py +++ b/test/serverenv.py @@ -1,227 +1,232 @@ import os import re import time from boothrunner import BoothRunner from boothtestenv import BoothTestEnvironment -from utils import get_IP +from utils import get_IP, use_single_instance class ServerTestEnvironment(BoothTestEnvironment): ''' boothd site/arbitrator will hang in setup phase while attempting to connect to an unreachable peer during ticket_catchup(). In a test environment we don't have any reachable peers. Fortunately, we can still successfully launch a daemon by only listing our own IP in the config file. ''' typical_config = """\ # This is like the config in the manual transport="UDP" port="9929" # Here's another comment #arbitrator="147.2.207.14" site="147.4.215.19" #site="147.18.2.1" ticket="ticketA" ticket="ticketB" """ site_re = re.compile('^site=".+"', re.MULTILINE) working_config = re.sub(site_re, 'site="%s"' % get_IP(), typical_config, 1) + if not use_single_instance(): + # use port based on pid + port_re = re.compile('^port=".+"', re.MULTILINE) + working_config = re.sub(port_re, 'port="%s"' % (9929 + (os.getpid() % 1009)), working_config, 1) + def run_booth(self, expected_exitcode, expected_daemon, config_text=None, config_file=None, lock_file=True, args=(), debug=False, foreground=False): ''' Runs boothd. Defaults to using a temporary lock file and the standard config file path. There are four possible types of outcome: - boothd exits non-zero without launching a daemon (setup phase failed, e.g. due to invalid configuration file) - boothd exits zero after launching a daemon (successful operation) - boothd does not exit (running in foreground mode) - boothd does not exit (setup phase hangs, e.g. while attempting to connect to peer during ticket_catchup()) Arguments: config_text a string containing the contents of a configuration file to use config_file path to a configuration file to use lock_file False: don't pass a lockfile parameter to booth via -l True: pass a temporary lockfile parameter to booth via -l string: pass the given lockfile path to booth via -l args iterable of extra args to pass to booth expected_exitcode an integer, or False if booth is not expected to terminate within the timeout expected_daemon True iff a daemon is expected to be launched (this means running the server in foreground mode via -S; even though in this case the server's not technically not a daemon, we still want to treat it like one by checking the lockfile before and after we kill it) debug True means pass the -D parameter foreground True means pass the -S parameter Returns a (pid, return_code, stdout, stderr, runner) tuple, where return_code/stdout/stderr are None iff pid is still running. ''' if expected_daemon and expected_exitcode is not None and expected_exitcode != 0: raise RuntimeError("Shouldn't ever expect daemon to start and then failure") if not expected_daemon and expected_exitcode == 0: raise RuntimeError("Shouldn't ever expect success without starting daemon") self.init_log() runner = BoothRunner(self.boothd_path, self.mode, args) if config_text: config_file = self.write_config_file(config_text) if config_file: runner.set_config_file(config_file) if lock_file is True: lock_file = os.path.join(self.test_path, 'boothd-lock.pid') if lock_file: runner.set_lock_file(lock_file) if debug: runner.set_debug() if foreground: runner.set_foreground() runner.show_args() (pid, return_code, stdout, stderr) = runner.run(expected_exitcode) self.check_return_code(pid, return_code, expected_exitcode) if expected_daemon: self.check_daemon_handling(runner, expected_daemon) elif return_code is None: # This isn't strictly necessary because we ensure no # daemon is running from within test setUp(), but it's # probably a good idea to tidy up after ourselves anyway. self.kill_pid(pid) return (pid, return_code, stdout, stderr, runner) def write_config_file(self, config_text): config_file = self.get_tempfile('config') c = open(config_file, 'w') c.write(config_text) c.close() return config_file def kill_pid(self, pid): print("killing %d ..." % pid) os.kill(pid, 15) print("killed") # Wait for lock file to appear if must_exist is True, or disappear if # must_exist is False for maximum of timeout seconds def wait_for_lock_file(self, lock_file, must_exist = True, timeout = 30): start = time.time() wait = 0.1 while True: if must_exist and os.path.exists(lock_file) and os.path.getsize(lock_file) > 0: return True if not must_exist and not os.path.exists(lock_file): return True elapsed = time.time() - start if elapsed + wait > timeout: wait = timeout - elapsed appear_str = "appear" if must_exist else "disappear" print("Waiting for lock file %s to %s for %.1fs ..." % (lock_file, appear_str, wait)) time.sleep(wait) elapsed = time.time() - start if elapsed >= timeout: return False wait *= 2 def check_daemon_handling(self, runner, expected_daemon): ''' Check that the lock file contains a pid referring to a running daemon. Then kill the daemon, and ensure that the lock file vanishes (bnc#749763). ''' self.wait_for_lock_file(runner.lock_file, True, 30) daemon_pid = self.get_daemon_pid_from_lock_file(runner.lock_file) err = "lock file should contain pid" if not expected_daemon: err += ", even though we didn't expect a daemon" self.assertTrue(daemon_pid is not None, err) daemon_running = self.is_pid_running_daemon(daemon_pid) err = "pid in lock file should refer to a running daemon" self.assertTrue(daemon_running, err) if daemon_running: self.kill_pid(int(daemon_pid)) self.wait_for_lock_file(runner.lock_file, False, 30) time.sleep(1) daemon_pid = self.get_daemon_pid_from_lock_file(runner.lock_file) self.assertTrue(daemon_pid is None, 'bnc#749763: lock file should vanish after daemon is killed') def get_daemon_pid_from_lock_file(self, lock_file): ''' Returns the pid contained in lock_file, or None if it doesn't exist. ''' if not os.path.exists(lock_file): print("%s does not exist" % lock_file) return None l = open(lock_file) lines = l.readlines() l.close() self.assertEqual(len(lines), 1, "Lock file should contain one line") pid = re.search('\\bbooth_pid="?(\\d+)"?', lines[0]).group(1) print("lockfile contains: <%s>" % pid) return pid def is_pid_running_daemon(self, pid): ''' Returns true iff the given pid refers to a running boothd process. ''' path = "/proc/%s" % pid pid_running = os.path.isdir(path) # print "======" # import subprocess # print subprocess.check_output(['lsof', '-p', pid]) # print subprocess.check_output(['ls', path]) # print subprocess.check_output(['cat', "/proc/%s/cmdline" % pid]) # print "======" if not pid_running: return False c = open("/proc/%s/cmdline" % pid) cmdline = "".join(c.readlines()) print(cmdline) c.close() if cmdline.find('boothd') == -1: print('no boothd in cmdline:', cmdline) return False # self.assertRegexpMatches( # cmdline, # 'boothd', # "lock file should refer to pid of running boothd" # ) return True def _test_buffer_overflow(self, expected_error, **args): (pid, ret, stdout, stderr, runner) = \ self.run_booth(expected_exitcode=1, expected_daemon=False, **args) self.assertRegexpMatches(stderr, expected_error) diff --git a/test/utils.py b/test/utils.py index f872a22..2746556 100644 --- a/test/utils.py +++ b/test/utils.py @@ -1,14 +1,19 @@ import socket +import os +import sys def get_IP(): # IPv4 only for now s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) try: s.connect(('147.4.215.19', 9929)) ret = s.getsockname()[0] except: ret = '127.0.0.1' finally: s.close() return ret + +def use_single_instance(): + return ("--single-instance" in sys.argv) or (os.environ.get("BOOTH_RUNTESTS_SINGLE_INSTANCE") != None)