This chapter introduces the IRIX Checkpoint and Restart (IRIX CPR) facility. It describes how to checkpoint and restart a process, and how to set IRIX CPR control options.
This chapter contains the following sections:
IRIX Checkpoint and Restart (CPR) is a facility for saving a running process or set of processes and, at some later time, restarting the saved process or processes from the point already reached, without starting all over again. The checkpoint image is saved in a set of disk files, and restarted by reading the saved state from these files to resume execution.
The cpr command provides a command-line interface for checkpointing, restarting checkpointed processes, checking the status of checkpoint and restart operations, and deleting files that contain images of checkpointed processes.
Checkpointing is useful for halting and continuing resource-intensive programs that take a long time to run. IRIX CPR can help when you need to:
Improve a system's load balancing and scheduling
Run complex simulation or modeling applications
Replace hardware for high-availability or fail-safe applications
Processes can continue to run after checkpoint and can be checkpointed multiple times.
The IRIX 6.5.20 release adds support for Trusted IRIX attributes to Checkpoint and Restart (CPR) processes. For more information, see “Using CPR in Trusted IRIX” in Chapter 3.
A statefile is a directory containing information about a process or set of processes (including the names of open files and system objects). Statefiles contain all available information about a running process, to enable restart. The new process(es) should behave just as if the old process(es) had continued. Statefiles are stored as files inside a directory and are protected by normal IRIX security mechanisms.
A checkpoint owner is the owner of all checkpointed processes and the resulting statefiles. Only the checkpoint owner or superuser is permitted to perform a checkpoint. If targeted processes have multiple owners, only the superuser is permitted to checkpoint them. Only the checkpoint owner or superuser can restart checkpointed process(es) from a statefile. If the superuser performed a checkpoint, only the superuser can restart it.
A process group is a set of processes that constitute a logical job—they share the same process group ID. For example, modern UNIX shells arrange pipelined programs into a process group, so they all can be suspended and managed with the shell's job control facilities. You can determine the process group ID using the -j option of the ps command; for more information see the ps(1) man page. Programmers can change the process group ID using the setpgid() system call; for more information see the setpgid(2) man page.
A process session is a set of processes started from the same physical or logical terminal. Such processes share the same session ID. You can determine the process group ID and the session ID (SID) of any process by using the -j option to the ps command; for more information see the ps(1) man page. Programmers can change the session ID using the setsid() system call; for more information, see the setsid(2) man page.
An IRIX array session is a set of conceptually related processes running on different nodes in an array. Support is provided by the array services daemon, which knows about array configuration and provides functions for describing and administering the processes of a single job. The principal use of array services is to run jobs that are large enough to span two or more machines.
A process hierarchy is the set of all child processes with a common parent. The process hierarchy is identified by giving the process ID of the parent process. A process session is one example of a process hierarchy, but by no means the only example.
A share group is a group of processes created from a common ancestor by sproc() system calls; for more information see the sproc(2) man page. The sproc() call is like fork(), except that after sproc(), the new child process can share the virtual address space of the parent process. The parent and child each have their own program counter value and stack pointer, but text and data space are visible to both processes. This provides a mechanism for building parallel programs.
An IRIX job is a group of related processes all descended from a point of entry process and identified by a unique job ID. A job can contain multiple process groups, sessions, or array sessions, and all processes in one of these subgroups are always contained within one job. For more information see the job_limits(5) man page.
To verify that CPR runs on your system, check that the eoe.sw.cpr subsystem is installed:
$ versions eoe.sw.cpr I = Installed, R = Removed Name Date Description I eoe 09/28/96 IRIX Execution Environment, 6.3 I eoe.sw 09/14/96 IRIX Execution Environment Software I eoe.sw.cpr 09/14/96 Checkpoint and Restart |
If no CPR subsystem is installed, see “Installing CPR” in Chapter 2 for instructions on installing CPR.
To checkpoint a set of processes (one process or more), use the -c option of the cpr command, providing a statefile name, and specifying a process ID with the -p option. For example, to checkpoint process 1111 into statefile ckptSep7, enter the following:
$ cpr -c ckptSep7 -p 1111 |
To checkpoint all processes in a process group, enter the process group ID (for example, 123) followed by the : GID modifier:
$ cpr -c statefile -p 123:GID |
To checkpoint all processes in a process session, enter the process session ID (for example, 345) followed by the : SID modifier:
$ cpr -c statefile -p 345:SID |
To checkpoint all processes in an IRIX array session, enter the array session ID (for example, 0x8000abcd00001111) followed by the : ASH modifier:
$ cpr -c statefile -p 0x8000abcd00001111:ASH |
To checkpoint all processes in a process hierarchy, enter the parent process ID (for example, 567) followed by the : HID modifier:
$ cpr -c statefile -p 567:HID |
To checkpoint all processes in an sproc() share group, enter the share group ID (for example, 789) followed by the : SGP modifier:
$ cpr -c statefile -p 789:SGP |
To checkpoint all processes in an IRIX job, enter the job ID (for example, 0x8000abcd00001234) followed by the :JID modifier:
$ cpr -c statefile -p 0x8000abcd00001234:JID |
It is possible to combine process designators using the comma separator, as in the following example. All processes are recorded in the same statefile.
$ cpr -c ckptSep8 -p 1113,1225,1397:HID |
The -w option specifies that cpr use the attribute file located in the current working directory (versus $HOME/.cpr).
$ cpr -c -w ckptDec13 -p 1113
You can place the statefile anywhere, provided you have write permission for the target directory, and provided there is enough disk space to store the checkpoint images. You might want to include the date as part of the statefile name, or you might want to number statefiles consecutively. The -f option of the cpr command forces an overwrite of an existing statefile.
The C shell (csh), Korn shell (ksh or, after IRIX 6.3, sh), Tops C shell (tcsh), and GNU shell (bash) all support job control. The Bourne shell (bsh, formerly sh) does not. Jobs can be suspended with Ctrl+Z, backgrounded with the bg built-in command, or foregrounded with fg. All job control shells provide the jobs built-in command with an -l option to list process ID numbers and a -p option to show the process group ID of a job.
To restart a set of processes (one process or more), use the -r option of the cpr command, providing just the statefile name. For example, to restart the set of processes checkpointed in ckptSep7, enter the following:
$ cpr -jm -r ckptSep7 |
Use the -j option if you want to perform interactive job control after restart. Otherwise, the process group restored belongs to init, effectively disabling job control.
Use the -m option if you want to migrate the checkpointed memory to the location in the system topology where the restart operation is executing.
$ cpr -c -w ckptDec13 -p 1113 |
Use -w option if you want to use the attribute file located in the current working directory (versus $HOME/.cpr).
You may restart more than one statefile with the same cpr command. If a restart involves more than one process, all restarts must succeed before any process is allowed to run; otherwise all restarts fail. Restart failure can occur for any of the following reasons:
unavailable PID | |
The original process ID is not available (already in use), and the option to allow ANY process ID was not in effect. | |
component unavailable | |
Application binaries or libraries are no longer available on the system, and neither the REPLACE nor SUBSTITUTE option was in effect. | |
security and data integrity | |
The user lacks proper permission to restart the statefile, or the restart will destroy or replace data without proper authorization. Only the checkpoint owner and the superuser may restart a set of processes. | |
resource limitation | |
System resources such as disk space, memory (swap space), or number of processes allowed, ran out during restart. | |
file contents change | |
If the CONTENTS action was used for FILE policies in the user's cpr attribute file, the restart could fail if file contents have changed between checkpoint and the restart. For more information, see “FILE Policy”. | |
other fatal failure | |
Some important part of a process restart failed for unknown reasons. |
The statefile remains unchanged after restart; cpr does not delete it automatically. To free disk space, use the -D option of cpr; for more information, see the section “Deleting Statefiles”.
If a checkpoint is issued against an interactive process or a group of processes rooted at an interactive process, it can be restarted interactively using the with -j option of the cpr command. This option makes processes interactive and job-controllable. The restarted processes run in the foreground, even the original ones ran in the background. Users may issue job control signals to background the process if desired. An interactive job is defined as a process with a controlling terminal; for more information see the termio(7) man page. Only one controlling terminal is restored even if the original process had multiple controlling terminals.
The -m option of the cpr command migrates process memory so it is restored to the location in the system topology where the restart operation is executing, for example, within a specific cpuset, within the global cpuset, and so on. Without this option, the default restart behavior on NUMA systems is to restore process memory back to where it was at the time of the checkpoint. See the migration(3) man page for scenarios that may prevent pages from migrating properly. This option has no effect on non-NUMA systems.
To obtain information about checkpoint status, employ the -i option of the cpr command, providing the statefile name. You may query more than one statefile at a time. For example, to get information about the set of processes checkpointed in ckptSep7, either before or after restart, enter the following command:
$ cpr -i ckptSep7 |
This displays information about the statefile revision number, process names, credential information for the processes, the current working directory, open file information, the time when the checkpoint was done, and so forth.
To delete a statefile and its associated open files and system objects, use the -D option of the cpr command, providing a statefile name. You may delete more than one statefile at a time. For example, to delete the file ckptSep7, enter the following command:
$ cpr -D ckptSep7 |
Only the checkpoint owner and the superuser may delete a statefile directory. Once a checkpoint statefile has been deleted, restart is no longer possible.
The cview command brings up a graphical interface for CPR and provides access to some features of the cpr command. As of the IRIX 6.5.16 release, new features are no longer being added to the cview command or interface. The cview command will be removed in the next major release of the IRIX operating system. The checkpoint control panel, shown in Figure 1-1, displays a list of processes that may be checkpointed.
Checkpoint options may be set in step II, and are explained in the section “Checkpoint and Restart Attributes”. Click the right tab at the bottom to switch panels.
The restart control panel, shown in Figure 1-2, displays a list of statefiles that may be restarted. The buttons near the bottom query checkpoints and delete statefiles.
The cpr command reads an attribute file at start-up time to set checkpoint configuration and control restart behavior. Typical defaults are given in the /etc/cpr_proto sample file. You can control CPR behavior by creating a similar .cpr attribute file in your home directory (if $HOME is not set, cpr consults the password entry). The CPR attribute file consists of one or more CKPT attribute definitions, each in the following format:
CKPT IDtype IDvalue { policy: instance: action ... } |
Possible values for IDtype are similar to process ID modifiers for the -c option of cpr, and are shown in Table 1-1. IDvalue specifies the process ID or process set ID.
Table 1-1. IDtype Modifier Options
IDtype | Process Type Designation |
---|---|
PID | UNIX process ID or POSIX thread ID. |
GID | UNIX process group ID; see setpgrp. |
SID | UNIX process session ID; see setsid(2). |
ASH | IRIX array session ID; see array_sessions(5). |
JID | |
HID | Process hierarchy (tree) rooted at the given process ID. |
SGP | IRIX sproc() shared group; see sproc(2). |
* | A wild card for anything. |
The policy lines inside the CKPT block specify default actions for CPR to take. Possible values for policy are shown in Table 1-2.
Table 1-2. Policy Names and Actions
Policy Name | Domain of Action |
---|---|
FILE | Policies for handling open files. |
WILL | Actions on the original process after checkpoint. |
CDIR | Policy on the original working directory; see chdir(2). |
RDIR | Policy on the original root directory; see chroot(2). |
FORK | Policy on original process ID. |
The FILE policy can take an optional instance field. This field specifies files that have a unique disposition, other than the default action. For example, in one case you want to replace a file, but in another case you want to append to a file. The instance field is enclosed in double quotes and may contain wildcards. For example, /tmp/* identifies all files in the /tmp directory, and /* identifies all files in the system.
The following action keywords are available for the FILE policy:
The following action keywords are available for the CDIR and RDIR policies:
The FORK policy can take an optional instance field, either PID or JID. If no instance is specified, the specified action is applied to all instances. The following action keywords are available for the FORK policy:
There is no attribute equivalent to the cpr -u option for operating system upgrade.
The $HOME/cpr file specifies a user's CPR default attributes. Here is an example of a custom .cpr attribute file:
CKPT PID 1111 { FILE: "/tmp/*": REPLACE WILL: CONT FORK: PID: ANY } |
This saves and restores all /tmp files, allows the process to continue after checkpoint, and permits process ID substitution if needed.