This chapter describes how to install and administer IRIX Checkpoint and Restart (CPR), and how to configure statefiles. It contains the following sections:
The system administrator is responsible for the following CPR tasks:
Install CPR software on server systems as required
Help users employ CPR on server systems and workstations
Prevent statefiles from filling up available disk space
Delete, or encourage users to delete, unneeded old statefiles
The subsystems that make up CPR are listed in Table 2-1.
Table 2-1. CPR Product Subsystems
Subsystem Name | Contents |
eoe.sw.cpr | Checkpoint and restart software | | CPR reference manual pages |
eoe.books.cpr | This guide as an InfoSearch document |
If CPR is not already installed, follow this procedure to install the software:
On the server, become superuser and invoke the inst command, specifying the location of the CD-ROM software distribution:
$ /bin/su - Password: # inst -f /CDROM/dist |
Prevent installation of all default subsystems using the keep subcommand:
Inst> keep * |
For additional information on inst, see the IRIX Admin: Software Installation and Licensing Guide, or the inst(1M) man page.
Make subsystem selections. To install CPR software, the man pages, and the CPR manuals for IRIS InSight, enter the following commands:
Inst> install eoe.*.cpr Inst> list i Inst> go |
The list subcommand with the i argument displays all the subsystems marked for installation. The go subcommand starts installation, which takes some time.
For additional information on available subsystems, see the IRIX Release Notes.
Ensure that the following line exists in the /var/sysgen/system/ file (change cprstub to cpr if necessary):
USE: cpr |
Because of their potential size and longevity, checkpoint images (statefiles) are one aspect of CPR where intervention by the system administrator may be required.
The statefile can exist anywhere on a filesystem where the user has write permission, provided there is enough disk space to store it. Statefiles tend to be slightly larger than their checkpointed process.
As the system administrator, you might want to create a policy saying that checkpoint images stored in temporary directories (such as /tmp or /var/spool) are not guaranteed to remain there. If users want to preserve a statefile indefinitely, they should place it in a permanent directory that they own themselves, such as their home directory.
Checkpoint images contain much information about a process, including process set IDs, copies of user data and stack memory, kernel execution states, signal vectors, a list of open files and devices, pipeline setup, shared memory, array job states, and so on.
To obtain information about a statefile directory, run the cpr command with the -i option:
$ cpr -i statefile ... |
This displays information about the statefile revision number, process names, credential information for the processes, the current working directory, open file information, the time when the checkpoint was done, and so forth.
There is no automated way to tell if a user has restarted a statefile or not. You need to ask.
First check with the checkpoint owner to request that they remove unneeded statefiles. If there is no reply, and checkpoints are overflowing disk space, look for the oldest statefiles, especially ones in a series, as the best candidates for removal.
To delete an entire statefile directory, run the cpr command with the -D option:
$ cpr -D statefile ... |
Only the checkpoint owner and the superuser may delete a statefile. Once a checkpoint has been deleted, it cannot be restarted until the statefile is restored from backups.
If you want to restrict user access to CPR, or if some users abuse the facility by leaving around large statefile directories, you can follow this procedure:
To temporarily disable CPR, make the /usr/sbin/cpr command 000 mode. To permanently shut off CPR, use the inst command to remove the eoe.sw.cpr subsystem.
The following system objects are checkpoint safe. See “Checkpoint-Safe Objects” in Chapter 3 for complete coverage of checkpoint safety.
UNIX processes, process groups, terminal control sessions, IRIX array sessions, process hierarchies, sproc() groups (see the sproc (2) man page), and random process sets
All user memory area, including user stack and data regions
System states, including process and user information, signal disposition and signal mask, scheduling information, owner credentials, accounting data, resource limits, current directory, root directory, locked memory, and user semaphores
System calls, if applications handle return values and error numbers correctly, although slow system calls may return partial results
Undelivered and queued signals are saved at checkpoint and delivered at restart
Open files (including NFS-mounted files), mapped files, file locks, and inherited file descriptors; this includes open pipes with pipeline data
Special files /dev/tty, /dev/console, /dev/zero, /dev/null, and ccsync(7M)
UNIX System V shared memory (but the original shared memory ID is not restored); see the shmop(2) man page
IRIX jobs; see the job_limits(5) man page
Jobs started with ChallengArray services, provided they have a unique ASH number; see the array_services(5) man page
Applications using the prctl() PR_ATTACHADDR option; see the prctl(2) man page
Applications using blockproc() and unblockproc(); see the blockproc(2) man page
The Power Fortran join synchronization accelerator; see the ccsync(7M) man page
R10000 counters; see the libperfex(3c) and perfex(1) man pages
The following system objects are not checkpoint safe. See “Limitations and Caveats” in Chapter 3 for more complete coverage of unsupported system objects.
Network socket connections; see the socket(2) man page
X terminals and X11 client sessions
Special devices such as tape drivers and CD-ROMs
Files opened with setuid credential that cannot be reestablished
UNIX System V semaphores and messages (as opposed to System V shared memory); see the semop(2) and msgop(2) man pages
This section provides a guide to various error messages that could appear during checkpoint and restart operations, and what these messages might indicate.
Checkpointing can fail for any of the reasons shown in Table 2-2.
Table 2-2. Checkpoint Failure Messages
Error Message | Problem Indicated |
Permission denied | Search permission denied on a pathname component of statefile. |
Resource busy | A resource required by the target process is in use by the system. |
Checkpoint error | An uncheckpointable resource is associated with the target process. |
File exists | The pathname designated by statefile already exists. |
Invalid argument | An invalid argument was passed to a function call. |
Too many symbolic links | A symbolic link loop occurred during pathname resolution. |
No such file or directory | The pathname to statefile is nonexistent. |
Not a directory | A component of the path prefix is not a directory. |
Filename too long | The pathname to statefile exceeds the maximum length allowed. |
No space left on device | Space remaining on disk is insufficient for the statefile. |
Operation not permitted | The calling process does not have appropriate privileges. |
Read-only file system | The requested statefile would reside on a read-only filesystem. |
No such process | The process or process group specified by ID does not exist. |
Restart can fail for any of the reasons shown in Table 2-3.
Table 2-3. Restart Failure Messages
Error Message | Problem Indicated |
Permission denied | Search permission denied on a path component of statefile. |
Resource temporarily unavailable | Total number of processes for user exceeds system limit. |
Checkpoint error | An unrestartable resource is associated with target process. |
Resource deadlock avoided | Attempted locking of a system resource would have resulted in a deadlock situation. |
Invalid argument | An invalid argument was passed to the function call. |
Too many symbolic links | A symbolic link loop occurred during pathname resolution. |
Filename too long | The pathname to statefile exceeds the maximum length. |
No such file or directory | The pathname to statefile is nonexistent. |
Not enough space | Restarting the target process requires more memory than allowed by the hardware or by available swap space. |
Not a directory | A component of the path prefix is not a directory. |
Operation not permitted | The real user ID of the calling process does not match the real user ID of one or more processes recorded in the checkpoint, or the calling process does not have appropriate privileges to restart one or more of the target processes. |