Caché fits into all common high-availability configurations supplied by operating system providers including Microsoft, IBM, HP, and EMC. Caché provides easy-to-use, often automatic, mechanisms that integrate easily with the operating system to provide high availability.
There are four general approaches to system failover. In order of increasing availability they are:
Each strategy has varying recovery time, expense, and user impact, as outlined in the following table.
Approach Recovery Time Expense User Impact
No Failover Unpredictable No cost to low cost High
Cold Failover Minutes Moderate Moderate
Warm Failover Seconds Moderate to high Low
Hot Failover Immediate Moderate to high None
There are variations on these strategies; for example, many large enterprise clients have implemented hot failover and also use cold failover for disaster recovery.
It is important to differentiate between failover and disaster recovery. Failover is a methodology to resume system availability in an acceptable period of time, while disaster recovery is a methodology to resume system availability when all failover strategies have failed.
If you require further information to help you develop a failover and backup strategy tailored for your environment, or to review your current practices, please contact the InterSystems Worldwide Response Center (WRC).
No Failover
With no failover in place your Caché database integrity is still protected from production system failure. Structural database integrity is maintained by Caché write image journal (WIJ) technology. Logical integrity is maintained through global journaling and transaction processing. While WIJ, global journaling, and transaction processing are optional, InterSystems highly recommends using them.
If a production system failure occurs, such as a hardware failure, the database and application are generally unaffected. Disk degradation, of course, is an exception. Disk redundancy and good backup procedures are vital to mitigate problems arising from disk failure.
With no failover strategy in place, system failures can result in significant downtime, depending on the cause and your ability to isolate and resolve it. If a CPU has failed, you replace it and restart, while application users wait for the system to become available. For many applications that are not business-critical this risk may be acceptable. Customers that adopt this approach share the following common traits:
Some clients cannot afford to purchase adequate redundancy to achieve higher availability. With these clients in mind, InterSystems strives to make Caché 100% reliable.
Cold Failover
A common and often inexpensive approach to recovery after failure is to maintain a standby system to assume the production workload in the event of a production system failure. A typical configuration has two identical computers with shared access to a disk subsystem.
After a failure, the standby system takes over the applications formerly running on the failed system. Microsoft Windows Clusters, HP MC/Serviceguard, Tru64 UNIX TruClusters, OpenVMS Clusters, and IBM HACMP provide a common approach for implementing cold failover. In these technologies, the standby system senses a heartbeat from the production system on a frequent and regular basis. If the heartbeat consistently stops for a period of time, the standby system automatically assumes the IP address and the disk formerly associated with the failed system. The standby can then run any applications (Caché, for example) that were on the failed system. In this scenario, when the standby system takes over the application, it executes a pre-configured start script to bring the databases online. Users can then reconnect to the databases that are now running on the standby server. Again, WIJ, global journaling, and transaction processing are used to maintain structural and data integrity.
Customers generally configure the failover server to mirror the main server with an identical CPU and memory capacity to sustain production workloads for an extended period of time. The following diagram depicts a common configuration:
Cold Failover Configuration
State of PROD IP address of PROD IP address of STDBY
FUNCTIONAL 191.10.25.1 191.10.25.50
OUT OF SERVICE N/A 191.10.25.1
Note:
Shadow journaling, where the production journal file is continuously applied to a standby database, includes inherent latency and is therefore not recommended as an approach to high availability. Any use of a shadow system for availability or disaster recovery needs should take these latency issues into consideration.
Warm Failover
The warm failover approach exploits a standby system that is immediately available to accept user connections after a production system failure. This type of failover requires the concurrent access to disk files provided, for example, by OpenVMS clusters and Tru64 UNIX TruClusters.
In this type of failover two or more servers, each running an instance of Caché and each with access to all disks, concurrently provide access to all data. If one machine fails, users can immediately reconnect to the cluster of servers.
A simple example is a group of OpenVMS servers with cluster-mounted disks. Each server has an instance of Caché running. If one server fails, the users can reconnect to another server and begin working again.
Warm Failover Configuration
State A B C
Normal 300 users 300 users 300 users
B fails 300 users 0 users 300 users
B users log on again 450 users 0 users 450 users
The 600 users on A and C are unaware of B's failure, but the 300 users that were on the failed server are affected.
Hot Failover
The hot failover approach can be complicated and expensive, but comes closest to ensuring 100% uptime. It requires the same degree of failover as for a cold or warm failover, but also requires that the state of a running user process be preserved to allow the process to resume on a failover server. One approach, for example, uses a three-tier configuration of clients and servers.
Hot Failover Configuration
Thousands of users on terminal browsers connect through TCP sockets to a bank of application servers. Each application server has a backup server ready to automatically start in case of a server failure. In turn, the application servers are each connected to a bank of data servers, each with its own backup server.
If a data server fails, any application server waiting for a response automatically resubmits its request to a different data server while the backup server is started. Similarly, any user terminal that sends a request to an application server that fails automatically reissues its request to an alternate application server.