Caché fits into all common high-availability configurations supplied
by operating system providers including Microsoft, IBM, HP, and EMC. Caché
provides easy-to-use, often automatic, mechanisms that integrate easily with the operating
system to provide high availability.
There are four general approaches to system failover. In order of increasing
availability they are:
Each strategy has varying recovery time, expense, and user impact, as outlined
in the following table.
There are variations on these strategies; for example, many large enterprise
clients have implemented hot failover and also use cold failover for disaster recovery.
It is important to differentiate between failover and disaster recovery.
Failover is
a methodology to resume system availability in an acceptable period of time, while
disaster
recovery is a methodology to resume system availability when all failover
strategies have failed.
If you require further information to help you develop a failover and backup
strategy tailored for your environment, or to review your current practices, please
contact the
InterSystems
Worldwide Response Center (WRC).
With no failover in place your Caché database integrity is still protected
from production system failure. Structural database integrity is maintained by Caché
write image journal (WIJ) technology. Logical integrity is maintained through global
journaling and transaction processing. While WIJ, global journaling, and transaction
processing are optional, InterSystems highly recommends using them.
If a production system failure occurs, such as a hardware failure, the database
and application are generally unaffected. Disk degradation, of course, is an exception.
Disk redundancy and good backup procedures are vital to mitigate problems arising
from disk failure.
With no failover strategy in place, system failures can result in significant
downtime, depending on the cause and your ability to isolate and resolve it. If a
CPU has failed, you replace it and restart, while application users wait for the system
to become available. For many applications that are not business-critical this risk
may be acceptable. Customers that adopt this approach share the following common traits:
-
Clear and detailed operational recovery procedures
-
Well-trained, responsive staff
-
Ability to replace hardware quickly
-
Disk redundancy (RAID and/or disk mirroring)
-
Enabled global journaling and WIJ
-
24x7 maintenance contracts with all vendors
-
Expectations from application users who tolerate moderate downtime
-
Management acceptance of risk of an extended outage
Some clients cannot afford to purchase adequate redundancy to achieve higher
availability. With these clients in mind, InterSystems strives to make Caché
100% reliable.
A common and often inexpensive approach to recovery after failure is to maintain
a standby system to assume the production workload in the event of a production system
failure. A typical configuration has two identical computers with shared access to
a disk subsystem.
After a failure, the standby system takes over the applications formerly running
on the failed system. Microsoft Windows Clusters, HP MC/Serviceguard, Tru64 UNIX TruClusters,
OpenVMS Clusters, and IBM HACMP provide a common approach for implementing cold failover.
In these technologies, the standby system senses a heartbeat from the production system
on a frequent and regular basis. If the heartbeat consistently stops for a period
of time, the standby system automatically assumes the IP address and the disk formerly
associated with the failed system. The standby can then run any applications (Caché,
for example) that were on the failed system. In this scenario, when the standby system
takes over the application, it executes a pre-configured start script to bring the
databases online. Users can then reconnect to the databases that are now running on
the standby server. Again, WIJ, global journaling, and transaction processing are
used to maintain structural and data integrity.
Customers generally configure the failover server to mirror the main server
with an identical CPU and memory capacity to sustain production workloads for an extended
period of time. The following diagram depicts a common configuration:
Cold Failover Configuration
Note:
Shadow journaling, where the production journal file is continuously applied
to a standby database, includes inherent latency and is therefore not recommended
as an approach to high availability. Any use of a shadow system for availability or
disaster recovery needs should take these latency issues into consideration.
The warm failover approach exploits a standby system that is immediately available
to accept user connections after a production system failure. This type of failover
requires the concurrent access to disk files provided, for example, by OpenVMS clusters
and Tru64 UNIX TruClusters.
In this type of failover two or more servers, each running an instance of Caché
and each with access to all disks, concurrently provide access to all data. If one
machine fails, users can immediately reconnect to the cluster of servers.
A simple example is a group of OpenVMS servers with cluster-mounted disks. Each
server has an instance of Caché running. If one server fails, the users can
reconnect to another server and begin working again.
Warm Failover Configuration
The 600 users on A and C are unaware of B's failure, but the 300 users that
were on the failed server are affected.
The hot failover approach can be complicated and expensive, but comes closest
to ensuring 100% uptime. It requires the same degree of failover as for a cold or
warm failover, but also requires that the state of a running user process be preserved
to allow the process to resume on a failover server. One approach, for example, uses
a three-tier configuration of clients and servers.
Hot Failover Configuration
Thousands of users on terminal browsers connect through TCP sockets to a bank
of application servers. Each application server has a backup server ready to automatically
start in case of a server failure. In turn, the application servers are each connected
to a bank of data servers, each with its own backup server.
If a data server fails, any application server waiting for a response automatically
resubmits its request to a different data server while the backup server is started.
Similarly, any user terminal that sends a request to an application server that fails
automatically reissues its request to an alternate application server.