High Availability



SUMMARY

  • Jan 20, 2017

The main objective of high availability is to not lose service to the IT System, so IT systems are duplicated at an alternate site with complete copies of data and functions in the event the main site goes down for some reason. This is part of what is called IT Service Recovery(ITSR) or IT Disaster Recovery (ITDR) which is part of a Business Continuity Plan (BCP). These different terms are used depending on how the company approached things.

Unfortunately, we don’t support clustering at the moment which is just one method that is sometimes used to achieve this. But it’s not always relevant or needed.

However we do provide the following to achieve a similar result:
o Agents can send the logs to two or more destinations at the same time, this allows the logs to be duplicated at the alternate site while the main site is in operation. This just means that each agent has two destinations configured all the time. If one site is down the agents still keep sending logs to the site that is up and working. In the case of v5 they will cache as much as they can for the site that has the Snare Servers is down before the logs wrap. Once the site is backup then the agents will resume sending logs to both site/systems.
o All agents can send logs to one SS. Then that main SS can send to the IT DR SS system using the reflector. In the event of an outage of the main site they can change their internal DNS to then point to the alternate IT DR SS and logs resume sending to the alternate site assuming they are using a DNS name in the agents and not IP addresses for the destination fields. If they use TCP or TLS as the protocol then there is minimal chance of loss of logs as they will be cached. This DNS change process would be triggered in the event the business declares some sort of disaster and they initiate their IT DR/BCP processes for failover. Most customers have a set of procedures they perform when doing a failover to their IT DR sites, not all things are always automatic, DNS redirection is one of those.

So both options provide methods to:
o Ensure data is duplicated at an alternate site
o Ensure that logs are still collected in the event of a failure of the main site
o Log archive may still be accessed and can produce reports from the second site.
o You may have to have a process to extract and copy over any customised objectives they create on the master and then load them in the alternate system to keep various reports in sync( we have options in the admin tools area to load them). This is not much different to making system changes on a Microsoft cluster which have to be manually replicated over to the other systems to ensure they all work on the same base configuration.

The Snare Agent Manager (SAM) does not have HA as yet. However a customer can have an alternate SAM with appropriate license keys and if they change their DNS for where the agents phone home on to point to the alternate SAM the agents will relicense from the alternate SAM all with no down time. The agents hold a 30 day key that keeps them running to handle any network disruptions to help allow failover process to occur if the primary SAM is down for any length of time.