OpenNMS Helps Keep Tabs On Networks - page 3
Early on Balog was intrigued with the notion of five-nines of availability.
Many of the service level agreements that he'd seen insisted that the system availability should be held to 99.999% uptime, or around 30 seconds of down-time per month. This seemed strange to him, since the widely used commercial HP OpenView tool only polled machines every five minutes. This means that the shortest outage is five minutes long; much more than 30 seconds and even the 4.5 minutes allowed by 99.99% uptime.
Polling time also became a big issue as the number of machines increased. "Managing ten to a hundred machines is easy", remarked Balog. It's more challenging as you increase the number of machines.
The OpenNMS solution uses a "down-time model" that temporarily increases the polling interval when an outage is detected (it changes to 30 seconds by default). This model allows customers to strike a reasonable balance between performance and capability. OpenNMS can currently monitor up to 20,000 devices from a single instance. Since the software is open source, capable coders can add more instances if needed.
Data collection tends to throttle performance and capability. One of Balog's original four customers, Rackspace, couldn't collect data fast enough. To mitigate the problem OpenNMS was modified to collect 200,000 data points from approximately 24,000 interfaces every five minutes, or 2.4 million data points an hour from a single instance of OpenNMS. The limitation turned out to be the speed at which the disk controller could write the data, not OpenNMS itself.
Making almost every aspect of OpenNMS configurable, allows for easy customer tweaking of their own systems. Flexible configurations also let Balog provide personalized services to customers that want an optimized turnkey solution to their network monitoring and notification systems systems.