|
OpenNMS: A Study in Deployment
Why Do We Need a Systems Management Tool?Current trends in the IT world continue to accelerate the rate of change in every area. Applications, server platforms and networks are no longer the slow moving entities they once were. They are subject to change on an almost daily basis. In this environment, it becomes more and more important for the IT Operations team to quickly detect, and respond to changes, or anomalous events. My employer is a relatively new business. Applications would be customized packages, and a large section of its core IT systems would be outsourced. Slowly, those package based solutions morphed into custom applications and, for a variety of reasons, these outsourced systems were brought back in-house a couple of years ago. This presented those of us in the IT department with some interesting challenges. One of those challenges was how to go about managing our newly re-acquired IT infrastructure and applications. When we first decided to move our core systems from an outsourced to an in-house IT Operations Department, our requirements were limited. Checking the availability of some services and the load on the network and key servers was about as much as we thought we needed. It became obvious over time that this was rather optimistic. Each new service added seemed to result in a new management tool being installed on a System Administrator's workstation. At one point we had three separate network monitoring systems, three separate performance management tools and a plethora different scripts, web pages and command line tools. The DBA team had one tool, the Network Admins another, the Unix and Windows teams yet another. We sent out critical alerts by email, pager, and SMS, often to completely inappropriate people. The company was growing, and it looked like it was beginning to need a grown-up systems management tool, but which one?
What Do We Expect from a Systems Management Application?There is definitely a "sweet spot" for systems management applications. Some are suited to smaller environments, others are most definitely suited to enterprise scale environments with more demanding requirements. Unsurprisingly the enterprise scale products often come with enterprise scale price tags and learning curves. We had a few key requirements:
Why OpenNMS?OpenNMS checked a lot of these boxes. It was (mostly) java, so we could run it on our Sun hardware. OpenNMS was already running in environments an order of magnitude larger than ours. It had a lot of the enterprise level features absent from other Open Source products. There were documents available on the Internet [1],[2] that pointed to its extensibility. It was based on a lot of familiar components (tomcat, postgres, rrdtool). Finally, in Open Source terms, it was a relatively mature product. We took a cautious approach deploying OpenNMS. Simplest to replace, and therefore first to go were the existing network monitoring products. Only after a month of parallel running with OpenNMS did we decommission our existing solutions. Second to go were the diverse collection of emails that were sent by applications or batch jobs. We replaced the destination email addresses with some mailboxes that delivered the notifications directly into OpenNMS. This turned out to be a bigger win than we'd expected. By having a central point where application alerts could be received and processed, we revealed hidden issues with applications that had existed for weeks or months. This was painful at first. The respective teams were often uncomfortable in having their problems aired to the world. Once we started to address these problems, however, and the frequency of the alerts started to reduce, we started to see real benefits. The operations team had a single console to monitor applications, and we could reduce the number of application support staff on call. The next target was system performance data collected by our existing tools. That which could be readily moved into OpenNMS went quickly. Platform specific data collectors (such as those which collected from Microsoft hosts using WMI) had any important alerts channeled in to OpenNMS. Our current focus, now that we believe our OpenNMS installation is mature, is back in application space. We are extending the end-to-end monitoring capabilities of OpenNMS to our web services providers. We are also starting to use it to retrieve instrumentation data directly from applications themselves, as well as their hosts.
Did We Meet Our Requirements?Here's how things shook out:
The "sweet spot" for OpenNMS seems to be about as wide as any Open Source solution and getting bigger by the month. We look forward to enhancements in the web user interface, a new JMX based data collector and support for event correlation in the near future.
Conclusions and Lessons LearnedWhat are the key lessons we have learned? In no particular order:
It's not been a totally smooth ride. We had repeated problems with memory leaks within the Java virtual machine (admittedly, not OpenNMS's fault). We also had a few nasty problems with corruption of OpenNMS's back-end database, which are now fixed. There were also a lot of "d'oh!" moments along the way, as we got up to speed with what is a pretty complex application. None of these problems ever seemed like show stoppers at the time. This had much to do with help extended by the development team and user community, to whom we extend our thanks. References[1] From black boxes to enterprises, Part 3: Hands-on JMX integration
[2] Integrating OpenNMS with OpenManage Server Agent to Manage Dell PowerEdge Servers Running Linux
|