Posted: September 10, 2020 | by Joachim Haller (Red Hat Accelerator, Sudoer)
If you’re an IT manager and need a high-level overview of system monitoring, this is it.
How can we demonstrate progress?
How can we defend our budgets?
How can we show that we play an important role and need (more) resources?
These are some of the questions facing every IT manager. Sysadmins often live under the radar, but we still face constant budgetary challenges from management—”why does it cost so much?”, “does the team have to be this big?”, “do we really need this new tool?”, etc.
Budgets should be well-monitored so they can make operations more efficient, often clearing the path for improved resources and tools. But badly managed budgets will cause confusion, spark discussions around cost savings, staff reductions, and potential outsourcing.
Why monitor?
The need to know and control is the foundation of monitoring, but in this context, it’s about motivating cost for infrastructure, staff, and tools. Since IT is, for many bystanders, an abyss of black magic, and although most realize it is necessary as part of daily operations, there is always a concern over the cost that regularly increases based on the evolving nature of technology.
So, the concern for cost and hunt for efficiency drive the quest for a crystal ball (or a tech solution) that can bring clarity and provide evidence when it comes to IT investments.
Monitoring tools
Monitoring has evolved into different segments. We now have network monitoring tools like SolarWinds, Nagios Core, and Zabbix and infrastructure monitoring like OpManager, VMware vROps, and perhaps even my old companion WhatsUp Gold.
Then there are application monitoring tools like Raygun, Pingdom, and AppDynamics. And in the realm of performance monitoring, there are tools like Traceview, Datadog, and Dynatrace.
Naturally, these cover several segments, but I just wanted to mention a few of the available options for reference.
How about new tools?
There is something about monitoring that constantly seems to require completely new tools that can discover more, act faster, be more proactive, shed light on more things. But most of all, from a management perspective, these tools need to be able to calculate cost and efficiency in order to support the budgets that inevitably will refer back to IT spending.
Since monitoring reports and dashboards are used by managers, the tools are usually colorful and seemingly easy to understand, which means that “anyone” can become an “expert.” This leads to decisions about investing in a new monitoring tool involving many others in addition to the sysadmin. This can feel like a nuisance, but if you understand their need for information, it is possible to contain their involvement to the reporting and keep them away from discovery or automatic remediations. Split up the responsibilities and define actual areas of expertise.
KPI—gone wrong
When I talk about general configuration, it means enabling the default triggers, plus some that are custom made in order to safeguard the newly constructed KPI (key performance indicator). “If you can monitor something, why not?” seems to be an all too common starting point. This can be an easy mistake during the initial setup of a new monitoring tool. Avoid this trap by starting with the basics, and wait a bit to implement all the bells and whistles.
Personally, I am a bit skeptical when it comes to KPIs because they can send the organization in the completely wrong direction, as in this example:
The KPI team, located in a different country, were not operating any servers and had limited understanding of IT, but realized that servers need to be patched regularly not to fall behind. They decided to set up a trigger that checked when a server was patched and subsequently rebooted. The KPI was set to 10 times per month. All sysadmins protested—they could not possibly disturb the 24/7 business operations that often by rebooting the servers. The current mode of operation was to collect patches and reboot twice per month. An additional reboot would only happen in case of an emergency. The arguments were not accepted, and the KPI of 10 was maintained. So the admins injected a text line that would trigger the monitoring tool to show that they all were compliant with the KPI. They also filled in the change management tool to show that they planned 10 reboots per server and month. This is how “shadow IT” gains ground, and what actually goes on is hidden.
Trigger happy
Another common scenario is to enable all default triggers and set them to automatically create incident records and take corrective actions. All of a sudden, the incident management system is flooded with automatic tickets, and the servers are tuned and trimmed, which, in turn, causes the application to fail. Complexity becomes overwhelming, and IT performance and reliability plummet, leading to angry users and managers.
My only advice here is to be careful when enabling default triggers and adding actions because it can easily get out of control, and you will end up firefighting trying to roll back system changes that cause application failure. At worst, you have forgotten to turn off the triggers, which will immediately “correct” whatever changes you apply.
Simple is hard; complex is easy
With so many features available in new monitoring tools, it is difficult not to use them. Most likely, the new tool was expensive, and managers and other engaged specialists want to see some bang for their time and investment. Here is where you, as a sysadmin, need to stand your ground, start simple, and then go step by step. Identify the different areas of monitoring and have the tool monitor as-is to get a control. Then you can progressively add triggers or actions in the areas you identified to gauge the feedback of each. But never do too many changes at once and use the change management tool to keep track of what has been done, when, and by whom.
One day, not too long after the initiation of the new monitoring tool, you will look back and realize that complexity has come all by itself.
Automated actions
There are, of course, many actions you can hand over to the monitoring system to initiate and let it log and report results. If the action fails several times, then do something else. If other things happen, trigger alert for them, and you or the helpdesk can decide the best course of action. The benefit of automating actions is that you and other service functions will have more time to innovate, to come up with new things and ideas that will drive the organization forward. Monitoring is about maintaining a state, so do what you can to automate it.
Management decisions
Being a sysadmin, it is easy to have monitoring tools trigger events for you and get rid of reoccurring events, thus ensuring that the user community can go about their daily tasks. However, you also need to provide management with information that is useful to them when it comes to showing what you are doing, the percentage of availability, and the reasons why availability drops. Provide information that will support new investments, be it in licenses, hardware, cloud resources, user or sysadmin training, etc.
If you manage to show how much you have automated and what valuable things you have been able to do with that time saved, this is a real winner. This way, management, and perhaps many of your colleagues, can see the benefit of automation in boosting efficiency and creating new value.
Keep it simple, and keep it clear. Display simple relationships that make it easy for management to understand why, for example, paying for license upgrades will not only improve system speed but also provide better availability for the users. The managers that receive your monitoring reports will also need to explain and motivate investments to their own managers, so keep that in mind too.
Sysadmin is not finance
Some monitoring tools allow you to set financial parameters, and my best advice is to involve those who actually work with finance. This means that the view of efforts and time spent or saved will translate to real values that management can actually trust to inform their decisions. With repetitive tasks out of the way and time available for new projects and innovation, it will become clear that effective monitoring with the right triggers and automated actions is the best thing since SPAM in a can. The monitoring can trigger automated corrective actions, availability goes up, and the numbers have credibility—that matters.
Wrapping up
Monitoring provides control and visibility for sysadmins, and it is a great help in their daily work, but it is also an important ally for IT managers to show proof of performance, justify improvements, and support automation. Don’t enable everything or just run it out of the box. Start with the basics and work your way up from there. Complex is easy; simple is hard. Use the monitoring reports to help managers understand. Involve the right people and departments to set up monitoring so that it also becomes useful for managers to make informed decisions about savings, improvements, and necessary investments. Use monitoring to your advantage by supporting management decisions with valid information, and the whole organization will benefit.
https://www.redhat.com/sysadmin/linux-system-monitoring-managers