By Funksen
on Wed 27 of Jan., 2010 13:24 CET
Problem:
My company needs to monitor servers, services, switches, UPS. The target of this task is to setup a monitoring system, which is able to check the devices and services and send alarms to several people. There should be a difference between critical and non-critical services and devices.
Preface:
Everything you do here, happens at your own risk!
I'm using FreeBSD 7.2 for this task, to be more precise a jailed instance of it. So you should be able to install FreeBSD and update it. Please update your Ports before starting to be sure that you have the newest version of Nagios. I'll describe in an other post how to update your system.
Tip: Back up, because you will break something!
Solution:
As you can see in the title I decided to use Nagios. You can find a lot of resources at:
http://www.nagios.org/
http://www.monitoringexchange.org/
http://nagios.manubulon.com/
Installing
Okay lets start our actual work! We'll use something to see the output of our work, so the standard Apache will do that for us.
>cd /usr/ports/www/apache22 && make install clean
Just use the standard setting for the Apache server, you don't need to change the package for this task.
Now we need Nagios:
>cd /usr/ports/net-mgmt/nagios && make install clean
Enable the embedded Perl package and hit okay, non of the x11 packages are needed, as long you don't use x11. When the installer asks you which packages should be compiled for php you have to check Apache, or the mod_php module won't be complied.
For the Nagiosplugins enable all, just a few megabytes of disk space are needed. FreeBSD will fetch them from sourceforge.
Okay, keep waiting a little bit, depends on the power of core/s, but the installer will ask you if you want to create a group "nagios". Answer Yes. After that you'll be ask to create a user called "nagios". Answer Yes. A few moments later the nagiosistaller is finished, and it gives you some advices we'll follow now.
Configuration
Fist of all, we'll edit the httpd.conf of the Apache. This is needed that the GUI of Nagios can be displayed (porperly).
>vi /usr/local/etc/apache22/httpd.conf
Check if the Phpmodul is implemented:
LoadModule php5_module libexec/apache22/libphp5.so
To enable cgi, delete the # in front of the line, maybe you can add the .pl extension, if you want to run perlscripts
AddHandler cgi-script .cgi .pl
Now search for the section and add:
ScriptAlias /nagios/cgi-bin/ /usr/local/www/nagios/cgi-bin/
Alias /nagios/ /usr/local/www/nagios/
As I don't describe any security issue here, we'll make Nagios visible for all. If you want to restrict it, please read the Apache manual:
http://httpd.apache.org/docs/2.0/howto/auth.html
With that in mind we add the lines for the static Nagios page:
Order deny,allow
Allow from all
php_flag engine on
php_admin_value open_basedir /usr/local/www/nagios/:/var/spool/nagios/
and for the CGI-Application:
Options +ExecCGI
Okay now it should be possible to start the web server:
>apachectl start
If you working in a Jail the ps -ax doesn't work properly, so just type http://
IP of your server/nagios] into the address line of your browser. The rest should be up to Apache and you should see something like this:
If the web server doesn't start in the jail you maybe forgot to load a kernelmodul "accf_http". You can make sure if it's loaded using
>kldstat | grep accf
You should see something like:
5 1 0xc6c22000 2000 accf_http.ko
Kernelmoduls can't be loaded in a jail, you have the to that on the jailhost:
>kldload accf_http
Congratulation, you installed Nagios, but you cannot monitor anything now. You installed the static website, but now we have to get the Nagios service up.
Nagiosservice
Preparation:
So I'll try to give you a crash course using Nagios. But before starting it, please be sure you know what SNMP is and how to use it and how to snmpwalk.
You should find your config files under /usr/local/etc/nagios
Here are the nagios.cfg-sample, cgi.cfg-sample and the resource.cfg-sample. Copy and rename this files to another location, in my case the samplefolder. The name of the copied files should be nagios.cfg, cgi.cfg and so on.
>mkdir sample
>cp *.cfg-sample sample/
>mv nagios.cfg-sample nagios.cfg
or use my renamescript. Now you should have 3 files and 2 folders: sample and objects.
Lets go to the object folder and do exactly the same:
>cd objects
>mkdir sample
>cp *.cfg sample/
>mv commands.cfg-sample commands.cfg-sample
and so on...
In the end you should have a fileset like this:
commands.cfg
contacts.cfg
localhost.cfg
printer.cfg
sample
switch.cfg
templates.cfg
timeperiods.cfg
Actual work
Time of mindless copy paste is now over, you have to start thinking.
The hole thing with nagios is knowing inheritance. The Nagiosteam did a lot for you, so let's have a look. To be able to view all host and services edit the cgi.cfg and set the parameter
use_authentication=0
from 1 to Zero, or you'll get an error message. (But as the comments in this file say this is for a producing system a bad idea)
You can find the Nagios configuration under /usr/local/etc/nagios/nagios.cfg
Here you just need to define which object typs should be used. For example:
# You can specify individual object config files as shown below:
cfg_file=/usr/local/etc/nagios/objects/commands.cfg
cfg_file=/usr/local/etc/nagios/objects/contacts.cfg
cfg_file=/usr/local/etc/nagios/objects/timeperiods.cfg
cfg_file=/usr/local/etc/nagios/objects/templates.cfg
# Definitions for monitoring the local (FreeBSD) host
cfg_file=/usr/local/etc/nagios/objects/localhost.cfg
In these file the behavior of Nagios is defined. We'll add some of our object files a little bit later to monitor a windows server. But for now, use this file set. Let's see what these files for.
commands.cfg
The commands used by nagios are defined here, to check hosts and how to send mails
templates.cfg
Here start the inheritance. This file is very important, because the "skeleton" of the things to monitor are defined here.
contacts.cfg
If a host switches to a warning and critical stat somebody have to be contacted. These contacts are defined here.
printer.cfg
Ready to use script to monitor printers.
switch.cfg
Ready to use script to monitor printers.
timeperiods.cfg
Defines when a staff member should be alarmed
Let's monitor the localhost!
If you've done everything like i told you
>/usr/local/bin/nagios /usr/local/etc/nagios/nagios.cfg
will show something like:
Nagios Core 3.2.0
Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 08-12-2009
License: GPL
Website: http://www.nagios.org
Nagios 3.2.0 starting... (PID=66618)
Local time is Wed Jan 27 12:05:58 UTC 2010
and you should be glad!
Have a look at you web server and click on host groups:
Let's monitor a printer!
Start with the easy stuff.
>vi /usr/local/etc/nagios/nagios.cfg
and delete the hashmark
cfg_file=/usr/local/etc/nagios/objects/printer.cfg
Now you told Nagios to read printer.cfg on start up.
>cd usr/local/etc/nagios/objects/templats.cfg
Here you find the section about a generic host:
define host{
name generic-host ; The name of this host template
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_period 24x7 ; Send host notifications at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
and some line below the printer definiton:
define host{
name generic-printer ; The name of this host template
use generic-host ; Inherit default values from the generic-host template
check_period 24x7 ; By default, printers are monitored round the clock
check_interval 5 ; Actively check the printer every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
max_check_attempts 10 ; Check each printer 10 times (max)
check_command check-host-alive ; Default command to check if printers are "alive"
notification_period workhours ; Printers are only used during the workday
notification_interval 30 ; Resend notifications every 30 minutes
notification_options d,r ; Only send notifications for specific host states
contact_groups admins ; Notifications get sent to the admins by default
register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE
}
As you can see the generic-printer inheritanced from the generic-host. So if you make changes in generic-host, the genric-printer skeleton will change to! You can override attributes, just by setting the attribute one level depper. So if you add
retain_status_information 0
to the generic printer it will override the 1 inheritances from the generic host and so on.
I want to monitor a HP 3600n Laserjet, so I'll:
>vi /usr/local/etc/nagios/objects/printer.cfg
and add a new host:
define host{
use generic-printer
host_name NikosPrinter
alias HP3600n @ ITroom
address 100.100.100.101
hostgroups network-printers
notes_url http://100.100.100.66/wiki/index.php/Drucker
action_url http://100.100.100.185
}
The host get his standard setting from the generic-printer, which get his standard setting from generic-host.
Use: where you get the settings
Host_name: how the host is named in nagios
alias: more information in the webinterface
address: IP or FQDN(but prefer IP)
Hostgroup: use to group host, if you got mor printers
notes_url: a link to our internal wiki, where you got more informations
action_url: a link to the webinterface of the printer
Okay host is defined, now the services:
define service{
use generic-service
host_name NikosPrinter
service_description Printer Status
check_command check_hpjd!-C public
normal_check_interval 10 ; Check the service every 10 minutes under normal conditions
retry_check_interval 1 ; Re-check the service every minute until its final/hard state is determined
notification_interval 0
}
service_description: Name of the service in the webinterface
normal_check_interval: Check the service every 10 minutes under normal conditions
retry_check_interval: Re-check the service every minute until its final/hard state is determined
notification_interval: How often you receive a mail, but I don't want to get spammed by printers so I think one mail I enogh
define service{
use generic-service
hostgroup_name network-printers
service_description PING
check_command check_ping!3000.0,80%!5000.0,100%
normal_check_interval 10
retry_check_interval 1
notification_interval 0
}
Okay the same as above but here I use a hostgroup to ping instad of a hostname.
Lets have a look at the commands:
check_command check_hpjd!-C public
Arguments are separated by “!”
How to use standard markos can be found in the Nagios documentation, so I'll don't go any further with this.
and from commands.cfg
define command{
command_name check_hpjd
command_line $USER1$/check_hpjd -H $HOSTADDRESS$ $ARG1$
}
Restart nagios:
>ps -ax | grep nagios
>kill [pid]
>/usr/local/bin/nagios /usr/local/etc/nagios/nagios.cfg
This should result in:
Last blog comments