Shell Script to Check Linux System Health

September 10, 2014 | By
| 27 Replies More

This article we are introducing a shell script to perform linux system health check. This script collects system information and status like hostname, kernel version, uptime, cpu / memory / disk usage. Script uses hostname, uptime, who, mpstat, lscpu, ps, top, df, free, bc commands to get system information and cut, grep, awk and sed for text processing. The output of the script is a text file which will be generated in the current directory. A variable is set to provide email address to which script can send report file. Apart from system status, the script will check a predefined threshold for cpu load and filesystem size.

Remember : Make sure you have all the above commands working, to output all results correctly.

Understanding linuxhealthcheck.sh Script

#tell which shell to use
#!/bin/bash
#Here we put email address to send email with report. If no email provided – log file will be just saved.
EMAIL='alerts@account.com'
#We will create function to easily manage what to do with output.
function sysstat {

#Print header, hostname (hostname command used), Kernel version (uname -r) , Uptime (from uptime command) and Last reboot time (from who command)
echo -e "
#!/bin/bash

function sysstat {

echo -e "

####################################################################

Health Check Report (CPU,Process,Disk Usage, Memory)

####################################################################

#hostname command returns hostname
Hostname : `hostname`

#uname command with key -r returns Kernel version
Kernel Version : `uname -r`

#uptime command used to get uptime, and with sed command we cat process output to get only uptime.
Uptime : `uptime | sed 's/.*up \([^,]*\), .*/\1/'`

#who command is used to get last reboot time, awk for processing output
Last Reboot Time : `who -b | awk '{print $3,$4}'`

*********************************************************************

CPU Load - > Threshold < 1 Normal > 1 Caution , > 2 Unhealthy

*********************************************************************

"

#here we check if mpstat command is in our system
MPSTAT=`which mpstat`

#here we get exit code from previous command
MPSTAT=$?

#if exit status in not 0, this means that mpstat command is not found (or not exist in our system)
if [ $MPSTAT != 0 ]

then

echo "Please install mpstat!"

echo "On Debian based systems:"

echo "sudo apt-get install sysstat"

echo "On RHEL based systems:"

echo "yum install sysstat"

else

echo -e ""

#here we check in same way if lscpu installed
LSCPU=`which lscpu`

LSCPU=$?

if [ $LSCPU != 0 ]

then

RESULT=$RESULT" lscpu required to procedure accurate results"

else

#if we have lscpu installed, we can get number of CPU's on our system and get statistic for each using mpstat command.
cpus=`lscpu | grep -e "^CPU(s):" | cut -f2 -d: | awk '{print $1}'`

i=0

#here we make loop to get and print CPU usage statistic for each CPU.
while [ $i -lt $cpus ]

do

#here we get statistic for CPU and print it. Awk command help to do this, since output doesn't allow this to do with grep. AWK check if third value is equal to variable $i (it changes from 0 to number of CPU), and print %usr value for this CPU
echo "CPU$i : `mpstat -P ALL | awk -v var=$i '{ if ($3 == var ) print $4 }' `"

#here we increment $i variable for loop
let i=$i+1

done

fi

echo -e "

#here with uptime command we get load average for system, and cut command helps to process result.
Load Average : `uptime | awk -F'load average:' '{ print $2 }' | cut -f1 -d,`

#same as before, but with awk command we check if system is Normal (if value less than 1, Caution (if between 1 and 2) and Unhealthy.
Heath Status : `uptime | awk -F'load average:' '{ print $2 }' | cut -f1 -d, | awk '{if ($1 > 2) print

"Unhealthy"; else if ($1 > 1) print "Caution"; else print "Normal"}'`

"

fi

echo -e "

******************************************************************

Process

******************************************************************

Top memory using processs/application

PID %MEM RSS COMMAND

#with ps command we get list of processes,  awk show only needed columns. After with sort command we sort it by third column and we need only top 10, that why we used head command
`ps aux | awk '{print $2, $4, $6, $11}' | sort -k3rn | head -n 10`

Top CPU using process/application

#with top command we get top CPU using processes, and with combination of head and tail we get top 10.
`top b -n1 | head -17 | tail -11`

**********************************************************************

Disk Usage - > Threshold < 90 Normal > 90% Caution > 95 Unhealthy

**********************************************************************

"
#we get disk usage with df command. -P key used to have postfix like output (there was problems with network shares, etc and -P resolve this problems). We print output to temp file to work with info more than one.
df -Pkh | grep -v 'Filesystem' > /tmp/df.status

#We create loop to process line by line from df.status
while read DISK

do

#here we get line from df.status and print result formatted with awk command
LINE=`echo $DISK | awk '{print $1,"\t",$6,"\t",$5," used","\t",$4," free space"}'`

echo -e $LINE

echo

done < /tmp/df.status

echo -e "

Heath Status"

echo

#here almost same loop, but we check disk usage, and print Normal if value less 90, Caution if between 90 and 95, and Unhealthy if greater than 95)
while read DISK

do

USAGE=`echo $DISK | awk '{print $5}' | cut -f1 -d%`

if [ $USAGE -ge 95 ]

then

STATUS='Unhealthy'

elif [ $USAGE -ge 90 ]

then

STATUS='Caution'

else

STATUS='Normal'

fi

LINE=`echo $DISK | awk '{print $1,"\t",$6}'`

#here we print result with status
echo -ne $LINE "\t\t" $STATUS

echo

done < /tmp/df.status

#here we remove df.status file
rm /tmp/df.status

#here we get Total Memory, Used Memory, Free Memory, Used Swap and Free Swap values and save them to variables.
TOTALMEM=`free -m | head -2 | tail -1| awk '{print $2}'`
#All variables like this is used to store values as float (we are using bc to make all mathematics operations, since without bc all values will be integer). Also we use if to add zero before value, if value less than 1024, and result of dividing will be less than 1.
TOTALBC=`echo "scale=2;if($TOTALMEM 0) print 0;$TOTALMEM/1024"| bc -l`
USEDMEM=`free -m | head -2 | tail -1| awk '{print $3}'`
USEDBC=`echo "scale=2;if($USEDMEM 0) print 0;$USEDMEM/1024"|bc -l`
FREEMEM=`free -m | head -2 | tail -1| awk '{print $4}'`
FREEBC=`echo "scale=2;if($FREEMEM 0) print 0;$FREEMEM/1024"|bc -l`

TOTALSWAP=`free -m | tail -1| awk '{print $2}'`
TOTALSBC=`echo "scale=2;if($TOTALSWAP 0) print 0;$TOTALSWAP/1024"| bc -l`
USEDSWAP=`free -m | tail -1| awk '{print $3}'`
USEDSBC=`echo "scale=2;if($USEDSWAP 0) print 0;$USEDSWAP/1024"|bc -l`
FREESWAP=`free -m | tail -1| awk '{print $4}'`
FREESBC=`echo "scale=2;if($FREESWAP 0) print 0;$FREESWAP/1024"|bc -l`

echo -e "

********************************************************************

Memory

********************************************************************

Physical Memory

Total\tUsed\tFree\t%Free

# as we get values in GB, also we get % of usage dividing Free by Total
${TOTALBC}GB\t${USEDBC}GB \t${FREEBC}GB\t$(($FREEMEM * 100 / $TOTALMEM ))%

Swap Memory

Total\tUsed\tFree\t%Free

#Same as above – values in GB, and in same way we get % of usage
${TOTALSBC}GB\t${USEDSBC}GB\t${FREESBC}GB\t$(($FREESWAP * 100 / $TOTALSWAP ))%
"

}

#here we make filename value, using hostname, and date.
FILENAME="health-`hostname`-`date +%y%m%d`-`date +%H%M`.txt"

#here we run function and save result to generated filename
sysstat > $FILENAME

#here we print output to user.
echo -e "Reported file $FILENAME generated in current directory." $RESULT

#here we check if user provide his email address to send email
if [ "$EMAIL" != '' ]

then

#if email proviced – we check if we have mailx command to send email
STATUS=`which mail`
#if mailx command not exist on system (previous command returned non-zero exit code we warn user that mailx is not installed
if [ "$?" != 0 ]

then

echo "The program 'mail' is currently not installed."

#if mailx installed, we send email with report to user
else

cat $FILENAME | mail -s "$FILENAME" $EMAIL

fi

fi

Don't copy the script from above as it may not work , download linuxsystemhealth.sh script from here.

script linux health check run

System health report

[root@linoxide script]# cat health-linoxide.com-140831-0909.txt | more

#####################################################################
Health Check Report (CPU,Process,Disk Usage, Memory)
#####################################################################

Hostname : linoxide.com
Kernel Version : 3.15.4-x86_64-linoxide342
Uptime : 7 days
Last Reboot Time : 2014-08-27 08:46

*********************************************************************
CPU Load - > Threshold < 1 Normal > 1 Caution , > 2 Unhealthy
*********************************************************************

CPU0 : 0.06

Load Average : 0.00

Heath Status : Normal

*********************************************************************
Process
*********************************************************************

=> Top memory using processs/application

PID %MEM RSS COMMAND
1361 12.6 127896 /usr/lib/systemd/systemd-journald
1642 7.1 72252 /usr/sbin/rsyslogd
1644 1.9 20148 /usr/bin/python
2340 1.8 19092 /sbin/dhclient
1634 1.4 14748 /usr/sbin/NetworkManager
31410 0.8 8724 sshd:
31441 0.7 7888 sshd:
31432 0.7 7784 sshd:
2558 0.5 5988 /usr/sbin/sshd
1 0.5 5412 /sbin/init

=> Top CPU using process/application
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 98572 5412 2796 S 0.0 0.5 0:11.82 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.51 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:03.33 kworker/u2:0
7 root 20 0 0 0 0 S 0.0 0.0 0:04.17 rcu_sched
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
10 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 khelper
11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs

*********************************************************************
Disk Usage - > Threshold < 90 Normal > 90% Caution > 95 Unhealthy
*********************************************************************

/dev/root / 11% used 13G free space

devtmpfs /dev 0% used 495M free space

tmpfs /dev/shm 0% used 496M free space

tmpfs /run 1% used 495M free space

tmpfs /sys/fs/cgroup 0% used 496M free space

tmpfs /tmp 0% used 496M free space

Heath Status

/dev/root / Normal
devtmpfs /dev Normal
tmpfs /dev/shm Normal
tmpfs /run Normal
tmpfs /sys/fs/cgroup Normal
tmpfs /tmp Normal

*********************************************************************
Memory
*********************************************************************

=> Physical Memory

Total Used Free %Free

0.96GB 0.46GB 0.50GB 52%

=> Swap Memory

Total Used Free %Free

0.24GB 0GB 0.24GB 100%

[root@linoxide script]#

Filed Under : SHELL SCRIPTS

Free Linux Ebook to Download

Comments (27)

Trackback URL | Comments RSS Feed

  1. Keith says:

    Thanks for this great script and details, however I can't download the script from your link. Access forbidden.

    Can you fix it?

  2. lfzyx says:

    When I download the shell script

    403 Forbidden

  3. Bobbin Zachariah says:

    Sorry , its fixed now !!

  4. muzammil says:

    please fix this error

    inuxhealthcheck.sh: 132: /home/muzamil/Desktop/linuxhealthcheck.sh: Syntax error: "}" unexpected

  5. Amit says:

    Hey thanks it's very useful.....g8...

  6. Kesavan says:

    Should this be used under only root privileges??

  7. Sharath says:

    Great script. Thank you so much for sharing!!!

  8. Bobbin Zachariah says:

    Welcome Sharath :-)

    • nani says:

      The script is really good!

      If in-case i just want mail alert only when the threshold limit exceeds normal for all the checks.

      As its hard to check for individual servers and there thresholds.

      Appreciate your reply.

      Thanks in advance.

  9. Nitin says:

    hi Bobbin,
    I get a division by 0 error at this line
    ${TOTALSBC}GB\t${USEDSBC}GB\t${FREESBC}GB\t$(($FREESWAP * 100 / $TOTALSWAP))%

    ./linuxhealthcheck.sh: line 112: 0 * 100 / 0 : division by 0 (error token is "0 ")

    Regards
    Nitin

    • Bobbin Zachariah says:

      Most probably you got this error because you don't have swap, or it less than 1 GB.
      In script it takes swap in GB, so if less, it will take 0.

      • Octopus says:

        Your script works in some condifitions and a specific hardware environment. We must review your script between the lines 83-87, because temporary files are misinterpreted and impacted by the hardware environment on which Linux System running. Today web servers are equipped with hard drives "SSD". SWAP partitions do not like SSD.
        Best Regards

  10. ashutosh says:

    i m getting this error while running this script on my diecroty "u01"

    which: no lscpu in (/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/lo cal/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin)
    may pl.suggest.

  11. Cris says:

    Hi Bobbin,

    First of all, This is a great tool. Thank you!

    I'm having an problem getting the result i need in the Memory portion. Everytime i run i the below error occurs.

    [root@dev01 ~]# ./linuxhealthcheck.sh
    ./linuxhealthcheck.sh: line 99: bc: command not found
    ./linuxhealthcheck.sh: line 101: bc: command not found
    ./linuxhealthcheck.sh: line 103: bc: command not found
    ./linuxhealthcheck.sh: line 105: bc: command not found
    ./linuxhealthcheck.sh: line 107: bc: command not found
    ./linuxhealthcheck.sh: line 109: bc: command not found

    Is there any apps i need to install?

    Thanks,
    Cris

  12. Saikrishna says:

    Hello,

    Great Script :)

    But I want to run the same script for multiple servers ? Do I need to use for loop via. ssh ?

    Can you help me ?

    Thanks

  13. ranjith says:

    Hey, thanks for the script its simple and neat.
    is there any alternative to mail application on debian, I couldt find any repo named mail.

    thank you

  14. hugich says:

    First of all thank you for this script!
    May i suggest a change?

    #here with uptime command we get load average for system, and cut command helps to process result.
    Load Average : `uptime | awk -F'load average:' '{ print $2 }' | cut -f1 -d,`

    #same as before, but with awk command we check if system is Normal (if value less than 1, Caution (if between 1 and 2) and Unhealthy.
    Heath Status : `uptime | awk -F'load average:' '{ print $2 }' | cut -f1 -d, | awk '{if ($1 > 2) print "Unhealthy"; else if ($1 > 1) print "Caution"; else print "Normal"}'`

    in Health Status uptime return the value for a single core, in a multicore environment you should change this script considering that. For example in a 32 core env the Unhealty state will be only if $1 > 64 Caution with $1 > 32 and so on.

    Again thank you!

  15. Sasha says:

    When I run the code, I get this:

    ./linuxhealthcheck.sh: line 43: bc: command not found

    what does this mean?

  16. Lucian says:

    Many thanks !

  17. Woj says:

    microsecadmin@uamicrweb1:~$ ./linuxhealthcheck.sh
    Reported file health-uamicrweb1-161130-1425.txt generated in current directory.
    ./linuxhealthcheck.sh: line 134: wojciech.boczarski@microsec.co.uk: command not found
    The program 'mail' is currently not installed.
    microsecadmin@uamicrweb1:~$ mail
    Cannot open mailbox /var/mail/microsecadmin: Permission denied
    No mail for microsecadmin
    microsecadmin@uamicrweb1:~$ sudo mail
    No mail for root

    any ideas how to fix that?

Leave a Reply

Commenting Policy:
Promotion of your products ? Comment gets deleted.
All comments are subject to moderation.