Description: Candidates should be able to measure hardware resource and network bandwidth, identify and troubleshoot resource problems.
Key Knowledge Areas:
Measure CPU usage
Measure memory usage
Measure disk I/O
Measure network I/O
Measure firewalling and routing throughput
Map client bandwidth usage
Match / correlate system symptoms with likely problems
Estimate throughput and identify bottlenecks in a system including networking
The following is a partial list of the used files, terms and utilities:
iostat
netstat
w
top
sar
processes blocked on I/O
blocks out
vmstat
pstree, ps
Isof
uptime
swap
blocks in
Computers hang, People Nags and Applications run with lags. Handling all of this problems is on administrators' shoulders. It doesn't matter if we are talking about a single computer or we are some where in the cloud. We need some tools to explore the behavior of a system in order to find a solution. Performance problems have two main roots, Sometimes lack of system resources such as CPU, Memory, HDD or even Network cause them, Sometimes we as humankind make mistakes. When there is a paradox between what we expect, what is in need and how we configure computers hang.
Measure memory usage
RAM is like a gateway of a town, everything goes through RAM. If we have a process, it should be loaded in RAM first inorder to be served by CPU later. CPU feeds its Caches from RAM data. How about User activities? If user ask to read some data From HardDisk it is loaded in RAM and then user can work with that.
As it seeam RAM is pretty busy and always in need.Linux uses some Techniques to over come problem, to make it simple lets explain Memory Terminology
Item
Description
Page
The blocks that are used in memory, Ususally 4K size
Paging
Used to get memory from secondary storage to primary storage
Swap
emulated memory on HDD
Virtual Memory
Total allocatable memory, [known as process address space]Linux supports Tera Bytes of virtual memory and use TLB to allocate physial memory.
Translation Look aside Buffer(TLB)
Kind of cache which speedup translation between RAM and virtual memory, stored in RAM
Page cache
Recently used memory pages are stored here(not cpu cache ) Used for parked data
Dirty Cache
Data which are waiting to go to HDD from RAM(Works Like a buffer)
Buffers
Used for some Block devices and cache file system metadata.not really important
Lets explorer whats going inside with some tools and commands.
root@server1:~# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 98012 79256 12136 319060 4 17 170 93 85 174 4 1 95 0 0
the command vmstat show virtual memory status and some other information. For periodic check with interval you can use vmstat 2 5. Lets try free command:
root@server1:~# free -h
total used free shared buff/cache available
Mem: 971M 637M 93M 36M 241M 116M
Swap: 1.0G 403M 617M
if you use -h option you can see human readable values, you can use -m for megabyte and -g for gigabyte view.
as you can see 241M is allocated to buff/cache. to clear buff/cache:
echo 3 > /proc/sys/vm/drop_caches ; free -h
this system just has 1 GigaByte of Ram, so obviously it starts using swap, Whats swappiness?
Swappiness is the kernel parameter that defines how much (and how often) your Linux kernel will copy RAM contents to swap. This parameter's default value is “60” and it can take anything from “0” to “100”. The higher the value of the swappiness parameter, the more aggressively your kernel will swap.
uptime command shows uptime obviously :), number of current logged users, and Load average in last 1 min, Last 5 min and last 15 min.The way that uptime calculate system load average is base on CPU Load average and Disk I/O.But the way it shows Load Average is base on # of CPU cores. So to find out % of Real System Load we need some calculations:
(Load average / number of cpu cores) x 100 ===> %real load average in our system
so if you are going beyond your CPU capability.You need to investigate more. Sometimes CPU is the bottleneck and some times Disk is bottleneck. top command is here to help us :
top command gives us information about system uptime, load avarage, and detailed info about processes. top also has some tricks:
Key
Description
press "1"
shows all cpu cores load
"shift" + "<" or ">"
sort top based on different culoms
"shift" + "p"
sort based CPU usage
"shift" + "m"
sort based memory usage
press "c"
shows absolute patch of process
press "z"
will running process in color
press "d" and then delay number
by default top command runs every 3.0 second
press "k" and insetr PID of process
Kill the process by using PID
press "r" and PID of process
to renice a process
"shift" + "w"
ro write top command results
press "q"
to exit
So top command is usefull for monitoring both CPU and RAM and also show us I/O wait processes.For more detailed lets take a look at third line of top command result. Linux has two Spaces from OS point of view, User space and System space , top command classified process this way :
us:% CPU time spent in user space #####Linux has two Spaces from OS point of view, User space and Kernel space
sy:% CPU time spent in kernel space####
ni:% CPU time spent on low priority processes
id:% CPU time spent idle
wa:io wait cpu time (or) % CPU time spent in wait (on disk)####### IO WAIT ######
hi: % CPU time spent servicing/handling hardware interrupts
si:% CPU time spent servicing/handling software interrupts
st:% CPU time in involuntary wait by virtual cpu while hypervisor is servicing another processor
did you see wa? it shows IO wait processes and it happens when the bottleneck is Disk. In this condition a process needs some data to be read from the Hard Disk, but Hard Disk is pretty busy and can not read Data When is needed. So poor process must be waited till Disk dose its jobs. Process is I/O Blocked and sleep :) from CPU point of view process goes in an "Unintrruptable Sleep".This Condition is so bad because we can't even kill that process :( so monitor processes with top command and always make sure that % of I/O waited process is Zero.
Measure disk I/O
If finally we have detected that the problem is Disk speed, We should monitor Disk I/O, top command is amazing but we need to know what program has caused the problem. we use iotop [need to be installed based on your distro, kernel >= 2.6 ]:
There is another tool which gives us less info but its quick and handy, Like vmstat that we have talked about, we have iostat which gives us a snap shot of current situation of Disk I/O: [iostat is a part of sysstat package and need to be installed]:
There is another tool which is sar, yes that is funny because that is the name of a bird in Farsi :) but here it is acronym for "System Activity Report". For running sar command you should define delay and time for it:
root@server1:~# sar 1 7
Linux 4.10.0-28-generic (server1) 11/29/2017 _x86_64_ (4 CPU)
12:26:42 AM CPU %user %nice %system %iowait %steal %idle
12:26:43 AM all 0.25 0.00 0.00 0.00 0.00 99.75
12:26:44 AM all 0.76 0.00 0.25 0.00 0.00 98.99
12:26:45 AM all 0.50 0.00 0.50 0.00 0.00 99.00
12:26:46 AM all 0.50 0.00 0.75 0.00 0.00 98.74
12:26:47 AM all 0.51 0.00 0.51 0.00 0.00 98.99
12:26:48 AM all 0.50 0.00 0.25 0.00 0.00 99.25
12:26:49 AM all 0.75 0.00 0.75 0.00 0.00 98.49
Average: all 0.54 0.00 0.43 0.00 0.00 99.03
Some times we need to know what files are opened in out system. Weather we have performance issue or just for monitoring whats going on before rebooting the computer. lsof gives us " LiSt of Open Files" :
COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME
systemd 1 root cwd DIR 8,1 4096 2 /
systemd 1 root rtd DIR 8,1 4096 2 /
systemd 1 root txt REG 8,1 1577232 135885 /lib/systemd/systemd
systemd 1 root mem REG 8,1 18976 136484 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
systemd 1 root mem REG 8,1 262408 136292 /lib/x86_64-linux-gnu/libblkid.so.1.1.0
systemd 1 root mem REG 8,1 14608 136324 /lib/x86_64-linux-gnu/libdl-2.23.so
systemd 1 root mem REG 8,1 456632 136429 /lib/x86_64-linux-gnu/libpcre.so.3.13.2
systemd 1 root mem REG 8,1 1868984 136300 /lib/x86_64-linux-gnu/libc-2.23.so
systemd 1 root mem REG 8,1 138696 136446 /lib/x86_64-linux-gnu/libpthread-2.23.so
systemd 1 root mem REG 8,1 286824 136376 /lib/x86_64-linux-gnu/libmount.so.1.1.0
systemd 1 root mem REG 8,1 64144 136282 /lib/x86_64-linux-gnu/libapparmor.so.1.4.0
systemd 1 root mem REG 8,1 92864 136363 /lib/x86_64-linux-gnu/libkmod.so.2.3.0
systemd 1 root mem REG 8,1 117288 136290 /lib/x86_64-linux-gnu/libaudit.so.1.0.0
systemd 1 root mem REG 8,1 55904 136416 /lib/x86_64-linux-gnu/libpam.so.0.83.1
systemd 1 root mem REG 8,1 252152 136457 /lib/x86_64-linux-gnu/libseccomp.so.2.2.3
systemd 1 root mem REG 8,1 31712 136454 /lib/x86_64-linux-gnu/librt-2.23.so
systemd 1 root mem REG 8,1 23128 136303 /lib/x86_64-linux-gnu/libcap.so.2.24
systemd 1 root mem REG 8,1 130224 136458 /lib/x86_64-linux-gnu/libselinux.so.1
systemd 1 root mem REG 8,1 162632 136272 /lib/x86_64-linux-gnu/ld-2.23.so
lsof gives us a very very long list of open files, you can see part of that above. But we usually use it with grep and also we can use it for a specific user lsof -u myuser.
Measure Network I/O
By spreading Internet and growth of personal Networks, the importance of Network grows. When we want explore what has caused problem in network, we should consider whole path, from beginning to the end and visa versa. That is Like talking about car traffic in big cities. Lets get familiar with some commands and tools.
iftop
Let add another member to top series commands, iftop. iftop gives you info about Sessions on your NIC and sort them based on transfer rate.
Like top, you can use < > keys to sort based on different columns, and you can prees 1 or 2 or 3 inorder to sort based on last 2secs, 10 secs or 40 secs network activities. also you can pick specific Interface by using iftop -i eth1
nload
Networl Load nload is another Handy tool, this time for monitoring current bandwidth.
For Testing Connection between two nodes of network we have nice iperf tool.
### MAchine A 192.168.10.131 ### as server###
root@server1:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 192.168.10.131 port 5001 connected with 192.168.10.130 port 41784
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 3.44 GBytes 2.95 Gbits/sec
###machine B 192.168.10.130 ### as client ###
iperf -c 192.168.10.131
------------------------------------------------------------
Client connecting to 192.168.10.131, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.10.130 port 41784 connected with 192.168.10.131 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 3.44 GBytes 2.96 Gbits/sec
as both of nodes in my example are virtual machines, you can see amazing 2.95 Gbits/sec bandwidth. Who knows some day in future this might happened in real world :)
And finally, lets introduce a tiny tool to check you internet speed from command line:
apt install speedtest-cli
root@server1:~# speedtest
Retrieving speedtest.net configuration...
Retrieving speedtest.net server list...
Testing from WestHost (107.182.226.165)...
Selecting best server based on latency...
Hosted by Sprint (New York City, NY) [2.13 km]: 581.08 ms
Testing download speed........................................
Download: 1.24 Mbit/s
Testing upload speed..................................................
Upload: 3.23 Mbit/s