Compare commits

..

10 Commits

8 changed files with 235 additions and 62 deletions

11
.gitignore vendored
View File

@ -1,9 +1,12 @@
* *
!psi-alerts.sh
!psi-alerts@.service
!psi-monitor.service
!psi-by-example
!.gitignore !.gitignore
!CONFIGURE.md !CONFIGURE.md
!INSTALL.md !INSTALL.md
!README.md !README.md
!psi-alerts-user.service
!psi-alerts.sh
!psi-alerts@.service
!psi-by-example
!psi-monitor-user.service
!psi-monitor.service
!psi-monitor.sh

View File

@ -1,19 +1,26 @@
# CONFIGURE # CONFIGURE Included in this project are a number of systemd units:
Included in this project are a number of systemd units:
* psi-monitor.service * psi-monitor.service
* uses psi-monitor executable (in /usr/bin/) * uses psi-monitor executable (in /usr/bin/)
* psi-alerts@.service (system template service) * psi-alerts@.service (systemd template service)
* uses psi-alerts.sh script * uses psi-alerts.sh script in */usr/local/bin/*
* psi-alerts-user.service (systemd user service)
* also uses psi-alerts.sh script in *~/bin/* (or wherever you want to
put it)
The `psi-alerts.sh` is essentially a daemon (a systemd simple service), and for The `psi-alerts.sh` is essentially a daemon (a systemd simple service), and for
now the systemd template needs to be instantiated with the username that will now the systemd template needs to be instantiated with the username that will
execute `psi-alerts.sh`. Also, a systemd unit override should be created, like execute `psi-alerts.sh` (if using the systemd template). Also, a systemd unit
so: override should be created, like so:
``` ```
sudo cp psi-alerts@.service /etc/systemd/system/
sudo systemctl edit psi-alerts@<user>.service sudo systemctl edit psi-alerts@<user>.service
``` ```
--OR--
```
cp psi-alerts-user.service ~/.config/systemd/user/psi-alerts.service
systemctl --user edit psi-alerts.service
```
This will open an editor, and in later versions of systemd the comment code will be included, clearly showing where the override should be entered: This will open an editor, and in later versions of systemd the comment code will be included, clearly showing where the override should be entered:
``` ```
@ -32,17 +39,21 @@ Environment=SSH_HOST="localhost"
Environment=SSH_PORT=5999 Environment=SSH_PORT=5999
Environment=SSH_ID_PATH="~user/.ssh/psi-alerts" Environment=SSH_ID_PATH="~user/.ssh/psi-alerts"
Environment=CLEAR_THRESHOLD="5.0" Environment=CLEAR_THRESHOLD="5.0"
ExecStart= # Clear ExecStart for user unit
ExecStart=/path/to/psi-alerts.sh --user # User unit
### Edits below this comment will be discarded ### Edits below this comment will be discarded
### /etc/systemd/system/psi-alerts@.service ### /etc/systemd/system/psi-alerts@.service
# [Unit] # [Unit]
# Description=Pressure Stall Information (PSI) alerts # Description=Pressure Stall Information (PSI) alerts
# PartOf=multi-user.target # PartOf=multi-user.target # system template
# PartOf=default.target # user service
# After=psi-monitor.service # After=psi-monitor.service
# #
# [Service] # [Service]
# User=%i #
# User=%i # User unit will not have User=%i
# Type=simple # Type=simple
# ExecStart=psi-alerts.sh # ExecStart=psi-alerts.sh
# #
@ -85,5 +96,5 @@ All of these are required except where noted, there are no default options
(SMS and email will still work, as they don't use SSH) (SMS and email will still work, as they don't use SSH)
* **CLEAR_THRESHOLD**: The percentage threshold the some avg300 threshold * **CLEAR_THRESHOLD**: The percentage threshold the some avg300 threshold
should be below before considering the alert cleared. This will depend should be below before considering the alert cleared. This will depend
highly on the workload running on highly on the workload running on the system.

View File

@ -4,4 +4,60 @@ First, clone this repository with the `--recurse-submodules` flag:
$ git clone --recurse-submodules https://git.eldon.me/trey/psi-alerts.git $ git clone --recurse-submodules https://git.eldon.me/trey/psi-alerts.git
``` ```
`--recurse-submodules` is only necessary if you wish to use the modified
psi-by-example program for `psi-monitor`. I found this too noisy to be of use,
it alerts too quickly so I wrote my own with relaxed timing.
If you want to use the psi-by-example/psi-monitor code, you'll need to compile
it:
```
gcc -o psi-monitor psi-monitor.c
```
## Using the systemd template unit
1. Copy the `psi-alerts.sh` and `psi-monitor.sh` scripts to */usr/local/bin*:
```
sudo cp psi-alerts.sh /usr/local/bin
sudo cp psi-monitor.sh /usr/local/bin/psi-monitor
### OR ###
sudo cp psi-by-example/psi-monitor /usr/local/bin
```
2. Copy the systemd units to */etc/systemd/system*:
```
sudo cp psi-alerts@.service psi-monitor.service /etc/systemd/system/
```
## Using the systemd user units
1. Copy the `psi-alerts.sh` and `psi-monitor.sh` scripts to *~/bin* (or
wherever you want them):
```
cp -a psi-alerts.sh psi-monitor.sh ~/bin/
```
2. Copy the systemd user units to *~/.config/systemd/user/*
```
cp psi-alerts-user.service ~/.config/systemd/user/psi-alerts.service
cp psi-monitor-user.service ~/.config/systemd/user/psi-monitor.service
```
# CONFIGURE
See *CONFIGURE.md* in this repository
# ENABLE and START
## system template instance:
```
sudo systemctl enable --now psi-monitor.service psi-alerts@<user>.service
```
## User instance
```
systemctl --user enable --now psi-monitor.service psi-alerts.service
```

12
psi-alerts-user.service Normal file
View File

@ -0,0 +1,12 @@
[Unit]
Description=Pressure Stall Information (PSI) alerts
PartOf=default.target
After=psi-monitor.service
[Service]
Type=simple
ExecStart=psi-alerts.sh
[Install]
WantedBy=default.target

View File

@ -42,6 +42,7 @@ svc="psi-monitor.service"
cpu="/proc/pressure/cpu" cpu="/proc/pressure/cpu"
mem="/proc/pressure/memory" mem="/proc/pressure/memory"
io="/proc/pressure/io" io="/proc/pressure/io"
user="$(whoami)"
host="$(hostname)" host="$(hostname)"
email_to="${EMAIL_TO}" email_to="${EMAIL_TO}"
sms_dst="${SMS_DST}" sms_dst="${SMS_DST}"
@ -55,11 +56,21 @@ notification_cmd="${NOTIFICATION_CMD}"
notification_hist_cmd="${NOTIFICATION_HIST_CMD}" notification_hist_cmd="${NOTIFICATION_HIST_CMD}"
notification_opts="${NOTIFICATION_OPTS}" notification_opts="${NOTIFICATION_OPTS}"
id_idx="${NOTIFICATION_IDX}" id_idx="${NOTIFICATION_IDX}"
user=false
if [[ -n "${1}" ]]; then
if [[ "${1}" == "-u" ]] || \
[[ "${1}" == "--user" ]]; then
user=true
fi
fi
get_ssh_agent () { get_ssh_agent () {
for dir in /tmp/ssh-*; do for dir in /tmp/ssh-*; do
if [[ -O ${dir} ]]; then if [[ -O ${dir} ]]; then
# only choose the last agent # only choose the last agent
export SSH_AGENT_PID=$(ps -eaf | grep '[s]sh-agent' | \
grep ${user} | awk '{print $2}')
export SSH_AUTH_SOCK=$(ls ${dir}/agent.* | tail -1) export SSH_AUTH_SOCK=$(ls ${dir}/agent.* | tail -1)
fi fi
done done
@ -76,17 +87,23 @@ print_psi () {
cat "${(P)$(tr '[[:upper:]]' '[[:lower:]]' <<< "${psi_file}")}" cat "${(P)$(tr '[[:upper:]]' '[[:lower:]]' <<< "${psi_file}")}"
} }
print_pidstat () { print_stats () {
local psi_type="${1}" local psi_type="${1}"
case "${psi_type}" in case "${psi_type}" in
CPU) CPU)
top -bcn1 -o %CPU -w 512 | head -n 30
printf "\n\n"
pidstat -ul --human pidstat -ul --human
;; ;;
IO) IO)
sudo iotop --batch --only --iter=10
printf "\n\n"
pidstat -dl --human pidstat -dl --human
;; ;;
MEM) MEM)
top -bcn1 -o %MEM -w 512 | head -n 30
printf "\n\n"
pidstat -rl --human pidstat -rl --human
;; ;;
*) *)
@ -123,7 +140,7 @@ send_notice () {
print "Connection to notification daemon failed!" >&2 print "Connection to notification daemon failed!" >&2
false false
else else
echo ${notification_id} print ${notification_id}
true true
fi fi
elif [[ -n "${ssh_id_path}" ]]; then elif [[ -n "${ssh_id_path}" ]]; then
@ -132,11 +149,11 @@ send_notice () {
print "Connection to notification daemon failed!" >&2 print "Connection to notification daemon failed!" >&2
false false
else else
echo ${notification_id} print ${notification_id}
true true
fi fi
else else
echo "No SSH notifications configured. Returning." >&2 print "No SSH notifications configured. Returning." >&2
false false
fi fi
#set +x #set +x
@ -145,7 +162,7 @@ send_notice () {
send () { send () {
#set -x #set -x
if [[ "${#@}" -lt 2 ]] && [[ "${#@}" -gt 3 ]]; then if [[ "${#@}" -lt 2 ]] && [[ "${#@}" -gt 3 ]]; then
echo "Wrong number of arguments to send()!" >&2 print "Wrong number of arguments to send()!" >&2
return false return false
fi fi
@ -177,12 +194,13 @@ send () {
subj="PSI on ${host} ${psi_type} triggered!" subj="PSI on ${host} ${psi_type} triggered!"
current_alarms="${psi_type}" current_alarms="${psi_type}"
fi fi
print_psi "${psi_type}" >> ${email}
printf "\n\n" >> ${email}
# is this an email or SMS? # is this an email or SMS?
if [[ ! "${dst}" =~ "@${sms_domain}" ]]; then if [[ ! "${dst}" =~ "@${sms_domain}" ]]; then
for p in $(tr '|' ' ' <<< "${current_alarms}"); do for p in $(tr '|' ' ' <<< "${current_alarms}"); do
printf "\npidstat info for ${p}\n\n" >> ${email} printf "\n\nStatistics info for ${p}\n\n" >> ${email}
print_pidstat "${p}" >> ${email} print_stats "${p}" >> ${email}
printf "\n\n" >> ${email} printf "\n\n" >> ${email}
done done
fi fi
@ -226,7 +244,7 @@ exec_notices () {
send "${psi_type}" "${current_alarms}" "${email_to}" send "${psi_type}" "${current_alarms}" "${email_to}"
;; ;;
*) *)
echo "Something went wrong!" >&2 print "Something went wrong!" >&2
false false
;; ;;
esac esac
@ -247,7 +265,7 @@ check_dunst_id_is_visible () {
"${notification_hist_cmd} | jq '.data[0][].id.data'"); then "${notification_hist_cmd} | jq '.data[0][].id.data'"); then
if ! ids=$(ssh -qi "${ssh_id_path}" -p ${ssh_port} -l "${ssh_user}" \ if ! ids=$(ssh -qi "${ssh_id_path}" -p ${ssh_port} -l "${ssh_user}" \
"${ssh_host}" "${notification_hist_cmd} | jq '.data[0][].id.data'"); then "${ssh_host}" "${notification_hist_cmd} | jq '.data[0][].id.data'"); then
echo "Connection to dunst failed!" >&2 print "Connection to dunst failed!" >&2
return 2 return 2
fi fi
fi fi
@ -261,64 +279,65 @@ check_dunst_id_is_visible () {
} }
local current_alarm="" local current_alarm=""
local last_alarm=""
typeset -A notice_sent typeset -A notice_sent
typeset -A secs typeset -A secs
integer last_dunst_id=-1 integer last_dunst_id=-1
local last_line="" local last_line=""
set -x #set -x
while true; do while true; do
local line=$(journalctl -u ${svc} -n1) if ${user}; then
line=$(journalctl --user -u ${svc} -n1)
else
line=$(journalctl -u ${svc} -n1)
fi
now=$(date +%s)
last_timestamp=$(date -d "$(awk '{print $1" "$2" "$3}' <<< "${line}")" +%s)
time_diff=$(( now - last_timestamp ))
if [[ "${last_line}" == "${line}" ]]; then if [[ "${last_line}" == "${line}" ]]; then
# line hasn't changed since last run, do nothing # last line hasn't changed, check to see if we can clear alarms
if (( time_diff >= 3 )); then
# haven't seen a monitor alert for 3 seconds, see if we can clear them
if [[ -n "${current_alarms}" ]]; then
typeset -a alarms=( $(tr '|' ' ' <<< "$current_alarms") )
for alarm in ${alarms}; do
integer elapsed=$(( now - ${secs[${alarm}]} ))
if is_clear "${alarm}" && (( elapsed >= 300 )); then
current_alarms=$(sed -E "s/${alarm}\|?//; s/|$//" <<< "${current_alarms}")
unset "notice_sent[${alarm}]"
unset "secs[${alarm}]"
fi
done
fi
sleep 1
continue
fi
sleep 1 sleep 1
continue continue
fi fi
last_line="${line}" last_line="${line}"
local now=$(date +%s) if (( time_diff < 3 )); then
local last_timestamp=$(date -d $(awk '{print $1" "$2" "$3}' <<< "${line}") +%s) local psi_type="$(grep -Eo "(CPU|MEM|IO) PSI event" <<< "${line}" | grep -Eo "CPU|MEM|IO")"
local time_diff=$(( now - last_timestamp )) if [[ -n "${psi_type}" ]]; then
if (( time_diff >= 3 )); then secs+=(${psi_type} ${now})
# haven't seen a monitor alert for 3 seconds, see if we can clear them
if [[ -n "${current_alarms}" ]]; then
typeset -a alarms=( $(tr '|' ' ' <<< "$current_alarms") )
for alarm in ${alarms}; do
integer elapsed=$(( now - ${secs[${alarm}]} ))
if is_clear "${alarm}" && (( elapsed >= 300 )); then
current_alarms=$(sed -E "s/${alarm}\|?//" <<< "${current_alarms}")
last_alarm=$(awk -F'|' '{print $NF}' <<< "${current_alarms}")
unset "notice_sent[${alarm}]"
unset "secs[${alarm}]"
fi
done
fi
sleep 1
continue
fi
local psi_type="$(grep -Eo "(CPU|MEM|IO) PSI event" <<< "${line}" | grep -Eo "CPU|MEM|IO")"
if [[ -n "${psi_type}" ]]; then
secs+=(${psi_type} ${now})
if [[ "${psi_type}" != "${last_alarm}" ]]; then
if [[ ! ${notice_sent[${psi_type}]} ]]; then if [[ ! ${notice_sent[${psi_type}]} ]]; then
last_dunst_id=$(exec_notices "${psi_type}" "${current_alarms}") last_dunst_id=$(exec_notices "${psi_type}" "${current_alarms}")
notice_sent+=(${psi_type} true) notice_sent+=(${psi_type} true)
elif (( last_dunst_id >= 0 )) && check_dunst_id_is_visible "${last_dunst_id}"; then elif (( last_dunst_id >= 0 )) && check_dunst_id_is_visible "${last_dunst_id}"; then
last_alarm="${psi_type}"
sleep 1 sleep 1
continue continue
fi fi
fi if [[ -z "${current_alarms}" ]]; then
last_alarm="${psi_type}" current_alarms="${psi_type}"
if [[ -z "${current_alarms}" ]]; then else
current_alarms="${psi_type}" if ! grep -q "${psi_type}" <<< "${current_alarms}"; then
else current_alarms="${current_alarms}|${psi_type}"
if ! grep -q "${psi_type}" <<< "${current_alarms}"; then fi
current_alarms="${current_alarms}|${psi_type}"
fi fi
fi fi
fi fi
sleep 1 sleep 1
done done
set +x #set +x

10
psi-monitor-user.service Normal file
View File

@ -0,0 +1,10 @@
[Unit]
Description=Pressure Stall Information (PSI) Monitor
PartOf=default.target
[Service]
Type=simple
ExecStart=/home/trey/bin/psi-monitor.sh 80
[Install]
WantedBy=default.target

62
psi-monitor.sh Executable file
View File

@ -0,0 +1,62 @@
#!/usr/bin/env zsh
#
# Pressure Stall Information monitor
#
# Copyright © 2023 Trey Blancher $(base64 -d <<< dHJleUBibGFuY2hlci5uZXQK)
#
# This program is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free
# Software Foundation, either version 3 of the License, or (at your option)
# any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
# for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
#
# Submodules may be distributed under a separate software license; see the
# LICENSE file within each submodule.
#
# This script monitors the three pressure stall information files
# /proc/pressure{cpu,io,memory} and reports if any resource is above threshold
# for the "some" values. It takes an optional single argument, the threshold at
# which to alert. If this is not supplied, it defaults to a threshold of 30.0
# percent.
#
local cpu="/proc/pressure/cpu"
local cpu_ctr=0
local io="/proc/pressure/io"
local io_ctr=0
local mem="/proc/pressure/memory"
local mem_ctr=0
local threshold=30.0
if [[ -n "${1}" ]]; then
threshold=${1}
fi
# main loop
while true; do
local cpu_pct=$(grep 'some' ${cpu} | awk '{print $2}' | awk -F'=' '{print $2}')
local io_pct=$(grep 'some' ${io} | awk '{print $2}' | awk -F'=' '{print $2}')
local mem_pct=$(grep 'some' ${mem} | awk '{print $2}' | awk -F'=' '{print $2}')
if (( cpu_pct > threshold )); then
cpu_ctr=$(( ${cpu_ctr} + 1 ))
printf "CPU PSI event %d triggered.\n" ${cpu_ctr}
fi
if (( io_pct > threshold )); then
io_ctr=$(( ${io_ctr} + 1 ))
printf "IO PSI event %d triggered.\n" ${io_ctr}
fi
if (( mem_pct > threshold )); then
mem_ctr=$(( ${mem_ctr} + 1 ))
printf "MEM PSI event %d triggered.\n" ${mem_ctr}
fi
sleep 10
done