2023-08-07 22:20:41 -04:00
|
|
|
# Pressure Stall Information (PSI) - Alerts
|
|
|
|
## PURPOSE
|
|
|
|
|
|
|
|
This project aims to deliver Pressure Stall Information (PSI) alerts via
|
2023-08-12 11:27:29 -04:00
|
|
|
standard Linux graphical desktop notifications (through `libnotify` compatible
|
|
|
|
daemons and CLI programs), and email (email-to-SMS is also supported). This
|
|
|
|
can alert the system administrator of CPU, I/O, or Memory (RAM) pressure in
|
|
|
|
near real time.
|
2023-08-07 22:20:41 -04:00
|
|
|
|
|
|
|
## PREREQUISITES
|
2023-08-12 11:27:29 -04:00
|
|
|
* A Linux system with kernel 5.2.0 or greater, with the /proc filesystem
|
|
|
|
enabled
|
2023-08-07 22:20:41 -04:00
|
|
|
* systemd
|
|
|
|
* zsh
|
2023-08-12 11:27:29 -04:00
|
|
|
* sysstat (for pidstat)
|
2023-08-07 22:20:41 -04:00
|
|
|
* ssh (OpenSSH, for desktop notifications)
|
|
|
|
* psi-by-example (a modified version of this is included in this project as a
|
|
|
|
submodule)
|
2023-08-12 11:27:29 -04:00
|
|
|
* a libnotify-compatible desktop notification system
|
2023-08-24 10:00:32 -04:00
|
|
|
* any notification program should use the `--print-id` parameter if
|
|
|
|
possible
|
|
|
|
* both `notify-send` and `dunstify` (part of
|
|
|
|
[dunst](https://dunst-project.org/)) support this
|
|
|
|
* note, this has only been tested with `dunst`, since it has the capability
|
|
|
|
of showing notification history
|
|
|
|
* `notify-send` specifically does not appear to retain a history, so the
|
|
|
|
`check_dunst_id_is_visible` function won't work with it (and the logic to
|
|
|
|
skip sending a new notification if one is already sent will be broken).
|
|
|
|
* since I don't use `notify-send`, I'm not sure how to solve this
|
|
|
|
* patches welcome!
|
|
|
|
* jq (for the aformentioned `dunst` integration)
|
2023-08-12 11:27:29 -04:00
|
|
|
|
|
|
|
## History
|
|
|
|
|
|
|
|
When I first learned about [Pressure Stall
|
|
|
|
Information](https://docs.kernel.org/accounting/psi.html) (PSI), I was
|
|
|
|
intrigued. This provides a real-time view into the performance and typical
|
|
|
|
resource contention Linux system administrators need to worry about: CPU, I/O,
|
|
|
|
and Memory (RAM). During this research, I found [this
|
|
|
|
post](https://unixism.net/2019/08/linux-pressure-stall-information-psi-by-example/)
|
|
|
|
complete with a C code example; albeit, it was light on I/O details and the
|
|
|
|
example C code the author provided didn't even include Memory pressure at all
|
|
|
|
(so modified it to include Memory pressure).
|
|
|
|
|
|
|
|
A quick and dirty description of PSI: whenever one or more processes are
|
|
|
|
waiting for some measurable resource (CPU, I/O, or RAM), the percentage of
|
|
|
|
processes waiting on the resource will begin to increase. Initially, the
|
|
|
|
percentage will be low, but as resource contention increases, more and more
|
|
|
|
processes will be waiting to be processed by the CPU for that resource. If not
|
|
|
|
all processes are waiting on this resource, PSI calls this the "some"
|
|
|
|
contention for resources. If all processes are waiting on the resource, this
|
|
|
|
is known as the "full" resource contention.
|
|
|
|
|
|
|
|
The pressure information is exposed in the _/proc_ filesystem in these three
|
|
|
|
virtual files: _/proc/pressure/cpu_, _/proc/pressure/io_,
|
|
|
|
_/proc/pressure/memory_. Each file reports both some and full, and has the
|
|
|
|
following output:
|
2023-08-07 22:20:41 -04:00
|
|
|
|
|
|
|
```
|
2023-08-12 11:27:29 -04:00
|
|
|
some avg10=0.02 avg60=0.43 avg300=0.55 total=711489361
|
|
|
|
full avg10=0.02 avg60=0.43 avg300=0.54 total=681874430
|
2023-08-07 22:20:41 -04:00
|
|
|
```
|
|
|
|
|
2023-08-12 11:27:29 -04:00
|
|
|
This example is taken from _/proc/pressure/io_, for I/O pressure. The full
|
|
|
|
CPU pressure information really depends on the cgroups, which this project
|
|
|
|
doesn't pay close attention to at this time. The percentages are a measure of
|
|
|
|
the average resource pressure over the last 10s, 60s, and 300s (5 minutes).
|
|
|
|
The total is the number of microseconds that any processes were waiting for the
|
|
|
|
resource; this is a counter that is reset on boot, and will continously update
|
|
|
|
as processes wait for the resource. They always have to wait for the resource,
|
|
|
|
even if it's on the order of hundreds of microseconds or less. Even if the
|
|
|
|
percentages were all zeroes, the total counter will be nonzero (at least for
|
|
|
|
the some metrics), and even the full metrics will have a nonzero total except
|
|
|
|
for CPU, because the full CPU total only really applies to cgroups (and are out
|
|
|
|
of scope for this project at present).
|
|
|
|
|
|
|
|
The monitor code (from psi-by-example listed above) only considers the "some"
|
|
|
|
pressure for all three resources, which will usually alert before the system
|
|
|
|
becomes critical (and in the case of full Memory usage/thrashing, completely
|
|
|
|
unusable for any workload). Thus the alerts should come in well before the full
|
|
|
|
resource pressure gets maxed out.
|
|
|
|
|
|
|
|
Now, I don't know C very well, but this _monitor.c_ code was easy enough to
|
|
|
|
extend to include memory pressure. However, the _create_load.c_ only creates
|
|
|
|
CPU and I/O load (memory load is too detrimental to system performance).
|
|
|
|
|
|
|
|
This was developed on an [SSDNodes VPS](https://ssdnodes.com) (Virtual Private
|
|
|
|
Server), which is a KVM virtual machine, backed by SSD hardware. It is very
|
|
|
|
well provisioned with virtual hardware: 8 vCPUs, 32GiB RAM, and 640GiB SSD
|
|
|
|
disk space. Currently, there is very little load on this system, even with
|
|
|
|
four different websites on it, with corresponding database engines, and an
|
|
|
|
nginx reverse proxy. I plan on putting
|
|
|
|
[mailcow-dockerized](https://docs.mailcow.email/) on this VPS soon, which has
|
|
|
|
the potential to increase the load significantly.
|
|
|
|
|
|
|
|
Now, once the regular workload of this VPS increases, my current configuration
|
|
|
|
may become too noisy. However, I've tried to configure `psi-alerts.sh` in such
|
|
|
|
a way that it only alerts once when the pressure on a resource increases, and
|
|
|
|
won't alert again until that pressure subsides (and the some percentages drop
|
|
|
|
below the configurable threshold for at least five minutes).
|
2023-08-07 22:20:41 -04:00
|
|
|
|
|
|
|
## TODO
|
|
|
|
* finish INSTALL section
|
|
|
|
* figure out how we're actually going to build all submodules
|
|
|
|
* and install them
|
|
|
|
* finish CONFIGURE section
|
|
|
|
* about defining an instance and editing it
|
|
|
|
* `sudo systemctl edit psi-alerts@<user>.service`
|
|
|
|
* mainly for `Environment=` variables
|
2023-08-12 11:27:29 -04:00
|
|
|
* consider reworking this for a user service, not a system service
|
|
|
|
* this could make desktop notifications simpler, and not having to use
|
|
|
|
SSH keys without passphrases
|
2023-08-24 10:00:32 -04:00
|
|
|
* possibly learn how to connect to an existing ssh-agent
|
2023-08-12 11:27:29 -04:00
|
|
|
* need to become much more familiar with user services
|
|
|
|
* consider reworking all code in a compiled language (other than C)
|
|
|
|
* time to learn Go
|
|
|
|
* or continue learning Rust
|
|
|
|
* need to know how to use kernel syscalls in these languages (if possible)
|
|
|
|
* also, convert psi-alerts.sh script to either of these languages
|