psi-alerts/README.md

# Pressure Stall Information (PSI) - Alerts 
## PURPOSE

This project aims to deliver Pressure Stall Information (PSI) alerts via
standard Linux graphical desktop notifications (through `libnotify` compatible
daemons and CLI programs), and email (email-to-SMS is also supported).  This
can alert the system administrator of CPU, I/O, or Memory (RAM) pressure in
near real time.

## PREREQUISITES
* A Linux system with kernel 5.2.0 or greater, with the /proc filesystem
  enabled
* systemd
* zsh
* sysstat (for pidstat)
* ssh (OpenSSH, for desktop notifications)
* psi-by-example (a modified version of this is included in this project as a
  submodule)
* a libnotify-compatible desktop notification system
    * any notification program should use the `--print-id` parameter if
      possible
        * both `notify-send` and `dunstify` (part of
          [dunst](https://dunst-project.org/)) support this
    * note, this has only been tested with `dunst`, since it has the capability
      of showing notification history
        * `notify-send` specifically does not appear to retain a history, so the
          `check_dunst_id_is_visible` function won't work with it (and the logic to
          skip sending a new notification if one is already sent will be broken).  
            * since I don't use `notify-send`, I'm not sure how to solve this
            * patches welcome!
* jq (for the aformentioned `dunst` integration)

## History

When I first learned about [Pressure Stall
Information](https://docs.kernel.org/accounting/psi.html) (PSI), I was
intrigued.  This provides a real-time view into the performance and typical
resource contention Linux system administrators need to worry about:  CPU, I/O,
and Memory (RAM).  During this research, I found [this
post](https://unixism.net/2019/08/linux-pressure-stall-information-psi-by-example/)
complete with a C code example;  albeit, it was light on I/O details and the
example C code the author provided didn't even include Memory pressure at all
(so modified it to include Memory pressure).  

A quick and dirty description of PSI:  whenever one or more processes are
waiting for some measurable resource (CPU, I/O, or RAM), the percentage of
processes waiting on the resource will begin to increase.  Initially, the
percentage will be low, but as resource contention increases, more and more
processes will be waiting to be processed by the CPU for that resource.  If not
all processes are waiting on this resource, PSI calls this the "some"
contention for resources.  If all processes are waiting on the resource, this
is known as the "full" resource contention.  

The pressure information is exposed in the _/proc_ filesystem in these three
virtual files: _/proc/pressure/cpu_, _/proc/pressure/io_,
_/proc/pressure/memory_.  Each file reports both some and full, and has the
following output:

```
some avg10=0.02 avg60=0.43 avg300=0.55 total=711489361
full avg10=0.02 avg60=0.43 avg300=0.54 total=681874430
```

This example is taken from _/proc/pressure/io_, for I/O pressure.   The full
CPU pressure information really depends on the cgroups, which this project
doesn't pay close attention to at this time.  The percentages are a measure of
the average resource pressure over the last 10s, 60s, and 300s (5 minutes).
The total is the number of microseconds that any processes were waiting for the
resource;  this is a counter that is reset on boot, and will continously update
as processes wait for the resource.  They always have to wait for the resource,
even if it's on the order of hundreds of microseconds or less.  Even if the
percentages were all zeroes, the total counter will be nonzero (at least for
the some metrics), and even the full metrics will have a nonzero total except
for CPU, because the full CPU total only really applies to cgroups (and are out
of scope for this project at present).

The monitor code (from psi-by-example listed above) only considers the "some"
pressure for all three resources, which will usually alert before the system
becomes critical (and in the case of full Memory usage/thrashing, completely
unusable for any workload).  Thus the alerts should come in well before the full
resource pressure gets maxed out.

Now, I don't know C very well, but this _monitor.c_ code was easy enough to
extend to include memory pressure.  However, the _create_load.c_ only creates
CPU and I/O load (memory load is too detrimental to system performance).  

This was developed on an [SSDNodes VPS](https://ssdnodes.com) (Virtual Private
Server), which is a KVM virtual machine, backed by SSD hardware.  It is very
well provisioned with virtual hardware:  8 vCPUs, 32GiB RAM, and 640GiB SSD
disk space.  Currently, there is very little load on this system, even with
four different websites on it, with corresponding database engines, and an
nginx reverse proxy.  I plan on putting
[mailcow-dockerized](https://docs.mailcow.email/) on this VPS soon, which has
the potential to increase the load significantly.

Now, once the regular workload of this VPS increases, my current configuration
may become too noisy.  However, I've tried to configure `psi-alerts.sh` in such
a way that it only alerts once when the pressure on a resource increases, and
won't alert again until that pressure subsides (and the some percentages drop
below the configurable threshold for at least five minutes).

## TODO
* finish INSTALL section
    * figure out how we're actually going to build all submodules
    * and install them
* finish CONFIGURE section
    * about defining an instance and editing it
        * `sudo systemctl edit psi-alerts@<user>.service`
        * mainly for `Environment=` variables
    * consider reworking this for a user service, not a system service
        * this could make desktop notifications simpler, and not having to use
          SSH keys without passphrases
          * possibly learn how to connect to an existing ssh-agent
        * need to become much more familiar with user services
* consider reworking all code in a compiled language (other than C)
    * time to learn Go
    * or continue learning Rust
    * need to know how to use kernel syscalls in these languages (if possible)
    * also, convert psi-alerts.sh script to either of these languages
Initial commit 2023-08-07 2023-08-07 22:20:41 -04:00			`# Pressure Stall Information (PSI) - Alerts`
			`## PURPOSE`

			`This project aims to deliver Pressure Stall Information (PSI) alerts via`
Initial commit of CONFIGURE.md and INSTALL.md, updated README.md 2023-08-12 11:27:29 -04:00			standard Linux graphical desktop notifications (through `libnotify` compatible
			`daemons and CLI programs), and email (email-to-SMS is also supported). This`
			`can alert the system administrator of CPU, I/O, or Memory (RAM) pressure in`
			`near real time.`
Initial commit 2023-08-07 2023-08-07 22:20:41 -04:00
			`## PREREQUISITES`
Initial commit of CONFIGURE.md and INSTALL.md, updated README.md 2023-08-12 11:27:29 -04:00			`* A Linux system with kernel 5.2.0 or greater, with the /proc filesystem`
			`enabled`
Initial commit 2023-08-07 2023-08-07 22:20:41 -04:00			`* systemd`
			`* zsh`
Initial commit of CONFIGURE.md and INSTALL.md, updated README.md 2023-08-12 11:27:29 -04:00			`* sysstat (for pidstat)`
Initial commit 2023-08-07 2023-08-07 22:20:41 -04:00			`* ssh (OpenSSH, for desktop notifications)`
			`* psi-by-example (a modified version of this is included in this project as a`
			`submodule)`
Initial commit of CONFIGURE.md and INSTALL.md, updated README.md 2023-08-12 11:27:29 -04:00			`* a libnotify-compatible desktop notification system`
Fleshed out both documents 2023-08-24 10:00:32 -04:00			* any notification program should use the `--print-id` parameter if
			`possible`
			* both `notify-send` and `dunstify` (part of
			`[dunst](https://dunst-project.org/)) support this`
			* note, this has only been tested with `dunst`, since it has the capability
			`of showing notification history`
			* `notify-send` specifically does not appear to retain a history, so the
			`check_dunst_id_is_visible` function won't work with it (and the logic to
			`skip sending a new notification if one is already sent will be broken).`
			* since I don't use `notify-send`, I'm not sure how to solve this
			`* patches welcome!`
			* jq (for the aformentioned `dunst` integration)
Initial commit of CONFIGURE.md and INSTALL.md, updated README.md 2023-08-12 11:27:29 -04:00
			`## History`

			`When I first learned about [Pressure Stall`
			`Information](https://docs.kernel.org/accounting/psi.html) (PSI), I was`
			`intrigued. This provides a real-time view into the performance and typical`
			`resource contention Linux system administrators need to worry about: CPU, I/O,`
			`and Memory (RAM). During this research, I found [this`
			`post](https://unixism.net/2019/08/linux-pressure-stall-information-psi-by-example/)`
			`complete with a C code example; albeit, it was light on I/O details and the`
			`example C code the author provided didn't even include Memory pressure at all`
			`(so modified it to include Memory pressure).`

			`A quick and dirty description of PSI: whenever one or more processes are`
			`waiting for some measurable resource (CPU, I/O, or RAM), the percentage of`
			`processes waiting on the resource will begin to increase. Initially, the`
			`percentage will be low, but as resource contention increases, more and more`
			`processes will be waiting to be processed by the CPU for that resource. If not`
			`all processes are waiting on this resource, PSI calls this the "some"`
			`contention for resources. If all processes are waiting on the resource, this`
			`is known as the "full" resource contention.`

			`The pressure information is exposed in the _/proc_ filesystem in these three`
			`virtual files: _/proc/pressure/cpu_, _/proc/pressure/io_,`
			`_/proc/pressure/memory_. Each file reports both some and full, and has the`
			`following output:`
Initial commit 2023-08-07 2023-08-07 22:20:41 -04:00
			```
Initial commit of CONFIGURE.md and INSTALL.md, updated README.md 2023-08-12 11:27:29 -04:00			`some avg10=0.02 avg60=0.43 avg300=0.55 total=711489361`
			`full avg10=0.02 avg60=0.43 avg300=0.54 total=681874430`
Initial commit 2023-08-07 2023-08-07 22:20:41 -04:00			```

Initial commit of CONFIGURE.md and INSTALL.md, updated README.md 2023-08-12 11:27:29 -04:00			`This example is taken from _/proc/pressure/io_, for I/O pressure. The full`
			`CPU pressure information really depends on the cgroups, which this project`
			`doesn't pay close attention to at this time. The percentages are a measure of`
			`the average resource pressure over the last 10s, 60s, and 300s (5 minutes).`
			`The total is the number of microseconds that any processes were waiting for the`
			`resource; this is a counter that is reset on boot, and will continously update`
			`as processes wait for the resource. They always have to wait for the resource,`
			`even if it's on the order of hundreds of microseconds or less. Even if the`
			`percentages were all zeroes, the total counter will be nonzero (at least for`
			`the some metrics), and even the full metrics will have a nonzero total except`
			`for CPU, because the full CPU total only really applies to cgroups (and are out`
			`of scope for this project at present).`

			`The monitor code (from psi-by-example listed above) only considers the "some"`
			`pressure for all three resources, which will usually alert before the system`
			`becomes critical (and in the case of full Memory usage/thrashing, completely`
			`unusable for any workload). Thus the alerts should come in well before the full`
			`resource pressure gets maxed out.`

			`Now, I don't know C very well, but this _monitor.c_ code was easy enough to`
			`extend to include memory pressure. However, the _create_load.c_ only creates`
			`CPU and I/O load (memory load is too detrimental to system performance).`

			`This was developed on an [SSDNodes VPS](https://ssdnodes.com) (Virtual Private`
			`Server), which is a KVM virtual machine, backed by SSD hardware. It is very`
			`well provisioned with virtual hardware: 8 vCPUs, 32GiB RAM, and 640GiB SSD`
			`disk space. Currently, there is very little load on this system, even with`
			`four different websites on it, with corresponding database engines, and an`
			`nginx reverse proxy. I plan on putting`
			`[mailcow-dockerized](https://docs.mailcow.email/) on this VPS soon, which has`
			`the potential to increase the load significantly.`

			`Now, once the regular workload of this VPS increases, my current configuration`
			may become too noisy. However, I've tried to configure `psi-alerts.sh` in such
			`a way that it only alerts once when the pressure on a resource increases, and`
			`won't alert again until that pressure subsides (and the some percentages drop`
			`below the configurable threshold for at least five minutes).`
Initial commit 2023-08-07 2023-08-07 22:20:41 -04:00
			`## TODO`
			`* finish INSTALL section`
			`* figure out how we're actually going to build all submodules`
			`* and install them`
			`* finish CONFIGURE section`
			`* about defining an instance and editing it`
			* `sudo systemctl edit psi-alerts@<user>.service`
			* mainly for `Environment=` variables
Initial commit of CONFIGURE.md and INSTALL.md, updated README.md 2023-08-12 11:27:29 -04:00			`* consider reworking this for a user service, not a system service`
			`* this could make desktop notifications simpler, and not having to use`
			`SSH keys without passphrases`
Fleshed out both documents 2023-08-24 10:00:32 -04:00			`* possibly learn how to connect to an existing ssh-agent`
Initial commit of CONFIGURE.md and INSTALL.md, updated README.md 2023-08-12 11:27:29 -04:00			`* need to become much more familiar with user services`
			`* consider reworking all code in a compiled language (other than C)`
			`* time to learn Go`
			`* or continue learning Rust`
			`* need to know how to use kernel syscalls in these languages (if possible)`
			`* also, convert psi-alerts.sh script to either of these languages`