Initial commit of CONFIGURE.md and INSTALL.md, updated README.md

2023-08-12 11:27:29 -04:00
parent 354088b245
commit 42f94bbf77
3 changed files with 140 additions and 11 deletions
@@ -0,0 +1,51 @@
+# CONFIGURE
+Included in this project are a number of systemd units:
+    * psi-monitor.service
+        * uses psi-monitor executable (in /usr/bin/)
+    * psi-alerts@.service (system template service)
+        * uses psi-alerts.sh script
+
+The `psi-alerts.sh` is essentially a daemon (a systemd simple service), and for
+now the systemd template needs to be instantiated with the username that will
+execute `psi-alerts.sh`.  Also, a systemd unit override should be created, like
+so:  
+
+```
+sudo systemctl edit psi-alerts@<user>.service
+```
+
+This will open an editor, and in later versions of systemd the comment code will be included, clearly showing where the override should be entered:
+
+```
+### Editing /etc/systemd/system/psi-alerts@trey.service.d/override.conf
+### Anything between here and the comment below will become the contents of the drop-in file
+
+[Service]
+Environment=EMAIL_TO="email@domain.tld"
+Environment=SMS_DST="phone_number@sms.domain.tld"
+Environment=NOTIFICATION_CMD="dunstify"
+Environment=NOTIFICATION_OPTS="--timeout=0 --printid --urgency=critical --icon=/usr/share/icons/breeze-dark/emblems/16/emblem-warning.svg"
+Environment=NOTIFICATION_IDX=15
+Environment=SSH_USER="username"
+Environment=SSH_HOST="localhost"
+Environment=SSH_PORT=5999
+Environment=SSH_ID_PATH="~trey/.ssh/psi-alerts"
+Environment=CLEAR_THRESHOLD="5.0"
+
+### Edits below this comment will be discarded
+
+### /etc/systemd/system/psi-alerts@.service
+# [Unit]
+# Description=Pressure Stall Information (PSI) alerts
+# PartOf=multi-user.target
+# After=psi-monitor.service
+#
+# [Service]
+# User=%i
+# Type=simple
+# ExecStart=psi-alerts.sh
+#
+# [Install]
+# WantedBy=multi-user.target
+```
+
@@ -0,0 +1,7 @@
+# INSTALL
+First, clone this repository with the `--recurse-submodules` flag:
+```
+$ git clone --recurse-submodules https://git.eldon.me/trey/psi-alerts.git
+```
+
+
@@ -2,28 +2,90 @@
 ## PURPOSE

 This project aims to deliver Pressure Stall Information (PSI) alerts via
-standard Linux graphical desktop (through `libnotify` compatible daemons and
-CLI programs), and email (email-to-SMS is also supported).  This can alert the
-system administrator of CPU, I/O, or Memory (RAM) pressure in near real time.
+standard Linux graphical desktop notifications (through `libnotify` compatible
+daemons and CLI programs), and email (email-to-SMS is also supported).  This
+can alert the system administrator of CPU, I/O, or Memory (RAM) pressure in
+near real time.

 ## PREREQUISITES
-* A Linux system with kernel 5.2.0 or greater
+* A Linux system with kernel 5.2.0 or greater, with the /proc filesystem
+  enabled
 * systemd
 * zsh
+* sysstat (for pidstat)
 * ssh (OpenSSH, for desktop notifications)
 * psi-by-example (a modified version of this is included in this project as a
  submodule)
+* a libnotify-compatible desktop notification system
+
+## History
+
+When I first learned about [Pressure Stall
+Information](https://docs.kernel.org/accounting/psi.html) (PSI), I was
+intrigued.  This provides a real-time view into the performance and typical
+resource contention Linux system administrators need to worry about:  CPU, I/O,
+and Memory (RAM).  During this research, I found [this
+post](https://unixism.net/2019/08/linux-pressure-stall-information-psi-by-example/)
+complete with a C code example;  albeit, it was light on I/O details and the
+example C code the author provided didn't even include Memory pressure at all
+(so modified it to include Memory pressure).  
+
+A quick and dirty description of PSI:  whenever one or more processes are
+waiting for some measurable resource (CPU, I/O, or RAM), the percentage of
+processes waiting on the resource will begin to increase.  Initially, the
+percentage will be low, but as resource contention increases, more and more
+processes will be waiting to be processed by the CPU for that resource.  If not
+all processes are waiting on this resource, PSI calls this the "some"
+contention for resources.  If all processes are waiting on the resource, this
+is known as the "full" resource contention.  
+
+The pressure information is exposed in the _/proc_ filesystem in these three
+virtual files: _/proc/pressure/cpu_, _/proc/pressure/io_,
+_/proc/pressure/memory_.  Each file reports both some and full, and has the
+following output:

-## INSTALL
-First, clone this repository with the `--recurse-submodules` flag:
 ```
-$ git clone --recurse-submodules https://git.eldon.me/trey/psi-alerts.git
+some avg10=0.02 avg60=0.43 avg300=0.55 total=711489361
+full avg10=0.02 avg60=0.43 avg300=0.54 total=681874430
 ```

-## CONFIGURE
-Included in this project are a number of systemd units:
-    * psi-monitor.service
-    * psi-alerts@.service (template service)
+This example is taken from _/proc/pressure/io_, for I/O pressure.   The full
+CPU pressure information really depends on the cgroups, which this project
+doesn't pay close attention to at this time.  The percentages are a measure of
+the average resource pressure over the last 10s, 60s, and 300s (5 minutes).
+The total is the number of microseconds that any processes were waiting for the
+resource;  this is a counter that is reset on boot, and will continously update
+as processes wait for the resource.  They always have to wait for the resource,
+even if it's on the order of hundreds of microseconds or less.  Even if the
+percentages were all zeroes, the total counter will be nonzero (at least for
+the some metrics), and even the full metrics will have a nonzero total except
+for CPU, because the full CPU total only really applies to cgroups (and are out
+of scope for this project at present).
+
+The monitor code (from psi-by-example listed above) only considers the "some"
+pressure for all three resources, which will usually alert before the system
+becomes critical (and in the case of full Memory usage/thrashing, completely
+unusable for any workload).  Thus the alerts should come in well before the full
+resource pressure gets maxed out.
+
+Now, I don't know C very well, but this _monitor.c_ code was easy enough to
+extend to include memory pressure.  However, the _create_load.c_ only creates
+CPU and I/O load (memory load is too detrimental to system performance).  
+
+This was developed on an [SSDNodes VPS](https://ssdnodes.com) (Virtual Private
+Server), which is a KVM virtual machine, backed by SSD hardware.  It is very
+well provisioned with virtual hardware:  8 vCPUs, 32GiB RAM, and 640GiB SSD
+disk space.  Currently, there is very little load on this system, even with
+four different websites on it, with corresponding database engines, and an
+nginx reverse proxy.  I plan on putting
+[mailcow-dockerized](https://docs.mailcow.email/) on this VPS soon, which has
+the potential to increase the load significantly.
+
+Now, once the regular workload of this VPS increases, my current configuration
+may become too noisy.  However, I've tried to configure `psi-alerts.sh` in such
+a way that it only alerts once when the pressure on a resource increases, and
+won't alert again until that pressure subsides (and the some percentages drop
+below the configurable threshold for at least five minutes).

 ## TODO
 * finish INSTALL section
@@ -33,3 +95,12 @@ Included in this project are a number of systemd units:
    * about defining an instance and editing it
        * `sudo systemctl edit psi-alerts@<user>.service`
        * mainly for `Environment=` variables
+    * consider reworking this for a user service, not a system service
+        * this could make desktop notifications simpler, and not having to use
+          SSH keys without passphrases
+        * need to become much more familiar with user services
+* consider reworking all code in a compiled language (other than C)
+    * time to learn Go
+    * or continue learning Rust
+    * need to know how to use kernel syscalls in these languages (if possible)
+    * also, convert psi-alerts.sh script to either of these languages