This is going to be a long post. As planned in an earlier post, the Moodle upgrade from 4.1 to 4.5 was to be done along with server upgrade from Ubuntu 20.04 to 24.04. I thought I would be done in a day or two, but tweaking and troubleshooting continued for a week. Then setting up backups, documentation and so on, went on for another week.
Instead of a bitnami MySQL server, opted for a minimal Ubuntu 24.04 VM image for both the database server as well as the php application server.
Server types
Server-php = Standard B2s (2 vcpus, 4 GiB memory) Ubuntu 20.04
30 GB OS disk, 320 GB data disk.
Since B2s is capped at 23 Mbps, doesn't make sense to use SSD.
LMS-Database-VM = Standard B2s (2 vcpus, 4 GiB memory) Debian 10.13
(bitnami mysql)
30 GB OS disk.
Created with Ubuntu 24.04 LTS minimal.
Chose standard SSD for OS disk. No Azure backup.
Azure prompts to save the private key in pem format, saved as ppk also using puttygen.
The db vm configuration following
sudo apt install mysql-server
sudo apt install mysqltuner
sudo apt install mysql-client
sudo nano /etc/ssh/sshd_config
for changing the port 22
Then change the opened port in Azure portal also.
(had to reboot for changes to take effect, just restarting the ssh daemon did not work -
sudo ufw status
to verify ufw is not running, then rebooted.
Verified that updates are enabled -
sudo apt-get install unattended-upgrades
showed that it is already installed.
The php server configuration:
Changed port 22 as above
Then installed apache and php
sudo apt install apache2
sudo apt install php libapache2-mod-php php-cli php-mysql
and rebooted.
As mentioned in my previous post
we have to attach the newly created disk using the portal, and then connect inside the vm
parted /dev/sda
mklabel gpt
(parted) mkpart
Partition name? []? lmsdata
File system type? [ext2]? ext4
Start?
Start? 0%
End? 100%
(parted) print
To update fstab with UUID, used blkid to get UUID.
blkid was showing only partuuid and not uuid - because it was not formatted!
But first do mount and test
mount did not work because it was not formatted!
sudo mkfs.ext4 /dev/sda1
blkid to get UUID
nano /etc/fstab to add the line
UUID=333dd333-ca33-3333-33c7-abcdef /var/www/our_data_disk ext4 defaults,nofail 1 2
rebooted, OK.
AcceptPathInfo On
in /etc/apache2/apache2.conf
Also, from previous server,
<DirectoryMatch "^/.*/\.git/">
Require all denied
</DirectoryMatch>
Also, if needed,
4 Handling 40x errors
5 Hiding internal paths
PHP setup
and installing the required extensions:
From
1. All the settings were OK in /etc/php/8.3/apache2/php.ini
except post_max_size and upload_max_filesize, set to 64M from 2M
2. extensions - installed by
sudo apt install php8.3-curl
etc
sudo apt install php8.3-ctype
sudo apt install php8.3-curl
sudo apt install php8.3-dom
sudo apt install php8.3-gd
sudo apt install php8.3-iconv
sudo apt install php8.3-intl
sudo apt install php8.3-json
sudo apt install php8.3-mbstring
sudo apt install php8.3-pcre
sudo apt install php8.3-simplexml
sudo apt install php8.3-spl
sudo apt install php8.3-xml
sudo apt install php8.3-zip
sudo apt install php8.3-ctype
sudo apt install php8.3-curl
sudo apt install php8.3-dom
sudo apt install php8.3-gd
sudo apt install php8.3-iconv
sudo apt install php8.3-intl
sudo apt install php8.3-json
sudo apt install php8.3-mbstring
sudo apt install php8.3-pcre
sudo apt install php8.3-simplexml
sudo apt install php8.3-spl
sudo apt install php8.3-xml
sudo apt install php8.3-mysql
# for CMS, we need postgres also
sudo apt-get install php8.3-pgsql
#and restart apache with
sudo systemctl restart apache2
A good guide for LAMP installation - https://www.digitalocean.com/community/tutorials/how-to-install-lamp-stack-on-ubuntu
We need postgres for CMS. So,
sudo apt install postgresql
edit the file for changes if any needed
nano /etc/postgresql/*/main/postgresql.conf
checking the old server for the values, no changes to be made;
listen addresses also did not need to be changed, since localhost is enough for us.
Initially I had thought of using rclone over ssh to copy the files. But, without specifying pem file means we need to create a config file
so, just copied the script over with filezilla.
Next, rclone over ssh
Unfortunately,
rclone lsd php2024:
2024/10/08 04:50:01 Failed to create file system for "php2024:": failed to parse private key file: ssh: no key found
says check the config file.
rclone config file
The config file had both key_file and key_pem set. So, deleted key_file line.
But ... doesn't work.
So, trying with sftp.
sftp -P 8800 -i key.pem username@ipaddress
put -r directoryname
works, but all timestamps are put to todays.
scp can retain timestamps
scp -rpC sourceDirName username@server:destDirName
scp -P 8800 -i /home/ouruser/our_key.pem -rpC sourcedir remoteuser@remote.domain:/home/remoteuser/destdir
Need to change permissions and owners as per old server.
changed to
chmod 755 moodledata
chown -R www-data:www-data moodledata
etc.
Takes a few minutes for our biggest instance with >100GB of data.
For the moodle code directories, I just copied within the remote machine instead of copying the identical code from the old server.
to move hidden directories also,
mv -f /path/subfolder/{.,}* /path/
and to try rsync,
rsync: command not found
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(235) [sender=3.1.3]
Solution - install rsync on remote machine -
$sites-available$ rsync -av -e 'ssh -p 8800 -i /home/ouruser/our_key.pem' . ouruser@52.172.103.81:/home/ouruser/varwww
But for certificates in /etc/letsencrypt, we can't use this, since it has to be done as root.
So,
zip -r /home/ouruser/le.zip letsencrypt
For other instances, just copying files between directories -
cp -rT sourcedir destdir
Added the autossh server user as per
this post, for database connections over ssh tunnel. Initial setup - on remote server,
sudo adduser theserveruser
and then when copied the public key to authorized_keys, had to nano as root and edit it, removing carriage returns. Then it worked.
sudo systemctl enable our-tunnel-service-name
Copied root crontab.
sudo su -
apt install cron
crontab -e
for pg_dump
sudo su postgres
pg_dump ourdb > /tmp/ourdbdump.sql
Checking the usernames in root server directory
pg_service.conf which is readable only by root.
sudo su postgres
psql
create database dbname;
create user username;
(To change, alter user username with encrypted password 'MyPassWord';)
grant all privileges on database dbname to username;
But tried psql -U ourdb_admin -d ourdb < tmp/ourdbdump.sql -W
and got
error,
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: Peer authentication failed for user "ourdb_admin"
tried adding a line to
sudo nano /etc/postgresql/16/main/pg_hba.conf
local all all md5
and then
sudo systemctl restart postgresql
psql -h 127.0.0.1 -U ourdb_admin -d ourdb < tmp/ourdbdump.sql -W
But lots of errors.
Did again as postgres user. Still, errors like role read_only_user does not exist etc.
When try to open with dbeaver using the admin user,
permission denied for schema postgres
So, need to do even after grant all privileges,
Step 1
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA name_schema TO name_user;
Step 2
GRANT USAGE ON SCHEMA name_schema TO name_user;
But, when trying that,
ERROR: schema "postgres" does not exist
postgresql - How to switch databases in psql? - Stack Overflow
we should use \connect ourdb
list users with \du
list schemas with
\dnS
Maybe should drop database and do again with --no-owner
drop database ourdb;
pg_dump --no-owner --dbname=postgresql://ourdb_admin:ourdbpw@localhost:5432/ourdb > ourdbdumpasadmin.sql
pg_dump: error: query failed: ERROR: permission denied for sequence batch_job_execution_seq
alter role ourdb_admin superuser;
That worked.
Then, trying that dump in new server:
First
create database ourdb;
alter role ourdb_admin superuser;
create user admin_readonly;
GRANT CONNECT ON DATABASE mydb TO xxx;
-- This assumes you're actually connected to mydb..
GRANT USAGE ON SCHEMA public TO xxx;
GRANT SELECT ON mytable TO xxx;< want all tables
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT ON TABLES TO xxx;
alter default privileges in schema postgres grant select on tables to admin_readonly;
Next, the mysql databases.
First, set up db users.
Using this to set up mysql
creating database
At first, Can't connect to MySQL server on '127.0.0.1:3306'
Checking the tunnel,
sudo service tunnel status
× tunnel.service -
---snip---
autossh[14025]: Host key verification failed.
Solution was to sudo ssh into the host, so that the host key was stored in /root/.ssh
Then, restart autossh, worked.
CREATE DATABASE master_db;
CREATE USER 'admin'@'%' IDENTIFIED BY 'thepassword';
(did not do ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'password';)
GRANT ALL ON master_db.* TO 'admin'@'%';
Used the existing backup script on existing mysql server to take the backup,
$MYSQLDUMP --defaults-extra-file=/home/ouruser/.my.cnf -u $MyUSER -h $MyHOST -p$MyPASS $db | $GZIP -9 > $FILE
gzip -d thefile.gz
date;mysql -h 127.0.0.1 -u db_admin -p master_db < Database-VM.09-10-2024;date
showed that our biggest instance took around 2 hours 15 minutes to get restored.
After copying over the config files and creating the home directories of the various websites, tried enabling one of the sites and got errors.
systemctl reload apache2 Job for apache2.service failed.
See "systemctl status apache2.service" and "journalctl -xeu apache2.service" for details.
root# systemctl status apache2.service
● apache2.service - The Apache HTTP Server
---snip---
apachectl[8721]: Invalid command 'RewriteEngine', perhaps misspelled or defined by a module not included in the server configuration
Oct 08 15:22:40 SSSVV-php-2024 systemd[1]: apache2.service: Control process exited, code=exited, status=1/FAILURE
Oct 08 15:22:40 SSSVV-php-2024 systemd[1]: Reload failed for apache2.service - The Apache HTTP Server.
a2enmod rewrite
Solved. Then, checking with a browser, testing by editing
/etc/hosts on Linux and
C:\Windows\System32\drivers\etc\hosts
on Windows.
The browser was getting "connection refused".
sudo apt install git
git branch --track MOODLE_405_STABLE origin/MOODLE_405_STABLE
git checkout MOODLE_405_STABLE
sudo chown -R azureuser:www-data *
sudo -u www-data /usr/bin/php admin/cli/maintenance.php --enable
sudo chmod -R 775 .
sudo -u www-data /usr/bin/php admin/cli/upgrade.php
First error was:
== Environment ==
!! max_input_vars !!
[System] this test must pass - PHP setting max_input_vars must be at least 5000.
Corrected that with
Setting max_input_vars to 9000 in the ini files - need to uncomment also -
sudo nano /etc/php/8.3/apache2/php.ini
and
sudo nano /etc/php/8.3/cli/php.ini
Then,
!! Asynchronous backups disabled !!
[System] if this test fails, it indicates a potential problem - Your site is currently configured to use synchronous backups. Asynchronous
backups provide a better user experience, allowing users to do other
operations while a backup or restore is in progress. Asynchronous backups
are enabled for all sites from Moodle 4.5 LTS. Synchronous backups will be
removed from Moodle LMS the version after 4.5 LTS.
OK - we can enable that after finishing the upgrade.
But then,
Database transaction aborted automatically in upgrade.php
Default exception handler: Exception - Undefined constant "MESSAGE_DEFAULT_LOGGEDIN" Debug:
Error code: generalexceptionmessage
* line 37 of /mod/hvp/db/messages.php: Error thrown
Tried deleting mod_hvp plugin directory and then upgrading. Reports successfully upgraded from 4.5 to 4.5. But on disabling maintenance mode, home page shows 500 error. So, trying putting back version 4.1 - then drop db and restore db again.
sudo mysql -p
Enter password:
Welcome to the MySQL monitor.
DROP DATABASE master_db;
Query OK, 563 rows affected (19.94 sec)
mysql> CREATE DATABASE master_db;
Query OK, 1 row affected (0.03 sec)
mysql> GRANT ALL ON master_db.* TO 'db_admin'@'%';
Query OK, 0 rows affected (0.02 sec)
10 minutes for importing this db, of our smallest instance.
mod_hvp supported
only in 4.4 So upgrading only to 4.4. Again, error 500 in moove theme, so trying upgrade to 4.4 with moove plugin directory deleted.
Upgrade to 4.4 successful.
Downloaded and installed the latest Moove theme. Added to
Moove | Custom theme settings | Appearance | Administration |
server/admin/settings.php?section=themesettingmoove#theme_moove_advanced
Raw SCSS
footer#page-footer .madeby {display:none;}
to remove the intrusive ad.
mod_hvp upgraded to 4.4
Gave message about php cache being 2M, need to increase to >5M.
Checking in php.ini for 2M, did not find any.
Then, copied this directory to the other moodle install directories - along with the hidden files like .git directory etc, instead of doing git clone.
cp -r ../firstInstance/* .
cp -r ../firstInstance/.* .
Then edited config.php to point to the appropriate db for each instance, copied over missing files like html files, customsearch, etc - verified by doing an
ls > textfile.txt
on old server as well as new server, and doing a diff of the two textfiles to check for missing files.
Verified mail setup on moodle was working, by sending a test email. It went to spam for both old server as well as new server, but XOAuth-based mail send working.
Enabled Async backups - According to
https://tracker.moodle.org/browse/MDL-81172Site administrations > Courses > Asynchronous backup/restore
Confirm Asynchronous backups are still disabled
Enable Asynchronous backups
Enabled all the tickboxes in
admin/settings.php?section=asyncgeneralsettings
The very next day, users were unable to log in. Found that the database server had run out of disk space. logbin files were the culprit.
But the bin log is already commented out.
/etc/mysql/my.cnf includes
mysql.conf.d/mysql.cnf where it is commented out.
With this version of MySQL, binary logging is enabled by default - Mysql bin log is enabled by default,
and would need some startup command to not do binlogs
For purging, (I had just directly deleted)
we need to use the above syntax.
We need to specify skip-log-bin to prevent binary logs -
# nano /etc/mysql/mysql.conf.d/mysqld.cnf
and added skip-log-bin near the end,
# disable binary log
skip-log-bin
# server-id = 1
# log_bin = /var/log/mysql/mysql-bin.log
# binlog_expire_logs_seconds = 2592000
and then restarted with sudo systemctl restart mysql.service
The newly created instance, which I had cloned from the smallest one of our existing instances, had around 2000 users which needed to be deleted. Doing bulk delete from the web gui was timing out. Doing one by one, saw that each delete would take around 10 seconds from the web ui. Bulk delete of 10 at a time would be possible, but doing that 200 times would be a pain. So, decided to do user delete using
moosh, writing a script to wait for a second after each user was deleted to avoid overloading the database server.
A script created by chatgpt for deleting, as well as this line
while read in; do echo moosh -n user-delete "$in";sleep 2 | bash; done < todelete.txt
moosh -n user-delete theusername@wewantto.del ; sleep 2;
moosh -n user-delete theusername2@wewantto.del ; sleep 2;
etc - that worked. Most probably the issues were due to some sort of bash variables quoting problem, I guess. Ran it using
screen for some 15 hours. The script progress was interrupted twice by two users who could not be deleted without timeouts. Just skipped those users and deleted the rest.
The cloned Moodle instance was showing a 'modify registration' screen instead of 'new registration' screen (just to remove the annoying 'Your site is not registered' message).
But even after doing this, perhaps due to cloudflare caching, it was going back to the 'modify registration' page. Finally, tried to copy-paste the new token in the submit url, it says invalid token. But now when I registered again, it showed old site as our cloned site and new site also as our cloned site, so I clicked on move registration to new site, and registration worked. So maybe the token replacement in the url worked.
The web server was allowing directory listings by default. To disable this, first, tried editing apache2.conf
<Directory /var/www/>
Options -Indexes FollowSymLinks
instead of
<Directory /var/www/>
Options Indexes FollowSymLinks
Caused apache2 restart to fail, so put it back. It turns out it needs to be
Options -Indexes +FollowSymLinks
Then it works.
The database server was showing a load average of 8 or 9 when it had a 2 CPU safe value of 1.5 or so. This was leading to slow performance. First, checked for slow queries -
https://hnsws.blogspot.com/2024/07/moodle-mysql-database-cpu-usage-at-200.htmlset global slow_query_log_file ='/var/log/mysql/slow-query.log';
set global slow_query_log = 'ON';
show variables like '%slow%';
tail -f /var/log/mysql/slow-query.logdid not show any. Disabled the binary logs as
noted above. The problem still persists with 150 to 180% cpu usage for mysql. Took several instances out of proxying via cloudflare - that made the sites load a bit snappier. But the database is still overloaded.
Checking Tasks in the biggest Moodle instance, I see that there is a scheduled task, scheduled as "ASAP" so that it starts running as soon as it finishes,
Issue certificates task
\mod_customcert\task\issue_certificates_task
This is doing heavy database reads - > 27K records, and taking nearly 5 minutes to complete. And once it is done, it starts again.
We can either
(a) clean up the database so that only a few certificates remain - probably not desirable, since we would need the teachers / students to retain their certificates in their accounts
(b) change the scheduling to only once a day, at midnight
According to the docs, the scheduled task emails the certificate to the users / teachers if they have not yet received it.
https://docs.moodle.org/403/en/Custom_certificate_moduleSo hopefully it should be fine if I change the scheduling to only once a day in the middle of the night - so I did that.
(Site Administration > General > Messaging > Notification settings )
That is the page where we can disable all email / mobile / web notifications and also disable new login notification. So, disabled new login notifications for every instance. I asked if I can turn off all notifications except email notifications for password reset etc for the five instances, but apparently they need the notifications to increase user engagement.
Then,
like I did previously, I cleared the task_adhoc table of old entries. The load average has dropped from 8.5 to 2.5, but need it to go below 1.5.
So, went through the task log on each moodle instance and see which task takes 15 seconds or more - Site Administration > Server > Tasks > Task logs - and later, any task > 1 second.
Our biggest instance -
Category enrolment sync task - 3 sec
From
https://docs.moodle.org/401/en/Category_enrolmentsWarning: The use of the category enrolments plugin may cause performance problems. If so, it is recommended that you use Cohort sync instead, though this will require a cohort sync enrolment method to be added to each course in the category.
Changed this from * to */5 (every 5 minutes.) in Site Administration > Server > Tasks > Scheduled tasks.
Award badge - currently set to */5, changed to */7
sending badges - */5 changed to */9
Then, an instance with lots of users -
Update Course Progress Data 18 sec
*/5 -> */11
Global search indexing 10 sec
*/30 -> */43
Analytics cleanup 10 sec
Once an hour, left it as is.
Then the newly cloned instance -
Global search indexing 2 sec
*/30 -> */47
Analytics cleanup 7 sec
Once an hour, left it as is.
Then, the development instance -
Cleanup old sessions 2sec
Changed from * to */5
Then the instance we're running for a sister org -
All tasks report failed, with the reason OAuth token refresh failed.
This is probably due to a password change - would need to put in the correct password and reconnect to that system account, in Site Administration > Server > Server > OAuth 2 services
The largest of our Moodle instances was complaining of failed tasks - send user notifications - every day, once the Tasks max fail delay time was being exceeded. And there were no send user notifications tasks which showed success at all, in the task logs. Checking email sending with a test email - works fine. So, it was probably the mobile messaging notifications which were the culprit. For starting up our customized airnotifier instance.
sudo apt install nodejs
sudo apt install npm
sudo npm install forever -g
then tried startscript.sh - seemed to work.
On top, node briefly showed up.
sudo apt install net-tools
sudo netstat -tunlp
Node seems to be listening on 8080.
seems to indicate that if email is working, and sending notifications to mobile is working, this error should go away.
Just now enabled mobile notifications from
Site admin -> General -> Notification settings
Tested with mobile app - working.
But when I tried killing node and restarting the process using a cron, after a few hours, node showed CPU usage. Probably that was the wrong way to do it - should maybe have done a server reboot, because that was probably the reason for the node CPU usage. Did not have to go through all the possibilities as in this post -
https://stackoverflow.com/questions/13375735/node-js-cpu-100 - because the log file itself had the reason.
In the log file, there were lots of entries like
errorInfo: {
code: 'messaging/registration-token-not-registered',
message: 'Requested entity was not found.'
},
"So apparently, this error happens when the FCM token I'm trying to send to is not registered anymore, as evidenced by the "messaging/registration-token-not-registered" error code. In that case I just need to remove this token from the user's token and be done with it."
Then, the err.log shows "Error: listen EADDRINUSE: address already in use :::8080 - maybe that was the culprit. Rebooting the server solved the issue.
I wanted to use cloudflare proxying and cloudflare origin certificates for all the instances as in
https://hnsws.blogspot.com/2022/11/cloudflare-origin-server-certificate.html . But unfortunately, with cloudflare proxying, the server was timing out for certain admin tasks like bulk-adding users, uploading, etc. within two minutes. Disabled Cloudflare proxying, put back the LetsEncrypt certificates, then
in
php.ini, set
max_execution_time to 600 (10 minutes) instead of 30 (the default) as
suggested by Moodle and also the timeout value in
/etc/apache2/apache2.conf - it was already set to 300, bumped up to 600. When proxied via cloudflare, still times out in two minutes though the official limit is 400 seconds - Cloudflare connection limits
sudo apt install certbot python3-certbot-apache
https://www.digitalocean.com/community/tutorials/how-to-secure-apache-with-let-s-encrypt-on-ubuntu-20-04
But errors like
Renewal configuration file /etc/letsencrypt/renewal/name.org.conf is broken.
The error was: expected /etc/letsencrypt/live/name.org/cert.pem to be a symlink
Skipping.
Maybe the way to fix is to delete the letsencrypt directory and try again.
So, tried renaming the /etc/letsencrypt directory and re-running
sudo certbot --apache
After apache crashed initially with errors that the file /etc/letsencrypt/options-ssl-apache.conf was not found, just disabled all the ssl sites with
sudo a2dissite *ssl*
sudo service apache2 restart
sudo certbot --apache
All OK.
The moodledata backups did not go so smoothly. Initial test with a smaller instance - using
rclone sync to copy and sync the moodledata to a shared google drive - was taking more than an hour to do the sync after copying all the files - so this would not be practical for our large instance. Then, internet searches led to the thought of trying
restic.
Restic with rclone to Google Shared Drive worked for the smaller instance, but failed for the larger instance. Probably due to too many retries with google rate limiting. So, the options were to use restic with ssh to a local machine, or
restic and Azure Blob Storage. That worked, with the caveat that I had to use the
user-level key and not a SAS token - with
SAS token, the repo creation would fail.
Instead of trying ssh and local server, first tried using azure blob storage, since that is easy to set up.
Set up a new blob container with private access only, inside our existing storage account.
indicates that we get the storage access key from
Azure portal -> Storage Account -> Security + networking -> Access keys
key1 & key2 - can show key and connection string
Regenerated key1 and key2 (last used 1200+ days ago.)
But we need SAS - searching for SAS in the account, we find
Shared access signature under Security + networking.
- changed the expiry date, enabled Container access also, and clicked generate SAS and connection string.
Initial trial with the version 0.16.4 of restic installed by apt on Ubuntu 24 -
restic -r azure:our-container:/our-repo init
Invalid backend error,
need to use the official binary.
bzip2 -d restic_0.17.1_linux_amd64.bz2
chmod +x restic_0.17*
which restic
sudo mv restic_0.17* /usr/bin/restic
restic version
Workaround - use the account key just to do the init, and the sas token thereafter?
So, SAS token doesn't work. But account key works. And our largest instance initial backup, 100+ GB, completed in just over an hour. So, this would be the way to go. Will add more details soon.
The email sending credentials with a php script were not working initially, so
set up ssmtp on both the php server and the db server, with email notification for disk space use using bash scripts.
sudo apt install ssmtp
sudo apt install mail-utils
While testing the mail command on the command line, if Ctrl-D doesn't work (like on Mac systems?), we can use the syntax
echo "[email body]" | mail -s "[subject]" [recipient]
changed from php disk alert to bash disk alert.
After a day, our largest instance started showing 'Failed task - convert submission' in admin alerts. This seems to be assignfeedback_editpdf\task\convert_submission -
Simply
sudo apt install ghostscript
and a restart of apache or reboot just to make sure, and the problem went away.