Ceci est une ancienne révision du document !
Table des matières
Haute dispo cluster failover redhat
Voir aussi :
- OpenSVC
- Paquet resource-agents
Ressources :
- myvip
- fence_node-1
- fence_node-2
- ping
- srvweb
- ClusterMon-External
Liens intros :
Installation
Voir :
Prérequis
Prérequis
- Date syncho
- SELinux désactivé
- service NetworkManager arrêté
- Règles pare-feu
- Conf
/etc/hosts
Date synchro (ntp)
Les nœuds doivent avoir la date et l'heure synchronisée (voir NTP)
Vérif
date
Exemple avec Clush cluster_shell_parallele
echo date |clush -B -w node-[1-2]
SELinux désactivé
setenforce 0 sed -i.bak "s/SELINUX=enforcing/SELINUX=permissive/g" /etc/selinux/config
Vérif
sestatus
Service NetworkManager arrêté et désactivé
systemctl stop NetworkManager systemctl disable NetworkManager
Pare-feu
Si pare-feu activé
firewall-cmd --permanent --add-service=high-availability firewall-cmd --add-service=high-availability
Ou
Désactivation du parefeux
systemctl stop firewalld
systemctl disable firewalld
#rpm -e firewalld
Vérif
iptables -L -n -v
Résolution noms
Chaque nœud doit pouvoir pinguer les autres via son nom. Il est conseiller d'utiliser /etc/hosts plutôt que DNS.
- /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 127.0.1.1 node-1.localdomain 192.168.97.221 node-1.localdomain node-1 192.168.97.222 node-2.localdomain node-2
Install
Install paquets
yum install -y pacemaker pcs psmisc policycoreutils-python
echo "P@ssw0rd" | passwd hacluster --stdin systemctl start pcsd.service systemctl enable pcsd.service #unset http_proxy #export no_proxy=localhost,127.0.0.1,node-1,node-2 pcs cluster auth node-1 node-2 #-u hacluster -p passwd #pcs cluster setup --start --name my_cluster node-1 node-2 pcs cluster setup --name my_cluster node-1 node-2 pcs cluster start --all pcs cluster enable --all
Le fichier corosync.conf est automatiquement crée
- /etc/corosync/corosync.conf
totem { version: 2 secauth: off cluster_name: my_cluster transport: udpu } node-list { node- { ring0_addr: node-1 node-id: 1 } node- { ring0_addr: node-2 node-id: 2 } } quorum { provider: corosync_votequorum two_node-: 1 } logging { to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes }
Vérifier la conf de corosync 1
corosync-cfgtool -s
Doit retourner no faults
Ne doit pas comporter d’adresse 127.0.0.1
Vérifier la conf de corosync 2
corosync-cmapctl |grep members pcs status corosync
Configuration
Prevent Resources from Moving after Recovery
pcs resource defaults resource-stickiness=100
Pas de quorum
#pcs property set no-quorum-policy=ignore pcs property set no-quorum-policy=freeze
Configuration du fencing / stonith
Test en vue du fencing via iDRAC
Voir https://www.devops.zone/tricks/connecting-ssh-drac-reboot-server/
Tester du fencing
/usr/sbin/fence_drac5 --ip=192.168.96.221 --username=root --password=calvin --ssh -c 'admin1->'
Test avec OpenManage /opt/dell/srvadmin/sbin/racadm
racadm -r 192.168.96.221 -u root -p calvin get iDRAC.Info
Test via SSH sur iDRAC Pour redemarrer le serveur en se connectant en SSH sur la iDRAC
ssh root@192.168.96.221 racadm serveraction powercycle
Si pas de stonith / fence sinon la VIP refusera de démarrer
# Si pas de stonith / fence pcs property set stonith-enabled=false
Vérif
crm_verify -LVVV
Configuration
# pcs stonith create fence_node-1 fence_drac5 ipaddr=192.168.96.221 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=node-1 stonith-action=poweroff pcs stonith create fence_node-1 fence_drac5 ipaddr=192.168.96.221 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=node-1 op monitor interval="60s" pcs stonith create fence_node-2 fence_drac5 ipaddr=192.168.96.222 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=node-2 op monitor interval="60s" pcs stonith level add 1 node-1 fence_node-1 pcs stonith level add 1 node-2 fence_node-2
Interdire le suicide (le fencing de soi-même)
pcs constraint location fence_node-1 avoids node-1 pcs constraint location fence_node-2 avoids node-2
Tester le fencing
#stonith_admin --reboot node-1 pcs stonith fence node-1
Ajout ressources
Ajout ressource VIP (adresse IP virtuelle)
pcs resource create myvip IPaddr2 ip=192.168.97.230 cidr_netmask=24 nic=bond0 op monitor interval=30s on-fail=fence #pcs constraint location myvip prefers node-1=INFINITY pcs constraint location myvip prefers node-1=100 pcs constraint location myvip prefers node-2=50 #pcs resource meta myvip resource-stickiness=100
Ajouter ressource ping
pcs resource create ping ocf:pacemaker:ping dampen=5s multiplier=1000 host_list=192.168.97.250 --clone pcs constraint location myvip rule score=-INFINITY pingd lt 1 or not_defined pingd
Ajout ressource Apache
Avant il faut configurer http://localhost/server-status et arrêter le service d'apache sur l'ensemble des nœuds
curl http://localhost/server-status systemctl stop httpd.service systemctl disable httpd.service
pcs resource create srvweb apache configfile="/etc/httpd/conf/httpd.conf" statusurl="http://127.0.0.1/server-status" op monitor interval=1min #--clone # Le serveur Web toujours sur la VIP pcs constraint colocation add srvweb with myvip # D'abord la VIP puis le serveur Web pcs constraint order myvip then srvweb
Manip
Déplacer la VIP
pcs resource move myvip node-1 pcs resource move myvip node-2
Retour arrière - Déplacer la VIP
#pcs constraint --full |grep prefer
pcs constraint remove cli-prefer-myvip
pcs resource relocate run
Remise à zero compteur erreurs
#pcs resource failcount reset res1 #crm_resource -P pcs resource cleanup
Déplacer toutes les ressources sur le nœud primaire (ignoring resource stickiness)
#pcs resource relocate show
pcs resource relocate run
Maintenance sur une ressource
#pcs resource update fence_node-1 meta target-role=stopped #pcs resource update fence_node-1 meta is-managed=false #pcs resource update fence_node-1 op monitor enabled=false #pcs resource disable fence_node-1 pcs resource unmanage fence_node-1
Maintenance générale du cluster
pcs property set maintenance-mode=true
Fin de maintenance
pcs property set maintenance-mode=false
Arrêt du cluster
pcs cluster stop --all pcs cluster disable --all
Diagnostic / Supervision
Diag Passif
# Check syntax conf corosync -t # Check cluster communication corosync-cfgtool -s # check the node's network corosync-cmapctl |grep members
Vérif
pcs cluster pcsd-status pcs cluster verify pcs status corosync crm_mon -1 --fail crm_mon -1Af journalctl --since yesterday -p err journalctl -u pacemaker.service --since "2017-02-24 16:00" -p warning
Script supervision (ces commandes doivent retourner aucune ligne)
LANG=C pcs status |egrep "Stopped|standby|OFFLINE|UNCLEAN|Failed|error" crm_verify -LVVV LANG=C pcs resource relocate show |sed -ne '/Transition Summary:/,$p' |grep -v '^Transition Summary:' crm_mon -1f | grep -q fail-count
Voir plus haut si (script /usr/local/bin/crm_logger.sh)
tailf /var/log/messages |grep "ClusterMon-External:"
Script supervision Quel nœud est actif
LANG=C crm_resource --resource myvip --locate |cut -d':' -f2 |tr -d ' '
Le serveur web répond t-il bien en utilisant l'IP de la VIP. (Le code de retour doit-être 0)
#curl -4 -m 1 --connect-timeout 1 http://192.168.97.230/ > /dev/null 2>&1 curl -4 -m 1 --connect-timeout 1 http://192.168.97.230/cl.html > /dev/null 2>&1 #echo $?
ACL
Compte en lecture seule avec les droits de consulter crm_mon
Attention : ce compte trouver le mdp iDRAC/Ilo
pcs stonith --full |grep passwd
Mise en œuvre
#adduser rouser #usermod -a -G haclient rouser usermod -a -G haclient process pcs property set enable-acl=true pcs acl role create read-only description="Read access to cluster" read xpath /cib #pcs acl user create rouser read-only pcs acl user create process read-only
#crm_mon --daemonize --as-html /var/www/html/cl.html
- /usr/local/bin/crm_logger.sh
#!/bin/sh # https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/pdf/High_Availability_Add-On_Reference/Red_Hat_Enterprise_Linux-7-High_Availability_Add-On_Reference-en-US.pdf logger -t "ClusterMon-External" "${CRM_notify_node:-x} ${CRM_notify_rsc:-x} \ ${CRM_notify_task:-x} ${CRM_notify_desc:-x} ${CRM_notify_rc:-x} \ ${CRM_notify_target_rc:-x} ${CRM_notify_status:-x} ${CRM_notify_recipient:-x}"; exit
chmod 755 /usr/local/bin/crm_logger.sh chown root.root /usr/local/bin/crm_logger.sh
pcs resource create ClusterMon-External ClusterMon update=10000 user=process extra_options="-E /usr/local/bin/crm_logger.sh --watch-fencing" htmlfile=/var/www/html/cl.html pidfile=/tmp/crm_mon-external.pid op monitor on-fail="restart" interval="60" clone
Colocation - page de monitoting toujours actif sur la VIP
Seulement nécessaire si ressource non clonée
pcs constraint colocation add ClusterMon-External with myvip
Test
curl 192.168.97.230/cl.html
Diag Actif
En cas de pb
pcs resource debug-start resource_id
Ajout 2em interface pour le heartbeat
Redundant Ring Protocol (RRP) rrp_mode If set to active, Corosync uses both interfaces actively. If set to passive, Corosync sends messages alternatively over the available networks.
Avant de modifier la conf, on passe le cluster en mode maintenance :
pcs property set maintenance-mode=true
- /etc/hosts
192.168.21.10 node1 192.168.22.10 node1b 192.168.21.11 node2 192.168.22.11 node2b
On ajoute rrp_mode et ring1_addr
- /etc/corosync/corosync.conf
totem { rrp_mode: active } nodelist { node { ring0_addr: node1 ring1_addr: node1b nodeid: 1 } node { ring0_addr: node2 ring1_addr: node2b nodeid: 2 } }
pcs cluster reload corosync pcs cluster status corosync corosync-cfgtool -s pcs property unset maintenance-mode
Reprise sur incident
#crm_resource -P pcs resource cleanup pcs resource relocate run #pcs cluster start --all
Crash-tests
Test 1 Crash brutal
echo 1 > /proc/sys/kernel/sysrq echo c > /proc/sysrq-trigger
Test 2 Coupure électrique : Débranchement du câble
Test 3 Coupure réseaux
ifdown bond0
Test 4 Perte du ping de la passerelle sur l'un des nœud
iptables -A OUTPUT -d 192.168.97.250/32 -p icmp -j REJECT
Test 5 Fork bomb, nœud ne répond plus, sauf au ping
Fork bomb
:(){ :|:& };:
Test 6 Perte connexion iDRAC : Débranchement du câble
Nettoyage - effacer
pcs cluster stop --force #--all pcs cluster destroy --force #--all systemctl stop pcsd systemctl stop corosync systemctl stop pacemaker yum remove -y pcsd corosync pacemaker userdel hacluster rm -rf /dev/shm/qb-*-data /dev/shm/qb-*-header rm -rf /etc/corosync rm -rf /var/lib/corosync rm -rf /var/lib/pcsd rm -rf /var/lib/pacemaker rm -rf /var/log/cluster/ rm -rf /var/log/pcsd/ rm -f /var/log/pacemaker.log*
Erreurs
1 Erreur Dell hardware
UEFI0081: Memory size has changed from the last time the system was started. No action is required if memory was added or removed.
2 Test fork-bomb
error: Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)
Autres
Pour voir / vérifier les "property"
#pcs property set symmetric-cluster=true
pcs property
Ressources
Lister
pcs resource standards
ocf lsb service systemd stonith
pcs resource providers
heartbeat openstack pacemaker
Lister les agents : Exemple
pcs resource agents systemd pcs resource agents ocf:heartbeat
Timeout par défaut pour les ressources
pcs resource op defaults timeout=240s
Stopper toutes les ressources
pcs property set stop-all-resources=true
pcs property unset stop-all-resources
ocf:pacemaker:ping
/usr/lib/ocf/resource.d/pacemaker/ping
ocf:heartbeat:apache
/usr/lib/ocf/resource.d/heartbeat/apache
egrep '^#.*OCF_RESKEY' /usr/lib/ocf/resource.d/heartbeat/apache export OCF_ROOT=/usr/lib/ocf/ /usr/lib/ocf/resource.d/heartbeat/apache meta-data
Autre Lister toutes les ressources
crm_resource --list
Dump CIB (Cluster Information Base)
pcs cluster cib pcs cluster cib cib-dump.xml
Ajout d'une ressource service
pcs resource create CRON systemd:crond
#pcs resource op add CRON start interval=0s timeout=1800s
UPDATE
pcs resource update ClusterMon-External htmlfile='/tmp/cl.html'
UNSET
pcs resource update ClusterMon-External htmlfile=
Stonith
pcs property list --all |grep stonith
Confirmer que le nœud est bien arrêté.
Attention, si ce n'est pas le cas risque de pb
pcs stonith confirm node2
Failcount
crm_mon --failcounts
pcs resource failcount show resource_id
pcs resource failcount reset resource_id
Actualisation de l’état, et remise à zéro du “failcount”
pcs resource cleanup resource_id
Install depuis zero
echo "P@ssw0rd" |passwd hacluster --stdin systemctl start pcsd.service systemctl enable pcsd.service pcs cluster auth -u hacluster -p P@ssw0rd 8si-pms-pps-srv-1 8si-pms-pps-srv-2 pcs cluster setup --name my_cluster 8si-pms-pps-srv-1 8si-pms-pps-srv-2 pcs cluster start --all pcs cluster enable --all pcs resource defaults resource-stickiness=100 pcs property set no-quorum-policy=freeze pcs stonith create fence_8si-pms-pps-srv-1 fence_drac5 ipaddr=172.18.202.230 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=8si-pms-pps-srv-1 op monitor interval="60s" pcs stonith create fence_8si-pms-pps-srv-2 fence_drac5 ipaddr=172.18.202.231 login=root passwd=calvin secure=1 cmd_prompt="/admin1->" pcmk_host_list=8si-pms-pps-srv-2 op monitor interval="60s" pcs stonith level add 1 8si-pms-pps-srv-1 fence_8si-pms-pps-srv-1 pcs stonith level add 1 8si-pms-pps-srv-2 fence_8si-pms-pps-srv-2 pcs constraint location fence_8si-pms-pps-srv-1 avoids 8si-pms-pps-srv-1 pcs constraint location fence_8si-pms-pps-srv-2 avoids 8si-pms-pps-srv-2 pcs resource create myvip IPaddr2 ip=172.18.202.226 cidr_netmask=24 nic=bond0 op monitor interval=30s #on-fail=fence pcs constraint location myvip prefers 8si-pms-pps-srv-1=100 pcs constraint location myvip prefers 8si-pms-pps-srv-2=50 #pcs resource meta myvip resource-stickiness=60 # l'utilisateur process doit appartenir au groupe haclient #usermod -a -G haclient process pcs property set enable-acl=true pcs acl role create read-only description="Read access to cluster" read xpath /cib pcs acl user create process read-only pcs resource create ClusterMon-External ClusterMon update=10000 user=process extra_options="-E /usr/local/bin/crm_logger.sh --watch-fencing" htmlfile=/var/www/html/cl.html pidfile=/tmp/crm_mon-external.pid op monitor on-fail="restart" interval="60" clone pcs resource create appmgr systemd:appmgr pcs constraint colocation add appmgr with myvip
Voir aussi :
- Généralité
- keepalived
- Ricci Luci Ccs cman ccs, ricci and luci are deprecated
- Css
- Autres
Fencing
Cluster
