Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

KVM: Enable HA heartbeat on ShareMountPoint#12773

Draft
weizhouapache wants to merge 5 commits intoapache:4.22from
weizhouapache:4.22-kvm-ha-on-shared-mount-point
Draft

KVM: Enable HA heartbeat on ShareMountPoint#12773
weizhouapache wants to merge 5 commits intoapache:4.22from
weizhouapache:4.22-kvm-ha-on-shared-mount-point

Conversation

@weizhouapache
Copy link
Member

Description

This PR improves KVM HA heartbeat to support SharedMountPoint

Tested

  • add shared mount point pool
  • kvm hosts write heartbeat to the mount point
  • deploy vms
  • set global settingforce.ha to true
  • force shutdown one of the kvm hosts
  • after 3 mins, the VMs are started on other kvm hosts.

Example of heartbeat files

[root@kvm1 ~]# ls -l /mnt/cloudstack*/KVMHA
/mnt/cloudstack1/KVMHA:
total 1
-rw-r--r--. 1 root root 11 Mar  9 12:28 hb-10.1.34.125
-rw-r--r--. 1 root root 11 Mar  9 12:09 hb-10.1.35.76

/mnt/cloudstack2/KVMHA:
total 1
-rw-r--r--. 1 root root 11 Mar  9 12:28 hb-10.1.34.125
-rw-r--r--. 1 root root 11 Mar  9 12:09 hb-10.1.35.76

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@weizhouapache weizhouapache force-pushed the 4.22-kvm-ha-on-shared-mount-point branch from 296835a to 93b95aa Compare March 9, 2026 12:47
@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 3.70%. Comparing base (c3d6a8c) to head (4e4da44).
⚠️ Report is 5 commits behind head on 4.22.

❗ There is a different number of reports uploaded between BASE (c3d6a8c) and HEAD (4e4da44). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (c3d6a8c) HEAD (4e4da44)
unittests 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##               4.22   #12773       +/-   ##
=============================================
- Coverage     17.61%    3.70%   -13.91%     
=============================================
  Files          5917      448     -5469     
  Lines        531402    38028   -493374     
  Branches      64971     7036    -57935     
=============================================
- Hits          93586     1409    -92177     
+ Misses       427262    36432   -390830     
+ Partials      10554      187    -10367     
Flag Coverage Δ
uitests 3.70% <ø> (-0.01%) ⬇️
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@weizhouapache
Copy link
Member Author

@blueoranguran package

@weizhouapache weizhouapache requested a review from Copilot March 9, 2026 16:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enhances KVM HA heartbeat to support SharedMountPoint primary storage by introducing a new heartbeat script and extending HA support checks across KVM storage and monitor components.

Changes:

  • Added a new KVM “SharedMountPoint” heartbeat script (kvmsmpheartbeat.sh) that writes heartbeat files to a mountpoint.
  • Expanded HA-supported storage pool types to include SharedMountPoint in driver, monitor, and pool logic.
  • Added a setType hook to KVMStoragePool and set the pool type after creation / retrieval to ensure HA detection works.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
scripts/vm/hypervisor/kvm/kvmsmpheartbeat.sh New local mountpoint-based heartbeat writer/checker script for SharedMountPoint HA.
plugins/storage/volume/default/.../CloudStackPrimaryDataStoreDriverImpl.java Allows HA support for SharedMountPoint in primary storage capability logic.
plugins/hypervisors/kvm/.../LibvirtStoragePool.java Enables HA support for SharedMountPoint and selects the correct heartbeat script.
plugins/hypervisors/kvm/.../KVMStoragePoolManager.java Sets pool type after creation; exposes getStorageAdaptor.
plugins/hypervisors/kvm/.../KVMStoragePool.java Introduces a default setType method for backward compatibility.
plugins/hypervisors/kvm/.../LibvirtCheckVMActivityOnStoragePoolCommandWrapper.java Ensures retrieved pool has its type set before HA checks.
plugins/hypervisors/kvm/.../KVMHAMonitor.java Expands heartbeat monitoring to include SharedMountPoint pools.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +168 to +201
if [ ! -f "$hbFile" ]; then
# signal large difference if file missing
return 999999
fi
now=$(date +%s)
hb=$(cat "$hbFile" 2>/dev/null)
if [ -z "$hb" ]; then
return 999998
fi
diff=`expr $now - $hb 2>/dev/null`
if [ $? -ne 0 ]
then
return 999997
fi
if [ -z "$interval" ]; then
# if no interval provided, consider 0 as success
if [ $diff -gt 0 ]; then
return $diff
else
return 0
fi
fi
if [ $diff -gt $interval ]
then
return $diff
fi
return 0
}

if [ "$rflag" == "1" ]
then
check_hbLog
diff=$?
if [ $diff == 0 ]
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return in bash only supports exit codes 0–255; values like 999999 will wrap modulo 256, so diff=$? will not contain the intended “seconds ago” value and can lead to incorrect DEAD/ALIVE reporting. Consider having check_hbLog print the computed diff (e.g., via echo) and use its return code only as success/failure, or store the diff in a global variable and return 0/1.

Suggested change
if [ ! -f "$hbFile" ]; then
# signal large difference if file missing
return 999999
fi
now=$(date +%s)
hb=$(cat "$hbFile" 2>/dev/null)
if [ -z "$hb" ]; then
return 999998
fi
diff=`expr $now - $hb 2>/dev/null`
if [ $? -ne 0 ]
then
return 999997
fi
if [ -z "$interval" ]; then
# if no interval provided, consider 0 as success
if [ $diff -gt 0 ]; then
return $diff
else
return 0
fi
fi
if [ $diff -gt $interval ]
then
return $diff
fi
return 0
}
if [ "$rflag" == "1" ]
then
check_hbLog
diff=$?
if [ $diff == 0 ]
hb_diff=0
if [ ! -f "$hbFile" ]; then
# signal large difference if file missing
hb_diff=999999
return 1
fi
now=$(date +%s)
hb=$(cat "$hbFile" 2>/dev/null)
if [ -z "$hb" ]; then
hb_diff=999998
return 1
fi
diff=`expr $now - $hb 2>/dev/null`
if [ $? -ne 0 ]
then
hb_diff=999997
return 1
fi
if [ -z "$interval" ]; then
# if no interval provided, consider 0 as success
if [ $diff -gt 0 ]; then
hb_diff=$diff
return 1
else
hb_diff=0
return 0
fi
fi
if [ $diff -gt $interval ]
then
hb_diff=$diff
return 1
fi
hb_diff=0
return 0
}
if [ "$rflag" == "1" ]
then
check_hbLog
status=$?
diff="${hb_diff:-0}"
if [ $status -eq 0 ]

Copilot uses AI. Check for mistakes.
Comment on lines +201 to +206
if [ $diff == 0 ]
then
echo "=====> ALIVE <====="
else
echo "=====> Considering host as DEAD because last write on [$hbFile] was [$diff] seconds ago, but the max interval is [$interval] <======"
fi
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing a numeric comparison but uses the string operator == and unquoted $diff. Use numeric operators (-eq, -ne) and quote variables in [ tests (or switch to [[ ... ]]) to avoid mis-comparisons or test errors if the variable is empty/unset.

Copilot uses AI. Check for mistakes.
Comment on lines +107 to +110
if [ $? -gt 0 ]
then
return
fi
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$? here reflects only the exit status of the last command in the pipeline (awk), not whether ps/grep matched anything. As written, this check won’t behave as intended and is effectively redundant with the later -z "$vmPids" guard. Consider removing this $? block, or use set -o pipefail (with care for script-wide impact) or replace the pipeline with a more direct PID lookup (e.g., pgrep) and check for empty output.

Suggested change
if [ $? -gt 0 ]
then
return
fi

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +130
mounts=$(cat /proc/mounts | grep "$MountPoint")
if [ $? -gt 0 ]
then
# mount point not present — we don't remount in local-only script
# nothing to do here; keep for compatibility with original flow
:
else
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grep "$MountPoint" can produce false positives when $MountPoint is a substring of another mount target (e.g., /mnt/cloudstack1 matching /mnt/cloudstack10). This can incorrectly treat the mount as present and trigger deleteVMs. Consider using a stricter mount check (e.g., matching the mount target field exactly or using findmnt).

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mounts=$(cat /proc/mounts | grep "$MountPoint")
if [ $? -gt 0 ]
then
# mount point not present — we don't remount in local-only script
# nothing to do here; keep for compatibility with original flow
:
else
mounts=$(cat /proc/mounts | grep "$MountPoint ")
if [ $? -gt 0 ]
then
# mount point not present — we don't remount in local-only script
# nothing to do here; keep for compatibility with original flow
:
else


public class KVMHAMonitor extends KVMHABase implements Runnable {

public static final List<StoragePoolType> STORAGE_POOL_TYPES_WITH_HA_SUPPORT = List.of(StoragePoolType.NetworkFilesystem, StoragePoolType.SharedMountPoint);
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HA-supported pool types are now defined in multiple places (e.g., also in CloudStackPrimaryDataStoreDriverImpl), and LibvirtStoragePool depends on KVMHAMonitor just to reuse this constant. Consider centralizing this list in a small shared utility/helper in an appropriate common package for the KVM plugin (or a single authoritative location) to avoid duplication and reduce cross-class coupling.

Copilot uses AI. Check for mistakes.
@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✖️ debian ✔️ suse15. SL-JID 17048

@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17054

@weizhouapache
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@weizhouapache weizhouapache added this to the 4.22.1 milestone Mar 10, 2026
@weizhouapache
Copy link
Member Author

@sureshanaparti @rajujith
added this to 4.22.1 milestone, a small improvement but may benefit several users

cc @DaanHoogland @NuxRo

@blueorangutan
Copy link

[SF] Trillian test result (tid-15586)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 55643 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr12773-t15586-kvm-ol8.zip
Smoke tests completed. 146 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestListIdsParams>:teardown Error 1.12 test_list_ids_parameter.py
test_01_snapshot_root_disk Error 7.74 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 47.53 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 47.54 test_snapshots.py
ContextSuite context=TestSnapshotStandaloneBackup>:teardown Error 31.34 test_snapshots.py
test_01_snapshot_usage Error 25.61 test_usage.py
test_01_vpn_usage Error 1.08 test_usage.py

weizhouapache and others added 3 commits March 10, 2026 17:01
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@rajujith rajujith moved this to In Progress in Apache CloudStack 4.22.1 Mar 11, 2026
Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/storage/KVMStoragePoolManager.java:399

  • isPoolSupportHA() now includes SharedMountPoint, so this block will add SharedMountPoint primary pools to _haMonitor. However, deleteStoragePool(...) below still removes pools from _haMonitor only when type == NetworkFilesystem, which can leave SharedMountPoint pools registered in the HA monitor after deletion/disconnect. Update the removal condition to stay consistent with the expanded HA-support set.
    private synchronized KVMStoragePool createStoragePool(String name, String host, int port, String path, String userInfo, StoragePoolType type, Map<String, String> details, boolean primaryStorage) {
        StorageAdaptor adaptor = getStorageAdaptor(type);
        KVMStoragePool pool = adaptor.createStoragePool(name, host, port, path, userInfo, type, details, primaryStorage);
        pool.setType(type);

        // LibvirtStorageAdaptor-specific statement
        if (pool.isPoolSupportHA() && primaryStorage) {
            KVMHABase.HAStoragePool storagePool = new KVMHABase.HAStoragePool(pool, host, path, PoolType.PrimaryStorage);
            _haMonitor.addStoragePool(storagePool);
        }

plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/KVMHAMonitor.java:145

  • The debug log messages in this method are now used for both NetworkFilesystem and SharedMountPoint, but still say "Found NFS storage pool". Also, this "Found ... continuing" message is logged even after the pool is marked for removal (e.g. when storage == null). Adjust the log text/placement to reflect "libvirt storage pool" and only log "found" when the pool is actually present and running.
    private void checkForNotExistingLibvirtStoragePools(Set<String> removedPools, String uuid) {
        try {
            Connect conn = LibvirtConnection.getConnection();
            StoragePool storage = conn.storagePoolLookupByUUIDString(uuid);
            if (storage == null || storage.getInfo().state != StoragePoolState.VIR_STORAGE_POOL_RUNNING) {
                if (storage == null) {
                    logger.debug(String.format("Libvirt storage pool [%s] not found, removing from HA list.", uuid));
                } else {
                    logger.debug(String.format("Libvirt storage pool [%s] found, but not running, removing from HA list.", uuid));
                }

                removedPools.add(uuid);
            }

            logger.debug(String.format("Found NFS storage pool [%s] in libvirt, continuing.", uuid));


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +118 to +129
mounts=$(cat /proc/mounts | grep "$MountPoint")
if [ $? -gt 0 ]
then
# mount point not present — we don't remount in local-only script
# nothing to do here; keep for compatibility with original flow
:
else
# mount exists; if not in read-check mode, consider deleting VMs similar to original behavior
if [ "$rflag" == "0" ]
then
deleteVMs $MountPoint
fi
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mount-point detection uses grep "$MountPoint" on /proc/mounts, which can yield false positives when the mount point is a prefix of another path (e.g. /mnt/cloudstack1 matching /mnt/cloudstack10). This can cause the script to treat the mount as present/absent incorrectly and trigger follow-on actions. Use an exact match on the mount target field (e.g. findmnt -n -o TARGET --target "$MountPoint" or matching the second column in /proc/mounts).

Suggested change
mounts=$(cat /proc/mounts | grep "$MountPoint")
if [ $? -gt 0 ]
then
# mount point not present — we don't remount in local-only script
# nothing to do here; keep for compatibility with original flow
:
else
# mount exists; if not in read-check mode, consider deleting VMs similar to original behavior
if [ "$rflag" == "0" ]
then
deleteVMs $MountPoint
fi
if grep -q "^[^ ]\+ $MountPoint " /proc/mounts
then
# mount exists; if not in read-check mode, consider deleting VMs similar to original behavior
if [ "$rflag" == "0" ]
then
deleteVMs $MountPoint
fi
else
# mount point not present — we don't remount in local-only script
# nothing to do here; keep for compatibility with original flow
:

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +131
# mount exists; if not in read-check mode, consider deleting VMs similar to original behavior
if [ "$rflag" == "0" ]
then
deleteVMs $MountPoint
fi
fi

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteVMs is executed when the mount exists (the grep succeeds) and the script is in write mode. In the original kvmheartbeat.sh, VMs are only killed after a remount attempt when the expected mount is missing. As written, this can kill running VMs on every successful heartbeat run whenever the storage is mounted. The condition should be inverted/adjusted so VM killing is only done for the intended failure/remount scenario (or removed entirely for the local-only heartbeat flow).

Suggested change
# mount exists; if not in read-check mode, consider deleting VMs similar to original behavior
if [ "$rflag" == "0" ]
then
deleteVMs $MountPoint
fi
fi
# mount exists; in local-only heartbeat we do not kill VMs based on mount state
# original kvmheartbeat.sh only deleted VMs after a failed remount; there is
# no such remount logic here, so VM deletion is intentionally disabled
:
fi

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +195
return 999999
fi
now=$(date +%s)
hb=$(cat "$hbFile" 2>/dev/null)
if [ -z "$hb" ]; then
return 999998
fi
diff=`expr $now - $hb 2>/dev/null`
if [ $? -ne 0 ]
then
return 999997
fi
if [ -z "$interval" ]; then
# if no interval provided, consider 0 as success
if [ $diff -gt 0 ]; then
return $diff
else
return 0
fi
fi
if [ $diff -gt $interval ]
then
return $diff
fi
return 0
}

if [ "$rflag" == "1" ]
then
check_hbLog
diff=$?
if [ $diff == 0 ]
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_hbLog returns large numeric values (e.g. return 999999 / return $diff) and the caller reads them via $?. In bash, function return/exit codes are limited to 0-255, so values wrap modulo 256 and can incorrectly report ALIVE (e.g. a diff of 512 seconds becomes exit code 0). Use stdout to pass the diff (or store it in a variable) and use the exit status purely as a boolean (alive/dead).

Suggested change
return 999999
fi
now=$(date +%s)
hb=$(cat "$hbFile" 2>/dev/null)
if [ -z "$hb" ]; then
return 999998
fi
diff=`expr $now - $hb 2>/dev/null`
if [ $? -ne 0 ]
then
return 999997
fi
if [ -z "$interval" ]; then
# if no interval provided, consider 0 as success
if [ $diff -gt 0 ]; then
return $diff
else
return 0
fi
fi
if [ $diff -gt $interval ]
then
return $diff
fi
return 0
}
if [ "$rflag" == "1" ]
then
check_hbLog
diff=$?
if [ $diff == 0 ]
echo 999999
return 1
fi
now=$(date +%s)
hb=$(cat "$hbFile" 2>/dev/null)
if [ -z "$hb" ]; then
echo 999998
return 1
fi
diff=`expr $now - $hb 2>/dev/null`
if [ $? -ne 0 ]
then
echo 999997
return 1
fi
if [ -z "$interval" ]; then
# if no interval provided, consider 0 as success
if [ $diff -gt 0 ]; then
echo "$diff"
return 1
else
echo 0
return 0
fi
fi
if [ $diff -gt $interval ]
then
echo "$diff"
return 1
fi
echo 0
return 0
}
if [ "$rflag" == "1" ]
then
diff=$(check_hbLog)
status=$?
if [ "$status" -eq 0 ]

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

KV Host HA/instance HA with shared mountpoint

5 participants