Alert Reference

Objective

This document provides reference information on various types of alerts supported by Volterra. Use the information provided in this document to understand the details on the various alerts and action required to be performed.

Key Points

The following apply to Volterra alerts:

  • There is no separate alert for health score. This is because health score is composed of multiple components. For example, health score of a site is computed based on the data-plane connection status to the Regional Edge (RE) sites, control-plane connection status and K8s API server status in the site. There are individual alerts defined for each of the above conditions, but no alert is available for the health score itself.

Note: You can obtain the healthscore of a site in VoltConsole. You can also obtain it using the API https://www.volterra.io/docs/api/graph-connectivity#operation/ves.io.schema.graph.connectivity.CustomAPI.NodeQuery with "field_selector":{"healthscore":{"types":["HEALTHSCORE_OVERALL"]}}.

  • The amount of time before alert generation is not the same for all alerts. This duration is determined based on the severity of the alerts. For example, alert is raised as soon as the tunnel connection to RE goes down, whereas health check alert for a service is raised only if the condition persists for 10 minutes. This is to keep the alert volume under manageable level and not to generate alerts on temporary or transient failure conditions.
  • It is not supported to change the threshold for alerts.
  • Volterra does not support users to define new alerts using an API. However, in case existing alerts do not satisfy your requirement, you can create a support request for new alert in VoltConsole.

Alerts & Descriptions

The following table presents alerts and associated details such as group, type, severity, and associated actions.

Alert Name Type Group Severity Description Action
CaptchaChallengeFailure CAPTCHA Challenge Failure event Security major CAPTCHA challenge failed. Consider blocking the relevant users/IPs using FastACL, Network Policy or Service Policy.
ErrorRateAnomaly Error Rate Anomaly custom Timeseries-Anomaly minor Error rate anomaly detected. Metric looks abnormal and needs attention.
FluentbitOutputErrors Log Collection Error metric Infrastructure major Fluentbit has output errors. Collect info and open issue. Monitor Grafana fluent dashboard. Let L2 fix during working hours.
JsChallengeFailure JS Challenge Failure event Security major JS challenge failed. Consider blocking the relevant users/IPs using FastACL, Network Policy or Service Policy.
KubeAPIErrorsHigh K8S API Error metric Infrastructure major API server is returning errors for some requests. Check kube-apiserver log to see the detail. Contact support if the issue persists.
KubeAPILatencyHigh K8S API Error metric IaaS-CaaS minor Kubernetes API latency at 99th percentile is too high for more than 2 seconds. Possible intermittent problem which may occur during parallel application updates. Check HW utilization of CE site. If persist for longer than hour contant support.
KubeCronJobRunning K8S Job Too Long metric IaaS-CaaS minor Kubernetes CronJob running for more than hour. Job can be stuck or it is expected to run longer. Check logs from Kubernetes Pod. Contact support in case of non-customer vk8 workload.
KubeDaemonSetMisScheduled K8S Daemonset Error metric IaaS-CaaS minor Some pods of DaemonSet are running where they are not supposed to run. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s DaemonSet.
KubeDaemonSetNotScheduled K8S Daemonset Error metric IaaS-CaaS minor Some pods of DaemonSet are not scheduled. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s DaemonSet.
KubeDaemonSetRolloutStuck K8S Daemonset Error metric IaaS-CaaS minor Kubernetes DaemoSet desired Pods are not scheduled or ready. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s DaemonSet.
KubeDeploymentGenerationMismatch K8S Deployment Error metric IaaS-CaaS minor Deployment generation does not match, this indicates that the Deployment has failed but has not been rolled back. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeDeploymentReplicasMismatch K8S Deployment Error metric IaaS-CaaS minor Kubernetes Deployment has not matched the expected number of Pod replicas for more than 1hr. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeJobFailed K8S Job Failed metric IaaS-CaaS minor Kubernetes Job failed to complete in last 2 hours. Check Kubernetes Job and Pod status, events and logs in vK8s cluster. Contact support in case of etcd job.
KubeMetricsMissing Kubernetes Metrics Missing metric Infrastructure critical Essential Kubernetes metrics are missing. All Kubernetes alerts are affected as well. Check if kube-state-metrics workload is running and its logs. Restart this service on cluster this alert appeared.
KubeNodeUnschedulable K8S Node Scheduling Disabled metric IaaS-CaaS minor Node has Scheduling Disabled. TODO
KubePersistentVolumeFullInFourDays K8S PVC Error metric IaaS-CaaS major Based on recent sampling, the PersistentVolumeClaim is expected to fill up within four days. Resize PVC or clean disk.
KubePersistentVolumeSpaceLow K8S PVC Error metric IaaS-CaaS major Kubernetes PersistentVolumeClaim is getting out of space. Resize PVC or clean disk.
KubePodCPUThrottlingHigh K8S Pod CPU Throttled metric IaaS-CaaS major Kubernetes Pod container is throttling it's CPU limits. Increase flavor for vk8s Deployment or StatefulSet definition. Contact support in case of non vk8s Pod.
KubePodContainerTooMuchMemory metric IaaS-CaaS critical More than 90% of allowed memory is being used by container. Add more replicas.
KubePodCrashLooping K8S Pod Crashing metric IaaS-CaaS minor Kubernetes Pod container restarting often. Possible causes can be out of memory limit (OOM), liveness probe or container entrypoint failure. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubePodNotReady K8S Pod Not Ready metric IaaS-CaaS minor Pod has been in a non-ready state for more than 10 min. The reason might be readiness probe failures, scheduling due out of quotas or broken node. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeStatefulSetGenerationMismatch K8S StatefulSet Error metric IaaS-CaaS minor StatefulSet generation does not match. This indicates that the StatefulSet has failed but has not been rolled back.
KubeStatefulSetReplicasMismatch K8S StatefulSet Error metric IaaS-CaaS minor Kubernetes StatefulSet has not matched the expected number of Pod replicas for longer than 15 minutes. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeStatefulSetUpdateNotRolledOut K8S StatefulSet Error metric IaaS-CaaS minor StatefulSet update has not been rolled out. Check Kubernetes Pod status, events and logs in vK8s cluster. Contact support in case of non vk8s Deployment.
KubeVersionMismatch K8S Internal Error metric Infrastructure minor There are different versions of Kubernetes components running. This can be caused by failure during Volterra Software Upgrade. Check Volterra Software Upgrade status. Ignore if upgrade is in progress.
LoggingForwardFailed Log Collection Error metric Infrastructure critical Log collection has failed to forward logs for more than 15 minutes. Node is not sending logs. Check Fluentd status, health. Inspect fluentbit logs for errors. If none fluebtbit can reach fluentd, restart fluentd instances. If it persist for more than 2 hours escalate to L2.
LoggingOutputQueueStucked Log Collection Error metric Infrastructure major Fluentbit output queue is stuck. Restart fluentbit. Escalate to L2 if it persist for more than 2 hours.
LoggingRetriesFailed Log Collection Error metric Infrastructure critical Log collector has tried too many times to forward logs in last 15 minutes. Check network connectivity between CE and RE site.
MaliciousUserDetected Malicious User Detected event Security major Malicious user detected. Consider blocking the relevant user using FastACL, Network Policy or Service Policy.
NodeAideFilesAddedRemoved Node Error event Infrastructure major Monitored files on filesystem were unexpectedly modified. Use logs to verify which files were modified and why.
NodeAideFilesChanged Node Error event Infrastructure critical Monitored files on filesystem were unexpectedly modified. Use logs to verify which files were modified. Creatite an issue in Gitlab for tracking. Immediately escalate to the Security Team (L2).
NodeAideNotRunning Node Error event Infrastructure critical Aide check did not run in past 24 hours. Check the service status and notify the security team.
NodeFilesystemFilesFillingUp Node Filesystem Error metric Infrastructure minor Filesystem at node is predicted to run out of files within the next 24 hours. Check disk usage at Site dashboard. Deprovision workload or add new node into site.
NodeFilesystemOutOfFiles Node Filesystem Error metric Infrastructure minor Filesystem at node has only a few percent available inodes left. Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE. Contact support in case problem persist.
NodeFilesystemOutOfSpace Node Filesystem Error metric Infrastructure major Filesystem at node has only a few percent available space left. Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE.
NodeFilesystemSpaceFillingUp Node Filesystem Error metric Infrastructure minor Filesystem at node is predicted to run out of space within the next 24 hrs. Check disk usage at Site dashboard. Deprovision workload or add new node into site. Do disk resize in case of cloud CE.
NodeLoadHigh Node Load High metric Infrastructure minor Node has higher load than 1 per CPU for more than 10 mins. Add new node into site or deprovision workload. If problem persist for more than 30 minutes, escalate to L2 immediatelly.
NodeNicMgmtDegraded Node NIC Error event Infrastructure critical Management NIC configuration issues detected on node. Check the network connectivity.
NodeNicTxTimeout Node NIC Error event Infrastructure critical Node network TX timeouts detected. Check the network connectivity.
NodeNotReady K8S Node Error metric Infrastructure critical Site node is down. Pods cannot be scheduled or deprovisioned since node is not responding. Check Node and HW status in console UI. Reboot node. If problem persist for longer than 1 hour contact support.
NodeTooManyPods K8S Node Error metric Infrastructure minor Number of running pods is near maximum. Add a new node to the affected site or deprovision some workload.
NodeUSBDeviceConnected USB Device Detected event Infrastructure major New USB device connected to the node. No action required.
NodeUSBDeviceDisconnected USB Device Disconnected event Infrastructure major USB device disconnected from the node. No action required.
RequestRateAnomaly Request Rate Anomaly custom Timeseries-Anomaly minor Request rate anomaly detected. Metric looks abnormal and needs attention.
RequestThroughputAnomaly Request Throughput Anomaly custom Timeseries-Anomaly minor Request throughput anomaly detected. Metric looks abnormal and needs attention.
ResponseLatencyAnomaly Response Latency Anomaly custom Timeseries-Anomaly minor Response latency anomaly detected Metric looks abnormal and needs attention.
ResponseThroughputAnomaly Response Throughput Anomaly custom Timeseries-Anomaly major Response throughput anomaly detected. Metric looks abnormal and needs attention.
SSOCreated SSO Provider Created event UAM major New UAM SSO provider was created. No action required.
SSODeleted SSO Provider Deleted event UAM major Existing UAM SSO provider was deleted. No action required.
ServiceClientErrorPerSourceSite Virtual Host Client Error metric Virtual-Host major More than 10% of the requests from site to service failed due to client error. Some clients are sending invalid requests to the virtual-host. Consider blocking the relevant users/IPs using Volterra Policy features.
ServiceEndpointHealthcheckFailure Endpoint healthcheck failure metric Virtual-Host minor Healthcheck failed for virtual-host endpoint. Check the health of the origin servers. Check connectivity of origin servers to Volterra.
ServiceServerErrorPerSourceSite Virtual Host Server Error metric Virtual-Host major ServiceServerErrorPerSourceSite Proxy is seeing excessive errors from upstream origin servers. Check the health of the origin servers. Check connectivity of origin servers to Volterra.
SiteBgpToTGWDown Site BGP to TGW Down metric Ves-Software critical Site's BGP peering to TGW is down. Verify network connectivity on given site and status of AWS VM.
SiteCertificateExpiration K8S Client Certificate Error metric Infrastructure minor Kubernetes certificates is expiring for your Volterra Site. In order to avoid interruption, upgrade to latest available Volterra Software Version. Upgrade Volterra Software Version to latest available.
SiteCustomerTunnelInterfaceDown Customer Tunnel Interface Down metric Infrastructure major Connection from CE to a single RE is down. Some functionality will be limited. Check physical and network connectivity of the CE.
SiteDeleted Site Deleted event Infrastructure critical Entire site was deleted. No action required.
SiteHardwareChanged Site Hardware Changed metric Infrastructure minor Customer Edge node changed certified hardware. No action required.
SiteHttpProbeDown RE to Customer Site Tunnel Down metric Infrastructure major HTTP check from connected Regional Edge to Customer Edge has failed.' Check the network connectivity.
SiteHttpUnhealthy Remote HTTP check failed metric IaaS-CaaS major Communication with Volterra services at site is failing. Check the network connectivity.
SiteNodeHeartbeatMissed Site Heartbeat Down metric Infrastructure major Node at site did not send heartbeat for more than 20 minutes. Check network connectivity and power status of node in Site. If running, trying rebooting the node.
SiteNonconformingVersion Site Running Nonconforming Version metric Infrastructure major Site is running unsupported software version. Update the site's software version.
SitePhysicalInterfaceDown Physical Interface Down metric Infrastructure critical One of the physical interfaces of CE went down. Check physical and network connectivity of the CE.
SitePhysicalInterfaceDown Physical Interface Down event Infrastructure critical Physical interface on node is down. Check the network connectivity.
SiteRegistrationApproved Site Registration Approved event Infrastructure major Site registration was approved and waiting for configuration. Check registration object for failure.
SiteRegistrationDeleted Site Registration Deleted event Infrastructure major The site node registration was deleted. No action required.
SiteRegistrationDuplicateName Site Registration Duplicate Name Error event Infrastructure major Cannot register node with given name, the same name is already registered. Choose different node name.
SiteRegistrationPending Site Registration Pending event Infrastructure major Site registration is in pending state. Check registration object for failue.
SiteSSHFailedLogin SSH Failed Login event UAM major Failed SSH login to node detected. Validate access with respect to your internal security policies.
SiteSSHLoginWithLockOutCert SSH Login with Lock out Cert event UAM critical SSH login to node with lock out cert detected. Validate access with respect to your internal security policies.
SiteSSHLoginWithOfflineCert SSH Login with OFFLINE certificate event UAM critical SSH login to node with OFFLINE ssh-cert cert detected. Security incident on PRODUCTION. No action needed on test/crt/staging environments.
SiteSSHPasswordLogin SSH Password Login event UAM critical SSH login to node using password authentication detected. Validate access with respect to your internal security policies.
SiteSSHPubkeyLogin SSH Pubkey Login event UAM major SSH login using key to node detected. Validate access with respect to your internal security policies.
SiteSudoExecuted Sudo Command Executed event UAM major Priviledged command execution at node detected. Validate command with respect to your internal security policies.
SiteTGWTunnelDown Site's tunnel interfaces to TGW are down metric Ves-Software critical Site's tunnel interfaces to TGW are down. Verify network connectivity on given site and status of AWS VM.
SiteTGWTunnelDown Site TGW Tunnel Down metric Ves-Software critical Site's tunnel interfaces to TGW are down. Verify network connectivity on given site and status of AWS VM.
SiteTunnelConnectionDown IPSec/SSL Tunnel Connection Down event Infrastructure critical IPSec/SSL tunnel connection to the site is down. Check the network connectivity.
SiteTunnelInterfaceDown Tunnel Interface Down metric Infrastructure critical Connection from both REs to CE are down. Majority of functionality will be impacted. Check physical and network connectivity of the CE
SiteUpgradeFailing Site Upgrade Failing metric Infrastructure critical Volterra software upgrade is failing at Site. It retries every 10 minutes and keeps updating the status. Check Volterra Software status message info. Contact support if problem persist for more than 30 minutes.
UserCreated User Created event UAM major New UAM user was created. No action required.
UserDeleted User Deleted event UAM major Existing UAM user was deleted. No action required.
UserUpdated User Updated event UAM major Existing UAM user was updated. No action required.
VerHighHugepagesMemoryUsage VER High Hugepages Memory Usage metric Ves-Software critical Hugepages Memory reached critical level. Escalate to L2.
VesArgoLowCountersAvailable Argo Low Available Counters metric Ves-Software critical Argo has low available counters on node. This may lead to service crash. Escalate to L2.
VesArgoMemoryLow Argo Memory Low metric Ves-Software major Argo is low on free memory. Increase Argo memory size.
VesArgoTooManySynPackets Argo Too Many Syn Packets metric Ves-Software critical Argo has too many syn packets on VIP. Check for traffic source and/or escalate to L2.
VesKubeCronJobRunning K8S CronJob Runs Too Long metric Ves-Software minor CronJob is taking more than 1h to complete. Create issue and let L2 solve it during working hours.
VesKubeDaemonSetMisScheduled K8S Daemonset Error metric Ves-Software minor Some Pods of DaemonSet are running where they are not supposed to run. If problem persist for more than 30 minutes, delete old Pods. If problem persits escalate to L2 immediately.
VesKubeDaemonSetNotScheduled K8S Daemonset Error metric Ves-Software minor Some Pods of DaemonSet are not scheduled. If problem persist for more than 30 minutes, describe DaemonSet resource and escalate to L2 immediately.
VesKubeDaemonSetRolloutStuck K8S Daemonset Error metric Ves-Software major Only part of the desired Pods of DaemonSet are scheduled and ready. If problem persist for more than 30 minutes, describe daemonset resource and escalate to L2 immediately.
VesKubeDeploymentGenerationMismatch K8S Deployment Error metric Ves-Software major Deployment generation does not match, this indicates that the Deployment has failed but has not been rolled back. If problem persist for more than 30 minutes, delete old pods. If problem persits escalate to L2 immediately.
VesKubeDeploymentReplicasMismatch K8S Deployment Error metric Ves-Software major Deployment has not matched the expected number of replicas for longer than an hour. If problem persist for more than 30 minutes, describe deployment resource and escalate to L2 immediately.
VesKubeJobFailed K8S Job Failed metric Ves-Software critical Kubernetes Job failed to complete in last 2 hours. Create issue and let L2 solve it during working hours.
VesKubeLongCronJobRunning K8S Long CronJob Runs Too Long metric Ves-Software minor CronJob is taking more than 2h to complete. Create issue and let L2 solve it during working hours.
VesKubePersistentVolumeFullInFourDays K8S PVC Error metric IaaS-CaaS major Based on recent sampling, the PersistentVolumeClaim is expected to fill up within four days. TODO
VesKubePersistentVolumeSpaceLow K8S PVC Error metric IaaS-CaaS major Kubernetes PersistentVolumeClaim is getting out of space. Resize PVC or clean disk.
VesKubePodCPUThrottlingHigh K8S Pod CPU Throttled metric Infrastructure major Kubernetes Pod container is throttling it's CPU limits. Increase limits in Deployment or StatefulSet definition. File an Issue with permanent limit increase proposal.
VesKubePodCPUThrottlingHigh K8S Pod CPU Throttled metric Infrastructure critical Pod container is throttling it's CPU limits. Increase limits in Deployment or StatefulSet definition. File an Issue with permanent limit increase proposal.
VesKubePodCPUThrottlingLongTime K8S Pod CPU Throttled for Long Time metric Infrastructure major Kubernetes Pod container is throttling it's CPU limits for long time. Increase limits in Deployment or StatefulSet definition. File an Issue with permanent CPU limit increase proposal.
VesKubePodContainerTooMuchMemory metric Ves-Software major More than 90% of allowed memory is being used by container. Increase limits in Deployment or StatefulSet definition. File an Issue with permanent limit increase proposal.
VesKubePodCrashLooping K8S Pod Crashing metric Ves-Software critical Pod container is crashing. Check the reason for pod crash. Service is crashing. Very often reason can be low memory limits. Check why pod is crashing and if problem persist for more than 15 minutes or continuously crashing and it affects other services, escalate to L2. If reason is oomkill, raise resource limits. If error, contact service's team. If it's a scheduling problem, contact SRE team.
VesKubePodEtcdBackupCrashLooping K8S EtcdBackup Crashing metric Ves-Software major ETCD backup job is crashing. If problem persist for more than 30 minutes, escalate to L2 immediatelly.
VesKubePodFluentbitCrashLooping K8S Fluentbit Crashing metric Ves-Software major Fluentbit service is crashing. If problem persist for more than 30 minutes, escalate to L2 immediatelly.
VesKubePodNotReady K8S Pod Not Ready metric Ves-Software critical Pod has been in a non-ready state for longer than 10 minutes. If problem persist for more than 30 minutes, escalate to L2 immediatelly.
VesKubeStatefulSetGenerationMismatch K8S StatefulSet Error metric Ves-Software major StatefulSet generation does not match, this indicates that the StatefulSet has failed but has not been rolled back. If problem persist for more than 30 minutes, delete old pods. If problem persits escalate to L2 immediately.
VesKubeStatefulSetReplicasMismatch K8S StatefulSet Error metric Ves-Software major StatefulSet has not matched the expected number of replicas for longer than 15 minutes. If problem persist for more than 30 minutes, describe statefulset resource and escalate to L2 immediately.
VesKubeStatefulSetUpdateNotRolledOut K8S StatefulSet Error metric Ves-Software major StatefulSet update has not been rolled out. If problem persist for more than 30 minutes, delete old pods. If problem persits escalate to L2 immediately.
VesSvcRecoveredPanic Service recovered from panic event Ves-Software critical A panic was encountered and recovered in execution of service. Check the service logs for details.
ViewActionError View Action Error event IaaS-CaaS major View action finished with error. Check the validity of your view variables.
VoltShareDecryptionError VoltShare Decryption Error metric VoltShare major Decrypt operation has failures. Check secret policy or admin policy.
VoltShareEncryptionError VoltShare Encryption Error metric VoltShare major Encrypt operation has failures. Check secret policy or admin policy.
WafTooManySecurityEvents Security Events metric Security major Virtual Host WAF security events detected. Consider blocking the relevant users/IPs using FastACL or Network Policy or Service Policy.

TSA Severity vs Anomaly Scores

The following table presents the reference table for the Time-Series Anomaly (TSA) scores and associated severity of the alerts related to various metrics. The table also shows the absolute threshold for the associated metrics.

Metric Severity Score Absolute Threshold
Request Rate minor 0.6 NA
Request Rate major 1.5 50 rps
Request Rate critical 3.0 100 rps
Request Throughput minor 0.6 NA
Request Throughput major 1.5 2500 kbps
Request Throughput critical 3.0 5000 kbps
Response Throughput minor 0.6 NA
Response Throughput major 1.5 25000 kbps
Response Throughput critical 3.0 50000 kbps
Response Latency minor 0.6 NA
Response Latency major 1.5 250 ms
Response Latency critical 3.0 500 ms
Error Rate minor 0.6 NA
Error Rate major 1.5 5 erps
Error Rate critical 3.0 10 erps

Note: For more information on the Volterra TSA, see Time-Series Anomaly Dectection guide.