Oshyn DevOps: Metrics & Alerts in a Production AWS Sitecore Solution

Formats the Html codes.FormatFix the Html to be XHtml compliantFix Write Validate that the text is XHtml compliant.Validate Proofing

A major tenet of DevOps is to address small issues before they become big issues. An important method for discovering issues when they’re small is to set up monitoring with alerts. At Oshyn DevOps, we make sure that we have alerts on every piece of the Sitecore solution that can keep us aware of the health of the system and notify us of any changes that require investigation.

In the production monitoring space, the first two things we tend to look at for Sitecore environments are:

  1. Metrics - This is the data generated by the different cloud resources such as: disk space, CPU and memory on VMs, throughput on a WAF (web application firewall), database connections, distributed cache CPU, connections, etc. These metrics number in the tens of thousands for any reasonably-sized, highly-available Sitecore solution.
  2. Alerts - We create alerts by taking the above metrics and putting thresholds on them. When a threshold is met, the appropriate staff is notified, and the issue is remediated before it becomes an issue for the site.

Understanding how these alerting mechanisms work in your cloud hosting provider is crucial to the overall uptime of your Sitecore solution. This is one of the reasons Oshyn currently only supports DevOps for Azure and AWS. This article focuses on our highly-available Sitecore production monitoring in AWS, but Azure works exactly the same way (except with Azure toolsets).

For a typical highly-available production Sitecore implementation on AWS, Oshyn’s DevOps teams receive a minimum 50 different alerts—and typically over 100. A non-exhaustive list is:

Group Metric Sample Threshold
AWS/EC2 * for each VM
Win Memory Available Less than 2GB
Win LogicalDisk % Free Space Less than 20%
Win Web Service Connection Attempts/sec Greater than 500
Win Web Service Get Requests/sec Greater than 200
Win Web Service Current Connections Greater than 1,000
Win Web Service Current Anonymous Users Greater than 5000
Linux DISK_FREE Less than 10GB
Linux mem_used_percent Greater than 75%
Linux swap_used_percent Greater than 75%
AWS/RDS * for each RDS instance
CPU Utilization Greater than 85%
Database Connections Greater than 1,500
Freeable Memory Less than 1GB
AWS/ApplicationELB * for each load balancer
Request Count / Second Greater than 1,000
Healthy Host Count Less than 2
AWS/ElasticCache * for each load balancer
CurrConnections Greater than 100
Freeable Memory Less than 1GB
CacheHits (per second) Greater than 500
Log File Metrics
Errors in all Sitecore Instance Logs More than 5 errors within 5 minutes
Errors in Identity Server Log More than 5 errors within 5 minutes
Errors in XConnect Service log More than 5 errors within 5 minutes
Errors in XConnect Automation Engine log More than 5 errors within 5 minutes
Errors in XConnect Index Worker log More than 5 errors within 5 minutes
Errors in XConnect Processing Engine Log More than 5 errors within 5 minutes
External Sensors * for each website
HTTPS Certificate Validity Up/Down
HTTP/HTTPS Connectivity for External Users Up/Down
HTTP/HTTPS Connectivity for Content Editors Up/Down
HTTP/HTTPS Advanced Sensor for Search Results Up/Down
HTTP/HTTPS Advanced Sensor for Contact Us Up/Down

It’s important to note that the thresholds are adjusted based on the overall size of the resource and the utilization of the site we are monitoring. Having less than 1GB free memory is an emergency if the resource has 16GB, but isn’t if the resource has 2GB. Similarly, what constitutes an event that warrants further investigation also depends on the site we are monitoring. A spike in current connections greater than 1,000 for one website may be alarming but not for another. The only way to know is to monitor the site for some time prior to setting the thresholds. Using the CloudWatch metric tools, you can see over time what constitutes normal and can therefore set your thresholds to detect anything abnormal before it evolves into an outage event.

CloudWatch metrics over time screenshot
CloudWatch metrics over time

After the alerts are configured in CloudWatch, they look like this:

Alerts Configuration in CloudWatch screenshot
Alerts Configuration in CloudWatch

And they are replicated into Oshyn’s monitoring system, based on PRTG, which looks like this:

Monitoring System screenshot
Monitoring System

And like this on mobile:

Mobile Monitoring screenshot
Mobile Monitoring

When these alerts are triggered, emails are sent, messages are sent to a Slack channel, and push notifications are sent to apps running on mobile devices. This is how Oshyn DevOps is able to identify and remediate small issues before they become outages that cost your business lost sales and marketplace perception.