How to find the cause of CPU spikes
Finding the cause of CPU spikes
Question
When we get AWS CloudWatch alerts about high CPU usage on a server (shard/cluster), what is the best way to find the root cause in FusionReactor?
-
Check if the issue is still ongoing or has resolved back to normal state.
- If it was an alert from an earlier time (e.g., overnight), note the time and go to this timeframe within the Cloud UI (use the time selector in the top bar).
-
Check the Metrics screen:
- Look at the amount of transactions running at that time.
- Review their completion time and CPU usage to identify when the spike started.
-
View previous time frames:
- When looking at the Web Requests, Transactions, or JDBC screens, go to ‘Saved in Cloud’ to view historical data.
-
Investigate requests during the spike:
- Identify requests with increased completion time and investigate the cause.
- Open longer/erroring requests to check traces/sub-transactions and determine where extra time was spent.
-
Check transaction types:
- Determine if many larger scheduled tasks were executing simultaneously.
-
Analyze the Applications screen:
- Ensure you are looking at the correct timeframe.
- Check performance graphs for anomalies or unexpected increases in throughput.
-
Review Logging for the timeframe in question:
- Look for unusual behavior.
- Check the IP sending requests to detect if a bot is making multiple requests.
-
Explore Traces for affected Services:
- Run a query for the specific timeframe.
- Focus on services with the longest durations to identify the root cause.