1. CIRCUIT – An Adobe Developer Event
Presented by ICF Interactive
Monitoring AEM - Going
above and beyond CPU,
Disk, and Memory
Michael Chan
ICFI Interactive
2. Introduction
Who Am I
• Michael Chan, Systems Engineer & Architect for ICFI
Interactive Managed Services
• Former Java & C developer
• With past experience in
– Unix security
– Network monitoring
– Systems (network, storage, server) integration
– Ecommerce
• Primary responsibilities at ICFI (among others)
– Build out systems infrastructure, including systems
automation, logging, and monitoring
– Enable engineers to quickly assess and respond to
systems issues
3. Purpose of session
Session will cover:
• Introduce systems monitoring concepts
• Provide practical ideas and examples on how to
monitor your website and AEM stack
• Use data to make correlations for root-cause
analysis
Session will not cover:
• Which monitoring software to use
• How to implement x or y feature in your monitoring
software
• What alerting strategies you should use
4. Goals of systems monitoring
• Maintain site availability
– Can users access the site?
• Identify performance issues
– Are users waiting too long?
• Troubleshoot problems
– How do I identify root cause?
• Identify long-term trends
– Is the application slowing down?
– Do we need faster hardware?
5. Monitoring tools out there (not exhaustive)
Open Source (free!)
• Nagios
• Icinga
• Zabbix
SAAS
• Application-performance focused
– AppDynamics
– New Relic
• Boundary
• Datadog
6. Monitoring software considerations
What I have found most important
• Easy to use
– Has a convenient GUI
– Easy to add servers, applications
• Easy to view and interpret data
– Need to be able to view data and quickly make correlations
• Extensible
– Easy to customize, e.g. monitor Publisher listening on port 4506 instead
of 4503
– Support for plugins and especially custom scripts, necessary for
application-specific monitoring
• Other considerations
– Can the setup configs be version controlled in Git?
– Is there an API for the monitoring system, to create/modify configs?
Tip: everyone’s needs are differerent, use what makes sense for you!
7. Basic monitoring – CPU, network, disk
Good questions to ask when monitoring these
• CPU Load Average
– What percentage of CPU is the application utilizing?
– Is there surplus CPU capacity left?
• Network Statistics
– e.g. Bytes in/out, Packets in/out
– How much traffic are our servers receiving?
– Do any network spikes correlate with slower application performance?
• Disk (IOPS, throughput)
– How much is the application utilizing the disk?
– Is the application hitting any Disk I/O thresholds?
Tip: benchmark your Network and Disk I/O thresholds to discover your
hardware limitations.
Note: AEM may be hitting CPU limits even before CPU load is %100. Reason
for this is that threads often can be waiting on another thread’s operations to
complete, and until that thread completes, the rest are waiting or blocked.
Therefore slowness can begin even at %50-%75 CPU utilization
9. Simple web monitoring – Apache performance stats
Apache, mod_status module
• Provides performance statistics
• Note: path e.g. /server-status should be disabled from public internet
root@Client Prod CQ Disp 1a i-a678d2db:~# curl -s http://localhost/
server-status | html2text|more
****** Apache Server Status for localhost ******
Server Version: Apache/2.2.15 (Unix) Communique/4.1.2 mod_ssl/
2.2.15 OpenSSL/
1.0.1e-fips
Server Built: Jul 18 2014 02:31:29
====================================================================
Current Time: Wednesday, 29-Jul-2015 01:37:00 GMT
Restart Time: Sunday, 26-Jul-2015 03:39:30 GMT
Parent Server Generation: 4
Server uptime: 2 days 21 hours 57 minutes 30 seconds
Total accesses: 3430869 - Total Traffic: 114.6 GB
CPU Usage: u43.79 s19.41 cu0 cs0 - .0251% CPU load
13.6 requests/sec - 477.1 kB/second - 35.0 kB/request
41 requests currently being processed, 21 idle workers
10. Web monitoring – STM / RUM – nice to have
Synthetic Transaction Monitoring
• (also known as active monitoring) is website monitoring that is
done using a web browser emulation or scripted recordings of
web transactions.
• Examples
– Selenium
– Neustar
– Keynote
• Advantages
– Repeatable process
• e.g. can ensure that the process of “login, add product to shopping cart, checkout”
works between code releases
– Can be used as a control
– Cheap
• Disadvantages
– Monitors only what you decided to test against
– Not as thorough as RUM
11. Web monitoring – RUM / STM – nice to have
Real User Monitoring
• (RUM) is a passive monitoring technology that records all user
interaction with a website or client interacting with a server or
cloud-based application.
• Examples
– Google Analytics
– New Relic
– Keynote
– Many, many more
• Advantages
– Real-user “testing” data
– Monitoring for issues as they occur
– Identifies browser-related issues
• Disadvantages
– Expensive
– Too much information (information overload)
12. Adobe WEM monitoring – basic checks for Author, Publisher
Ports to monitor (are they accessible)?
• Author – 4502
• Publisher – 4503
Suggested pages to monitor
• Sling login page - /system/sling/cqform/defaultlogin.html
– Should always work!
– Response times almost always the same
– If Sling login page is up, but for example homepage is not, can be indicative a content
or code-related issue
curl -s http://localhost:4503/system/sling/cqform/defaultlogin.html | grep
QUICKSTART_HOMEPAGE
<!-- QUICKSTART_HOMEPAGE - (string used for readyness detection, do not remove) -->
• Homepage, important landing pages
– If Publisher hosts multiple farms & host-specific sling mappings are used, you may
need to pass host-header:
curl -H "Host: www.citytechinc.com" http://localhost:4503/us/en.html
– Above example is another reason why a customizable monitoring solution is needed
• Nagios has an http_check plugin that supports sending host headers with requests
13. Adobe WEM monitoring – error.log, critical errors
Files to monitor
• error.log, keywords (AEM 5.5, 5.6, although some may still be
applicable to 6.x)
– critical errors
• OutOfMemoryMonitor
CQ
shutting
down
• StackOverflowError
• Maximum
threads
reached
• Java OutOfMemoryErrors, e.g.
– java.lang.OutOfMemoryError:
unable
to
create
new
native
thread
• too
many
open
files
– Non-critical errors (error count is useful)
• RecursionTooDeepException
• Failed
to
mmap
tar
file
/
java.lang.OutOfMemoryError:
Map
failed
14. Adobe WEM monitoring – error.log, repository related
Files to monitor
• error.log, repository-related keywords
– critical errors
• tar
files
read-‐only
– Non-critical errors (error count is useful, with alarm set when
threshold is exceeded)
• failed
to
retrieve
state
of(.+)node
• failed
to
retrieve
state
of
intermediary
node
• Failed
to
read
bundle
• Repository
error
during
page
import
• Unable
to
create
version
• lucene(.+)Unknown(.+)node
• lucene(.+)query
result
node
Tip: When encoutering important repository errors, make sure to
update your monitoring software to detect it!
18. Adobe WEM monitoring – access.log, cont.
Files to monitor
• access.log
– Cache-busting requests
• Contains query strings, e.g.
– http://www.citytechinc.com/us/en.html?hi=test
• Extensionless, e.g.
– GET /athletes/athletes.34360.html/career
– Extensions
• .js, .css
• Images - .bmp, .jpg, .jpeg, .png
Tip: calculate the percentage of cache-busting requests
over time as a baseline to compare against.
24. Adobe WEM monitoring – JCR queries
------------- 07/26/2015 01:44:33 +0000 org.archive.jmx.Client SlowQueries: --------------
creationTime: Sun Jul 26 01:40:06 GMT 2015 duration: 2788ms language: xpath occurrenceCount: 1 position: 1
statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/
@cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek-
americas:product/shifters' and jcr:content/@cq:tags = 'trek-americas:product/shifters/nonLocking' and jcr:content/
@cq:tags = 'trek-americas:brand/trek'))]
creationTime: Sun Jul 26 01:36:34 GMT 2015 duration: 1766ms language: xpath occurrenceCount: 8729
position: 2
statement: /jcr:root/var/eventing/jobs//element(*, slingevent:Job)[jcr:contains(., '/com/day/cq/replication/job')
and not(@slingevent:finished)]
creationTime: Sun Jul 26 01:40:33 GMT 2015 duration: 809ms language: xpath occurrenceCount: 1 position: 3
statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/
@cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek-
americas:product/shifters' and jcr:content/@cq:tags = 'trek-americas:product/shifters/nonLocking' and jcr:content/
@cq:tags = 'trek-americas:brand/trek'))]
creationTime: Sun Jul 26 01:41:15 GMT 2015 duration: 790ms language: xpath occurrenceCount: 1 position: 4
statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/
@cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek-americas:brand/
trek' and jcr:content/@cq:tags = 'trek-americas:product/ulocks' and jcr:content/@cq:tags = 'trek-americas:product/
ulocks/titanium'))]
creationTime: Sun Jul 26 01:40:05 GMT 2015 duration: 782ms language: xpath occurrenceCount: 1 position: 5
statement: /jcr:root/content/trek-us/en/home/products//element(*, cq:Page)[jcr:contains(., '*') and (jcr:content/
@cq:template = '/apps/trek-americas/templates/productDetail-page') and ((jcr:content/@cq:tags = 'trek-
americas:product/levers' and jcr:content/@cq:tags = 'trek-americas:electric/zwave' and jcr:content/@cq:tags = 'trek-
americas:product/ulocks/titanium'))] order by jcr:content/content-par/productdetail/@releasedate descending
Tip: The slow query statistic by default shows all queries since AEM startup. However this counter
can be reset, if you want to have for example 10-minute “summaries” of the slowest queries.
25. Adobe WEM monitoring – misc.
Other possible things to monitor
• Running workflows
• Bundle status - installed, active
• Replication queues - total, blocked
data for all of the above is possible via curl!
26. JVM monitoring – heap usage
Heap usage
• Useful for viewing AEM memory usage and GC issues
• Can be obtained via JMX
– Example using free cmdline-jmxclient.jar tool:
# java -jar /usr/local/bin/cmdline-jmxclient.jar - i-d4bb64dd.ct-prod.ctmsp.com:12345
'java.lang:name=PS Old Gen,type=MemoryPool' Usage
07/26/2015 20:12:20 +0000 org.archive.jmx.Client Usage:
committed: 4462215168
init: 894828544
max: 14316601344
used: 4158743792
• Also viewable via jmap command
# jmap -heap 31470
Attaching to process ID 31470, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.5-b03
using thread-local object allocation.
Parallel GC with 1 thread(s)
- additional output trimmed -
29. JVM Monitoring – GC pause times
Why monitor JVM pause times?
• These are “stop-the-world” events where the application is unreponsive due to JVM
garbage collection
• Sometimes JVM garbage collection is not successful, and thus constant GCs occur
since memory cannot be freed – this incurs serious CPU usage
• Should be monitored since it can be a performance hit
How to monitor?
• Pause times can be added to stdout via JVM options
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
2015-07-27T18:50:30.212+0000: [Full GC [PSYoungGen: 98121K->0K(6107264K)] [ParOldGen:
6144935K->1561525K(6291456K)] 6243056K->1561525K(12398720K) [PSPermGen: 193509K-
>193465K(193600K)], 5.7558230 secs] [Times: user=22.98 sys=0.00, real=5.75 secs]
2015-07-27T18:50:42.432+0000: [GC [PSYoungGen: 5916288K->81734K(5998080K)] 7477813K-
>1643259K(12289536K), 0.1018320 secs] [Times: user=0.52 sys=0.00, real=0.10 secs]
• Pause times also can be added via:
-XX:+PrintGCApplicationStoppedTime
Total time for which application threads were stopped: 0.0001780 seconds
Total time for which application threads were stopped: 0.0001920 seconds
Tip: Even if you don’t have time to enable monitoring via JMX, at least print GC output
to a log file for later analysis when AEM is slowing down!
31. Summary
• Monitor all homepage & landing pages, for
all individual Publishers and Dispatchers
• Use AEM logs and tools to provide info on
AEM status and performance – access/
error/request logs, rlog.jar, thread status,
slow queries page and customize your
monitoring to record this data
• Use JMX and verbose GC logging to record
JVM memory heap usage, and GC pause
times