Published on February 3, 2014
Pointers to useful KB articles when dealing with CPU bottleneck issues Within Data ONTAP, work can also be classified as high or low priority, and some low priority work is non-essential/opportunistic. Also, as system load increases, it is likely that kernel optimizations will result in non-linear scaling (that is, at a higher load level, the system will process the work more efficiently than at a lower load level). As such, it can appear that a CPU bottleneck has been reached when in fact the system could do more client work. Deeper analysis can determine the type of work but is non-trivial and outside the scope of this KB article. A general strategy for analyzing the bottlenecks is to use both service metrics (protocol/volume/lun latency) and component metrics (CPU, Disk IO, Network IO) to provide a holistic view of the system and reduce the chance of making a false conclusion. HIGH CPU does not always mean problem in the filer. In order to come to the conclusion of any sort, it is best advised to understand how CPUs on the NetApp filer are designed to function. If you have a NetApp Support account, please take a look at this excellent KB providing interesting information on CPUs inside NetApp filer. https://kb.netapp.com/support/index?page=content&id=3010150 Block reclamation scanners cause kahuna bottleneck. http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=210480 What is the 'wafl scan status' command? https://kb.netapp.com/support/index?page=content&id=3011346 How does Data ONTAP make use of multiple CPUs? https://kb.netapp.com/support/index?page=content&id=3010150
What causes High CPU during disk scrub although raid.scrub.perf_impact is set to low? https://kb.netapp.com/support/index?page=content&id=3011323 Data ONTAP 8: sysstat shows high CPU utilization on multiple processor system https://kb.netapp.com/support/index?page=content&id=2013653 How does Data ONTAP schedule work across multiple physical CPUs? https://kb.netapp.com/support/index?page=content&id=3010118 If the Filer acts as a snapmirror destination, then it is busy running the Deswizzler after a snapmirror upgrade which can cause high CPU usage. By the way, what is deswizzler or deswizzling? https://kb.netapp.com/support/index?page=content&actp=LIST&id=3011866 You can monitor the deswizzler work with the command wafl scan status: https://kb.netapp.com/support/index?page=content&id=3011346 How to find out if there is a performance issue when CPU utilization is high on a storage system? https://kb.netapp.com/support/index?page=content&id=3011266 Using the stats command interactively in repeat mode https://library.netapp.com/ecmdocs/ECMM1278265/html/sysadmin/monitoring/task/t_oc_mntr_sy sinfo-stats-command-repeat-mode.html High CPU usage or a panic might occur on a NetApp storage system running Data ONTAP 8.0.2 with the string: process on cpu0 hung (pmcsas_asyncd_0) for 5002 milliseconds! https://kb.netapp.com/support/index?page=content&id=2017021 CPU hung panic during early boot up https://kb.netapp.com/support/index?page=content&id=2014013
Diagnosing NetApp CPU Issues – Kahuna Bottlenecks http://dosysadminsdream.wordpress.com/2013/01/24/diagnosing-netapp-cpu-issues-kahunabottlenecks/ FACT: “A high CPU on a Storage Controller does not always mean CPU bottle neck or performance problem. In Data ONTAP, a high CPU means only that it is doing lot of work. If the Storage controller is not busy with user protocols workload, it is doing background work like deswizzling or disk scrubbing etc. But if user workload is introduced into this system, Data ONTAP is able to throttle this scanner work down in order dedicate the CPU to user workload. “ FACT: “During Disk scrubbing, system will be checking the disk blocks of all disks for media errors and parity consistency. If Data ONTAP finds media errors or inconsistencies, it fixes them by reconstructing the data from other disks and rewriting the data and that's the reason you see the CPU Load high that time. To minimise the performance impact, you can schedule the disk scrub to non-peak hours or change your RAID scrub speed to Low by using.” filer>options raid.scrub.perf_impact low NetApp performance Diagnosis commands Note: Don’t forget to enable print logging 'on' in the putty session, as the output will often exceed the screen length. Also, note that certain commands may not be available under 'Admin prompt [priv set admin]', you may have to go to advance level such as '[priv set advanced] or [priv set diag]'. filer>sysstat -x 1 Gives you a second-by-second readout of the filer’s performance. In particular look at the CP Time and CP Type – if you’re constantly hitting 100% CP Time and the CP Type is showing lots of B’s (back to backs) this indicates that the NVRam cache is being flooded and the filer is struggling to write all the incoming data quickly enough.
filer>lun stats –z Then wait 10 seconds and then filer> lun stats –o The first command clears out the lun stats and resets all counters to zero. Wait for a specified period of time then run the second command to get a snapshot of the lun performance across that time. Look especially for excessive Partner Ops indicating misconfiguration somewhere (particularly prevalent when ALUA has not been enabled on a LUN and the DSM version is trying to use all paths at once). filer>priv set diag filer>statit -b Then wait 5 secs then filer>statit -e These privileged commands give a detailed look at the filer disk performance. The first begins (-b) the performance snapshot and the second ends (-e) it. You may also refer to following pdf [Monitoring Storage Performance using NetApp Operations Manager] http://media.netapp.com/documents/tr-4090.pdf NetApp Storage Monitoring Using HP OpenView http://www.netapp.com/us/media/tr-3688.pdf For checking CPU: filer>sysstat -m 1 and also, filer>priv set diag filer>sysstat -M 1
If you are unable to make sense of the outputted data, do not worry, just contact NetApp technical Phone or Email Support, they are really good. In most cases, they will ask you to collect the logs and upload it to the NetApp support site. To help you do this, NetApp support will direct you to following tools for log collection: Tool : Perfstat C:>perfstat -f [filer] -t 5 -i 6 > [case number].perfstat.out Download the perfstat tool from the NetApp Support Site – Perfstat tool. http://support.netapp.com/NOW/download/tools/perfstat/ Tool: NSanity Collects details of all SAN related components for end-to-end diagnosis. For full command info check the NSanity page on the NOW site. http://support.netapp.com/NOW/download/tools/nsanity/ How to upload a file to NetApp https://kb.netapp.com/support/index?page=content&id=1010090 BUGs that are linked to HIGH CPU Utilization BUG:91653: Volume SnapMirror source has high CPU usage http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=91653 BUG:110630: Wildcard searches from CIFS on large directories are CPU-intensive http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=110630
BUG: 168255: WAFL scanners may cause excessive latency on an idle system http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=168255 BUG: 698798: High CPU utilization with many concurrent 'block ownership' and 'blocks used' scanners http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=698798 BUG: 721610: High CPU usage in check_acm() http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=721610 BUG: 595957:High CPU utilization on Cluster-Mode storage systems that have high number of SAS shelves and disks http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=595957 BUG: 590193:WAFL background file system scanner may cause high CPU usage. http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=590193 BUG: 222545:CPU utilization can experience high numbers without associated load http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=222545 BUG:164124: Kerberos replay cache can cause high CPU usage http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=164124 BUG: 22458: Filer's CPU usage is high when many CIFS files open and many Change Notify requests http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=22458 BUG: 12249: High CPU utilization when many Change Notify requests or virus scanning http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=12249
BUG: 227649: Using SNMP to retrieve disk configuration causes high CPU usage http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=227649 CPU utilization much higher after 7G upgrade on systems with quota enabled volumes http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=158789 BUG: 245502: System slow or unusable due to high CPU activity http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=245502 email@example.com
Presentación que realice en el Evento Nacional de Gobierno Abierto, realizado los ...
In this presentation we will describe our experience developing with a highly dyna...
Presentation to the LITA Forum 7th November 2014 Albuquerque, NM
Un recorrido por los cambios que nos generará el wearabletech en el futuro
Um paralelo entre as novidades & mercado em Wearable Computing e Tecnologias Assis...
Diagnosing NetApp CPU Issues ... indicates a CPU bottleneck for that ... to the 3 cpu measurements from netapp cpu_busy/avg ...
NetApp CPU Bottleneck Issues. Some help when dealing with CPU bottleneck issues A general strategy for analyzing the bottlenecks is to use both service ...
CPU bottleneck issues netapp. by ashwin-pawar. on Dec 26, 2014. Report Category: Technology
Generally CPU usage: %90 I/O ... FAS2020 Performance issues; ... disks installed in the system soits difficult to tell if this is the bottleneck.
HIGH CPU Utilization Issues on NetApp Filer ... Close Share CPU bottleneck issues netapp
Technical Report. Reference Architecture Design Guide . Microsoft Exchange Server, SQL Server, and SharePoint Server Mixed Workload on VMware vSphere 4, NetApp
Troubleshooting and identifying data storage ... tell DRS not to let any individual server exceed 60% CPU ... the controller can become the bottleneck as ...
... when system doing snapshot. the cpu ... CIFS performance problem; ... Start with turning off the auditing to eliminate one potential bottleneck ...