On Production SQL Server, we have following config: 3 Dell PowerEdge R630 servers, combined into Availability Group All 3 are connected to single Dell SAN storage unit which is a RAID array From time to time, on PRIMARY we are seeing messages similar to below: > SQL Server has encountered 11 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [F:\Data\MyDatabase.mdf] in database id 8. The OS file handle is 0x0000000000001FBC. The offset of the latest long I/O is: 0x000004295d0000. The duration of the long I/O is: 37397 ms. We are novice in performance troubleshooting What are the most common ways or best practices in troubleshooting this particular issue related to storage ? What performance counters, tools, monitors, apps, etc. must be used to narrow down to the root cause of such messages ? Might be there is a Extended Events that can help, or some kind of audit / logging ?
meme (imported from SE)
This is far less often a disk issue, and far more often a networking issue. You know, the N in SAN? If you go to your SAN team and start talking about the disks being slow, they're gonna show you a fancy graph with 0 millisecond latency on it and then point a stapler at you. Instead, ask them about the network path to the SAN. Get speeds, if it's multipathed, etc. Get numbers from them about the speeds you should be seeing. Ask if they have benchmarks from when the servers were set up. Then you can use [Crystal Disk Mark] or [diskpd] to validate those speeds. If they don't line up, again, it's most likely the networking. You should also search your error log for messages that contain "FlushCache" and "saturation", because those can also be signs of network contention. One thing you can do to avoid those things as a DBA is make sure that your maintenance and any other data-heavy tasks (like ETL) aren't going on at the same time. That can definitely put a lot of pressure on storage networking. You may also want to check the answers here for more suggestions: [Slow checkpoint and 15 second I/O warnings on flash storage](https://dba.stackexchange.com/questions/233705/slow-checkpoint-and-15-second-i-o-warnings-on-flash-storage) I blogged about a similar topic here: [From The Server To The SAN](https://www.brentozar.com/archive/2019/01/from-the-server-to-the-san/) : https://crystalmark.info/en/software/crystaldiskmark/ : https://gallery.technet.microsoft.com/DiskSpd-A-Robust-Storage-6ef84e62
Ok, for anyone interested, We solved issue in Question couple months ago simply by installing directly attached SSD drives into each of 3 servers, and moving DB data and log files from SAN to those SSD drives Here summary on what I did to research on this issue (using recommendations from all the posts it this question), before we decided to install SSD drives: 1) started collecting PerfMon counters for following drives at all 3 servers: `Disk F:` is logical disk based on SAN, contains MDF data files `Disk I:` is logical disk based on SAN, contains LDF log files `Disk T:` is directly attached SSD, dedicated solely to tempDB Picture below is average values collected for 2 weeks period [![Disk Performance Counters]] `Disk I: (LDF)` has such a small IO and Latency is very low, so Disk I: can be ignored You can see that `Disk T: (TempDB)` has bigger IO compared to `Disk F: (MDF)`, and it has much better Latency at the same time - 0 ms Obviously something is wrong with Disk F: where data files reside, it has high Latency and Avg Disk Write Queue, despite low IO 2) Checked Latency for individual databases using query from this website https://www.brentozar.com/blitz/slow-storage-reads-writes/ Few active databases on Primary server had 150-250 ms read latency and 150-450 ms write latency What is interesting, master and msdb database files had read latency up to 90 ms which is suspicious given the small size of their data and low IO - another indication something is wrong with SAN 3) There were no specific timings During which "SQL Server has encountered occurences..." messages showed up There were no maintenance or disk heavy ETL running when those messages were logged 4) Windows Event Viewer Did not show any other entries that would hint the problem, except "SQL Server has encountered occurences..." 5) Started checking top 10 queries From sp_BlitzCache (cpu, reads, etc.), and omptimizing where possible No super IO heavy queries that would churn tons of data and impacting storage heavily, though Indexing in databases is OK, I maintain it 6) We do not have SAN team We only have 1 sysadmin who helps on occassion Network path to SAN - it is multipathed, each of 3 servers have 2 network cables leading to switches and then to SAN, and its supposed to be 1 Gigabyte / sec 7) There were no CrystalDiskMark results Or any other benchmark test results from when the servers were setup up, so I do not know what the speeds *should* be, and its not possible to benchmark at this point to see what the speeds currently are, as it would have impacted Production 8) Setup Extended Events session on checkpoint event for database in question XE session helped to discover that during "SQL Server has encountered occurences..." messages, checkpoint happened really slow (up to 90 seconds) 9) SQL Server Error Log Contained "FlushCache" "Saturation" entries These supposed to show up when checkpoint time for given database exceeds recovery interval settings Details showed that amount of data that checkpoint is trying to flush is small and it is taking long time to complete, and the overall speed is about 0.25 MB / sec... weird 10) Finally, this picture shows storage troubleshooting chart: [![Slow Disk IO Troubleshooting Steps]] It appears we simply have a "Hardware Problem:- Work with system admin/hardware vendor to fix any misconfiguration of SAN, old/faulty drivers, controllers, firmware, etc." In another question "Slow checkpoint..." https://dba.stackexchange.com/questions/233705/slow-checkpoint-and-15-second-i-o-warnings-on-flash-storage Sean had very nice list of what items have to be checked at hardware and software level to troubleshoot Our sysadmin could not check all things from the list, so we simply choose to throw some hardware at this issue - it was not expensive at all Summary: Moving to local SSD resolved not only storage performance issues but also data safety that I was concerned about (if SAN fails, all 3 servers lose their data at the same time) Now each server has local copy of DB data, and full/diff/log backups are done to the mentioned SAN No more "SQL Server has encountered occurences..." messages in Windows Event Viewer logs, and performance of backups, integrity checks, index rebuilds, queries etc. has increased significantly : https://i.stack.imgur.com/2MQei.png : https://i.stack.imgur.com/SMlFE.png
Why storing the data on a SAN? What's the point? All database performance is tied to Disk I/O and you are using 3 servers with only one device for the I/O behind them. That makes no sense... and unfortunately so common. I spend my life encountering poorly designed hardware platforms where people just try to design a large scale computer. All CPU power here, all disks there... hopefully there is not such a thing as remote RAM. And the saddest is they compensate the lack of efficiency of this design with huge servers that cost ten time more than they should. I saw $400k infra slower than a $1k laptop. A SQL server software is a very advanced piece of software, it is designed to take advantage of any bits of hardware, CPU cores, CPU cache, TLB, RAM, disk controllers, hard drive cache... They almost include all filesystem logic. They are developed on regular computer and benchmarked on high end systems. Therfore a SQL server must have its own disks. Installing them on a SAN is like "emulating" a computer, you lose all performance optimisations. SANs are for storing backups, immutable files, and files you just append data to (logs). Datacenter administrators tend to put all they can on SANs because this way they have only one pool of storage to manage, it's more easy than caring for storage on each server. It's a "I don't want to do my job" choice, and a very bad one, because then they have to deal with performance problems and all the company suffer from this. Just install software on the hardware it is designed for. Keep it simple. Care for I/O bandwidth, cache and context switch overhead, ressource jitter (happens when ressource is shared). You'll end up maintaining 1/10th of the devices for the same raw output power, save your ops team lot of headaches, gain performance that makes your end users happy and more productive, make your company a better place to work in, and save lot of energy (the planet will thank you). You said in comments, you are considering to put SSD in your server. You will not recognize your setup with dedicated SSDs, compared to a SAN you'll get something like 500x improvement even with data and transaction log files on same drive. A state of the art SQL Server would have fast separate SSD for data and transaction log on different hardware controllers channels (most server motherboard have several). But compared to your current setup we are talking of sci-fi there. Just give SSD a try.
We have a similar setup and recently encountered these messages in the logs. We are using a DELL Compellent SAN. Here are some things to check when receiving these messages that helped us find a solution - Review your windows performance counters for your disks that the warning messages are pointing to, specifically: - Disk avg. read time - Disk avg. write time - Disk read bytes/sec - Disk write bytes/sec - Disk Transfers/sec - Avg. disk queue length - The above are averages. If you have many database files on one drive these averages can skew the result and mask a bottle neck on specific database files. Check out [this](https://www.sqlskills.com/blogs/paul/how-to-examine-io-subsystem-latencies-from-within-sql-server/) query from Paul S. Randal which returns average latency for each file from the dmv `sys.dm_io_virtual_file_stats`. In our case the average latency reported was acceptable, but underneath the covers we had many files with > 200 ms average latency. - Check the timings. Is there any pattern? Does it happen more frequently at certain a time in the night? If so check if any maintenance jobs are running at that time or any scheduled activity which may increase disk activity and expose a bottle neck in your IO subsystem. - Check the windows event viewer for errors. If your switch or SAN is being overloaded or not setup properly for your application you may find some messages in this log, and it is good to take this information to your SAN admin. In our case we were receiving iSCSI connection errors often throughout the day, hinting at the problem. - Review your SQL Server code. When you receive these messages you shouldn't immediately think it is an IO subsystem issue and pass it to your SAN admin. You need to do your part and review the database. Do you have really bad queries being run often churning through tons of data? Bad indexing? Excessive transaction log writes? You can use some open source queries to get a health check on your database, an example for checking how your query plan looks is [sp_blitzCache](https://www.brentozar.com/blitzcache/) - Don't ignore these. Today you may be receiving them a few times a day... then several months later when your workload increases and you forgot to monitor them they start to increase. Receiving lots of these messages can prevent SQL Server from accessing a certain file, and if it is **tempdb**, that is not good. In our case it got so bad that SQL Server shut itself down. Our solution was upgrading our switch to a SAN switch. Yes, these are all points to cover within SQL Server. What led us to finding out it was the switch was that we were receiving about 1500 iSCSI pdu disconnect errors in the Windows application event viewer on the SQL Server every day. That prompted the investigation by our SAN admins into the switch. Immediately after upgrading, the iSCSI errors were gone and average latency came down to around 50 ms for all the files, and that correlated to better performance in the application. With these points in mind hopefully you can find your solution.