Over time, you tend to learn a Solaris performance trick or two. Or three. Or more. That's cool, it's how stuff works: You learn, you do, you remember.
Performance analysis and tuning is just like that: You learn a trick from a person that is more senior than you are, you apply it, you feel like a hero, you learn the next trick.
But having a bag of tricks is not enough. Because then you start trying out stuff without a system, and spend useless time hunting that problem with a hit-and-miss approach, gut-based only.
Therefore, I'm always glad to listen to Ulrich Gräf when he does one of his famous performance tuning workshops (if you're lucky, you can catch Uli blogging in German here), because he'll give you the full view, the context and the system too, when it comes to performance analysis.
So here's my personal cheat sheet for Oracle Solaris Performance Analysis, including some guideance on how to systematically catch that elusive bottleneck.
Note: Sorry for the long silence: I was on a vacation and we're in the process of relocating to a new home. Lots of stuff to do... Hopefully regular programming will continue now :).
Back to performance tuning. But where to start? If the "system is slow" or the "web server doesn't perform", or simply "performance sucks", where's the best place to begin?
Defining the Problem and the Solution
Before you start hunting down your performance problem, the first step is to actually define it:
- What exactly is "too slow"? Is it MB/s? Java Transactions per minute? Frames per second? Web page load time? Average database query response time?
The more specific you can define performance, the better you can figure out what's going on. This also helps with testing assumptions or defining the overall playground for your performance bottleneck hunt. So: Please ask what the specific performance metric is, how it is measured and get as much data as you can on what values have been measured under what circumstances so far.
How slow is "slow"? How much would be "good enough"? Once you know what the actual measureable performance quality is, ask what the currently measured value is, and why it is unacceptable. That will give you a starting point and a feeling for what the significance of the issue is. Then, ask what value would make for an acceptable result. That will give you a performance goal to optimize to. And once you reach it, you'll know you're done and that you can celebrate!
Collect all facts, assumptions and constraints. This is important, too: What's the server model, CPU count, frequency and RAM size? How many networking cards, HBAs, disks, etc. are attached to it? How's everything wired together? What version of Solaris and the application are running? What update? What patches? Which of this information is for sure, and which is just assumed/guessed? What actions have been tried so far and with what results? What options does your user/customer still have in terms of equipment, budget or manpower? The more you know about your environment, the more potential explanations for the experienced performance issue you can generate, and the more potential solutions you'll be able to come up with.
Hunting for Bottlenecks
Now that we know the problem, the rules and the playing field, we're ready to start hunting for the problem. The "problem" is always a bottleneck: Data flows in and out of the system all the time, and the point at which its flow is constrained is the point we're looking for: The bottleneck.
Find the bottleneck. Remove it. Measure. Find the next bottleneck. Continue, until you've reached your "acceptable performance" goal.
Every system has a bottleneck, we just need to figure out where it is. And the best way to find it is to follow the data.
Peeling the System Onion from the Inside Out
Because at the end of the day, every computer system is just a huge machine where data comes in at one end, and leaves the machine at the other. In between, it is copied multiple times through multiple pieces of a huge memory hierarchy. We only need to systematically step through that hierarchy, until we find our bottleneck.
A computer system is like an onion: You start with the CPU core in the middle, you know, the piece that does the actual work. Then you have many layers of different types of memory surrounding it like an onion. Data always flows from the outside in, gets processed, then it flows back from the inside out.
Let's look at all the types of memory from the inside out:
The CPU: The innermost part of the system is a single CPU core with its execution pipeline. Data is passed in and comes out of cores through Level 1 Cache, then Level 2 and for newer CPUs even Level 3 Cache.
RAM: Data that exits the CPU at its highest level cache is stored in RAM, and data that enters the CPU comes from RAM. This is also true for disk, graphics and networking data: They just present their input/output data as a special kind of RAM.
Disk: Data that needs to live beyond power cycles ultimately needs to be written to disk (or to the network, but we'll get to that in a minute). As with any layer of our data onion model, Disks can store much more than RAM, but they're even farther away from the CPU in terms of latency. Since disk latency hasn't improved much over the years while CPU speed continued to grow at the rate of Moore's law, a performance gap has formed. That gap is now being bridged through flash memory.
Network: Finally, data that leaves the system travels through the network. This is the outermost layer that can be crossed by our data before it reaches another system.
That's it, these are the 4 places to look for bottlenecks in any system: CPU, RAM, Disk and Network. In that order. Just remember the onion model, and you'll have some strategy to guide you through the next performance hunt.
A Cliffhanger With a Free Plug
In the next blog entry, we'll discuss each of the big four places in more detail, and learn some tricks on how to figure out whether they happen to be a problem or not.
Meanwhile, if you happen to be in Germany in November, mark your calendars for two interesting talks during the German Oracle User's Group (DOAG) conference:
Jörg Möllenkamp (of c0t0d0s0.org fame) will present on Thursday, Nov. 18th, 10:00 about "Performance Analyse – oder: Was macht eigentlich mein Solaris?".
And Ulrich Gräf, my personal performance hero from the beginning, will present on Friday, Nov. 19th, 14:00 about "I/O-Performance – Design und Analyse".
Don't miss these two talks!
Update: Meanwhile the second article in this series has been published, with some favorite performance analysis commands and examples. Read My Favorite Oracle Solaris Performance Analysis Commands now.
I realize that the onion model is not the only methodology for finding a performance bottleneck in a system and that it's very simplistic (for example, we could talk ages about lock contentions, software optimization etc.). But in the end, everything that's a problem with a system is somehow related to the four layers explained above. That's why peeling the onion always has been very useful to me.
But now I'm curious: What's your favourite performance analysis hunting model? Leave a comment and share your wisdom!