Sometimes it just happens. You have a bloated Java application at your hand and it does not perform well. You may have built this application yourself or just got it as it is now. It doesn't matter, thing is - you do not have a slightest idea what is wrong here.
Java ecosystem have abundance of diagnostic tools (thank for interfaces exposed at JVM itself), but they are mostly focused on some specific narrow kinds of problems. Despite calling themselves intuitive, they assume you have a lot of background knowledge about JVM and profiling techniques. Honestly, even seasoned Java (I'm speaking for myself here) developer can feel lost first time looking at JProfiler, YourKit of Mission Control.
If you have a performance problem at your hand, first you need is to classify problem: is it in Java or database or somewhere else? is CPU or memory kind of problem? Once you know what kind of problem you have, you can choose next diagnostic approach consciously.
Are we CPU bound?
One of first thing you would naturally do is to check CPU usage of your process. OS can show you process CPU usage. Which is useful, but the next question is which threads are consuming it. OS can show you threads usage too, you can even get OS IDs for your Java threads using jstack and correlate them ... manually (sick).
A simple tool showing CPU usage per Java thread is the thing I wanted badly for the years.
Besides CPU usage JMX have another invaluable metric - per thread allocation counter.
Collecting information from JMX is safe and can be done on live application instance (in case if you do not have JMX port open, SJK can connect using process ID).
Below is example of ttop command output.
2014-10-01T19:27:22.825+0400 Process summary process cpu=101.80% application cpu=100.50% (user=86.21% sys=14.29%) other: cpu=1.30% GC cpu=0.00% (young=0.00%, old=0.00%) heap allocation rate 123mb/s safe point rate: 1.5 (events/s) avg. safe point pause: 0.14ms safe point sync time: 0.00% processing time: 0.02% (wallclock time)  user=83.66% sys=14.02% alloc= 121mb/s - Proxy:ExtendTcpProxyService1:TcpAcceptor:TcpProcessor  user= 0.97% sys= 0.08% alloc= 411kb/s - RMI TCP Connection(35)-10.139.200.51  user= 0.61% sys=-0.00% alloc= 697kb/s - Invocation:Management  user= 0.49% sys=-0.01% alloc= 343kb/s - RMI TCP Connection(33)-10.128.46.114  user= 0.24% sys=-0.01% alloc= 10kb/s - PacketPublisher  user= 0.00% sys= 0.10% alloc= 11kb/s - PacketReceiver  user= 0.00% sys= 0.07% alloc= 22kb/s - RMI TCP Connection(31)-10.139.207.76  user= 0.00% sys= 0.05% alloc= 20kb/s - RMI TCP Connection(25)-10.139.207.76  user= 0.12% sys=-0.07% alloc= 2217b/s - Cluster|Member(Id=18, Timestamp=2014-10-01 15:58:3 ...  user= 0.00% sys= 0.04% alloc= 6657b/s - JMX server connection timeout 76  user= 0.00% sys= 0.03% alloc= 526b/s - PacketListener1P  user= 0.00% sys= 0.02% alloc= 1537b/s - Proxy:ExtendTcpProxyService1  user= 0.00% sys= 0.02% alloc= 6011b/s - JMX server connection timeout 49  user= 0.00% sys= 0.01% alloc= 0b/s - DistributedCache
Besides CPU and allocation, it also collect "true" GC usage and safe point statistics. Later two metrics are not available via JMX so they are available only for process ID connections.
CPU usage picture will give you good insight what to do next: should you profile your Java hot spots or all time is spent waiting result from DB.
Another common class of Java problems is related to garbage collection. If this is a case GC logs is first place to look at.
Do you have them enabled? If not, that is not a big deal, you can enable GC logging on running JVM process using jinfo command. You can also use SJK's gc command to peek GC activity for your java process (it is not as full as GC logs tough).
If GC logs confirm what GC is causing you problems, next step is to identify where that garbage comes from.
Commercial profilers are good at memory profiling, but this kind of analysis slows down target application dramatically.
Mission Control stands out of pack here, it can profile by sampling TLAB allocation failures. This technique is cheap and generally produce good results, though it is inherently biased and may mislead you sometimes.
For long time jmap and class histogram were main memory profiling instrument for me. Class histogram is simple and accurate.
Dead heap histogram is calculated as difference between object population before and after forced GC (using jmap class histogram command under hood).
Dead young heap histogram enforces full GC then wait 10 seconds (by default) then produce dead object histogram by technique describe above. Thus you see a summary freshly allocated garbage.
This methods cannot not tell you where in your code that garbage was allocated (this is job for Mission Control et al ). Though, if you know that is your top garbage objects, you may already know there they are allocated.