Ongoing site issues

 Posted by (Visited 6344 times)  Misc
Jan 292007
 

The feed appears to be fixed (thanks, Mox!).

However, I’m still maxing out the CPU regularly even with WP-Cache — just writing a comment can trigger this, and I don’t know why. If you see “Account suspended” when you visit here, that’s why.

Next steps are going to involve turning off some of the dynamic stuff on the sidebars…

  15 Responses to “Ongoing site issues”

  1. Raph,
    If you want I can set you up a little sandbox to play with and test. It will have one button WordPress installation.

  2. Raph, your symptoms of a single comment causing a CPU overload made me think of a similar problem my girlfriend’s blog ran into, and we have a dedicated server. Her blog was crashing our server one to three times a day.

    The main suspect was the spam filter: Spam Karma. Our theory was that there were so many spam comments entered in Spam Karma’s database (we had it set to keep spam in the database for 30 days) and we were constantly getting hit with so much comment spam, that the CPU was being choked, since spam karma compares each comment to it’s entire database.

    She installed a WP plugin called “Bad Behavior” which rejects malformed HTTP requests, or something like that, before they get to Spam Karma. She also trimmed the Spam Karma database to keep only a week’s worth of comments. Bad Behavior drastically reduced the amount of comments that Spam Karma ever saw, and shrinking the database meant that when Spam Karma did get to check a comment, it had to work a lot less to do so.

    Ever since we made those changes, the site has not crashed. If you’ve got CPU limits, it’s quite possible that your spam filter (if you’ve got one) is actually choking your CPU. I can’t guarantee that’s your problem, but I hope this information helps.

    *This is how we understand it to work. How it actually works may be different.

  3. We use two different spam filter plugins: Trackback Validator, which silently discards any trackbacks which come from bogus webpages; and Spam Nuke, which lets us dump the ‘spam’ category from the DB altogether.

    We get hundreds of bogus trackbacks in a day, but we empty the spam cache multiple times daily, and optimize the SQL tables a couple of times a week… not emptying it was what led to the high SQL query times before. Literally half the comments DB had become spam.

    Here’s what I just posted to the webhost’s support forums:

    On 1/24 and previous, I had virtually no reports of problems in my logs. I’d had some issues with slow SQL queries from time to time (I run a pretty popular WordPress blog), but nothing too dramatic.

    On 1/25 and since, my cpu_exceeded_logs are full. Like, from 0k I have gone to

    2007-01-25.log 85 k
    2007-01-26.log 268 k
    2007-01-27.log 50 k
    2007-01-28.log 410 k
    2007-01-29.log 317 k (this is just so far today!)

    I added WP-Cache to my installs. I removed some plugins from my site. I upgraded to WP 2.1. None of it seems to have made any difference, really.

    The result is constant unavailability of the site, as I trigger the CPU utilization suspension.

    The cpu exceeded logs look like this:

    Mon Jan 29 15:14:36 2007: used 0.05 seconds of cpu time for HTTP Request: https://www.raphkoster.com : GET /wp-content/plugins/live-comment-preview.php/commentPreview.js HTTP/1.1
    Mon Jan 29 15:14:37 2007: used 0.07 seconds of cpu time for HTTP Request: https://www.raphkoster.com : GET /wp-content/plugins/flags/flag_de.gif HTTP/1.1
    Mon Jan 29 15:14:37 2007: used 0.06 seconds of cpu time for HTTP Request: https://www.raphkoster.com : GET /images/widefill.jpg HTTP/1.1
    Mon Jan 29 15:14:40 2007: used 0.39 seconds of cpu time for HTTP Request: https://www.raphkoster.com : GET /areae/wiki/tiki-index.php?page=Business HTTP/1.1
    Mon Jan 29 15:14:43 2007: used 0.38 seconds of cpu time for HTTP Request: https://www.raphkoster.com : GET /areae/wiki/tiki-index.php?page=Potential+partners HTTP/1.1
    Mon Jan 29 15:14:44 2007: used 0.51 seconds of cpu time for HTTP Request: https://www.raphkoster.com : GET /2005/12/12/the-future-of-content/feed/ HTTP/1.1
    Mon Jan 29 15:14:44 2007: used 0.72 seconds of cpu time for HTTP Request: https://www.raphkoster.com : GET /2006/12/15/announcing-areae/ HTTP/1.1
    Mon Jan 29 15:14:45 2007: used 0.04 seconds of cpu time for HTTP Request: https://www.raphkoster.com : GET / HTTP/1.1

    I called support on Sunday, and was told that it just looked like the site has crossed some threshold and was just a bit too popular for the shared hosting.

    For comparison, these are my daily stats reported by Webalizer. As you can see, the stats before and after the CPU load problems are pretty comparable — in fact, they’re higher earlier in the month.

    Day Hits Files Pages Visits Sites KBytes
    22 92532 3.83% 74636 3.75% 15196 3.17% 7792 3.67% 5557 7.43% 1429379 3.50%
    23 161079 6.67% 137852 6.93% 20109 4.19% 10151 4.78% 7471 10.00% 2301545 5.63%
    24 108496 4.49% 91006 4.57% 21032 4.38% 9159 4.31% 6387 8.55% 2073752 5.07%
    25 86156 3.57% 71544 3.59% 17489 3.64% 8641 4.07% 5828 7.80% 1572410 3.85%
    26 77583 3.21% 64505 3.24% 17359 3.62% 8346 3.93% 5642 7.55% 1416050 3.46%
    27 66050 2.73% 55036 2.77% 15577 3.24% 7702 3.63% 5214 6.98% 1169107 2.86%
    28 69142 2.86% 58317 2.93% 17648 3.68% 9230 4.35% 6789 9.08% 1309004 3.20%

    So, does something look fishy? I don’t see how I could have suddenly tipped over to such high load.

  4. Hm, wonder why it takes 60 milliseconds of CPU time just to fetch an image? Do you have a complex .htaccess?

  5. We get hundreds of bogus trackbacks in a day, but we empty the spam cache multiple times daily, and optimize the SQL tables a couple of times a week… not emptying it was what led to the high SQL query times before.

    Doesn’t seem like you should have to do that. Not counting avoiding spam altogether, even. I’m not familiar with MySQL, but does it have the ability to trace and analyze? Have you looked at that? Perhaps Trackbacks needs it’s own table…is that determined by comment_type? Just guessing. Could always just blame SL hype for the outages. 🙂

  6. Mox:

    All that’s in the .htaccess is this little hack, which I can’t remember what it was even for… permalinks, I think?

    # BEGIN WordPress

    RewriteEngine On
    RewriteBase /
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule . /index.php [L]

    # END WordPress

    So I don’t think it could be that.

    Robusticus, emptying the spam cache and optimizing it is just a habit at this point. I doubt we need to do it as often as we do, we just do. 🙂 Also, emptying the spam frequently lets us filter it manually to pull out misidentified spams, which does happen a couple of times a day.

  7. The footer now reports the number of SQL queries and the time it took to gen the page.

  8. Hmmm, I use Akismet which comes with WordPress 2.1, does that work well for you? You might also try adding a CAPTCHA, which might reduce the database load (if the CAPTCHA doesn’t make expensive queries)

  9. Raph, have you considered being less interesting?

  10. You said you had installed WP-Cache2, but it does not appear to be caching comment pages (such as this one). Even a 2-5 minute cache on these pages would probably be helpful. Query caching in general would be great but doesn’t work for all of WordPress 2.0’s queries due to the way they change each time (due to using the NOW() timestamp).

    That’s all I can come up with for the moment outside of separating static content from dynamic content on different servers or the like, but if the dynamic content isn’t being served up efficiently, then there’s no sense in pursuing any sort of app partitioning until that’s set.

    Actually, it does seem that the majority of latency is on pages with a lot of queries. This page alone has 47 queries to render. The reason it might be impacting now vs. before could have something to do with the growing number of total rows that have to be looked at to respond to a query, along with inefficient indexing.

    Do you have slow queries reporting in your MySQL log during these times of heavy CPU?

  11. The number of hits in a given day only gives you an overall idea how busy your site is, and not how busy it is during the peak period. It’s possible that you just have more users hitting your site during the peak periods, which is causing this problem.

    A couple ideas

    1- Continue tweaking with WordPress and turning off features. You’ve done a bunch of this already, so more in-depth tuning will likely involved digging into the PHP code and timing how long specific parts are taking.

    2- Seperate your static and dynamic content. Something like static.raphkoster.com could serve up all of your images and take some load off of your dynamic content server.

    3- Get a different service where you have some control over the server configuration. Virtual servers are great for this and are only slightly more expensive than shared hosting plans. In addition, they typically have a much lower customer to CPU ratio, so you don’t have other sites affecting the performance of yours. You can change or remove the CPU limits completely, and tweak with PHP settings and Apache resource limits.

    I’ll volunteer to help you out with any of that stuff if you would like any assistance.

  12. Raph, have you considered being less interesting?

    Yeah, I tried that already. Like, posts about site issues, for example. But you’re still here!

    You said you had installed WP-Cache2, but it does not appear to be caching comment pages (such as this one).

    Right now, WP-Cache2 is set to 3600 seconds, and default settings. It shows 35 cached pages. Oddly, every single one of the pages it is caching is a /feed page, except for the front page. Why it isn’t caching these pages, I have no idea. It is supposed to be doing so.

    Do you have slow queries reporting in your MySQL log during these times of heavy CPU?

    There are a few slow queries popping up, but there is no relationship between those and the heavy CPU reports. The heavy CPU reports are constant — multiple log lines every second. The SQL warnings come a handful of times a day.

    The number of hits in a given day only gives you an overall idea how busy your site is, and not how busy it is during the peak period. It’s possible that you just have more users hitting your site during the peak periods, which is causing this problem.

    Again, the CPU errors are happening 24/7. I got the account suspended message last night at 12:30am Pacific, for example. The logs show it as occurring all night long.

    Also, I do have stats showing usage by hour too… again, nothing too dramatic has changed since last week.

  13. Number of queries increasing with number of comments is suspect. Might be able to save some cycles by combining those, at least, even if not reducing the number of base queries. Not sure if that would help, though.

  14. Since your site or usage doesn’t seem to have changed, I would begin to suspect something external to your site, but on the same server as yours. I don’t see many references on the Internet to that particular error message – most of them are from other BlueHost customers. BlueHost’s explanation of the error is that they occur when your site exceeds 20% CPU usage.

    My guess would be that the resource limits per-site aren’t strictly enforced until the whole server is over some threshold, then it begins to deny access to those who exceed their guaranteed limits. Some other site on the server could be pushing the server over that threshold.

    I found this link which recommends asking to be moved to a new server: http://www.cubecart.com/site/forums/index.php?showtopic=21177

    It makes sense that if you get moved to a new server, it will have fewer customers with smaller sites. Also, if you’ve been on the same server for several years and you get moved to a “new” server, it will likely have a faster CPU. 20% of a faster CPU might be double (or more) what you are allocated now.

  15. […] that’s why. Next steps are going to involve turning off some of the dynamic stuff on […] Read More… Published Monday, January 29, 2007 1:30 PM by Raph’s Website Filed under: […]

Sorry, the comment form is closed at this time.