Site freezes when processing MYSQL operation

Carl

Active Member
#1
Hello,

My site running on LS Enterprise 4-core became unresponsive on two occasions in about a week. Both time it happened after an attempt to update the databases by the site. i.e a person submits a form with data to update to the DB. The site would freeze, any connection attempt would be stuck on "connecting to (site address)...". I had to hard restart the server with /usr/local/lsws/bin/lswsctrl stop and /usr/local/lsws/bin/lswsctrl start both times to get the site working again and the same MYSQL operation would work fine this time, and on.

Not sure if something MYSQL related is the cause of the freeze, but any ideas what could be the cause? Could there be some settings (i.e connection settings,process limit)I need to check for the site?


Thanks
 

Carl

Active Member
#3
Hi,

After some research I found 'SHOW ENGINE INNODB STATUS' is supposed to show the latest deadlock if it happens, however no deadlocks were shown on my result. And I think a deadlock would timeout after 50 second by default, set by the 'innodb_lock_wait_timeout' variable, so no reason for my site being unresponsive for longer than that.

Since restarting LS doesn't start the MYSQL process, so if MYSQL is really locked up and is causing the site to be unresponsive, restarting LS shouldn't have fixed the problem. I am thinking perhaps something in LS was causing it.

Any other ideas?
 

Carl

Active Member
#4
Just had a freeze happen again, here was the MYSQL processlist while site was frozen

+-------+------+-----------+-----------+---------+------+-------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-------+------+-----------+-----------+---------+------+-------+------------------+
| 69919 | root | localhost | {hidden} | Sleep | 268 | | NULL |
| 69944 | root | localhost | {hidden} | Sleep | 254 | | NULL |
| 69964 | root | localhost | {hidden} | Sleep | 225 | | NULL |
| 70046 | root | localhost | {hidden} | Sleep | 97 | | NULL |
| 70048 | root | localhost | {hidden} | Sleep | 95 | | NULL |
| 70059 | root | localhost | {hidden} | Sleep | 70 | | NULL |
| 70061 | root | localhost | {hidden} | Sleep | 61 | | NULL |
| 70062 | root | localhost | {hidden} | Sleep | 61 | | NULL |
| 70063 | root | localhost | {hidden} | Sleep | 59 | | NULL |
| 70068 | root | localhost | {hidden} | Sleep | 50 | | NULL |
| 70070 | root | localhost | {hidden} | Sleep | 49 | | NULL |
| 70071 | root | localhost | {hidden} | Sleep | 48 | | NULL |
| 70079 | root | localhost | {hidden} | Sleep | 33 | | NULL |
| 70080 | root | localhost | {hidden} | Sleep | 30 | | NULL |
| 70081 | root | localhost | {hidden} | Sleep | 27 | | NULL |
| 70082 | root | localhost | {hidden} | Sleep | 25 | | NULL |
| 70083 | root | localhost | {hidden} | Sleep | 16 | | NULL |
| 70084 | root | localhost | {hidden} | Sleep | 14 | | NULL |
| 70085 | root | localhost | {hidden} | Sleep | 14 | | NULL |
| 70086 | root | localhost | {hidden} | Sleep | 11 | | NULL |
| 70087 | root | localhost | {hidden} | Sleep | 9 | | NULL |
| 70088 | root | localhost | {hidden} | Sleep | 8 | | NULL |
| 70089 | root | localhost | NULL | Query | 0 | init | show processlist |
| 70090 | root | localhost | {hidden} | Sleep | 6 | | NULL |
| 70091 | root | localhost | {hidden} | Sleep | 5 | | NULL |
| 70092 | root | localhost | {hidden} | Sleep | 4 | | NULL |
+-------+------+-----------+-----------+---------+------+-------+------------------+
26 rows in set (0.00 sec)

Restarted LS and checked processlist again

+-------+------+-----------+-----------+---------+------+-------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-------+------+-----------+-----------+---------+------+-------+------------------+
| 69944 | root | localhost | {hidden} | Sleep | 327 | | NULL |
| 69964 | root | localhost | {hidden} | Sleep | 298 | | NULL |
| 70046 | root | localhost | {hidden} | Sleep | 170 | | NULL |
| 70048 | root | localhost | {hidden} | Sleep | 168 | | NULL |
| 70059 | root | localhost | {hidden} | Sleep | 143 | | NULL |
| 70061 | root | localhost | {hidden} | Sleep | 134 | | NULL |
| 70062 | root | localhost | {hidden} | Sleep | 134 | | NULL |
| 70063 | root | localhost | {hidden} | Sleep | 132 | | NULL |
| 70068 | root | localhost | {hidden} | Sleep | 123 | | NULL |
| 70070 | root | localhost | {hidden} | Sleep | 122 | | NULL |
| 70071 | root | localhost | {hidden} | Sleep | 121 | | NULL |
| 70079 | root | localhost | {hidden} | Sleep | 106 | | NULL |
| 70080 | root | localhost | {hidden} | Sleep | 103 | | NULL |
| 70085 | root | localhost | {hidden} | Sleep | 87 | | NULL |
| 70095 | root | localhost | {hidden} | Sleep | 68 | | NULL |
| 70097 | root | localhost | {hidden} | Sleep | 62 | | NULL |
| 70098 | root | localhost | {hidden} | Sleep | 60 | | NULL |
| 70100 | root | localhost | {hidden} | Sleep | 51 | | NULL |
| 70101 | root | localhost | {hidden} | Sleep | 50 | | NULL |
| 70102 | root | localhost | {hidden} | Sleep | 47 | | NULL |
| 70103 | root | localhost | {hidden} | Sleep | 46 | | NULL |
| 70104 | root | localhost | {hidden} | Sleep | 38 | | NULL |
| 70105 | root | localhost | {hidden} | Sleep | 37 | | NULL |
| 70106 | root | localhost | {hidden} | Sleep | 37 | | NULL |
| 70107 | root | localhost | {hidden} | Sleep | 36 | | NULL |
| 70108 | root | localhost | {hidden} | Sleep | 35 | | NULL |
| 70124 | root | localhost | NULL | Query | 0 | init | show processlist |
+-------+------+-----------+-----------+---------+------+-------+------------------+
27 rows in set (0.00 sec)

Basically the same and nothing unusual there..so I'm guessing the freeze is not caused by MYSQL related things.
 

NiteWave

Administrator
#5
maybe the server is under sql-injection attack -- too many "Sleep" in sql process list.
or something like that -- because of wrong php code.

maybe you can simulate the behavior
1. in a test php script, say test.php, connect to mysql, and run "sleep" only
2. benchmark test.php, while access test.php in browser, you may reproduce the issue -- unresponsive
3. restart lsws may "fix" this issue.
or just kill all lsphp processes, may "fix" it as well.
but it's not a real fix. and all web servers may fail as well. not only lsws.

see if you can make above test easily.
 

Carl

Active Member
#6
I've looked online and apparently having a large number of "Sleep" connections are perfectly normal. I suspect it could be caused by a denial of service attack judging by the symptom. We are currently waiting on our hosting service provider with more info, meanwhile what are some of the settings in LS we can change to combat a DDoS attack?
 

NiteWave

Administrator
#9
the cause may come from
1)lsws
2)lsphp
3)mysql server

as you said, restart lsws have fixed the issue. sounds like the issue is at lsws side.
per above discussion, it's more likely connection issue between lsphp and mysql.
to further confirm this, when "The site would freeze", can you check
1) if there is lsphp process hang there for long time
#ps -ef|grep lsphp | grep site-user-name
luckily, you may find the process id, and strace it, or see how long it has run there:
#ps -o lstart pid-of-lsphp-process
then kill it(lsphp process), see if the site becomes normal ?

if no luck to find the process, just
2)killall lsphp5
and check by "ps -ef|grep lsphp" to ensure killall has succeeded

or just
3)restart mysql server
only.

if 1) or 2) or 3) can fix the issue ( -- no need restart lsws), it should narrow down the scope before continue investigating.
 

Carl

Active Member
#10
Did some testing when the site froze again, here are the latest info i've found.

1) We have two separate sites running in the same LS virtual host, one is our main site running on Zend Framework + Social Engine, the other is our blog site running on Wordpress. When the problem occurs and the main site became unresponsive, the blog site is still working fine. This lead me to believe the way LS works with some code in the main site may be causing the problem. As we've never experienced this problem when the site was ran on Apache.

2) When the main site was frozen I've tried to restart MySQL first, but that did not solve the problem. However when MySQL was shutdown the main site did respond with a 'unable to connect to MySQL server' error message generated by Zend when accessed. After MySQL was started again, the site still hangs like before, stuck at 'connecting to xxxxx'. I would think MySQL wasn't the cause for the hang as restarting it didn't fixe the problem, and our Wordpress blog site that uses the same MySQL server (but a different DB than our main site) still works fine.

3) While the main site was frozen, ps -ef | grep lsphp returned the following

0 64372 64366 0 Tue05PM ?? 0:06.25 lsphp5
-2 83538 64372 0 7:59AM ?? 0:00.19 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83633 64372 0 8:01AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83659 64372 0 8:02AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83689 64372 0 8:04AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83843 64372 0 8:11AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83879 64372 0 8:14AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83882 64372 0 8:14AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83886 64372 0 8:14AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83901 64372 0 8:15AM ?? 0:00.08 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83944 64372 0 8:17AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83982 64372 0 8:19AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83983 64372 0 8:19AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84045 64372 0 8:22AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84205 64372 0 8:22AM ?? 0:00.02 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84354 64372 0 8:22AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84361 64372 0 8:23AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84362 64372 0 8:23AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84364 64372 0 8:23AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84365 64372 0 8:23AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84366 64372 0 8:23AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84368 64372 0 8:23AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84369 64372 0 8:23AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84370 64372 0 8:23AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
0 84372 365 0 8:23AM ttys000 0:00.00 grep lsphp

The parent lsphp5 process started on Tuesday 5PM, been running for quite a long time, i am guessing that the other processes (83538-84370) running under lsphp5 were requests to access the site's homepage index.php, but is hanging there since the site is stuck.

4) I then did a killall lsphp5 and the site is working again.

Then I did a ps -ef | grep lsphp again after and got

-2 83538 1 0 7:59AM ?? 0:00.19 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83633 1 0 8:01AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83659 1 0 8:02AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83689 1 0 8:04AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83879 1 0 8:14AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83882 1 0 8:14AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83886 1 0 8:14AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83901 1 0 8:15AM ?? 0:00.08 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83944 1 0 8:17AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 83983 1 0 8:19AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84205 1 0 8:22AM ?? 0:00.02 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84369 1 0 8:23AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84376 1 0 8:24AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84386 1 0 8:24AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84391 1 0 8:24AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84392 1 0 8:24AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84395 1 0 8:24AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
-2 84417 1 0 8:25AM ?? 0:00.01 lsphp5:/usr/local/lsws/{mysite}/index.php
0 84419 64366 0 8:26AM ?? 0:00.02 lsphp5
-2 84421 84419 0 8:26AM ?? 0:00.24 lsphp5
-2 84422 84419 0 8:26AM ?? 0:00.39 lsphp5
-2 84423 84419 0 8:26AM ?? 0:00.08 lsphp5
-2 84424 84419 0 8:26AM ?? 0:00.02 lsphp5
-2 84432 84419 0 8:26AM ?? 0:02.22 lsphp5
-2 84433 84419 0 8:26AM ?? 0:01.20 lsphp5
-2 84436 84419 0 8:26AM ?? 0:00.38 lsphp5
-2 84437 84419 0 8:26AM ?? 0:00.74 lsphp5
-2 84438 84419 0 8:26AM ?? 0:09.50 lsphp5
0 84445 365 0 8:27AM ttys000 0:00.00 grep lsphp

I noticed the process that were "hanging" was no longer run under lsphp5, but rather under /sbin/launchd (1), a new 'parent' lsphp5 spawned (84419) and it spawned a number of children (84421-84445).

Doing ps -ef | grep lsphp a later no longer return those lsphp5:/usr/local/lsws/{mysite}/index.php processes, I'm guessing those processes timed out eventually.

Again, if MySQL isn't the cause, the fact that the blog site was still working while the main site was hanging lead me to believe the way LS works with some code in the main site may be the cause of the problem. Let me know what do you think about all this.
 
#11
>Doing ps -ef | grep lsphp a later no longer return those lsphp5:/usr/local/lsws/{mysite}/index.php processes, I'm guessing those processes timed out eventually.
they are run away processes and disappear themselves, so it's not an issue at all. just ignore it.

>Again, if MySQL isn't the cause, the fact that the blog site was still working while the main site was hanging lead me to believe the way LS works with some code in the main site may be the cause of the problem. Let me know what do you think about all this.
now the issue becomes more clear. since "killall lsphp5" can temporarily fix it, so the issue is relating to connections between lsphp5 and mysql(no direct link to litespeed process)

maybe it's related to lsphp suEXEC Daemon mode. can you change it to
lsws web admin -> Server -> External App -> lsphp5 -> Run On Start Up: No
 

Carl

Active Member
#12
Unfortunately the main site uses APC caching heavily, and suEXEC Daemon mode is required for APC to work properly. The blog site however doesn't use APC.
 

Carl

Active Member
#14
We are using PHP 5.3.29, it should be compatible with APC.

There was a freeze last night, this time killall lsphp5 didn't fix the problem but a server restart did..

I also noticed in the error log that during the freeze there are nothing but the following 2 types of logs:

2015-01-26 19:28:58.617 [INFO] [99.237.126.186:63770-0#site] Abort request processing by PID:20513, kill: 1, begin time: 2, sent time: 2, req processed: 0
2015-01-26 19:28:58.722 [INFO] [CLEANUP] Send signal: 30 to process: 20513


- Looks like logs of processing aborted and cleaned up, note that the req processed is always 0 during the freeze. the begin time and sent time varies. This type of log happen pretty much constantly, like every 3-5 sec, I think every time a request to the site made, it prints out this log, but I'm not too sure.

2015-01-26 19:57:14.417 [INFO] [99.249.216.97:61812-0#site] Connection idle time: 1801 while in state: 5 watching for event: 25,close!
2015-01-26 19:57:14.417 [NOTICE] [99.249.216.97:61812-0#site] Content len: 0, Request line: 'GET /toronto/clear-spirit-70-distillery-lane/unit-3306-C3096332 HTTP/1.1'
2015-01-26 19:57:14.418 [INFO] [99.249.216.97:61812-0#site] Cookie len: {long messy cookie value}
2015-01-26 19:57:14.418 [NOTICE] [99.249.216.97:61812-0#site] Redirect: #1, URL: /index.php
2015-01-26 19:57:14.418 [INFO] [99.249.216.97:61812-0#site] HttpExtConnector state: 8, request body sent: 0, response body size: 0, response body sent:0, left in buffer: 0, attempts: 0.
2015-01-26 19:57:14.418 [INFO] [99.249.216.97:61812-0#site] Lsapi connection state: 3, watching event: 25, respState: 1, m_iTotalPending:0, m_iPacketLeft:0, m_iCurRespHeader:7, req sent for 1801 seconds,Total processing time: 1801.
2015-01-26 19:57:14.418 [INFO] [99.249.216.97:61812-0#site] abort request..., code: 4
2015-01-26 19:57:14.418 [INFO] [99.249.216.97:61812-0#site] Abort request processing by PID:20457, kill: 1, begin time: 1801, sent time: 1801, req processed: 0
2015-01-26 19:57:14.540 [INFO] [CLEANUP] Send signal: 30 to process: 20457


- Looks like connection timeouts after 1800(we set this time) sec, makes since since the site is unresponsive, this log occurs once in a while.

Does those logs, perhaps more so the first one suggests where the problem lies?

--------

Also looking further back though the error log I've also noticed that when the site is running normally -

The "Abort request processing by PID:94759, kill: 1, begin time: 0, sent time: 0, req processed: 54" type message occurs infrequently aswell. But req processed is sometimes 0 as well here.

The " [CLEANUP] send signal: 30 to process: 20513" type message occurs also but very rarely.

Is it normal to see these messages when site is running normally?
 
Last edited:
#15
>There was a freeze last night, this time killall lsphp5 didn't fix the problem but a server restart did..
hmm... lsws restart, not linux reboot ?

>Looks like connection timeouts after 1800(we set this time) sec
usually it's not good to set this value long, the default is 10 sec ?
do you mean lsphp5 external app->Max Idle Time:1800 seconds
or Environment:LSAPI_MAX_PROCESS_TIME=1800 ?

can you show your lsphp5 settings and the reason for long "processing time"?
let's see if there is other better way to archive same goal.
 

Carl

Active Member
#16
We have some important scheduled tasks that are triggered via a HTTP requests to a page, the task can take up to 30 mins. If set less the schedule it will timeout before it is completed.
 

mistwang

LiteSpeed Staff
#17
I think it is a dead lock in APC's SHM access code, so as long as PHP process using the same SHM segment, it cannot process anything as it will be locked immediately when load a script from APC. You should strace the PHP process when it happens, if it stuck at "futex()" call, it is a dead lock. then you can check the process with GDB by attaching to that process,

gdb
attach <pid_of_a_php_process>
bt
It will show you the stack back trace. You will know which PHP module causes the problem.
 

Carl

Active Member
#19
>There was a freeze last night, this time killall lsphp5 didn't fix the problem but a server restart did..
hmm... lsws restart, not linux reboot ?
Yes lsws restart


Also the variable we set 1800 to is Connection Timeout (secs) under Server>tuning
 
Last edited:

Carl

Active Member
#20
I think it is a dead lock in APC's SHM access code, so as long as PHP process using the same SHM segment, it cannot process anything as it will be locked immediately when load a script from APC. You should strace the PHP process when it happens, if it stuck at "futex()" call, it is a dead lock. then you can check the process with GDB by attaching to that process,



It will show you the stack back trace. You will know which PHP module causes the problem.
I'm assuming that is to strace the existing lsphp5 processes first.

Assuming it is APC that is causing the problem, is there any settings in APC we can tune to avoid this problem without turning off suEXE Daemon mode? I noticed the problem usually occur 5-7 days after the previous LS restart to fix the problem, I'm thinking perhaps there is a certain build up of something somewhere and it breaks when reaches a certain threshold. Would you think increasing the max total ram size of APC would help? Even though right now it is set at 500mb and our usage never really went over 200mb in opcode + user cache.

Mean while we have set a scheduled task to restart lsws everyday in low-traffic time to reset server + APC state, will see if the problem occurs again...
 
Top