[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[gfarm-discuss-ja:61109] エラーログが全く出力されずにGFMDダウンした件について
- From: ogasawara <ogatak1978@xxxxxxxxx>
- Date: Fri, 11 Nov 2011 15:31:22 +0900
各位
お世話になっております。小笠原です。
昨日突然gfmdがsegfault等のメッセージ無く、2度ダウンしました。障害発生時
の状況としましては、
・サービス稼働中、接続が行われていた
・Gfarm-v2.5.1 gfmdで冗長化している
・メタ2台、ファイルノード4台、クライアント複数サーバで構成
(ファイルシステムノードを増やしたほうが耐障害性はやはり上がるのでしょうか?)
・rsyncで大量にファイルをGfarm上に転送していた(1回目のgfmdダウン時に停
止、以降実行せず)
今回の障害発生時のログを下記にまとめました。
■master gfmd サーバ
Nov 10 18:55:07 master-gfmd gfmd[20089]: [1002257] error at 124595662:1
replication to fsnode: src=2 dst=0
Nov 10 18:55:07 master-gfmd gfmd[20089]: [1002257] error at 124595663:3
replication to fsnode: src=2 dst=0
Nov 10 18:55:07 master-gfmd gfmd[20089]: [1002257] error at 124595676:1
replication to fsnode: src=2 dst=0
上記ログが平日186968回(1秒あたり約2回)、休日487920回(1秒あたり約5.6回)出力
!! MasterGFMDダウン ※ここで1回目のGFMDダウン(実行していたrsync停止)
数分後、gfmdプロセスがいない事をpsで確認し、/etc/init.d/gfmd startでgfmd
起動開始
Nov 10 19:09:32 master-gfmd gfmd[28086]: [1002737] metadata replication
master mode
gfmdにメモリが乗り、各gfsd,clientと接続し動作開始
Nov 10 19:13:37 master-gfmd gfmd[28086]: [1002284]
GFS_PROTO_FHREMOVE(123560631, 2, fsnode): busy, waiting for some time
... 上記ログ約1000回連続
Nov 10 19:13:42 master-gfmd gfmd[28086]: [1002478] inum 124595802 gen 3:
fewer replicas (0 out of 1) will be created
... 上記ログ約600回連続
Nov 10 19:14:04 master-gfmd gfmd[28086]: [1002284]
GFS_PROTO_FHREMOVE(123564859, 1, fsnode): busy, waiting for some time
... 上記ログ約7500回連続
Nov 10 19:14:22 master-gfmd gfmd[28086]: [1002257] error at 124596223:1
replication to fsnode: src=2 dst=0
... 上記ログ約100回連続
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002284]
GFS_PROTO_FHREMOVE(117896871, 7, fsnode): busy, waiting for some time
... 上記ログ約7000回連続
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002787] back_channel(fsnode)
GFS_PROTO_STATUS request: disconnecting: status rpc unresponsive
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002787] back_channel(fsnode)
GFS_PROTO_FHREMOVE request: disconnecting: broken pipe
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002781] channel(fsnode):
aborted: connection aborted
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002284]
GFS_PROTO_FHREMOVE(123558267, 1, fsnode): busy, waiting for some time
... 上記ログ約5800回連続
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002223] waiting removal of
(123668935, 1, fsnode): connection aborted
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002223] waiting removal of
(123671866, 1, fsnode): connection aborted
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002223] waiting removal of
(123675722, 1, fsnode): connection aborted
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002223] waiting removal of
(123668959, 1, fsnode): connection aborted
... 上記ログ約340回連続
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002787] back_channel(fsnode)
GFS_PROTO_STATUS request: disconnecting: status rpc unresponsive
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002223] waiting removal of
(123591651, 1, fsnode): connection aborted
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002787] back_channel(fsnode)
GFS_PROTO_FHREMOVE request: disconnecting: broken pipe
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002781] channel(fsnode):
aborted: connection aborted
Nov 10 19:14:33 master-gfmd gfmd[28086]: [1002284]
GFS_PROTO_FHREMOVE(123558267, 1, fsnode): busy, waiting for some time
... 上記ログ約5600回連続 19:15:04頃まで出力され続け、
Nov 10 19:15:11 master-gfmd gfmd[28086]: [1002257] error at 124596428:1
replication to fsnode: src=2 dst=0
Nov 10 19:15:12 master-gfmd gfmd[28086]: [1002257] error at 124596421:1
replication to fsnode: src=2 dst=0
Nov 10 19:15:12 master-gfmd gfmd[28086]: [1002257] error at 124596443:1
replication to fsnode: src=2 dst=0
Nov 10 19:15:12 master-gfmd gfmd[28086]: [1002257] error at 124596450:2
replication to fsnode: src=2 dst=0
Nov 10 19:15:12 master-gfmd gfmd[28086]: [1002257] error at 124596450:3
replication to fsnode: src=2 dst=0
!! MasterGFMDダウン ※ここで2回目のGFMDダウン
■ slave gfmd サーバ
!! MasterGFMDダウン ※ここで1回目のGFMDダウン
Nov 10 18:55:14 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 10 18:55:14 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 10 18:55:26 slave-gfmd gfmd[7274]: [1002993]
gfmd_channel(master-gfmd) : connection refused
Nov 10 18:55:26 slave-gfmd gfmd[7274]: [1002994] connecting to the
master gfmd failed, sleep 10 sec: connection refused
Nov 10 18:55:36 slave-gfmd gfmd[7274]: [1002993]
gfmd_channel(master-gfmd) : connection refused
Nov 10 18:55:36 slave-gfmd gfmd[7274]: [1002994] connecting to the
master gfmd failed, sleep 20 sec: connection refused
Nov 10 18:55:56 slave-gfmd gfmd[7274]: [1002993]
gfmd_channel(master-gfmd) : connection refused
slaveのgfmdですが、ログを確認したところ
Nov 2 23:19:04 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 2 23:19:04 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 2 23:19:32 slave-gfmd gfmd[7274]: [1002998]
gfmd_channel(master-gfmd) : connected
Nov 4 13:03:51 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 4 13:03:51 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 4 13:04:06 slave-gfmd gfmd[7274]: [1002998]
gfmd_channel(master-gfmd) : connected
Nov 4 15:31:58 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 4 15:31:58 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 4 15:32:07 slave-gfmd gfmd[7274]: [1002998]
gfmd_channel(master-gfmd) : connected
Nov 6 12:05:53 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 6 12:05:53 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 6 12:06:13 slave-gfmd gfmd[7274]: [1002998]
gfmd_channel(master-gfmd) : connected
Nov 6 15:14:44 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 6 15:14:44 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 6 15:15:13 slave-gfmd gfmd[7274]: [1002998]
gfmd_channel(master-gfmd) : connected
Nov 6 21:40:32 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 6 21:40:32 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 6 21:40:44 slave-gfmd gfmd[7274]: [1002998]
gfmd_channel(master-gfmd) : connected
Nov 7 00:00:34 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 7 00:00:34 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 7 00:00:44 slave-gfmd gfmd[7274]: [1002998]
gfmd_channel(master-gfmd) : connected
Nov 7 13:22:14 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 7 13:22:14 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 7 13:22:16 slave-gfmd gfmd[7274]: [1002998]
gfmd_channel(master-gfmd) : connected
Nov 7 22:45:31 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 7 22:45:31 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 7 22:45:47 slave-gfmd gfmd[7274]: [1002998]
gfmd_channel(master-gfmd) : connected
Nov 7 23:54:19 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 7 23:54:19 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 7 23:54:47 slave-gfmd gfmd[7274]: [1002998]
gfmd_channel(master-gfmd) : connected
Nov 8 09:12:38 slave-gfmd gfmd[7274]: [1002784]
gfmd_channel(master-gfmd): disconnected
Nov 8 09:12:38 slave-gfmd gfmd[7274]: [1000286] ((null)@master-gfmd)
disconnected
Nov 8 09:12:48 slave-gfmd gfmd[7274]: [1002998]
gfmd_channel(master-gfmd) : connected
障害発生前から不定期に接続が切れているのですが、問題なかったでしょうか。
■各gfsd
...
Nov 10 18:55:05 fsnode gfsd[14426]: [1002187]
GFS_PROTO_REPLICATION_REQUEST: replica_recv: no such file or directory
Nov 10 18:55:06 fsnode gfsd[14616]: [1002187]
GFS_PROTO_REPLICATION_REQUEST: replica_recv: no such file or directory
Nov 10 18:55:06 fsnode gfsd[14618]: [1002187]
GFS_PROTO_REPLICATION_REQUEST: replica_recv: no such file or directory
上記ログ大量に出力(毎日約5万ほど出力)
!! MasterGFMDダウン
Nov 10 18:55:14 fsnode gfsd[16584]: [1002386] back channel disconnected
Nov 10 18:55:14 fsnode gfsd[16584]: [1000550] connecting to gfmd at
slave-gfmd:10601 failed, sleep 10 sec: connection refused
Nov 10 18:55:24 fsnode gfsd[16584]: [1000550] connecting to gfmd at
slave-gfmd:10601 failed, sleep 20 sec: connection refused
Nov 10 18:55:44 fsnode gfsd[16584]: [1000550] connecting to gfmd at
slave-gfmd:10601 failed, sleep 40 sec: connection refused
Nov 10 18:56:24 fsnode gfsd[16584]: [1000550] connecting to gfmd at
slave-gfmd:10601 failed, sleep 80 sec: connection refused
Nov 10 18:57:44 fsnode gfsd[16584]: [1000550] connecting to gfmd at
slave-gfmd:10601 failed, sleep 160 sec: connection refused
Nov 10 19:00:24 fsnode gfsd[16584]: [1000550] connecting to gfmd at
slave-gfmd:10601 failed, sleep 320 sec: connection refused
Nov 10 19:02:28 fsnode gfsd[25573]: [1002294] (root@client) gfmd
protocol: GFS_PROTO_CLOSE error on compound_begin result: unexpected EOF
Nov 10 19:02:28 fsnode gfsd[25573]: [1002294] (root@client) gfmd
protocol: closing all descriptor error on compound_begin result:
unexpected EOF
Nov 10 19:02:28 fsnode gfsd[25573]: [1000451] (root@client) disconnected
Nov 10 19:13:37 fsnode gfsd[16025]: [1002441]
/tmp/.gfarm-gfsd127.0.0.1-10600/sock: remaining socket found and removed
Nov 10 19:13:37 fsnode gfsd[16025]: [1002443]
/tmp/.gfarm-gfsd127.0.0.1-10600: remaining socket directory found and
removed
Nov 10 19:13:37 fsnode gfsd[16025]: [1002441]
/tmp/.gfarm-gfsd192.168.x.x-10600/sock: remaining socket found and removed
Nov 10 19:13:37 fsnode gfsd[16025]: [1002443]
/tmp/.gfarm-gfsd192.168.x.x-10600: remaining socket directory found and
removed
今回それらしきログすら出力されず突然ダウンし、対策どころか障害原因を突き
止める事すら難しい状態と
なっているのですがlog levelをdebugに設定しないと致命的な障害が発生した場
合でも原因を追求できないのでしょうか。
現在のgfmdのメモリ使用量ですが、
PID USER PR VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5706 root 15 43.4g 41g 1532 S 0.7 66.5 63:11 /usr/sbin/gfmd -P
/var/run/gfmd.pid -f /etc/gfmd.conf -s local4
物理メモリ使用量が41GB、障害発生時も同等の使用量でした。
各サーバのGfarmの設定を確認したところ、ファイルシステムノードの4台の内の
1台で、
metadb_server_hostをmaster gfmdではなくslave gfmdを指定していました。接
続状態を確認したところ、
# netstat -a | grep 10601
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 fsnode:51806 master-gfmd:10601 ESTABLISHED
...
master gfmdに対して接続を行なっているようには見受けられるのですが、この
設定により
どのような影響が考えられるでしょうか。今回のgfmdダウンへの一因になってい
ますでしょうか。
又、gfarmで現在の書き込み数や読み込み数等の動的な値を確認する術はありま
すでしょうか。
最後に障害が発生しgfmdがダウンした場合、gfarm2fsでmountしている各クライ
アントの
接続を指定した秒数以上通信が行われていない場合エラーを返しmountを解除す
る等で
クライアントの一時的な高負荷を避ける方法はありますでしょうか。
よろしくお願いいたします。