Hello,
I have a cluster with 8 machines. 7 of them are compute node(g0[1-7]), 1 of them is the management node(mgt). There is a public directory in management node called /share, and this directory is mounted to all compute node over Infiniband with rdma. But some clients always freeze randomly, after I enabled the nfs client log with follow:
journalctl -fl shows:
(The mgt-ib is the address of management node)
Restarting network withis not help. The ways that can solve the problem are reboot the node or kill the tasks in the node and remount the share directory.
Here is my nfs.conf in the management node:The mount option is:cat /etc/fstab in g02 is
cat /etc/exports in the management node:nfsstat -s in the management node:and nfsstat -c in g02(after remount)
I just installed theese following packages to enable infiniband: rdma-core, infiniband-diags, ibutils, opensm. All nodes are installed with same system:How to solve it? Thank you.
I have a cluster with 8 machines. 7 of them are compute node(g0[1-7]), 1 of them is the management node(mgt). There is a public directory in management node called /share, and this directory is mounted to all compute node over Infiniband with rdma. But some clients always freeze randomly, after I enabled the nfs client log with follow:
Code:
rpcdebug -m rpc -s allrpcdebug -m nfs -s allCode:
...Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: NFS call test_stateid 0000000084ebf388Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=9813091 slotid=2 max_slotid=2 cache_this=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: NFS reply lookup: 0Dec 31 12:37:29 g02 kernel: NFS: nfs_update_inode(0:44/118169504428 fh_crc=0x51ea6285 ct=1 info=0x427e7f)Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/118169504428), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: open file(results/H2O_run-123_train.txt)Dec 31 12:37:29 g02 kernel: NFS call test_stateid 0000000084ebf388Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=23814709 slotid=1 max_slotid=1 cache_this=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 0 highest_used_slotid 4294967295Dec 31 12:37:29 g02 kernel: NFS reply lookup: -2Dec 31 12:37:29 g02 kernel: NFS: dentry_delete(intel64_lin/x86_64, 80c)Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: NFS call test_stateid 0000000084ebf388Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=9813091 slotid=2 max_slotid=2 cache_this=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: NFS reply lookup: 0Dec 31 12:37:29 g02 kernel: NFS: nfs_update_inode(0:44/118169504428 fh_crc=0x51ea6285 ct=1 info=0x427e7f)Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/118169504428), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: open file(results/H2O_run-123_train.txt)Dec 31 12:37:29 g02 kernel: NFS call test_stateid 0000000084ebf388Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=23814709 slotid=1 max_slotid=1 cache_this=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 0 highest_used_slotid 4294967295Dec 31 12:37:29 g02 kernel: NFS reply lookup: -2Dec 31 12:37:29 g02 kernel: NFS: dentry_delete(intel64_lin/x86_64, 80c)Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: atomic_open(0:44/40808747028), libc.so.6Dec 31 12:37:29 g02 kernel: NFS call test_stateid 0000000084ebf388Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0000 highest_used=4294967295 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=53210473 slotid=0 max_slotid=0 cache_this=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040...Restarting network with
Code:
systemctl restart networkingHere is my nfs.conf in the management node:
Code:
## This is a general configuration for the# NFS daemons and tools#[general]pipefs-directory=/run/rpc_pipefs#[nfsrahead]# nfs=15000# nfs4=16000#[exports]# rootdir=/export#[exportfs]# debug=0#[gssd]# verbosity=0# rpc-verbosity=0# use-memcache=0# use-machine-creds=1# use-gss-proxy=0# avoid-dns=1# limit-to-legacy-enctypes=0# context-timeout=0# rpc-timeout=5# keytab-file=/etc/krb5.keytab# cred-cache-directory=# preferred-realm=# set-home=1# upcall-timeout=30# cancel-timed-out-upcalls=0#[lockd]# port=0# udp-port=0#[exportd]# debug="all|auth|call|general|parse"# manage-gids=n# state-directory-path=/var/lib/nfs# threads=1# cache-use-ipaddr=n# ttl=1800[mountd]# debug="all|auth|call|general|parse"manage-gids=y# descriptors=0# port=0# threads=1# reverse-lookup=n# state-directory-path=/var/lib/nfs# ha-callout=# cache-use-ipaddr=n# ttl=1800#[nfsdcld]# debug=0# storagedir=/var/lib/nfs/nfsdcld#[nfsdcltrack]# debug=0# storagedir=/var/lib/nfs/nfsdcltrack#[nfsd]# debug=0threads=16# host=# port=0# grace-time=90# lease-time=90udp=y# tcp=y# vers3=y# vers4=y# vers4.0=y# vers4.1=y# vers4.2=yrdma=yrdma-port=20049[statd]# debug=0# port=0# outgoing-port=0# name=# state-directory-path=/var/lib/nfs/statd# ha-callout=# no-notify=0#[sm-notify]# debug=0# force=0# retry-time=900# outgoing-port=# outgoing-addr=# lift-grace=y#[svcgssd]# principal=Code:
mount -o rdma,port=20049 mgt-ib:/share /shareCode:
mgt-ib:/share /share nfs4 rw,sync,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=172.16.7.2,local_lock=none,addr=172.16.7.200 0 0Code:
/share 172.16.7.0/24(rw,sync,no_subtree_check,insecure,no_root_squash)Code:
Server rpc stats:calls badcalls badfmt badauth badclnt340861724 0 0 0 0Server nfs v4:null compound27 0% 340876261 99%Server nfs v4 operations:op0-unused op1-unused op2-future access close0 0% 0 0% 0 0% 94072660 7% 93096428 6%commit create delegpurge delegreturn getattr8524 0% 1382 0% 0 0% 92195288 6% 232114827 17%getfh link lock lockt locku21463569 1% 0 0% 1452 0% 0 0% 947 0%lookup lookup_root nverify open openattr11986903 0% 0 0% 0 0% 93272170 6% 0 0%open_conf open_dgrd putfh putpubfh putrootfh0 0% 34 0% 338905383 25% 0 0% 35 0%read readdir readlink remove rename11054156 0% 692804 0% 74575 0% 9568 0% 1600 0%renew restorefh savefh secinfo setattr0 0% 0 0% 1848 0% 0 0% 293884 0%setcltid setcltidconf verify write rellockowner0 0% 0 0% 0 0% 10738467 0% 0 0%bc_ctl bind_conn exchange_id create_ses destroy_ses0 0% 4 0% 56 0% 36 0% 22 0%free_stateid getdirdeleg getdevinfo getdevlist layoutcommit594 0% 0 0% 0 0% 0 0% 0 0%layoutget layoutreturn secinfononam sequence set_ssv0 0% 0 0% 1 0% 341016756 25% 0 0%test_stateid want_deleg destroy_clid reclaim_comp allocate2102813 0% 0 0% 15 0% 29 0% 0 0%copy copy_notify deallocate ioadvise layouterror247 0% 0 0% 0 0% 0 0% 0 0%layoutstats offloadcancel offloadstatus readplus seek0 0% 0 0% 0 0% 0 0% 162 0%write_same0 0%Code:
Client rpc stats:calls retrans authrefrsh89832614 0 89830228Client nfs v4:null read write commit open5 0% 3607936 4% 381364 0% 6036 0% 3380554 3%open_conf open_noat open_dgrd close setattr0 0% 24548294 27% 0 0% 27912591 31% 153 0%fsinfo renew setclntid confirm lock12 0% 0 0% 0 0% 0 0% 16 0%lockt locku access getattr lookup0 0% 15 0% 28497 0% 175939 0% 2094388 2%lookup_root remove rename link symlink4 0% 726 0% 102 0% 0 0% 0 0%create pathconf statfs readlink readdir53 0% 8 0% 0 0% 121 0% 3325 0%server_caps delegreturn getacl setacl fs_locations20 0% 27658681 30% 0 0% 0 0% 0 0%rel_lkowner secinfo fsid_present exchange_id create_session0 0% 0 0% 0 0% 9 0% 6 0%destroy_session sequence get_lease_time reclaim_comp layoutget4 0% 615 0% 1 0% 5 0% 0 0%getdevinfo layoutcommit layoutreturn secinfo_no test_stateid0 0% 0 0% 0 0% 0 0% 34034 0%free_stateid getdevicelist bind_conn_to_ses destroy_clientid seek18 0% 0 0% 0 0% 3 0% 0 0%allocate deallocate layoutstats clone0 0% 0 0% 0 0% 0 0%Code:
Linux version 6.1.0-26-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30)Statistics: Posted by nahso4 — 2024-12-31 05:43 — Replies 1 — Views 50