Quantcast
Channel: Debian User Forums
Viewing all articles
Browse latest Browse all 3441

General Questions • [Software] NFS client mounted over Infiniband freezes perodically

$
0
0
Hello,
I have a cluster with 8 machines. 7 of them are compute node(g0[1-7]), 1 of them is the management node(mgt). There is a public directory in management node called /share, and this directory is mounted to all compute node over Infiniband with rdma. But some clients always freeze randomly, after I enabled the nfs client log with follow:

Code:

rpcdebug -m rpc -s allrpcdebug -m nfs -s all
journalctl -fl shows:

Code:

...Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: NFS call  test_stateid 0000000084ebf388Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=9813091 slotid=2 max_slotid=2 cache_this=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: NFS reply lookup: 0Dec 31 12:37:29 g02 kernel: NFS: nfs_update_inode(0:44/118169504428 fh_crc=0x51ea6285 ct=1 info=0x427e7f)Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/118169504428), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: open file(results/H2O_run-123_train.txt)Dec 31 12:37:29 g02 kernel: NFS call  test_stateid 0000000084ebf388Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=23814709 slotid=1 max_slotid=1 cache_this=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 0 highest_used_slotid 4294967295Dec 31 12:37:29 g02 kernel: NFS reply lookup: -2Dec 31 12:37:29 g02 kernel: NFS: dentry_delete(intel64_lin/x86_64, 80c)Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: NFS call  test_stateid 0000000084ebf388Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=9813091 slotid=2 max_slotid=2 cache_this=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: NFS reply lookup: 0Dec 31 12:37:29 g02 kernel: NFS: nfs_update_inode(0:44/118169504428 fh_crc=0x51ea6285 ct=1 info=0x427e7f)Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/118169504428), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: open file(results/H2O_run-123_train.txt)Dec 31 12:37:29 g02 kernel: NFS call  test_stateid 0000000084ebf388Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=23814709 slotid=1 max_slotid=1 cache_this=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slotDec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 0 highest_used_slotid 4294967295Dec 31 12:37:29 g02 kernel: NFS reply lookup: -2Dec 31 12:37:29 g02 kernel: NFS: dentry_delete(intel64_lin/x86_64, 80c)Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0Dec 31 12:37:29 g02 kernel: NFS: atomic_open(0:44/40808747028), libc.so.6Dec 31 12:37:29 g02 kernel: NFS call  test_stateid 0000000084ebf388Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0000 highest_used=4294967295 max_slots=30Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=53210473 slotid=0 max_slotid=0 cache_this=0Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040...
(The mgt-ib is the address of management node)

Restarting network with

Code:

systemctl restart networking
is not help. The ways that can solve the problem are reboot the node or kill the tasks in the node and remount the share directory.

Here is my nfs.conf in the management node:

Code:

## This is a general configuration for the# NFS daemons and tools#[general]pipefs-directory=/run/rpc_pipefs#[nfsrahead]# nfs=15000# nfs4=16000#[exports]# rootdir=/export#[exportfs]# debug=0#[gssd]# verbosity=0# rpc-verbosity=0# use-memcache=0# use-machine-creds=1# use-gss-proxy=0# avoid-dns=1# limit-to-legacy-enctypes=0# context-timeout=0# rpc-timeout=5# keytab-file=/etc/krb5.keytab# cred-cache-directory=# preferred-realm=# set-home=1# upcall-timeout=30# cancel-timed-out-upcalls=0#[lockd]# port=0# udp-port=0#[exportd]# debug="all|auth|call|general|parse"# manage-gids=n# state-directory-path=/var/lib/nfs# threads=1# cache-use-ipaddr=n# ttl=1800[mountd]# debug="all|auth|call|general|parse"manage-gids=y# descriptors=0# port=0# threads=1# reverse-lookup=n# state-directory-path=/var/lib/nfs# ha-callout=# cache-use-ipaddr=n# ttl=1800#[nfsdcld]# debug=0# storagedir=/var/lib/nfs/nfsdcld#[nfsdcltrack]# debug=0# storagedir=/var/lib/nfs/nfsdcltrack#[nfsd]# debug=0threads=16# host=# port=0# grace-time=90# lease-time=90udp=y# tcp=y# vers3=y# vers4=y# vers4.0=y# vers4.1=y# vers4.2=yrdma=yrdma-port=20049[statd]# debug=0# port=0# outgoing-port=0# name=# state-directory-path=/var/lib/nfs/statd# ha-callout=# no-notify=0#[sm-notify]# debug=0# force=0# retry-time=900# outgoing-port=# outgoing-addr=# lift-grace=y#[svcgssd]# principal=
The mount option is:

Code:

mount -o rdma,port=20049 mgt-ib:/share /share
cat /etc/fstab in g02 is

Code:

mgt-ib:/share           /share          nfs4            rw,sync,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=172.16.7.2,local_lock=none,addr=172.16.7.200   0 0
cat /etc/exports in the management node:

Code:

/share 172.16.7.0/24(rw,sync,no_subtree_check,insecure,no_root_squash)
nfsstat -s in the management node:

Code:

Server rpc stats:calls      badcalls   badfmt     badauth    badclnt340861724   0          0          0          0Server nfs v4:null             compound27        0%     340876261 99%Server nfs v4 operations:op0-unused       op1-unused       op2-future       access           close0         0%     0         0%     0         0%     94072660  7%     93096428  6%commit           create           delegpurge       delegreturn      getattr8524      0%     1382      0%     0         0%     92195288  6%     232114827 17%getfh            link             lock             lockt            locku21463569  1%     0         0%     1452      0%     0         0%     947       0%lookup           lookup_root      nverify          open             openattr11986903  0%     0         0%     0         0%     93272170  6%     0         0%open_conf        open_dgrd        putfh            putpubfh         putrootfh0         0%     34        0%     338905383 25%     0         0%     35        0%read             readdir          readlink         remove           rename11054156  0%     692804    0%     74575     0%     9568      0%     1600      0%renew            restorefh        savefh           secinfo          setattr0         0%     0         0%     1848      0%     0         0%     293884    0%setcltid         setcltidconf     verify           write            rellockowner0         0%     0         0%     0         0%     10738467  0%     0         0%bc_ctl           bind_conn        exchange_id      create_ses       destroy_ses0         0%     4         0%     56        0%     36        0%     22        0%free_stateid     getdirdeleg      getdevinfo       getdevlist       layoutcommit594       0%     0         0%     0         0%     0         0%     0         0%layoutget        layoutreturn     secinfononam     sequence         set_ssv0         0%     0         0%     1         0%     341016756 25%     0         0%test_stateid     want_deleg       destroy_clid     reclaim_comp     allocate2102813   0%     0         0%     15        0%     29        0%     0         0%copy             copy_notify      deallocate       ioadvise         layouterror247       0%     0         0%     0         0%     0         0%     0         0%layoutstats      offloadcancel    offloadstatus    readplus         seek0         0%     0         0%     0         0%     0         0%     162       0%write_same0         0%
and nfsstat -c in g02(after remount)

Code:

Client rpc stats:calls      retrans    authrefrsh89832614   0          89830228Client nfs v4:null             read             write            commit           open5         0%     3607936   4%     381364    0%     6036      0%     3380554   3%open_conf        open_noat        open_dgrd        close            setattr0         0%     24548294 27%     0         0%     27912591 31%     153       0%fsinfo           renew            setclntid        confirm          lock12        0%     0         0%     0         0%     0         0%     16        0%lockt            locku            access           getattr          lookup0         0%     15        0%     28497     0%     175939    0%     2094388   2%lookup_root      remove           rename           link             symlink4         0%     726       0%     102       0%     0         0%     0         0%create           pathconf         statfs           readlink         readdir53        0%     8         0%     0         0%     121       0%     3325      0%server_caps      delegreturn      getacl           setacl           fs_locations20        0%     27658681 30%     0         0%     0         0%     0         0%rel_lkowner      secinfo          fsid_present     exchange_id      create_session0         0%     0         0%     0         0%     9         0%     6         0%destroy_session  sequence         get_lease_time   reclaim_comp     layoutget4         0%     615       0%     1         0%     5         0%     0         0%getdevinfo       layoutcommit     layoutreturn     secinfo_no       test_stateid0         0%     0         0%     0         0%     0         0%     34034     0%free_stateid     getdevicelist    bind_conn_to_ses destroy_clientid seek18        0%     0         0%     0         0%     3         0%     0         0%allocate         deallocate       layoutstats      clone0         0%     0         0%     0         0%     0         0%
I just installed theese following packages to enable infiniband: rdma-core, infiniband-diags, ibutils, opensm. All nodes are installed with same system:

Code:

Linux version 6.1.0-26-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30)
How to solve it? Thank you.

Statistics: Posted by nahso4 — 2024-12-31 05:43 — Replies 1 — Views 50



Viewing all articles
Browse latest Browse all 3441

Trending Articles