Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

运行psi失败 #7

Open
bzzbzz7 opened this issue Dec 26, 2024 · 13 comments
Open

运行psi失败 #7

bzzbzz7 opened this issue Dec 26, 2024 · 13 comments

Comments

@bzzbzz7
Copy link

bzzbzz7 commented Dec 26, 2024

alice作为服务端,bob作为客户端
版本:easy-psi:0.3.0beta
docker: 20.10

alice日志

[2024-12-26 10:36:53.978] [info] [main.cc:43] SecretFlow PSI Library v0.3.0beta Copyright 2023 Ant Group Co., Ltd.
[2024-12-26 10:36:53.983] [info] [main.cc:55] Kuscia task id: noebdjuz
I1226 10:36:54.031548  1094 external/com_github_brpc_brpc/src/brpc/server.cpp:1158] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=31371.
W1226 10:36:54.031599  1094 external/com_github_brpc_brpc/src/brpc/server.cpp:1164] Builtin services are disabled according to ServerOptions.has_builtin_services
[413.762]       perfetto.cc:45899 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024 KB, total sessions:1, uid:0 session name: ""
[2024-12-26 10:36:54.040] [info] [launch.cc:119] PSI config: {"protocol_config":{"protocol":"PROTOCOL_KKRT","role":"ROLE_SENDER","ecdh_config":{"curve":"CURVE_FOURQ"},"kkrt_config":{"bucket_size":"1048576"},"rr22_config":{"bucket_size":"1048576"}},"input_config":{"type":"IO_TYPE_FILE_CSV","path":"/home/kuscia/var/storage/data/alice.csv"},"output_config":{"type":"IO_TYPE_FILE_CSV","path":"/home/kuscia/var/storage/data/result/noebdjuz/"},"keys":["id1"],"skip_duplicates_check":true,"recovery_config":{"folder":"/home/kuscia/var/storage/data/tmp/noebdjuz/"},"left_side":"ROLE_RECEIVER"}
[2024-12-26 10:36:54.040] [info] [sender.cc:41] [KkrtPsiSender::Init] start
[2024-12-26 10:36:54.040] [info] [interface.cc:78] [AbstractPsiParty::Init] start

bob日志

[2024-12-26 10:36:56.709] [info] [main.cc:43] SecretFlow PSI Library v0.3.0beta Copyright 2023 Ant Group Co., Ltd.
[2024-12-26 10:36:56.713] [info] [main.cc:55] Kuscia task id: noebdjuz
I1226 10:36:56.753854  1102 external/com_github_brpc_brpc/src/brpc/server.cpp:1158] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=23750.
W1226 10:36:56.753904  1102 external/com_github_brpc_brpc/src/brpc/server.cpp:1164] Builtin services are disabled according to ServerOptions.has_builtin_services
[416.494]       perfetto.cc:45899 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024 KB, total sessions:1, uid:0 session name: ""
[2024-12-26 10:36:56.772] [info] [launch.cc:119] PSI config: {"protocol_config":{"protocol":"PROTOCOL_KKRT","role":"ROLE_RECEIVER","ecdh_config":{"curve":"CURVE_FOURQ"},"kkrt_config":{"bucket_size":"1048576"},"rr22_config":{"bucket_size":"1048576"}},"input_config":{"type":"IO_TYPE_FILE_CSV","path":"/home/kuscia/var/storage/data/bob.csv"},"output_config":{"type":"IO_TYPE_FILE_CSV","path":"/home/kuscia/var/storage/data/result/noebdjuz/test1_result.csv"},"keys":["id2"],"skip_duplicates_check":true,"recovery_config":{"folder":"/home/kuscia/var/storage/data/tmp/noebdjuz/"},"left_side":"ROLE_RECEIVER"}
[2024-12-26 10:36:56.772] [info] [receiver.cc:37] [KkrtPsiReceiver::Init] start
[2024-12-26 10:36:56.772] [info] [interface.cc:78] [AbstractPsiParty::Init] start
[2024-12-26 10:37:37.450] [info] [main.cc:43] SecretFlow PSI Library v0.3.0beta Copyright 2023 Ant Group Co., Ltd.
[2024-12-26 10:37:37.453] [info] [main.cc:55] Kuscia task id: noebdjuz
I1226 10:37:37.491086  1159 external/com_github_brpc_brpc/src/brpc/server.cpp:1158] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=23750.
W1226 10:37:37.491119  1159 external/com_github_brpc_brpc/src/brpc/server.cpp:1164] Builtin services are disabled according to ServerOptions.has_builtin_services
[457.222]       perfetto.cc:45899 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024 KB, total sessions:1, uid:0 session name: ""
[2024-12-26 10:37:37.504] [info] [launch.cc:119] PSI config: {"protocol_config":{"protocol":"PROTOCOL_KKRT","role":"ROLE_RECEIVER","ecdh_config":{"curve":"CURVE_FOURQ"},"kkrt_config":{"bucket_size":"1048576"},"rr22_config":{"bucket_size":"1048576"}},"input_config":{"type":"IO_TYPE_FILE_CSV","path":"/home/kuscia/var/storage/data/bob.csv"},"output_config":{"type":"IO_TYPE_FILE_CSV","path":"/home/kuscia/var/storage/data/result/noebdjuz/test1_result.csv"},"keys":["id2"],"skip_duplicates_check":true,"recovery_config":{"folder":"/home/kuscia/var/storage/data/tmp/noebdjuz/"},"left_side":"ROLE_RECEIVER"}
[2024-12-26 10:37:37.504] [info] [receiver.cc:37] [KkrtPsiReceiver::Init] start
[2024-12-26 10:37:37.504] [info] [interface.cc:78] [AbstractPsiParty::Init] start
[2024-12-26 10:37:43.206] [info] [interface.cc:136] [AbstractPsiParty::Init][Check csv pre-process] start
[2024-12-26 10:37:43.255] [info] [interface.cc:145] [AbstractPsiParty::Init][Check csv pre-process] end
[2024-12-26 10:37:43.276] [info] [interface.cc:183] [AbstractPsiParty::Init] end
[2024-12-26 10:37:43.276] [info] [receiver.cc:42] [KkrtPsiReceiver::Init] end
[2024-12-26 10:37:43.276] [info] [receiver.cc:47] [KkrtPsiReceiver::PreProcess] start
[2024-12-26 10:37:43.276] [info] [bucket_psi.cc:514] psi protocol=2, rank=0 item_size=9892
[2024-12-26 10:37:43.276] [info] [bucket_psi.cc:514] psi protocol=2, rank=1 item_size=9892
[2024-12-26 10:37:43.309] [info] [arrow_csv_batch_provider.cc:75] Reach the end of csv file /home/kuscia/var/storage/data/bob.csv.
[2024-12-26 10:37:43.310] [info] [arrow_csv_batch_provider.cc:75] Reach the end of csv file /home/kuscia/var/storage/data/bob.csv.
[2024-12-26 10:37:43.360] [info] [receiver.cc:86] [KkrtPsiReceiver::PreProcess] end
[2024-12-26 10:37:43.360] [info] [receiver.cc:91] [KkrtPsiReceiver::Online] start
[2024-12-26 10:37:43.373] [info] [bucket.cc:37] psi protocol=2, rank=0, inputs_size=9892
[2024-12-26 10:37:43.373] [info] [bucket.cc:37] psi protocol=2, rank=1, inputs_size=9892
[2024-12-26 10:37:43.373] [info] [bucket.cc:50] run psi bucket_idx=0, bucket_item_size=9892 
[2024-12-26 10:37:43.373] [info] [thread_pool.cc:30] Create a fixed thread pool with size 15
[2024-12-26 10:37:43.619] [info] [receiver.cc:161] [KkrtPsiReceiver::Online] end
[2024-12-26 10:37:43.619] [info] [receiver.cc:166] [KkrtPsiReceiver::PostProcess] start
[2024-12-26 10:37:43.619] [info] [receiver.cc:176] [KkrtPsiReceiver::PostProcess] end
[2024-12-26 10:37:43.619] [info] [interface.cc:188] [AbstractPsiParty::Finalize] start
[2024-12-26 10:37:43.619] [info] [interface.cc:202] [AbstractPsiParty::Finalize][Generate result] start
[2024-12-26 10:37:43.626] [info] [key.cc:91] Executing sort scripts: tail -n +2 /tmp/psi_index_59d7e26c-c1b7-4ba7-a157-2ce74393c4c7.csv | LC_ALL=C sort -n --parallel=16 --buffer-size=1G --stable --field-separator=, --key=1,1  >>/tmp/sorted_psi_index_17fb36b9-21d3-47f0-9dc2-afdef6fb4137.csv
[2024-12-26 10:37:43.778] [info] [key.cc:93] Finished sort scripts: tail -n +2 /tmp/psi_index_59d7e26c-c1b7-4ba7-a157-2ce74393c4c7.csv | LC_ALL=C sort -n --parallel=16 --buffer-size=1G --stable --field-separator=, --key=1,1  >>/tmp/sorted_psi_index_17fb36b9-21d3-47f0-9dc2-afdef6fb4137.csv, ret=0
[2024-12-26 10:37:43.813] [info] [key.cc:91] Executing sort scripts: tail -n +2 /home/kuscia/var/storage/data/result/noebdjuz/tmp-sort-in-113a56f0-99ef-4fab-a727-ab8e6feda902 | LC_ALL=C sort  --parallel=16 --buffer-size=1G --stable --field-separator=, --key=1,1  >>/home/kuscia/var/storage/data/result/noebdjuz/tmp-sort-out-113a56f0-99ef-4fab-a727-ab8e6feda902
[2024-12-26 10:37:43.910] [info] [key.cc:93] Finished sort scripts: tail -n +2 /home/kuscia/var/storage/data/result/noebdjuz/tmp-sort-in-113a56f0-99ef-4fab-a727-ab8e6feda902 | LC_ALL=C sort  --parallel=16 --buffer-size=1G --stable --field-separator=, --key=1,1  >>/home/kuscia/var/storage/data/result/noebdjuz/tmp-sort-out-113a56f0-99ef-4fab-a727-ab8e6feda902, ret=0
[2024-12-26 10:37:43.913] [info] [interface.cc:218] [AbstractPsiParty::Finalize][Generate result] end
[2024-12-26 10:37:43.914] [info] [interface.cc:250] [AbstractPsiParty::Finalize] end
[2024-12-26 10:37:43.919] [info] [launch.cc:95] Trace has been written to /tmp/psi_9e1c74a8-fe49-4161-b566-0b6885831023.trace.
[463.642]       perfetto.cc:47470 Tracing session 1 ended, total sessions:0
[2024-12-26 10:37:43.922] [info] [main.cc:117] Report: {"original_count":"9892","intersection_count":"9892"}
[2024-12-26 10:37:43.922] [info] [main.cc:118] Thank you for trusting SecretFlow PSI Library.
I1226 10:37:43.922899  1159 external/com_github_brpc_brpc/src/brpc/server.cpp:1218] Server[yacl::link::transport::internal::ReceiverServiceImpl] is going to quit
[2024-12-26 10:37:43.923] [warning] [channel.h:160] Channel destructor is called before WaitLinkTaskFinish, try stop send thread

kj状态

Status:
  Approve Status:
    Alice:          JobAccepted
    Bob:            JobAccepted
  Completion Time:  2024-12-26T02:37:45Z
  Conditions:
    Last Transition Time:  2024-12-26T02:36:47Z
    Status:                True
    Type:                  JobValidated
  Last Reconcile Time:     2024-12-26T02:37:45Z
  Phase:                   Failed
  Stage Status:
    Alice:     JobCreateStageSucceeded
    Bob:       JobCreateStageSucceeded
  Start Time:  2024-12-26T02:36:47Z
  Task Status:
    Noebdjuz:  Failed
Events:        <none>

kt状态

Status:
  Allocated Ports:
    Domain ID:  bob
    Named Port:
      noebdjuz-0/psi:  26835
    Domain ID:         alice
    Named Port:
      noebdjuz-0/psi:  31371
  Completion Time:     2024-12-26T02:37:45Z
  Conditions:
    Last Transition Time:  2024-12-26T02:36:48Z
    Status:                True
    Type:                  ResourceCreated
    Last Transition Time:  2024-12-26T02:36:53Z
    Status:                True
    Type:                  Running
    Last Transition Time:  2024-12-26T02:37:45Z
    Status:                False
    Type:                  Success
  Last Reconcile Time:     2024-12-26T02:37:45Z
  Message:                 The remaining no-failed party task counts 1 are less than the threshold 2 that meets the conditions for task success. pending party[], running party[bob], successful party[], failed party[alice]
  Party Task Status:
    Domain ID:  bob
    Phase:      Failed
    Domain ID:  alice
    Phase:      Failed
  Phase:        Failed
  Pod Statuses:
    alice/noebdjuz-0:
      Create Time:      2024-12-26T02:36:48Z
      Namespace:        alice
      Node Name:        4d7621b26247
      Pod Name:         noebdjuz-0
      Pod Phase:        Failed
      Ready Time:       2024-12-26T02:36:53Z
      Reason:           ContainerStatusUnknown
      Start Time:       2024-12-26T02:36:50Z
      Termination Log:  container[secretflow] terminated state reason "ContainerStatusUnknown", message: "The container could not be located when the pod was terminated"
  Service Statuses:
    alice/noebdjuz-0-psi:
      Create Time:   2024-12-26T02:36:48Z
      Namespace:     alice
      Port Name:     psi
      Port Number:   31371
      Ready Time:    2024-12-26T02:36:53Z
      Scope:         Cluster
      Service Name:  noebdjuz-0-psi
  Start Time:        2024-12-26T02:36:48Z
Events:              <none>

alice pod日志

Error from server: Get "https://172.17.0.2:10250/containerLogs/alice/noebdjuz-0/secretflow?follow=true": proxy error from 0.0.0.0:6443 while dialing 172.17.0.2:10250, code 502: 502 Bad Gateway
@6fj
Copy link
Member

6fj commented Dec 26, 2024

请问alice和bob部署环境是怎样的呢?目前alice这边还能正常工作吗?

@bzzbzz7 bzzbzz7 changed the title 运行psi报错 运行psi失败 Dec 26, 2024
@bzzbzz7
Copy link
Author

bzzbzz7 commented Dec 26, 2024

alice和bob 使用docker部署,默认配置。

docker run -itd --init --name=ezpsi-alice \
      --volume=${volume_data_path}:/app/data \
      --volume=${volume_data_path}:/home/kuscia/var/storage/data \
      --volume=${volume_log_path}/pods:/home/kuscia/var/stdout/pods \
      --volume=${volume_log_path}/kuscia:/home/kuscia/var/logs \
      --volume=${volume_log_path}/easypsi:/app/log/easypsi \
      --volume=${volume_log_path}/pods:/app/log/pods \
      --volume=${volume_pad_config_path}:/app/config \
      --volume=${volume_pad_db_path}:/app/db \
      --volume=${volume_pad_script_path}:/app/tmp/scripts \
      --workdir=/home/kuscia \
      -p ${web_port}:8080 \
      -p ${kuscia_port}:1080 \
      -e NODE_ID=${node_id} \
      -e HOST_PATH=${volume_data_path} \
      ${EASYPSI_IMAGE}

@bzzbzz7
Copy link
Author

bzzbzz7 commented Dec 26, 2024

重试后,又成功了,奇怪。

@bzzbzz7
Copy link
Author

bzzbzz7 commented Dec 26, 2024

新建一个新的psi任务后,出现如下错误:

[root@115472c330ed kuscia]# kubectl get pod -A
NAMESPACE   NAME         READY   STATUS    RESTARTS   AGE
bob         awdtmyxq-0   0/1     Pending   0          4m50s

[root@115472c330ed kuscia]# kubectl -n bob describe po awdtmyxq-0
Name:             awdtmyxq-0
Namespace:        bob
Priority:         0
Service Account:  default
Node:             <none>
Labels:           kuscia.secretflow/communication-role-client=true
                  kuscia.secretflow/communication-role-server=true
                  kuscia.secretflow/controller=kusciatask
                  kuscia.secretflow/pod-identity=94c9b94f-db97-45dd-9c23-c8e7630e708f-0
                  kuscia.secretflow/pod-role=
                  kuscia.secretflow/task-resource-uid=a52600a9-f70f-460b-85a6-2d96b5a27210
                  kuscia.secretflow/task-uid=94c9b94f-db97-45dd-9c23-c8e7630e708f
Annotations:      kuscia.secretflow/config-template-volumes: config-template
                  kuscia.secretflow/initiator: bob
                  kuscia.secretflow/task-id: awdtmyxq
                  kuscia.secretflow/task-resource: awdtmyxq-6e49cab20cd6
                  kuscia.secretflow/task-resource-group: awdtmyxq
Status:           Pending
IP:               
IPs:              <none>
Containers:
  secretflow:
    Image:      secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/psi-anolis8:0.3.0beta
    Port:       25719/TCP
    Host Port:  0/TCP
    Command:
      sh
    Args:
      -c
      /root/main --kuscia /etc/kuscia/task-config.conf
    Environment:
      KUSCIA_DOMAIN_ID:        bob
      TASK_ID:                 awdtmyxq
      TASK_CLUSTER_DEFINE:     {"parties":[{"name":"alice", "role":"", "services":[{"portName":"psi", "endpoints":["awdtmyxq-0-psi.alice.svc"]}]}, {"name":"bob", "role":"", "services":[{"portName":"psi", "endpoints":["awdtmyxq-0-psi.bob.svc"]}]}], "selfPartyIdx":1, "selfEndpointIdx":0}
      ALLOCATED_PORTS:         {"ports":[{"name":"psi", "port":25719, "scope":"Cluster", "protocol":"HTTP"}]}
      TASK_INPUT_CONFIG:       {
                                 "sf_psi_config_map": {
                                   "bob": {
                                     "link_config": {
                                       "recv_timeout_ms": "30000",
                                       "http_timeout_ms": 30000
                                     },
                                     "psi_config": {
                                       "protocol_config": {
                                         "protocol": "PROTOCOL_RR22",
                                         "role": "ROLE_SENDER",
                                         "ecdh_config": {
                                           "curve": "CURVE_FOURQ"
                                         },
                                         "kkrt_config": {
                                           "bucket_size": "1048576"
                                         },
                                         "rr22_config": {
                                           "bucket_size": "1048576"
                                         }
                                       },
                                       "input_config": {
                                         "type": "IO_TYPE_FILE_CSV",
                                         "path": "/home/kuscia/var/storage/data/bob.csv"
                                       },
                                       "output_config": {
                                         "type": "IO_TYPE_FILE_CSV",
                                         "path": "/home/kuscia/var/storage/data/result/awdtmyxq/"
                                       },
                                       "keys": ["id2"],
                                       "skip_duplicates_check": true,
                                       "recovery_config": {
                                         "folder": "/home/kuscia/var/storage/data/tmp/awdtmyxq/"
                                       },
                                       "left_side": "ROLE_RECEIVER"
                                     }
                                   },
                                   "alice": {
                                     "link_config": {
                                       "recv_timeout_ms": "30000",
                                       "http_timeout_ms": 30000
                                     },
                                     "psi_config": {
                                       "protocol_config": {
                                         "protocol": "PROTOCOL_RR22",
                                         "role": "ROLE_RECEIVER",
                                         "ecdh_config": {
                                           "curve": "CURVE_FOURQ"
                                         },
                                         "kkrt_config": {
                                           "bucket_size": "1048576"
                                         },
                                         "rr22_config": {
                                           "bucket_size": "1048576"
                                         }
                                       },
                                       "input_config": {
                                         "type": "IO_TYPE_FILE_CSV",
                                         "path": "/home/kuscia/var/storage/data/alice.csv"
                                       },
                                       "output_config": {
                                         "type": "IO_TYPE_FILE_CSV",
                                         "path": "/home/kuscia/var/storage/data/result/awdtmyxq/test3_result.csv"
                                       },
                                       "keys": ["id1"],
                                       "skip_duplicates_check": true,
                                       "recovery_config": {
                                         "folder": "/home/kuscia/var/storage/data/tmp/awdtmyxq/"
                                       },
                                       "left_side": "ROLE_RECEIVER"
                                     }
                                   }
                                 }
                               }
      KUSCIA_PORT_PSI_NUMBER:  25719
    Mounts:
      /etc/kuscia/task-config.conf from config-template (rw,path="task-config.conf")
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  config-template:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        awdtmyxq-configtemplate
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kuscia.secretflow/namespace=bob
Tolerations:     kuscia.secretflow/agent:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From              Message
  ----     ------            ----   ----              -------
  Warning  FailedScheduling  3m6s   kuscia-scheduler  0/1 nodes are available: failed to get task resource bob/ for pod. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling., can not find related task resource.
  Warning  FailedScheduling  2m32s  kuscia-scheduler  domain [alice] can not reserve resources for pods

@wangzul
Copy link

wangzul commented Dec 26, 2024

检查一下宿主机磁盘和内存是否占用过高

新建一个新的psi任务后,出现如下错误:

[root@115472c330ed kuscia]# kubectl get pod -A
NAMESPACE   NAME         READY   STATUS    RESTARTS   AGE
bob         awdtmyxq-0   0/1     Pending   0          4m50s

[root@115472c330ed kuscia]# kubectl -n bob describe po awdtmyxq-0
Name:             awdtmyxq-0
Namespace:        bob
Priority:         0
Service Account:  default
Node:             <none>
Labels:           kuscia.secretflow/communication-role-client=true
                  kuscia.secretflow/communication-role-server=true
                  kuscia.secretflow/controller=kusciatask
                  kuscia.secretflow/pod-identity=94c9b94f-db97-45dd-9c23-c8e7630e708f-0
                  kuscia.secretflow/pod-role=
                  kuscia.secretflow/task-resource-uid=a52600a9-f70f-460b-85a6-2d96b5a27210
                  kuscia.secretflow/task-uid=94c9b94f-db97-45dd-9c23-c8e7630e708f
Annotations:      kuscia.secretflow/config-template-volumes: config-template
                  kuscia.secretflow/initiator: bob
                  kuscia.secretflow/task-id: awdtmyxq
                  kuscia.secretflow/task-resource: awdtmyxq-6e49cab20cd6
                  kuscia.secretflow/task-resource-group: awdtmyxq
Status:           Pending
IP:               
IPs:              <none>
Containers:
  secretflow:
    Image:      secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/psi-anolis8:0.3.0beta
    Port:       25719/TCP
    Host Port:  0/TCP
    Command:
      sh
    Args:
      -c
      /root/main --kuscia /etc/kuscia/task-config.conf
    Environment:
      KUSCIA_DOMAIN_ID:        bob
      TASK_ID:                 awdtmyxq
      TASK_CLUSTER_DEFINE:     {"parties":[{"name":"alice", "role":"", "services":[{"portName":"psi", "endpoints":["awdtmyxq-0-psi.alice.svc"]}]}, {"name":"bob", "role":"", "services":[{"portName":"psi", "endpoints":["awdtmyxq-0-psi.bob.svc"]}]}], "selfPartyIdx":1, "selfEndpointIdx":0}
      ALLOCATED_PORTS:         {"ports":[{"name":"psi", "port":25719, "scope":"Cluster", "protocol":"HTTP"}]}
      TASK_INPUT_CONFIG:       {
                                 "sf_psi_config_map": {
                                   "bob": {
                                     "link_config": {
                                       "recv_timeout_ms": "30000",
                                       "http_timeout_ms": 30000
                                     },
                                     "psi_config": {
                                       "protocol_config": {
                                         "protocol": "PROTOCOL_RR22",
                                         "role": "ROLE_SENDER",
                                         "ecdh_config": {
                                           "curve": "CURVE_FOURQ"
                                         },
                                         "kkrt_config": {
                                           "bucket_size": "1048576"
                                         },
                                         "rr22_config": {
                                           "bucket_size": "1048576"
                                         }
                                       },
                                       "input_config": {
                                         "type": "IO_TYPE_FILE_CSV",
                                         "path": "/home/kuscia/var/storage/data/bob.csv"
                                       },
                                       "output_config": {
                                         "type": "IO_TYPE_FILE_CSV",
                                         "path": "/home/kuscia/var/storage/data/result/awdtmyxq/"
                                       },
                                       "keys": ["id2"],
                                       "skip_duplicates_check": true,
                                       "recovery_config": {
                                         "folder": "/home/kuscia/var/storage/data/tmp/awdtmyxq/"
                                       },
                                       "left_side": "ROLE_RECEIVER"
                                     }
                                   },
                                   "alice": {
                                     "link_config": {
                                       "recv_timeout_ms": "30000",
                                       "http_timeout_ms": 30000
                                     },
                                     "psi_config": {
                                       "protocol_config": {
                                         "protocol": "PROTOCOL_RR22",
                                         "role": "ROLE_RECEIVER",
                                         "ecdh_config": {
                                           "curve": "CURVE_FOURQ"
                                         },
                                         "kkrt_config": {
                                           "bucket_size": "1048576"
                                         },
                                         "rr22_config": {
                                           "bucket_size": "1048576"
                                         }
                                       },
                                       "input_config": {
                                         "type": "IO_TYPE_FILE_CSV",
                                         "path": "/home/kuscia/var/storage/data/alice.csv"
                                       },
                                       "output_config": {
                                         "type": "IO_TYPE_FILE_CSV",
                                         "path": "/home/kuscia/var/storage/data/result/awdtmyxq/test3_result.csv"
                                       },
                                       "keys": ["id1"],
                                       "skip_duplicates_check": true,
                                       "recovery_config": {
                                         "folder": "/home/kuscia/var/storage/data/tmp/awdtmyxq/"
                                       },
                                       "left_side": "ROLE_RECEIVER"
                                     }
                                   }
                                 }
                               }
      KUSCIA_PORT_PSI_NUMBER:  25719
    Mounts:
      /etc/kuscia/task-config.conf from config-template (rw,path="task-config.conf")
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  config-template:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        awdtmyxq-configtemplate
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kuscia.secretflow/namespace=bob
Tolerations:     kuscia.secretflow/agent:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From              Message
  ----     ------            ----   ----              -------
  Warning  FailedScheduling  3m6s   kuscia-scheduler  0/1 nodes are available: failed to get task resource bob/ for pod. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling., can not find related task resource.
  Warning  FailedScheduling  2m32s  kuscia-scheduler  domain [alice] can not reserve resources for pods

检查一下宿主机磁盘和内存是否占用过高

@bzzbzz7
Copy link
Author

bzzbzz7 commented Dec 26, 2024

alice 的pod内只有一个node:

[root@115472c330ed kuscia]# kubectl get node
NAME           STATUS   ROLES   AGE     VERSION
115472c330ed   Ready    agent   3h32m   v0.6.0b0
[root@115472c330ed kuscia]# 
[root@115472c330ed kuscia]# 
[root@115472c330ed kuscia]# kubectl describe node 115472c330ed
Name:               115472c330ed
Roles:              agent
Labels:             beta.kubernetes.io/arch=x86_64
                    beta.kubernetes.io/os=linux
                    domain=bob
                    kubernetes.io/apiVersion=0.26.6
                    kubernetes.io/arch=x86_64
                    kubernetes.io/hostname=115472c330ed
                    kubernetes.io/os=linux
                    kubernetes.io/role=agent
                    kuscia.secretflow/namespace=bob
                    kuscia.secretflow/runtime=runp
Annotations:        node.alpha.kubernetes.io/ttl: 0
CreationTimestamp:  Thu, 26 Dec 2024 10:24:11 +0800
Taints:             kuscia.secretflow/agent=v1:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  115472c330ed
  AcquireTime:     <unset>
  RenewTime:       Thu, 26 Dec 2024 13:56:57 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                     Message
  ----                 ------  -----------------                 ------------------                ------                     -------
  NetworkUnavailable   False   Thu, 26 Dec 2024 13:55:28 +0800   Thu, 26 Dec 2024 13:55:28 +0800   RouteCreated               RouteController created a route
  PIDPressure          False   Thu, 26 Dec 2024 13:55:28 +0800   Thu, 26 Dec 2024 13:55:28 +0800   AgentHasSufficientPID      Agent has sufficient PID available
  MemoryPressure       False   Thu, 26 Dec 2024 13:56:37 +0800   Thu, 26 Dec 2024 13:55:28 +0800   AgentHasSufficientMemory   Agent has sufficient memory available, total=31.2GB, available=9.2GB
  DiskPressure         False   Thu, 26 Dec 2024 13:56:37 +0800   Thu, 26 Dec 2024 13:55:28 +0800   AgentHasNoDiskPressure     Agent has no disk pressure. @agent_volume: space=62.5GB/199.9GB(31.3%) inode=635.9k/104.9M(0.6%)
  OutOfDisk            False   Thu, 26 Dec 2024 13:56:37 +0800   Thu, 26 Dec 2024 13:55:28 +0800   AgentHasSufficientDisk     Agent has sufficient disk space available. @agent_volume: free_space=137.4GB, free_inode=104.2M
  Ready                True    Thu, 26 Dec 2024 13:56:37 +0800   Thu, 26 Dec 2024 13:55:37 +0800   AgentReady                 Agent is ready
Addresses:
  InternalIP:  172.17.0.3
Capacity:
  cpu:      16
  memory:   32724764Ki
  pods:     500
  storage:  209611780Ki
Allocatable:
  cpu:      16
  memory:   9614944Ki
  pods:     500
  storage:  143666528Ki
System Info:
  Machine ID:                 367e8547-a38d-4166-b569-eca3953594fe
  System UUID:                
  Boot ID:                    1735177200-1735192528552861589
  Kernel Version:             4.18.0-305.3.1.el8.x86_64
  OS Image:                   docker://linux/anolis:8.8 (guest)
  Operating System:           linux
  Architecture:               x86_64
  Container Runtime Version:  
  Kubelet Version:            v0.6.0b0
  Kube-Proxy Version:         
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
Non-terminated Pods:          (0 in total)
  Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----    ------------  ----------  ---------------  -------------  ---
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
  storage            0         0
Events:              <none>

宿主机node:

CONTAINER ID   NAME                                  CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
115472c330ed   ezpsi-bob                             10.77%    1.938GiB / 31.21GiB   6.21%     16.5MB / 16.3MB   12.1MB / 429MB    298
4d7621b26247   ezpsi-alice                           8.88%     2.619GiB / 31.21GiB   8.39%     14.7MB / 17.2MB   903MB / 410MB     294


top - 13:59:25 up  4:19,  2 users,  load average: 2.02, 3.06, 3.08
Tasks: 542 total,   2 running, 539 sleeping,   0 stopped,   1 zombie
%Cpu(s):  4.5 us,  2.8 sy,  0.7 ni, 76.3 id,  0.0 wa,  1.1 hi,  0.6 si, 14.1 st
MiB Mem : 70.5/31957.8  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                             ]
MiB Swap:  0.0/5120.0   [                                                                                                    ]

@wangzul
Copy link

wangzul commented Dec 26, 2024

df -h
free -h
在宿主机执行看一下。

@bzzbzz7
Copy link
Author

bzzbzz7 commented Dec 26, 2024

[root@localhost ~]# df -lh
Filesystem           Size  Used Avail Use% Mounted on
devtmpfs              16G     0   16G   0% /dev
tmpfs                 16G     0   16G   0% /dev/shm
tmpfs                 16G   19M   16G   1% /run
tmpfs                 16G     0   16G   0% /sys/fs/cgroup
/dev/mapper/cl-root   44G   22G   23G  49% /
/dev/vda1           1014M  242M  773M  24% /boot
/dev/vdb1            200G   63G  138G  32% /data
tmpfs                3.2G  1.2M  3.2G   1% /run/user/42
tmpfs                3.2G     0  3.2G   0% /run/user/0
overlay              200G   63G  138G  32% /data/docker-data/overlay2/177dc3e9510b8e819f0bc9561f3000907214fb7e67c68c740d29e9e65e8151af/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/df36456ed80e7eec1ab5061e3e1b958814e28ec2c6f348ae2d127b6380c3e505/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/e4b8c72e735083aba0172667f67371e627519b3c52749714832a32bbd934c788/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/413003c78fdd8f8f420156054cd9b208cbc6824ac3d4a08c0f46f420d59c2a4b/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/a5265c3829abb6563cc526b04f8205ec0090f0b3226fb2749a358afabb5428c8/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/b7bf6688d86bbfe4c2836aba4b7bf2186ffeeff49ba2c4a5c7a5ec8b238a5e42/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/e16a66f0dd1d0d57564603fe40d3d887e1e4dd57b81b1a6a84e61083e3ed1835/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/05de43c97051043dc70061a5743e6b1e895d8e58620def5e70d0e4484d99b26b/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/173c9507fc3382cdf816469044ddd533b34a726fcb640487c748146792a2976a/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/97baa635ddb2d7d42f6a363b85453f4b4bb150e3c68460bdc501590dac9d99ad/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/aee01638f05c67a1fbfd3c9a973fa5ffda45c6c32265893af25393c4a30972b2/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/03ea55d3a99a33db92a81c6ca98bdcd9a08fe5ce51c60a365b656a155274ebb4/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/d2dd967ce696a4f3841930d312686966585a0f281c6bae9f0c42bddb0e1d0c64/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/024664da584ec4291ac6d884eb22a3cf4cc77dcf3c276e983fd168d25575c667/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/f179caa4e8d6c7655564bcdc85caf156f241d47f2f3a5ce75efbdcd78d71426d/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/4ee2668dfa9c76ae84361198bcf6c516446271f0d91a2d573d38ebd1e9e61d0e/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/095c30069cd5aee5ffebb982d481e05e3a9de611e874ed42411be2e8b778cbc4/merged
overlay              200G   63G  138G  32% /data/docker-data/overlay2/851622f93558a0fbad3c9430201e7bb8d21d46fbe930cd7fae98c2c7f2ef53d2/merged
[root@localhost ~]# 
[root@localhost ~]# 
[root@localhost ~]# free -h
              total        used        free      shared  buff/cache   available
Mem:           31Gi        21Gi       2.5Gi        27Mi       7.2Gi       9.2Gi
Swap:         5.0Gi          0B       5.0Gi

@wangzul
Copy link

wangzul commented Dec 26, 2024

/home/kuscia/var/logs/kuscia.log 日志提供一下。
日志获取后可尝试重新运行一下看是否必现。

@bzzbzz7
Copy link
Author

bzzbzz7 commented Dec 26, 2024

重试后又成功了,如下是kuscia日志:
kuscia.log

@wangzul
Copy link

wangzul commented Dec 26, 2024

重试后又成功了,如下是kuscia日志: kuscia.log

错误时间节点说一下,方便定位日志内容。

@bzzbzz7
Copy link
Author

bzzbzz7 commented Dec 26, 2024

Dec 26 13:54 左右

@wangzul
Copy link

wangzul commented Dec 30, 2024

/home/kuscia/var/logs/k3s.log 方便提供一下吗?
同时可以尝试释放一下内存资源,内存尝试预留10-12个g后重新运行几次看看。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants