-
Notifications
You must be signed in to change notification settings - Fork 355
Use infiniband #291
Comments
from your ibstat log: Port 1: Your port is down. Please get a local expert to help you with Infiniband adapters, verify your connection is correct, before you try CaffeOnSpark. Since everybody's setup is different, we don't have the bandwidth to troubleshoot your hardware settings. |
I met the same problem. RDMABuffer::RDMABuffer(RDMAChannel* channel, uint8_t* addr, size_t size) //******************************************************* self_ = ibv_reg_mr(channel_->adapter_.pd_, addr_, size, id_ = channel_->buffers_.size(); channel_->SendMR(self_, id_); } //******************************************************* Refer to http://caffe.berkeleyvision.org/installation.html //******************************************************* //******************************************************* //******************************************************* +-----------------------------------------------------------------------------+ //******************************************************* //******************************************************* //******************************************************* Dec 06 05:16:08 server01 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem module to start at boot time.... |
My current configuration encountered some problems.
I1205 17:02:12.401198 7160 layer_factory.hpp:77] Creating layer data
I1205 17:02:12.401211 7160 net.cpp:99] Creating Layer data
I1205 17:02:12.401216 7160 net.cpp:407] data -> data
I1205 17:02:12.401224 7160 net.cpp:407] data -> label
I1205 17:02:12.401321 7160 net.cpp:149] Setting up data
I1205 17:02:12.401330 7160 net.cpp:156] Top shape: 100 1 28 28 (78400)
I1205 17:02:12.401335 7160 net.cpp:156] Top shape: 100 (100)
I1205 17:02:12.401337 7160 net.cpp:164] Memory required for data: 314000
I1205 17:02:12.401341 7160 layer_factory.hpp:77] Creating layer label_data_1_split
I1205 17:02:12.401347 7160 net.cpp:99] Creating Layer label_data_1_split
I1205 17:02:12.401351 7160 net.cpp:433] label_data_1_split <- label
I1205 17:02:12.401356 7160 net.cpp:407] label_data_1_split -> label_data_1_split_0
I1205 17:02:12.401362 7160 net.cpp:407] label_data_1_split -> label_data_1_split_1
I1205 17:02:12.401396 7160 net.cpp:149] Setting up label_data_1_split
I1205 17:02:12.401402 7160 net.cpp:156] Top shape: 100 (100)
I1205 17:02:12.401407 7160 net.cpp:156] Top shape: 100 (100)
I1205 17:02:12.401409 7160 net.cpp:164] Memory required for data: 314800
I1205 17:02:12.401412 7160 layer_factory.hpp:77] Creating layer conv1
I1205 17:02:12.401422 7160 net.cpp:99] Creating Layer conv1
I1205 17:02:12.401425 7160 net.cpp:433] conv1 <- data
I1205 17:02:12.401430 7160 net.cpp:407] conv1 -> conv1
I1205 17:02:12.402066 7160 net.cpp:149] Setting up conv1
I1205 17:02:12.402081 7160 net.cpp:156] Top shape: 100 20 24 24 (1152000)
I1205 17:02:12.402084 7160 net.cpp:164] Memory required for data: 4922800
I1205 17:02:12.402097 7160 layer_factory.hpp:77] Creating layer pool1
I1205 17:02:12.402107 7160 net.cpp:99] Creating Layer pool1
I1205 17:02:12.402110 7160 net.cpp:433] pool1 <- conv1
I1205 17:02:12.402115 7160 net.cpp:407] pool1 -> pool1
I1205 17:02:12.402153 7160 net.cpp:149] Setting up pool1
I1205 17:02:12.402161 7160 net.cpp:156] Top shape: 100 20 12 12 (288000)
I1205 17:02:12.402164 7160 net.cpp:164] Memory required for data: 6074800
I1205 17:02:12.402168 7160 layer_factory.hpp:77] Creating layer conv2
I1205 17:02:12.402176 7160 net.cpp:99] Creating Layer conv2
I1205 17:02:12.402180 7160 net.cpp:433] conv2 <- pool1
I1205 17:02:12.402186 7160 net.cpp:407] conv2 -> conv2
I1205 17:02:12.403599 7160 net.cpp:149] Setting up conv2
I1205 17:02:12.403615 7160 net.cpp:156] Top shape: 100 50 8 8 (320000)
I1205 17:02:12.403620 7160 net.cpp:164] Memory required for data: 7354800
I1205 17:02:12.403630 7160 layer_factory.hpp:77] Creating layer pool2
I1205 17:02:12.403637 7160 net.cpp:99] Creating Layer pool2
I1205 17:02:12.403641 7160 net.cpp:433] pool2 <- conv2
I1205 17:02:12.403647 7160 net.cpp:407] pool2 -> pool2
I1205 17:02:12.403690 7160 net.cpp:149] Setting up pool2
I1205 17:02:12.403698 7160 net.cpp:156] Top shape: 100 50 4 4 (80000)
I1205 17:02:12.403702 7160 net.cpp:164] Memory required for data: 7674800
I1205 17:02:12.403705 7160 layer_factory.hpp:77] Creating layer ip1
I1205 17:02:12.403713 7160 net.cpp:99] Creating Layer ip1
I1205 17:02:12.403717 7160 net.cpp:433] ip1 <- pool2
I1205 17:02:12.403723 7160 net.cpp:407] ip1 -> ip1
I1205 17:02:12.406860 7160 net.cpp:149] Setting up ip1
I1205 17:02:12.406877 7160 net.cpp:156] Top shape: 100 500 (50000)
I1205 17:02:12.406879 7160 net.cpp:164] Memory required for data: 7874800
I1205 17:02:12.406890 7160 layer_factory.hpp:77] Creating layer relu1
I1205 17:02:12.406898 7160 net.cpp:99] Creating Layer relu1
I1205 17:02:12.406901 7160 net.cpp:433] relu1 <- ip1
I1205 17:02:12.406909 7160 net.cpp:394] relu1 -> ip1 (in-place)
I1205 17:02:12.407634 7160 net.cpp:149] Setting up relu1
I1205 17:02:12.407649 7160 net.cpp:156] Top shape: 100 500 (50000)
I1205 17:02:12.407654 7160 net.cpp:164] Memory required for data: 8074800
I1205 17:02:12.407657 7160 layer_factory.hpp:77] Creating layer ip2
I1205 17:02:12.407667 7160 net.cpp:99] Creating Layer ip2
I1205 17:02:12.407672 7160 net.cpp:433] ip2 <- ip1
I1205 17:02:12.407680 7160 net.cpp:407] ip2 -> ip2
I1205 17:02:12.407815 7160 net.cpp:149] Setting up ip2
I1205 17:02:12.407825 7160 net.cpp:156] Top shape: 100 10 (1000)
I1205 17:02:12.407829 7160 net.cpp:164] Memory required for data: 8078800
I1205 17:02:12.407835 7160 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
I1205 17:02:12.407840 7160 net.cpp:99] Creating Layer ip2_ip2_0_split
I1205 17:02:12.407843 7160 net.cpp:433] ip2_ip2_0_split <- ip2
I1205 17:02:12.407848 7160 net.cpp:407] ip2_ip2_0_split -> ip2_ip2_0_split_0
I1205 17:02:12.407856 7160 net.cpp:407] ip2_ip2_0_split -> ip2_ip2_0_split_1
I1205 17:02:12.407891 7160 net.cpp:149] Setting up ip2_ip2_0_split
I1205 17:02:12.407898 7160 net.cpp:156] Top shape: 100 10 (1000)
I1205 17:02:12.407902 7160 net.cpp:156] Top shape: 100 10 (1000)
I1205 17:02:12.407904 7160 net.cpp:164] Memory required for data: 8086800
I1205 17:02:12.407908 7160 layer_factory.hpp:77] Creating layer accuracy
I1205 17:02:12.407917 7160 net.cpp:99] Creating Layer accuracy
I1205 17:02:12.407920 7160 net.cpp:433] accuracy <- ip2_ip2_0_split_0
I1205 17:02:12.407924 7160 net.cpp:433] accuracy <- label_data_1_split_0
I1205 17:02:12.407930 7160 net.cpp:407] accuracy -> accuracy
I1205 17:02:12.407939 7160 net.cpp:149] Setting up accuracy
I1205 17:02:12.407944 7160 net.cpp:156] Top shape: (1)
I1205 17:02:12.407948 7160 net.cpp:164] Memory required for data: 8086804
I1205 17:02:12.407950 7160 layer_factory.hpp:77] Creating layer loss
I1205 17:02:12.407954 7160 net.cpp:99] Creating Layer loss
I1205 17:02:12.407958 7160 net.cpp:433] loss <- ip2_ip2_0_split_1
I1205 17:02:12.407963 7160 net.cpp:433] loss <- label_data_1_split_1
I1205 17:02:12.407966 7160 net.cpp:407] loss -> loss
I1205 17:02:12.407972 7160 layer_factory.hpp:77] Creating layer loss
I1205 17:02:12.408217 7160 net.cpp:149] Setting up loss
I1205 17:02:12.408229 7160 net.cpp:156] Top shape: (1)
I1205 17:02:12.408233 7160 net.cpp:159] with loss weight 1
I1205 17:02:12.408239 7160 net.cpp:164] Memory required for data: 8086808
I1205 17:02:12.408243 7160 net.cpp:225] loss needs backward computation.
I1205 17:02:12.408248 7160 net.cpp:227] accuracy does not need backward computation.
I1205 17:02:12.408252 7160 net.cpp:225] ip2_ip2_0_split needs backward computation.
I1205 17:02:12.408255 7160 net.cpp:225] ip2 needs backward computation.
I1205 17:02:12.408258 7160 net.cpp:225] relu1 needs backward computation.
I1205 17:02:12.408262 7160 net.cpp:225] ip1 needs backward computation.
I1205 17:02:12.408263 7160 net.cpp:225] pool2 needs backward computation.
I1205 17:02:12.408267 7160 net.cpp:225] conv2 needs backward computation.
I1205 17:02:12.408270 7160 net.cpp:225] pool1 needs backward computation.
I1205 17:02:12.408272 7160 net.cpp:225] conv1 needs backward computation.
I1205 17:02:12.408277 7160 net.cpp:227] label_data_1_split does not need backward computation.
I1205 17:02:12.408279 7160 net.cpp:227] data does not need backward computation.
I1205 17:02:12.408282 7160 net.cpp:269] This network produces output accuracy
I1205 17:02:12.408288 7160 net.cpp:269] This network produces output loss
I1205 17:02:12.408299 7160 net.cpp:282] Network initialization done.
I1205 17:02:12.408339 7160 solver.cpp:60] Solver scaffolding done.
I1205 17:02:12.411540 7160 CaffeNet.cpp:240] RDMA adapter: mlx5_0
I1205 17:02:12.414819 7160 CaffeNet.cpp:388] 0-th RDMA addr: 01000000360100000899f800
I1205 17:02:12.414834 7160 CaffeNet.cpp:388] 1-th RDMA addr:
I1205 17:02:12.414849 7160 JniCaffeNet.cpp:145] 0-th local addr: 01000000360100000899f800
I1205 17:02:12.414856 7160 JniCaffeNet.cpp:145] 1-th local addr:
17/12/05 17:02:12 INFO executor.Executor: Finished task 1.0 in stage 2.0 (TID 5). 931 bytes result sent to driver
17/12/05 17:02:12 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
17/12/05 17:02:12 INFO executor.Executor: Running task 1.0 in stage 3.0 (TID 7)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1565.0 B, free 18.9 KB)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 14 ms
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.6 KB, free 21.4 KB)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 105.0 B, free 21.5 KB)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 11 ms
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 392.0 B, free 21.9 KB)
I1205 17:02:12.636529 7160 common.cpp:61] 1-th string is NULL
F1205 17:02:12.639581 7160 rdma.cpp:327] Check failed: self_ Failed to register memory region.
infiniband information is as follows
omnisky@slave1:~/zzh/mnist$ ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.21.1000
Hardware version: 0
Node GUID: 0xec0d9a0300397dc2
System image GUID: 0xec0d9a0300397dc2
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 2
LMC: 0
SM lid: 2
Capability mask: 0x2651e84a
Port GUID: 0xec0d9a0300397dc2
Link layer: InfiniBand
I want know spark how to use infiniband , need to modify those configuration files or change infiniband's config . Please help me.
The text was updated successfully, but these errors were encountered: