Skip to content
This repository has been archived by the owner on Nov 16, 2019. It is now read-only.

Use infiniband #291

Open
loveheng opened this issue Dec 5, 2017 · 2 comments
Open

Use infiniband #291

loveheng opened this issue Dec 5, 2017 · 2 comments

Comments

@loveheng
Copy link

loveheng commented Dec 5, 2017

My current configuration encountered some problems.


I1205 17:02:12.401198 7160 layer_factory.hpp:77] Creating layer data
I1205 17:02:12.401211 7160 net.cpp:99] Creating Layer data
I1205 17:02:12.401216 7160 net.cpp:407] data -> data
I1205 17:02:12.401224 7160 net.cpp:407] data -> label
I1205 17:02:12.401321 7160 net.cpp:149] Setting up data
I1205 17:02:12.401330 7160 net.cpp:156] Top shape: 100 1 28 28 (78400)
I1205 17:02:12.401335 7160 net.cpp:156] Top shape: 100 (100)
I1205 17:02:12.401337 7160 net.cpp:164] Memory required for data: 314000
I1205 17:02:12.401341 7160 layer_factory.hpp:77] Creating layer label_data_1_split
I1205 17:02:12.401347 7160 net.cpp:99] Creating Layer label_data_1_split
I1205 17:02:12.401351 7160 net.cpp:433] label_data_1_split <- label
I1205 17:02:12.401356 7160 net.cpp:407] label_data_1_split -> label_data_1_split_0
I1205 17:02:12.401362 7160 net.cpp:407] label_data_1_split -> label_data_1_split_1
I1205 17:02:12.401396 7160 net.cpp:149] Setting up label_data_1_split
I1205 17:02:12.401402 7160 net.cpp:156] Top shape: 100 (100)
I1205 17:02:12.401407 7160 net.cpp:156] Top shape: 100 (100)
I1205 17:02:12.401409 7160 net.cpp:164] Memory required for data: 314800
I1205 17:02:12.401412 7160 layer_factory.hpp:77] Creating layer conv1
I1205 17:02:12.401422 7160 net.cpp:99] Creating Layer conv1
I1205 17:02:12.401425 7160 net.cpp:433] conv1 <- data
I1205 17:02:12.401430 7160 net.cpp:407] conv1 -> conv1
I1205 17:02:12.402066 7160 net.cpp:149] Setting up conv1
I1205 17:02:12.402081 7160 net.cpp:156] Top shape: 100 20 24 24 (1152000)
I1205 17:02:12.402084 7160 net.cpp:164] Memory required for data: 4922800
I1205 17:02:12.402097 7160 layer_factory.hpp:77] Creating layer pool1
I1205 17:02:12.402107 7160 net.cpp:99] Creating Layer pool1
I1205 17:02:12.402110 7160 net.cpp:433] pool1 <- conv1
I1205 17:02:12.402115 7160 net.cpp:407] pool1 -> pool1
I1205 17:02:12.402153 7160 net.cpp:149] Setting up pool1
I1205 17:02:12.402161 7160 net.cpp:156] Top shape: 100 20 12 12 (288000)
I1205 17:02:12.402164 7160 net.cpp:164] Memory required for data: 6074800
I1205 17:02:12.402168 7160 layer_factory.hpp:77] Creating layer conv2
I1205 17:02:12.402176 7160 net.cpp:99] Creating Layer conv2
I1205 17:02:12.402180 7160 net.cpp:433] conv2 <- pool1
I1205 17:02:12.402186 7160 net.cpp:407] conv2 -> conv2
I1205 17:02:12.403599 7160 net.cpp:149] Setting up conv2
I1205 17:02:12.403615 7160 net.cpp:156] Top shape: 100 50 8 8 (320000)
I1205 17:02:12.403620 7160 net.cpp:164] Memory required for data: 7354800
I1205 17:02:12.403630 7160 layer_factory.hpp:77] Creating layer pool2
I1205 17:02:12.403637 7160 net.cpp:99] Creating Layer pool2
I1205 17:02:12.403641 7160 net.cpp:433] pool2 <- conv2
I1205 17:02:12.403647 7160 net.cpp:407] pool2 -> pool2
I1205 17:02:12.403690 7160 net.cpp:149] Setting up pool2
I1205 17:02:12.403698 7160 net.cpp:156] Top shape: 100 50 4 4 (80000)
I1205 17:02:12.403702 7160 net.cpp:164] Memory required for data: 7674800
I1205 17:02:12.403705 7160 layer_factory.hpp:77] Creating layer ip1
I1205 17:02:12.403713 7160 net.cpp:99] Creating Layer ip1
I1205 17:02:12.403717 7160 net.cpp:433] ip1 <- pool2
I1205 17:02:12.403723 7160 net.cpp:407] ip1 -> ip1
I1205 17:02:12.406860 7160 net.cpp:149] Setting up ip1
I1205 17:02:12.406877 7160 net.cpp:156] Top shape: 100 500 (50000)
I1205 17:02:12.406879 7160 net.cpp:164] Memory required for data: 7874800
I1205 17:02:12.406890 7160 layer_factory.hpp:77] Creating layer relu1
I1205 17:02:12.406898 7160 net.cpp:99] Creating Layer relu1
I1205 17:02:12.406901 7160 net.cpp:433] relu1 <- ip1
I1205 17:02:12.406909 7160 net.cpp:394] relu1 -> ip1 (in-place)
I1205 17:02:12.407634 7160 net.cpp:149] Setting up relu1
I1205 17:02:12.407649 7160 net.cpp:156] Top shape: 100 500 (50000)
I1205 17:02:12.407654 7160 net.cpp:164] Memory required for data: 8074800
I1205 17:02:12.407657 7160 layer_factory.hpp:77] Creating layer ip2
I1205 17:02:12.407667 7160 net.cpp:99] Creating Layer ip2
I1205 17:02:12.407672 7160 net.cpp:433] ip2 <- ip1
I1205 17:02:12.407680 7160 net.cpp:407] ip2 -> ip2
I1205 17:02:12.407815 7160 net.cpp:149] Setting up ip2
I1205 17:02:12.407825 7160 net.cpp:156] Top shape: 100 10 (1000)
I1205 17:02:12.407829 7160 net.cpp:164] Memory required for data: 8078800
I1205 17:02:12.407835 7160 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
I1205 17:02:12.407840 7160 net.cpp:99] Creating Layer ip2_ip2_0_split
I1205 17:02:12.407843 7160 net.cpp:433] ip2_ip2_0_split <- ip2
I1205 17:02:12.407848 7160 net.cpp:407] ip2_ip2_0_split -> ip2_ip2_0_split_0
I1205 17:02:12.407856 7160 net.cpp:407] ip2_ip2_0_split -> ip2_ip2_0_split_1
I1205 17:02:12.407891 7160 net.cpp:149] Setting up ip2_ip2_0_split
I1205 17:02:12.407898 7160 net.cpp:156] Top shape: 100 10 (1000)
I1205 17:02:12.407902 7160 net.cpp:156] Top shape: 100 10 (1000)
I1205 17:02:12.407904 7160 net.cpp:164] Memory required for data: 8086800
I1205 17:02:12.407908 7160 layer_factory.hpp:77] Creating layer accuracy
I1205 17:02:12.407917 7160 net.cpp:99] Creating Layer accuracy
I1205 17:02:12.407920 7160 net.cpp:433] accuracy <- ip2_ip2_0_split_0
I1205 17:02:12.407924 7160 net.cpp:433] accuracy <- label_data_1_split_0
I1205 17:02:12.407930 7160 net.cpp:407] accuracy -> accuracy
I1205 17:02:12.407939 7160 net.cpp:149] Setting up accuracy
I1205 17:02:12.407944 7160 net.cpp:156] Top shape: (1)
I1205 17:02:12.407948 7160 net.cpp:164] Memory required for data: 8086804
I1205 17:02:12.407950 7160 layer_factory.hpp:77] Creating layer loss
I1205 17:02:12.407954 7160 net.cpp:99] Creating Layer loss
I1205 17:02:12.407958 7160 net.cpp:433] loss <- ip2_ip2_0_split_1
I1205 17:02:12.407963 7160 net.cpp:433] loss <- label_data_1_split_1
I1205 17:02:12.407966 7160 net.cpp:407] loss -> loss
I1205 17:02:12.407972 7160 layer_factory.hpp:77] Creating layer loss
I1205 17:02:12.408217 7160 net.cpp:149] Setting up loss
I1205 17:02:12.408229 7160 net.cpp:156] Top shape: (1)
I1205 17:02:12.408233 7160 net.cpp:159] with loss weight 1
I1205 17:02:12.408239 7160 net.cpp:164] Memory required for data: 8086808
I1205 17:02:12.408243 7160 net.cpp:225] loss needs backward computation.
I1205 17:02:12.408248 7160 net.cpp:227] accuracy does not need backward computation.
I1205 17:02:12.408252 7160 net.cpp:225] ip2_ip2_0_split needs backward computation.
I1205 17:02:12.408255 7160 net.cpp:225] ip2 needs backward computation.
I1205 17:02:12.408258 7160 net.cpp:225] relu1 needs backward computation.
I1205 17:02:12.408262 7160 net.cpp:225] ip1 needs backward computation.
I1205 17:02:12.408263 7160 net.cpp:225] pool2 needs backward computation.
I1205 17:02:12.408267 7160 net.cpp:225] conv2 needs backward computation.
I1205 17:02:12.408270 7160 net.cpp:225] pool1 needs backward computation.
I1205 17:02:12.408272 7160 net.cpp:225] conv1 needs backward computation.
I1205 17:02:12.408277 7160 net.cpp:227] label_data_1_split does not need backward computation.
I1205 17:02:12.408279 7160 net.cpp:227] data does not need backward computation.
I1205 17:02:12.408282 7160 net.cpp:269] This network produces output accuracy
I1205 17:02:12.408288 7160 net.cpp:269] This network produces output loss
I1205 17:02:12.408299 7160 net.cpp:282] Network initialization done.
I1205 17:02:12.408339 7160 solver.cpp:60] Solver scaffolding done.
I1205 17:02:12.411540 7160 CaffeNet.cpp:240] RDMA adapter: mlx5_0
I1205 17:02:12.414819 7160 CaffeNet.cpp:388] 0-th RDMA addr: 01000000360100000899f800
I1205 17:02:12.414834 7160 CaffeNet.cpp:388] 1-th RDMA addr:
I1205 17:02:12.414849 7160 JniCaffeNet.cpp:145] 0-th local addr: 01000000360100000899f800
I1205 17:02:12.414856 7160 JniCaffeNet.cpp:145] 1-th local addr:
17/12/05 17:02:12 INFO executor.Executor: Finished task 1.0 in stage 2.0 (TID 5). 931 bytes result sent to driver
17/12/05 17:02:12 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
17/12/05 17:02:12 INFO executor.Executor: Running task 1.0 in stage 3.0 (TID 7)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1565.0 B, free 18.9 KB)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 14 ms
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.6 KB, free 21.4 KB)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 105.0 B, free 21.5 KB)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 11 ms
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 392.0 B, free 21.9 KB)
I1205 17:02:12.636529 7160 common.cpp:61] 1-th string is NULL
F1205 17:02:12.639581 7160 rdma.cpp:327] Check failed: self_ Failed to register memory region.


infiniband information is as follows


omnisky@slave1:~/zzh/mnist$ ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.21.1000
Hardware version: 0
Node GUID: 0xec0d9a0300397dc2
System image GUID: 0xec0d9a0300397dc2
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 2
LMC: 0
SM lid: 2
Capability mask: 0x2651e84a
Port GUID: 0xec0d9a0300397dc2
Link layer: InfiniBand


I want know spark how to use infiniband , need to modify those configuration files or change infiniband's config . Please help me.

@junshi15
Copy link
Collaborator

junshi15 commented Dec 5, 2017

from your ibstat log:

Port 1:
State: Down

Your port is down. Please get a local expert to help you with Infiniband adapters, verify your connection is correct, before you try CaffeOnSpark. Since everybody's setup is different, we don't have the bandwidth to troubleshoot your hardware settings.

@mygithub20152015
Copy link

mygithub20152015 commented Dec 6, 2017

I met the same problem.

RDMABuffer::RDMABuffer(RDMAChannel* channel, uint8_t* addr, size_t size)
: channel_(channel),
addr_(addr),
size_(size) {

//*******************************************************
// case 1: Use cpu memory ibv_reg_mr() is ok, but some code is not work.
// addr_ = reinterpret_cast<uint8_t*>(malloc(size));
//
// http://server01:8042/node/containerlogs/container_1512543960414_0001_01_000003/root/stderr/?start=0
// F1206 02:14:43.892500 18704 math_functions.cu:79] Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered
// *** Check failure stack trace: ***
//
// case 2: Use gpu memory ibv_reg_mr() is not ok, help me.
// CUDA_CHECK(cudaMalloc(&addr_, size));
//
// http://server01:8042/node/containerlogs/container_1512543960414_0001_01_000003/root/stderr/?start=0
// F1205 17:02:12.639581 7160 rdma.cpp:327] Check failed: self_ Failed to register memory region.
//*******************************************************

self_ = ibv_reg_mr(channel_->adapter_.pd_, addr_, size,
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
CHECK(self_) << "Failed to register memory region";

id_ = channel_->buffers_.size();
channel_->buffers_.push_back(this);

channel_->SendMR(self_, id_);
peer_ = channel_->memory_regions_queue_.pop();

}

//*******************************************************
root@5ec610095991:~/CaffeOnSpark/caffe-public# more Makefile.config

Refer to http://caffe.berkeleyvision.org/installation.html
Parallelization over InfiniBand or RoCE
INFINIBAND := 1

//*******************************************************
root@server01:/rt/data/alexNet2# ibv_devices
device node GUID
------ ----------------
mlx5_0 ec0d9a0300397dd2

//*******************************************************
root@server01:/rt/data/alexNet2# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.21.1000
node_guid: ec0d:9a03:0039:7dd2
sys_image_guid: ec0d:9a03:0039:7dd2
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2
port_lmc: 0x00
link_layer: InfiniBand

//*******************************************************
root@5ec610095991:~/CaffeOnSpark/caffe-public# nvidia-smi
Wed Dec 6 07:34:09 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.69 Driver Version: 384.69 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:04:00.0 Off | N/A |
| 20% 33C P8 16W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:06:00.0 Off | N/A |
| 20% 36C P8 17W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:07:00.0 Off | N/A |
| 20% 33C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 20% 34C P8 8W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 108... Off | 00000000:0C:00.0 Off | N/A |
| 20% 28C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 108... Off | 00000000:0D:00.0 Off | N/A |
| 20% 27C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:0E:00.0 Off | N/A |
| 20% 31C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 108... Off | 00000000:0F:00.0 Off | N/A |
| 20% 31C P8 9W / 250W | 10MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

//*******************************************************
[root@server00 01_basic-client-server]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.io/nvidia/cuda 8.0-devel 7e0c5ccdc1eb 2 weeks ago 1.681 GB

//*******************************************************
Installation Mellanox OFED for Ubuntu on a Host
MLNX_OFED_LINUX-4.2-1.0.0.0-ubuntu16.04-x86_64.tgz

//*******************************************************
[root@server01 ~]# systemctl status nv_peer_mem
● nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem module to start at boot time.
Loaded: loaded (/etc/rc.d/init.d/nv_peer_mem; bad; vendor preset: disabled)
Active: active (exited) since Wed 2017-12-06 05:16:08 EST; 1min 32s ago
Docs: man:systemd-sysv-generator(8)
Process: 2055 ExecStart=/etc/rc.d/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)

Dec 06 05:16:08 server01 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem module to start at boot time....
Dec 06 05:16:08 server01 nv_peer_mem[2055]: starting... OK
Dec 06 05:16:08 server01 systemd[1]: Started LSB: Activates/Deactivates nv_peer_mem module to start at boot time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants