-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrebuttal.tex
1250 lines (883 loc) · 75.3 KB
/
rebuttal.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[12pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage{longtable}
\usepackage[utf8]{inputenc}
%\usepackage[T1]{fontenc}
\usepackage{lscape}
%\usepackage{pdfsync}
\usepackage{multirow}
\usepackage{fancyhdr}
\usepackage{graphicx}
\usepackage{lastpage}
\usepackage{afterpage}
\usepackage{lettrine}
\usepackage{color,soul}
\usepackage[dvipsnames]{xcolor}
\usepackage{colortbl}
\usepackage{enumitem}
\usepackage{tikz}
\usepackage{titlesec}
%Palatino font
%\usepackage{pxfonts}
%\usepackage{libertine}
\usepackage[scaled=0.88]{beraserif}
\usepackage[scaled=0.85]{berasans}
\usepackage[scaled=0.84]{beramono}
\usepackage{mathpazo}
%\linespread{1.05}
\usepackage[T1,small,euler-digits]{eulervm}
\usepackage[nomessages]{fp}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{siunitx}
\usepackage{bm}
\definecolor{bleuUCLclair}{rgb}{.09, 0.569, 1}
\definecolor{bleuUCLfonce}{rgb}{ .13, .52, .86}
\definecolor{redBurn}{rgb}{.91, 0.29, 0.08}
\usepackage[colorlinks=true,urlcolor=redBurn,linkcolor=black]{hyperref}
\addtolength{\topmargin}{-1.5cm}
\addtolength{\textheight}{1.5cm}
\addtolength{\textwidth}{2cm}
\addtolength{\footskip}{2cm}
\setlength{\evensidemargin}{-0.5cm}
\setlength{\oddsidemargin}{-0.5cm}
\setlength{\arrayrulewidth}{0.25pt}
\renewcommand{\baselinestretch}{1.1} % Interligne
\newenvironment{maliste}%
{ \begin{list}%
{\textcolor{bleuUCLfonce}{$\bullet$}\hspace{0.5cm}}%
{\setlength{\labelwidth}{50pt}%
\setlength{\leftmargin}{25pt}%
\setlength{\itemsep}{30pt}}}%
{ \end{list} }
%\renewcommand{\headrulewidth}{0.0pt}
%\newcommand{\clearemptydoublepage}{%
% \newpage{\pagestyle{empty}\cleardoublepage}}
%section like title in longtable
\newcommand{\seclong}[1]{\multicolumn{2}{@{}l}{{\Large\sffamily #1}}
\vspace{0.5cm}
\\}
%enumerate on two columns
\newcounter{listlong}
\newcommand{\newlistlong}{\setcounter{listlong}{1}}
\newcommand{\iteml}[1]{%
\hspace{4.5cm}\textcolor{redBurn}{\arabic{listlong}}\stepcounter{listlong}%
&%
#1%
\\%
}
%left in column
\newcommand{\lcol}[1]{%
\begin{minipage}[t]{.35\textwidth}%
#1%
\end{minipage}%
}
\title{\vspace{-1cm}
\begin{flushleft} {\sffamily Rebuttal for paper \emph{ASCOM\_2018\_119} }\end{flushleft}}
\date{\vspace{-1.7cm}\begin{flushleft}\sffamily DeepSphere: Efficient spherical Convolutional Neural Network with HEALPix sampling for cosmological applications, Nathanaël Perraudin, Michaël Defferrard, Tomasz Kacprzak, Raphael Sgier \end{flushleft}}
\pagestyle{fancy}
\fancyhf{}
\fancyfoot[R]{\sffamily\thepage\ / \pageref{LastPage}}
\fancyfoot[L]{ }
\fancypagestyle{plain}{%
\fancyhf{}%
\fancyfoot[R]{\sffamily\thepage\ / \pageref{LastPage}}
\fancyfoot[L]{ }
}
\renewcommand{\headrulewidth}{0.0pt}
\newcommand{\hlc}[2][yellow]{ {\sethlcolor{#1} \hl{#2}} }
\titleformat{\section}
{\bfseries\scshape}{Reviewer \# \thesection}{1em}{}
\titleformat{\subsection}
{\normalfont\scshape}{Comment \# \thesubsection}{1em}{}
\usepackage[framemethod=default]{mdframed}
%\mdfsetup{skipabove=\topskip,skipbelow=\topskip}
\global\mdfdefinestyle{comment}{%
linecolor=red,linewidth=0.1cm,%
leftmargin=-0.5cm,rightmargin=-0.5cm, innerleftmargin=0.4cm,innerrightmargin=0.4cm,
topline=false,bottomline=false
}
\global\mdfdefinestyle{manuscript}{%
linecolor=gray!20,linewidth=0.05cm,backgroundcolor=gray!20,%
leftmargin=-0.5cm,rightmargin=-0.5cm, innerleftmargin=0.4cm,innerrightmargin=0.4cm
}
\renewcommand{\subsectionautorefname}{Comment}
\newcommand{\nati}[1]{{\color[rgb]{.1,.6,.1}{NP: #1}}}
\newcommand{\mdeff}[1]{{\color[rgb]{.1,.6,.1}{MD: #1}}}
\newcommand{\TK}[1]{{\color{red}{TK: #1}}}
\newcommand{\todo}[1]{{\color[rgb]{.6,.1,.6}{TODO: #1}}}
\newcommand{\figref}[1]{Figure~\ref{fig:#1}}
\newcommand{\tabref}[1]{Table~\ref{tab:#1}}
\newcommand{\secref}[1]{Section~\ref{sec:#1}}
%\newcommand{\secref}[1]{\S\ref{sec:#1}}
\newcommand{\eqnref}[1]{(\ref{eqn:#1})}
\renewcommand{\b}[1]{{\bm{#1}}} % bold symbol
% MATH SYMBOLS
\newcommand{\1}{\b{1}} % all-ones vector
\newcommand{\0}{\b{0}} % all-zero vector
\newcommand{\g}[1]{\b{#1}}
\newcommand{\G}{\mathcal{G}}
\newcommand{\V}{\mathcal{V}}
\newcommand{\E}{\mathcal{E}}
\newcommand{\C}{\mathcal{C}}
\newcommand{\B}{\mathcal{B}}
\renewcommand{\L}{\b{L}}
\newcommand{\tL}{\tilde{\L}}
\newcommand{\W}{\b{W}}
\newcommand{\I}{\b{I}}
\newcommand{\D}{\b{D}}
\newcommand{\U}{\b{U}}
\newcommand{\x}{\b{x}}
\newcommand{\X}{\b{X}}
\newcommand{\y}{\b{y}}
\newcommand{\Y}{\b{Y}}
\newcommand{\bu}{\b{u}}
\newcommand{\f}{\b{f}}
\newcommand{\trans}{^\intercal}
\newcommand{\R}{\mathbb{R}}
\newcommand{\bLambda}{\b{\Lambda}}
\newcommand{\blambda}{\b{\lambda}}
\newcommand{\bO}{\mathcal{O}}
\newcommand{\T}{\mathcal{T}}
\DeclareMathOperator*{\esp}{E}
\DeclareMathOperator*{\var}{Var}
\DeclareMathOperator*{\vect}{vec}
\DeclareMathOperator*{\argmin}{arg \, min}
\newcommand{\pkg}[1]{\texttt{#1}}
\begin{document}
\maketitle
% Note: it is very convenient to refer to other comments using the referencing system of \LaTeX, such as here see \autoref{comment:errorEq2}.
The authors would like to thank the reviewers for their time and comments, which significantly helped improve the manuscript.
We believe we addressed all the raised issues in the following rebuttal.
\section*{Answer to the editor}
\begin{mdframed}[style=comment]
Thank you for submitting your manuscript to Astronomy and Computing. I have received comments from reviewers on your manuscript. Your paper should become acceptable for publication pending suitable minor revision and modification of the article in light of the appended reviewer comments. In particular, there is a request from a reviewer to have access to the data used in your analysis.
Though this is designated a minor revision, there are quite a few points that need to be addressed before the paper is acceptable for publication. When resubmitting your manuscript, please carefully consider all issues mentioned in the reviewers' comments, outline every change made point by point, and provide suitable rebuttals for any comments not addressed.
\end{mdframed}
We thank the editor for handling our submission and organizing the peer-review process.
We have revised our manuscript following the suggestions of the reviewers.
We believe that we addressed most requests and that the manuscript is now ready for publication.
We thank the reviewers for taking the time to review our manuscript and provide constructive comments that improved the quality of our contribution.
Currently, access to the dataset can be requested via Zenodo.
As this process is not anonymous, we provide in this rebuttal (comment \#2) a link for the reviewers to access it anonymously.
\newpage
\section{}
\subsection{}
\begin{mdframed}[style=comment]
I have read with great interest the paper by Perraudin and collaborators presenting a methodology (and the associated code) for performing deep machine learning on the sphere using novel algorithms.
The paper is well-written, and the results are clearly presented. The literature as well as the advantages and limitations of new and existing algorithms are discussed thoroughly and fairly. I recommend this paper for publication, provided the following comments are addressed.
\end{mdframed}
Dear reviewer, we thank you for your time and thorough comments.
The manuscript improved thanks to your feedback.
You'll find an answer to all the points you mentioned in this rebuttal.
\subsection{}
\begin{mdframed}[style=comment]
One fairly significant concern I have is how convincing the application to other machine learning algorithms is. The authors compare deep sphere to a SVM classifier, in the case of distinguishing from two cosmological models from density maps. This classification is obviously a simplification of what would be done in real cosmological analyses, but is sufficient for the purpose of demonstration here. However, the metrics adopted do not clearly highlight the superiority of DeepSphere. The SVM takes pixel histograms or power spectrum densities (PSDs) of each input map. It is not surprising that DeepSphere outperforms them given that it takes the whole map as input. Histograms and PSDs are very restrictive compressions of the data - dropping a lot of spatial information. An SVM or even a standard fully connected neural network (NN) would perform a lot better. It is unclear how that would compare to DeepSphere. But I would certainly expect DeepSphere to be a lot easier to train and have many fewer parameters, which I think is the key advantage that needs to be highlighted here. This is analogous to comparing a standard convolutional NN to a standard fully-connected NN. With appropriate depth both will extract the same information and perform equally well. But of course the CNN has fewer parameters and is easier to train since its structure naturally extracts spatial information. One wouldn't compare the CNN to a SVM trained on pixel histograms or PSDs, since those would not have access to the same information. It would be more natural to compare the CNN and the NN trained on the same maps, and examine the number of parameters, the time and number of examples needed for training to achieve the same performance, etc. This would truly highlight how the structure of the NN is a key advantage.
\end{mdframed}
We agree that the ML algorithms should be tested on the same input data for a fair comparison.
We performed the comparison with these features at it reflects the two-steps traditional way in classic machine learning: a) designing features b) using a method a ``simple'' classifier. Deep learning changed this tactic with networks that learn the features directly from the raw data.
The PS and histograms were chosen because they are standard statistics used by the cosmology community for analysing real data.
This is why we chose it as the reference point.
In fact, the single step of moving away from these hand-picked statistics to ML approaches is already interesting for physicists.
It would be indeed interesting to compare the performance of the CNN to other ML-based algorithms, but we belive it is a major undertaking if to be conducted fairly.
Given that the our method gives significant advantage over the traditional statistics used in cosmology, we believe that the current scope of the paper is already interesting for cosmologists.
We would be keen to leave more detailed comparisons of ML-algorithms for future work.
%In practice, those models are limited by the computational power and the dataset size.
This being said, we did try to train an SVM on the raw data, but were unable to obtain over $60\%$ accuracy in the three noiseless cases, which is far behind every other tested model.
Practically, the classifier would either overfit or be over-regularized.
We hence decided not to report the raw-data SVM results in the paper.
This information has now been added to the manuscript.
% \nati{FYI, I did try again to be entirely sure about it.}
While fully-connected NNs (FCNNs) generalize CNNs, they don't perform as well as CNNs on tasks where the symmetries exploited by the CNN are relevant.\footnote{For cosmological data on the sphere, a desired symmetry, highlighted in the introduction, is for the model to be invariant to rotations.}
That is similar to our comparison of the FCN and CNN variants.
While the CNN is a generalization, it doesn't perform as well as the FCN. We agree that show this in practice would be a great addition to the paper. Nevertheless, we were unable to make FCNN reach a baseline performance.
In practice, the large number of parameters required by a FCNN severely restricts what can be tested on conventional hardware. In our case the input number of pixels (for order $1$) is $1024^2$ which limits the number of neurons that can be used in the architecture. For completeness, we tried a FCNN and were only able to squeeze $128$ neurons in the first layer before hitting an out-of-memory error. With this architecture and a relative amount of noise of $0.5$, we were unable to regularize the network such that it would classify the validation set. (Remember that all reported classifiers obtained close to $100\%$ accuracy in this setting.) Similar conclusions were observed with other settings. These results were not included as they simply too far from the SVM classifiers.
%Furthermore, the harmonic resemblance is another sign that the constructed graph is able to capture the spherical structure of the HEALPix sampling, even when increasing the number of neurons for small part of the sphere (order $2$ and $4$).
\subsection{}
\begin{mdframed}[style=comment]
Page 4: "There is no known point set that achieves the analogue of uniform sampling in Euclidean space and allows exact and invertible discrete spherical harmonic decompositions of arbitrary but band-limited functions." This is ambiguous. There exists multiple schemes for performing exact SHT on band-limited signals. Those sampling are more efficient than healpix - often in terms of number of pixels, and sometimes in terms of performance of the spherical harmonic transforms - and more accurate since the transform is exact (at least numerically - so typical errors are machine precision). Healpix transforms are only exact to 1e-7 when using iterative schemes, and that depends on Nside and lmax. Please clarify this paragraph. Obviously healpix is very advantageous for the pooling and other operations that take advantage of its hierarchical nature. This is well discussed in the paper - in particular how the fast graph algorithms could accommodate other sampling schemes.
\end{mdframed}
That is right.
Thanks for pointing it out.
The confusion was due to our ignorance.
We removed the part about exact SHTs from the paragraph as we don't need to make such a point.
%\mdeff{That sentence mostly came from \url{https://healpix.jpl.nasa.gov/html/intronode2.htm}.}
%\mdeff{uniform sampling == same distance between all neighboring pixels}
\subsection{}
\begin{mdframed}[style=comment]
"Our graph is constructed as an approximation of sphere S2, a 2D manifold embedded in R3, Indeed, [33] showed that the graph Laplacian converges to the Laplace-Beltrami when the number of pixels goes to infinity. While our construction will not exactly respect the setting defined by [33], we observe empirically strong evidence of convergence." This paragraph is unclear. Could you please add a few words to clarify what the approach of 33 is?
\end{mdframed}
We tried to clarify the paragraph in the following manner:
\begin{mdframed}[style=manuscript]
Our graph is constructed as an approximation of the sphere $S^2$, a 2D manifold embedded in $\mathbb{R}^3$.
Indeed, \cite{belkin2007convergence} showed that the graph Laplacian converges to the Laplace-Beltrami when the number of pixels goes to infinity providing uniform sampling of the manifold and a fully connected graph built with exponentially decaying weights.
While our construction does not exactly respect their setting (the sampling is deterministic and the graph is not fully connected), we empirically observe a strong correspondence between the eigenmodes of both Laplacians (see Appendix A).
\end{mdframed}
While this particular point deserves more details, we believe that a deeper discussion does not belong to this paper.
In fact, we are currently investigating this convergence with the goal of constructing better graphs whose Fourier modes would converge to
% (or even be equivalent to)
the spherical harmonics.
\subsection{}
\begin{mdframed}[style=comment]
Figure 3: What was resolution used for those maps? Does the projection of the eigenvectors depend on the resolution and band-limit?
\end{mdframed}
Those maps have been produced with a resolution of $N_{side}=16$. The information has been added to the paper. (High resolution is costly as diagonalizing the Laplacian matrix scales as $O(N_{pix}^3)$.)
Since we do not have any convergence result yet, we are not sure about how the projection error evolves. However, if there is convergence (and we believe there is providing the correct set of graph weight is selected), the eigenvectors associated with the lowest $\ell$ probably converge first.
\subsection{}
\begin{mdframed}[style=comment]
Page 7: "That is much more efficient than filtering with spherical harmonics, even though HEALPix was de- signed as an iso-latitude sampling that has a fast spherical transform. That is especially true for smooth kernels which require a low polynomial degree K. Figure 4 compares the speed of low-pass filtering for Gaussian smoothing using the spherical harmonics and the graph-based method presented here. A naive implementation of our method is ten to twenty times faster for Nside = 2048, with K = 20 and K = 5, respectively, than using the spherical harmonic transform at lmax = 3Nside implemented by the highly optimized libsharp [38] library used by HEALPix." This is not so clear. Is the graph method also performing a Gaussian smoothing with the same characteristics? If so, what are those? How precise is the graph convolution approximating the true harmonic-space Gaussian convolution in this case? This is relevant since some approximations (for instance when analyzing cosmic microwave background data) require high-precision Gaussian smoothing. Finally, is lipsharp optimizing Gaussian smoothing in any way?
\end{mdframed}
The experiment whose result is reported in Figure 4 is intended to report filtering speed, independently of the chosen filter (we used a Gaussian for convenience).
While libsharp is probably not optimizing Gaussian smoothing in particular (w.r.t. filtering with any other filter), it is a highly optimized library for SHT.
We don't care about the approximation quality as, in the context of neural networks, filters are learned.
That is, graph filters are not approximating any predefined true harmonic-space filters anyway.
It is, however, true that replacing spherical filtering (using the SHT) with graph filtering is an interesting prospect (in terms of speed) that might be possible in the future as suggested by Appendix B (Figure B.13).
We feel that, in its current state, graph filtering is probably not a good enough approximation to be usable when high-precision filtering is required.
Nevertheless, theoretical improvements on a potential equivalence (or convergence) of the Laplacian eigenvectors to the spherical harmonics (see our answer to the comment 1.4) could potentially make it competitive.
\section{}
\label{sec:dataset}
\begin{mdframed}[style=comment]
First of all, I would like to thank the authors for writing such a well thought out paper on such an interesting method.
The paper contains an extremely well described technique for performing convolutions on the sphere by first transforming the sphere into pixel space using graphs and then using graph-based convolutional filters that are radially symmetric. These convolutions are then used in neural networks, with an example showing classification of different cosmological scenarios directly from convergence maps. There are several benefits to this method over previously described neural networks on spheres (such as using spherical harmonics, which is slow, or using 2D CNNs, which has to learn to deal with distortions from the projection). I found the approximation of the graph approaching spherical harmonics extremely fascinating and will certainly looking more into this myself in the future!
I will happily recommend this paper for publication. Firstly though, I would like access to the data (privately rather than via emailing the group) so that I can check the notebooks thoroughly. The notebooks are extremely well written but I would like to run them first. As well as this I think there are a few extra applications which will improve the manuscript further, and make the method more appealing for users. I would recommend that these ideas are at least seriously considered by the authors and hopefully implemented:
\end{mdframed}
Dear reviewer, we would like to thank you back for your time, your good words, and your thorough review.
The manuscript substantially improved thanks to your work.
You'll find an answer to all the points you mentioned in this rebuttal.
The data is available on request in Zenodo, but the access is not anonymous.
For the special purpose of the review, we put the data from Zenodo in the cloud storage.
It can be access by a public link:
https://polybox.ethz.ch/index.php/s/Oi2NjdF8dnMdwot
The link will expire on 31 March.
The password is "DeepSphere101".
Please do not share these details to anyone, or do not use it for purposes other than the review.
\subsection{}
\begin{mdframed}[style=comment]
The first, and most important is an example using masks. Since the authors mention how their method should in principle work well with masks, especially in comparison to HEALPix and Clebsh-Gordan transform spherical harmonic methods I think that it would be extremely enlightening to present this work, probably in the form of a performance vs. time plot. With this being done, the informational content of a masked version of the cosmological application could easily be included. This would, I'm sure, be attractive to the cosmological community as well as the ML community who are always looking for good ways to deal with masked data.
\end{mdframed}
We agree with the reviewer that an example with an irregular mask is the most missed.
A performance versus time plot would indeed be a great way to present such a comparison.
As the reviewer hints at, this example would be better presented in the light of a comparison to other formulations of spherical CNNs.
While such a comparison would without any doubt be insightful on many aspects (not only for masked data, but also on the role of isotropic filters or the importance of rotation equivariance), it is better carried out on diverse datasets and tasks, and deserves, in our opinion, a follow-up paper to be properly addressed.
The point of this paper was (i) to show that spherical CNNs are a great model for cosmological applications, and (ii) that a graph-based spherical CNN has certain undeniable advantages.
As hinted in the conclusion, we are actively working on a more thorough comparison of graph-based and SHT-based spherical CNNs.
On the practical side, a proper comparison requires to either modify the available implementations of \cite{cohen2018sphericalcnn,esteves2017sphericalcnn,kondor2018clebsch} to use the HEALPix sampling, or to transform the cosmological maps to the sampling schemes they support.
Moreover, implementing the pooling operation for irregular masks is not trivial.
All in all, that is a tremendous amount of engineering work that we preferred to defer.
Keep in mind that, while they are not irregular masks, we already use $1/12$, $1/48$, and $1/192$ parts of the sphere for training and inference.
While an SHT-based method needs to consider the whole sphere to work with this data, our graph-based method only considers the used part (and still takes the curvature of the surface into account).
The computational advantage of doing so has been added to Figure 4.
As stated above, the performance comparison will however have to wait our follow-up work.
% Our main motivation was to build a "large" dataset with a relatively low number of cosmological maps.
% In terms of neural network, the available implementation of \cite{cohen2018sphericalcnn,kondor2018clebsch} for spherical CNNs and Clebsh-Gordan CNNs are not scalable enough to be trained on the cosmological maps dataset and do not use the HEALPix sampling. Hence a comparison with these methods requires a different setting, which far beyond the scope of our contribution.
%\nati{We could compare the price of one convolution. What do you think? We can write the theoretical complexity. This is a sensitive point!}
%\mdeff{measure the computational time of filtering on a 1/12, 1/48 or 1/192 of the sphere with the graph. Add the result to Figure 4 and compare with filtering of the full sphere (with SHT and graph).}
\subsection{}
\begin{mdframed}[style=comment]
Secondly, using only SVM as a comparison might skew how well this technique appears to works in a deep learning setting. In principle (and I say this not knowing for sure) 2D CNNs should probably be able to work as well as DeepSphere (although no where near as elegantly) by taking in a projected 2D image of the sky and using *loads* of filters to learn about the distortion at different parts of the projected sky. It would be interesting (and I think a fairly easy test) to check how well DeepSphere performs against a 2D CNN on projections.
\end{mdframed}
%\nati{Please check if you like what I did in light of the discussion we had.}
We decided to add this experiment.
Note that making this comparison was technically not as trivial as it may seem because of the 2D projection (see Appendix D).
We shared the opinion of the reviewer that a 2D CNN should probably work as well as DeepSphere ``by taking in a projected 2D image of the sky and using *loads* of filters to learn about the distortion at different parts of the projected sky.''
We were however positively surprised by the results: the performance of the 2D CNN did in general not match DeepSphere (see Figure 9 and the new paragraph is section 4.5).
We did however not use *loads* of filters as suggested by the reviewer.
% the optimization was challenging and we already had to regularize significantly the network.
While adding more filters did improve slightly the performance (never to the level of DeepSphere), it made the optimization more challenging to the point where we could not obtain consistent results if we were to restart the training process.
Eventually, we decided to report the results for the architecture with the same number of parameters as DeepSphere.
% \TK{makes sense to me.}
% \todo{experiment in prograss by nati}
% \nati{\textbf{Very important point!!!}
% \begin{itemize}
% \item We can do this comparison...
% \item Since we used 1/12 of the sphere, all the pixel grids we use are somehow squared. I am now using this as a projection. @Tomek, do you know if this is ok? Or do you know how to do the projection in a simple and efficient manner?
% \item I think that traditional 2D CNNs will be as good as our algorithm. We should be very careful in the way we formulate this in our answer and in the paper.
% \end{itemize}
% }
% \TK{I would be tempted to leave it to future work! It is an interesting comparison, but requires more work and it brings only a bit more to already awesome paper.}
% \mdeff{Agreed that we should do at least one thing they asked for.
% \begin{itemize}
% \item Testing on the 1/12 of the sphere is not what he asked for. On one hand, a global projection would exhibit more distortion. On the other, we never use the whole sphere.
% \item Keep in mind that there is no orientation that is consistent for the 12 base faces (i.e., where is up?).
% \item Agreed that the performance should be very similar. Both are doing local computations. And locally (3 to 5 pixels across), the sphere is isomorphic to an Euclidean plane.
% \item The only difference, in my opinion, would be due to the anisotropic filters the 2D CNN has access to.
% \item Not sure we should include this result in the main paper. Hard to justify in the story, especially as a comparison with alternative spherical CNNs is missing. Let's see the results.
% \end{itemize}
% }
\subsection{}
\begin{mdframed}[style=comment]
Similarly, I think it would useful to check how well the DeepSphere CNN does against the FCN when the input data is rotated randomly during training - I imagine that the results should match or probably exceed the FCN results and I think this would be a neat little test which doesn't involve needing extra data, just more augmentation.
\end{mdframed}
% That comparison is not warranted.
% or justified. pointless is maybe too aggressive
We do not believe this comparison is useful because of the following reason.
There are two ways to deal with symmetries one wants to be invariant to: (i) build them into the architecture of the model, or (ii) augment the dataset such that a (more general) model learns them ``brute-force''.
With respect to rotation invariance, the FCN architecture is of the first kind, while the CNN is of the second kind.
Assuming the task is truly invariant to said symmetry,\footnote{As argued in the introduction, cosmological maps are rotation invariant or equivariant.} the performance of the first kind of models would be exactly the same as without augmentation, while the second kind would only catch up after seeing enough data to learn the invariance.
% That is a waste of resources which is only justified if it's not possible to back the desired invariance in the architecture.
Augmentation is only justified if it's not possible to back the desired invariance in the architecture.
As the reviewer writes in comment 2.8, we should be ``building a NN to suit [our] purposes''.
We updated the manuscript to more precisely reflect this argumentation.
%\TK{Seems a bit strong to me.. maybe we would write it in a softer language?}
%\nati{Now it is OK. But is it still strong enough. We should only point reviewer incoherence if necessary.}
% \nati{Technically, testing this is complicated as rotation on HEALPix are complicated and expensive. We probably will have to break a lot of code to do it.
% @Michael: what do you think?}
% \mdeff{Agreed.}
\subsection*{Minor comments}
\begin{mdframed}[style=comment]
I should say, even though I made the above comments, I thought the paper was really excellent and well written. There were a couple of very minor points which I would like the authors to expand slightly on or consider.
\end{mdframed}
Thank you very much for your comments. We addressed them to the best we could.
\subsection{}
\begin{mdframed}[style=comment]
While our construction will not exactly respect the setting defined by [33], we observe empirically strong evidence of convergence.
- Could you add a plot comparing the graph Laplacian as a function of number of pixels and the Laplace-Beltrami operator so the reader can explicitly see the convergence?
\end{mdframed}
Good point.
To experimentally show convergence, we would need to build a series of graph Laplacians that converges to the Laplace-Beltrami operator.
This is complicated and requires to solve the theoretical problem first.
Hence, we built one Laplacian and observed some similarities between the two operators instead.
We agree that our wording was unfortunate and we have modified the manuscript to mean that we empirically observed a strong correspondence rather than a proper convergence.
The convergence (or even exact correspondence) is a problem we are currently investigating and plan to release as a separate contribution.
See also comment 1.4 and our answer there.
%\todo{the new figure A.12 is maybe a weak proof of convergence}
%\mdeff{@nati, what do you think?} \nati{I like it. I totally agree with the new wording.}
\subsection{}
\begin{mdframed}[style=comment]
We found out that the one proposed above works well for our purpose, and did not investigate other approaches, leaving it to future work.
- Are you really going to search through different weighting schemes in future work? I would be interested if you were, but could you speculate on what you expect to achieve by looking at different ones?
\end{mdframed}
Yes, we are investigating how to best set the edge weights (the sole degree of freedom when building a graph).
The question is: what edge weights should be set such that the graph Laplacian converges (or is equivalent) to the Laplace-Beltrami (up to a certain bandwidth).
Ideally, that should work for any sampling of the sphere.
(Equivalently, the Fourier modes converge or are equivalent to the spherical harmonics.)
A proof of convergence or equivalence would allow a graph convolution that is truly equivariant to rotation.
Moreover, it would enable to do high-precision filtering using the graph rather than the SHT (for speed).
See also comment 1.6 and our answer there.
% \nati{@michael, how much shall we say here?}
% \mdeff{I like to be open. It's mostly written in the paper anyway.}
\subsection{}
\begin{mdframed}[style=comment]
Thanks to the softmax, the output $\bf{y}\in\mathbb{R}^{N_{classes}}$ of a neural network is a probability distribution over the classes, i.e., $y_i$ is the probability that the input sample belongs to class $i$.
- This is a common misconception which is incorrect and needs restating. It is *not* a probability distribution for the class, rather it is a discretised conditional distribution for the class given both the data and the weights and biases of the trained network. One way to see that the value of $y_i$ cannot be considered a probability (which implies P(y|x), i.e. the probability of a class given the data) is that if the network is untrained, would you retrieve back the correct probability? It's wholly dependent on the network.
\end{mdframed}
Thanks for making it precise. We completely agree and updated the manuscript accordingly.
% \nati{By the way, this mean that the reviewer is knowledgeable with ML, and may be Bayesian. He might be working on another spherical approach.}
% \mdeff{The guy likes to be precise.}
\subsection{}
\begin{mdframed}[style=comment]
For example, on images, the subject of interest is most often around the center of the picture. (Also... Contrast that with images, where the subject of interest is most often around the center of the picture.)
- I don't see that this is general at all, only if you are considering say, preprocessed postage stamps of galaxies. I can't think of many other cases in modern machine learning where the images contain the informative part only in the centre of an image. I do agree that the informative part is often localised though. Maybe this should be updated.
\end{mdframed}
Think of the MNIST digits, cells in biomedical imaging, faces on portraits, etc.
Most photographers prefer the subject of interest to be centered.
Then, what's important is not that the informative part is only in the center, but that a fully connected layer allows a network to focus its attention on the domain.
With average pooling, all pixels are treated equally.
\subsection{}
\begin{mdframed}[style=comment]
Than can be seen as intrinsic data augmentation, as a CNN would need to see many rotated versions of the same data to learn the invariance.
- I think you mean "This" rather than "Than". I think, however, you are making a more important point here than you realise. It is not really "intrinsic data augmentation" but rather building a NN to suit your purposes. This should be done by the entire community as, at the moment, too many people blindly build their networks without considering their data and then struggle to get the best results out. The symmetries of the data should be the first thing to influence architecture of a network, exactly as you have reasoned. You might want to add a statement about how you know what your symmetries are and thus, you know how to build a well reasoned network which is more likely to work (and that other should do the same).
\end{mdframed}
Thanks for pointing out the typo, we fixed it.
We do realize it's an important point, and agree that some people apply architectures to a different problem they have been devised for without much thought.
A large chunk of our paper is actually about using an appropriate NN architecture.
The point you suggest us adding is actually made in the Experiments section, where we describe the data and task, and motivate the architecture choice.
We wanted the Method section to be more generic.
% Note that the task plays a role as well (i.e., two different tasks on the same data might benefit from different architectures).
% \mdeff{that's in contradiction with his comment 2.3 where he asks us to do data augmentation with the CNN}
\subsection{}
\begin{mdframed}[style=comment]
As such, the SHT is only performed once on the input (no back and forth SHT between each layer). While this clever trick lowers the complexity of the convolution to $\mathcal{O}(N_{pix})$, the non-linearity is $\mathcal{O}(N^{3/2}_{pix})$.
- I don't understand what the difference is between the convolution and the "non-linearity". I think this sentence either needs explaining further. Is the process $\mathcal{O}(N_{pix})$ or $\mathcal{O}(N^{3/2}_{pix})$?
\end{mdframed}
In \cite{kondor2018clebsch}, the convolution is performed in the spectral (harmonic) domain with a simple Hadamard product.
Instead of recomposing the signal and applying a traditional non-linearity such as the ReLU in the pixel domain, \cite{kondor2018clebsch} proposes to apply a Clebsch–Gordan product in the spectral domain.
(As a non-linear operation preserving rotation equivariance, it is considered as the non-linearity of the NN.)
Their convolution costs $\mathcal{O}(N_{pix})$, and their non-linearity costs $\mathcal{O}(N^{3/2}_{pix})$.
Stacking those layers result in a NN of overall $\mathcal{O}(N^{3/2}_{pix})$ time complexity.
We clarified the sentence.
\subsection{}
\begin{mdframed}[style=comment]
When mentioning that you are 10 to 20 times faster than the SHT it should be noted that libsharp has amazing distributed mpi functionality which helps, although of course your method is genuinely lower order computationally (and can be distributed simply on a GPU).
\end{mdframed}
True.
In the experiment, both were compared on a single core.
They might, however, scale differently given one can be distributed through MPI and the other on a GPU.
The manuscript was updated.
\subsection{}
\begin{mdframed}[style=comment]
In Appendix A. (my personal favourite part) could you measure the percent difference that different eigenvectors differ from the spherical harmonics due to incomplete discretisation of the sphere, perhaps plotting side by side plots like Figure 3, but with SH too (or their differences). By the way, Figure A.12 is placed in the wrong section.
\end{mdframed}
% (To plot the difference between the graph eigenvectors and the spherical harmonics on the sphere, one would have to align the subspaces.) Impossible to rotate arbitrarily on HEALPix.
We cannot compare individual eigenvectors and spherical harmonics because of arbitration rotations.
While any rotation of the basis of a subspace (the space spanned by the eigenvectors and harmonics corresponding to a given degree $\ell$) doesn't alter the spanned subspace, the difference between the basis vectors before and after the rotation can be 100\%.
Your suggestion however made us realize that figure A.12 was not ideal in comparing the eigenvectors to the spherical harmonics.
As such, we replaced it with another that we hope is clearer.
This new figure should satisfy your request.
We also moved Figure A.12 in the correct subsection.
\subsection{}
\begin{mdframed}[style=comment]
In Appendix C. is there a connection between $t$ and $\sigma$ that you can make, or do you just fit them to be similar? If you just fit them to be similar I'm not sure where the relative differences come from. If they are related, however, then the relative differences make sense, but I can't see what the connection is.
\end{mdframed}
In the current version of the manuscript and the code, we simply fitted a few $t$ to match the different $\sigma$.
In theory, there should be a connection between the two values providing the graph Laplacian has converged toward the Laplace Beltrami operator.
% Again, this is left for a future contribution.
\subsection{}
\begin{mdframed}[style=comment]
A question that I would like to know personally, are you thinking about a future upgrade to non-radial kernels?
\end{mdframed}
We thought about it but are not sure that it would be beneficial.
We are also unsure about the proper way to do it.
%The main difficulty is that a graph is not intrinsically oriented.
%However, as the graph framework we are using only works with radial filters, it is not a simple question.
%A simple trick to make the network sensitive to direction is to add a per-channel fully connected layer.
%We are currently investigating other potential graph techniques that would lead to non-radial kernels.
\subsection{Details}
\begin{mdframed}[style=comment]
Finally, I noticed a few points in the text where there were typos or things which were difficult to follow and I have put a note of them below and tried to correct them as much as I can so that you don't have to put too much effort in - I can't wait to see this paper published. Sorry that the list is so extensive (I'm a bit of a stickler for details and trying to write papers concisely):
\end{mdframed}
Thank you very much for taking the time for making these corrections.
We found them very helpful in improving the quality of the paper.
We implemented most of them, see below.
\begin{mdframed}[style=comment]
Convolutional Neural Networks (CNNs) have been proposed as an alternative analysis tool in cosmology thanks to their ability to automatically design relevant statistics to maximise the precision of the parameter estimation, while maintaining robustness to noise [9–16].
- There are many of us in the community who are not overly keen on "precision" of parameter estimation being the target for machine learning. Rather, we would prefer robustness of parameter estimation. It might be worth changing this sentence a bit so as not to upset some of the referenced authors ;)
\end{mdframed}
Precision, understood as the final size of the posterior distribution on the measured parameters (including systematic errors), is the key interest of cosmology analysis.
The robustness to noise is also crucial, as demonstrated in [9].
%We expanded the sentence to include this.
We added a footnote to specify this.
% \TK{not sure what do to here.. will people get upset? "Precision" is commonly used in cosmology}
\begin{mdframed}[style=comment]
When you have multiple = in equations could you split them on to separate lines? This would make the logical steps for following the equations a lot clearer.
\end{mdframed}
Thank you for the suggestion, normally we would adopt it, but the paper is already quite long, and breaking up the lines in equations would make it even longer. So we decided against it.
\begin{mdframed}[style=comment]
You change a lot between ``node'', ``vertex'' and ``pixel''. Could you stick to one of these? I think that ``vertex'' or ``pixel'' makes the most sense (``pixel'' due to the HEALPix pixelisation of the maps or vertex because of the use of the $v$ notation). Node is quite confusing in several places because there is an inherent connection between node and neuron in ML literature, and since there is swapping between the three different terms it's a little hard to keep track.
\end{mdframed}
We agree that using different terms for the same notion can confuse readers.
While ``node'' and ``vertex'' are synonyms in referring to the elements of a graph, a ``pixel'' is a different concept referring to a point of the (spherical) sampling.
A pixel can be represented by a vertex, but they are not the same.
As such, we decided to replace all occurrences of ``node'' with ``vertex'' but keep the two concepts of ``vertex'' and ``pixel'' separate.
% \nati{What do you think? This avoid repetition. But we could be more consistent...}
% \TK{Agreed. Do you want to make the change @nati?}
% \mdeff{Agreed. Node is the same as vertex. Pixel is not necessarily.}
\begin{mdframed}[style=comment]
You change between math font and text font for the names of your layers when inside and outside of equations. It would look a lot neater sticking with one choice. (Sorry, that seems quite petty and I don't mean it to be).
\end{mdframed}
We have made the paper consistent by using a math font for layer names.
%\nati{@all, when you read the paper, please check that I did not forget some of them. Every layer has been changed to math symbols as we use them like function. For example, SM becomes $SM$.}
%\mdeff{Agreed. I checked and fixed one missing.}
\begin{mdframed}[style=comment]
In Appendix C. you talk about Kroneker a lot, but earlier you talk about Kroneker $\delta$ (which is it's correct name). It's probably worth sticking with the full name.
\end{mdframed}
We have modified the manuscript according to the suggestion.
\begin{mdframed}[style=comment]
Sky maps are rotation invariant: rotating maps on the sphere doesn’t change their interpretation, as only the statistics of the maps are relevant.
- I think that you mean that rotating maps on the sphere doesn't change their interpretation *when* only the statistics of the maps are relevant. Say you wanted to compare the NGC against the SGC, then rotating the map would obviously change the results.
\end{mdframed}
True.
Equivariance is desired in such a case (rather than invariance).
We clarified the sentence.
\begin{mdframed}[style=comment]
The flexibility of modelling the data domain with a graph allows to easily model data that spans only a part of the sphere, or data that is not uniformly sampled.
- I think you're missing a "one" after allows. "The flexibility of modelling the data domain with a graph allows one to easily model data that spans only a part of the sphere, or data that is not uniformly sampled."
\end{mdframed}
Thank you for the suggestion. We updated the text.
\begin{mdframed}[style=comment]
This kind of maps can be created using the gravitational lensing technique [see 29, for review]
- It should be "These" and not "This". "These kind of maps can be created using the gravitational lensing technique [see 29, for review]"
\end{mdframed}
Thank you for spotting this mistake. We updated the manuscript.
\begin{mdframed}[style=comment]
A CNN is composed of the following main building blocks [31]: (i) a convolution, (ii) a non-linearity, (iii) a down-sampling operation, (iv) a pooling operation, and (v), optionally, normalization.
- I think that this is generalising a CNN a bit too much. In my mind (and in my personal uses of CNNs) steps (iii) and (iv) are not always necessary either. Perhaps you want to say that you wish to tackle common tasks which often arise in CNNs: i) convolution, (ii) non-linearity, (iii) down-sampling operation, (iv) pooling operation, and (v), optionally, normalisation.
\end{mdframed}
We agree that (iii) and (iv) are optional and updated the sentence accordingly.
\begin{mdframed}[style=comment]
Likewise, down-sampling can be achieved by taking one pixel out of n.
- I'm afraid I don't understand what this means at all, could you rephrase it?
\end{mdframed}
We rephrased it.
\begin{mdframed}[style=comment]
A rhombus is a quadrilateral whose four sides all have the same length.
- I'm not sure that this really needs to be said.
\end{mdframed}
Agreed. We removed the footnote.
\begin{mdframed}[style=comment]
As each pixel is subdivided in four, the second coarser resolution is $N_{pix} = N_{side}^2\times12=2^2\times12=48$ pixels (middle sphere in Figure 5), the third is $N_{pix} = N_{side}^2\times12=42\times12=192$ pixels, etc., where $N_side =1,2,4,8,\dots$ is the grid resolution parameter.
- Might it be more concise to write "The resolution changes as $N_{pix} = N^2_{side}\times12$ such that $N_{pix} = 48$ for $N_{side} = 2$ and $N_{pix} = 192$ with $N_{side}=3$."
\end{mdframed}
Thank you for the suggestion, we updated the text.
\begin{mdframed}[style=comment]
... is a measure of the variation of the eigenvector $\bf{u}_i$ is on the graph defined by the Laplacian L.
- I don't think the "is" after $\bf{u}_i$ should be there.
\end{mdframed}
This has been corrected.
\begin{mdframed}[style=comment]
Given the convolution kernel $h:\mathbb{R}_+ \to \mathbb{R}$, a signal $f\in\mathbb{R}^{N_{pix}}$ on the graph is filtered as
- You don't mention what $\mathbb{R}_+$ is or why $h$ defines that map.
\end{mdframed}
We are not sure what the reviewer means here.
$\mathbb{R}_+$ is the ensemble of the real positive numbers and $h$ is a function that maps any positive real number to a real number.
We think that $\mathbb{R}_+$ is a common enough notation to not require being explicitly defined.
%\nati{This seems clear to me. Specifying what $\mathbb{R}_+$ is would be a bit too much I believe. @all, is it also for you?}
%\mdeff{That is also quite clear to me.}
\begin{mdframed}[style=comment]
This localization of the kernel $h$ can be useful to visualize kernels, as explained in Appendix C.
- Most of Appendix C is not involved with the localisation of the kernel so it might be worth expanding this sentence to "This localisation of the kernel $h$ can be useful to visualise kernels, as shown in an example of heat diffusion presented Appendix C. " Also, it is a little odd that Appendix C is mentioned before Appendix B in the text.
\end{mdframed}
We followed both suggestions.
\begin{mdframed}[style=comment]
However, when considering only parts of the sphere, one can observe important border effects (see Figure B.13 and Appendix B).
- I would just refer the reader to the appendix rather than the figure and the appendix. It is a bit more concise. "However, when considering only parts of the sphere, one can observe important border effects (see Appendix B)."
\end{mdframed}
We followed the suggestion.
\begin{mdframed}[style=comment]
... where $T_ig[j]=g[i-j]$ is, up to a flip, a translation operator.
- I don't understand what you mean by a "flip", a change of sign, the inversion of the kernel?
\end{mdframed}
We precised the meaning of a flip.
\begin{mdframed}[style=comment]
Similarly as (2), the convolution of the signal $f$ by a kernel $g$ is the scalar product of $f$ with translated versions $T_ig$ of the kernel $g$.
- I think this should be "As with equation (2)" or "Similarly, as in equation (2)" but I can't tell which. Also, it would be really useful if you referred to equations by "equation (x)" rather than just "(x)" throughout - much easier to read that way.
\end{mdframed}
% \nati{I do not know what we should do.}
Ideally we would replace equation references such as "(1)" with "Equation (1)", but given the large number of equation references, we would prefer to stick with a short-form to avoid lengthening the paper further.
% \mdeff{Agree to not introduce more words.}
\begin{mdframed}[style=comment]
The 1-neighborhood of a node is the set of nodes that are directly connected to it. The k-neighborhood of a node is the set of nodes that are connected to it through paths of length k.
- I'm a little confused here, I think the neighbourhood is every pixel which is closer than the distance of some pixel. The value of k would then be the length of the set of ordered distances for all pixels up to this pixel. Is that correct? For example, take a 3x3 square (on Euclidean space for simplicity), if the central pixel has coordinates (0, 0) then the 1-neighbourhood would be all pixels at a length of 1 pixel or less, i.e. (-1, 0), (+1, 0), (0, -1) and (0, +1) who are all at the same distance so the set of distances is {1} pixel and the length of the set is 1. The 2-neighbourhood would then be all pixels which were a distance $\sqrt{2}$ pixels or less away, i.e. (-1, 0), (+1, 0), (0, -1), (0, +1), (-1, -1), (-1, +1), (+1, -1), (+1, +1) so the set of distances would be {1, $\sqrt{2}$} pixels and the length of the set is 2. Is this correct? If so k is not a length. If this is not correct, I think I misunderstand something and so this should be updated in the text.
\end{mdframed}
The neighborhood is defined in the general context of graphs, where vertices don't necessarily have coordinates.
One way to define a distance between two vertices is to count the number of edges in the shortest path between them.
We clarified this point.
\begin{mdframed}[style=comment]
Similarly, each line of the matrix $\sum_{k=0}^K\theta_k\bf{L}^k$ defines an irregular patch of radius $K$.
- What do you mean by "line"? Is it a row or a column or something else which I've misunderstood?
\end{mdframed}
We mean ``column'' and have updated the manuscript.
\begin{mdframed}[style=comment]
... where $\tilde{\bf{L}}= \frac{2}{\lambda_{max}}\bf{L}-\bf{I}=-\frac{2}{\lambda_{max}}\bf{D}^{-1/2}\bf{W}\bf{D}^{-1/2}$ is the rescaled Laplacian with eigenvalues $\tilde{\Lambda}$ in [-1, 1].
- Could you put the equation in its own 2 line align environment to make it easier to read?
\end{mdframed}
Inspired by the reviewer suggestion, we changed the layout of that equation.
\begin{mdframed}[style=comment]
By construction of our graph, $|\mathcal{E}| < 8N_{pix}$ and the overall computational cost of the convolution reduces to $\mathcal{O}(N_{pix})$ operations. That is much more efficient than filtering with spherical harmonics, even though HEALPix was designed as an iso-latitude sampling that has a fast spherical transform.
- This could be written as "By construction of our graph, $|\mathcal{E}| < 8N_{pix}$ and the overall computational cost of the convolution reduces to $\mathcal{O}(N_{pix})$ operations and as such is much more efficient than filtering with spherical harmonics, even though HEALPix was designed as an iso-latitude sampling that has a fast spherical transform." to make it easier to read.
\end{mdframed}
Thanks for the suggestion. We changed the manuscript.
\begin{mdframed}[style=comment]
That is especially true for smooth kernels which require a low polynomial degree K.
- It is unclear what "That" refers to. If making the above change then "That" could be changed to "This" and it would make sense.
\end{mdframed}
Thanks for spotting this.
\begin{mdframed}[style=comment]
Coarsening can be naturally designed for hierarchical pixelisation schemes, as each subdivision divides a cell in an equal number of child sub-cells.
- I think you mean "where" instead of "as"
\end{mdframed}
% \todo{Assigned: @tomek}
% \nati{I think the "as" is correct here. We use it for "because". Not sure...}
Thank you for the suggestion, we adopted it.
\begin{mdframed}[style=comment]
Coarsening is the reverse operation: merging the sub-cells toward the goal of summarizing the data supported on them.
- This has an unusual syntax which makes it difficult to follow. I think you mean something like "To coarsen, the sub-cells are merged to summarise the data supported on them."
\end{mdframed}
% \todo{Assigned: @tomek}
% \nati{I prefer the version we have. But I leave it up to you to decide.}
We updated the text as suggested.
\begin{mdframed}[style=comment]
Given a map $\bf{x}\in\mathbb{R}^{N_{pix}}$, pooling defines $\bf{y}\in\mathbb{R}^{N'_{pix}}$ such as
- It should be "such that" not "such as". "Given a map $\bf{x}\in\mathbb{R}^{N_{pix}}$, pooling defines $\bf{y}\in\mathbb{R}^{N'_{pix}}$ such that"
\end{mdframed}
Thank you for spotting this mistake. We have corrected the manuscript.
\begin{mdframed}[style=comment]
... where $f$ is a function which operates on sets (possibly of varying sizes) and $N_{pix}/N'_{pix}$ is the down-sampling factor, which for HEALPix is $|\mathcal{C}(i)| = N_{pix}/N'_{pix} = (N_{side} / N'_{side})^2 = 4^p$.
- Can you explain what $p$ is and it would probably be worth splitting this equation into a multi-line align environment for clarity.
\end{mdframed}
$p$ was introduced to highlight the fact that the coarsening factor is a power of 4 by construction of HEALPix.
It is defined as $p=\log_2(N_{side} / N'_{side})$.
We have updated the manuscript with this clarification and with an equation environment.
\begin{mdframed}[style=comment]
The tail is composed of multiple fully connected layers (FC) followed by an optional softmax layer (SM).
- This needs to be expanded to state "The tail is composed of multiple fully connected layers (FC) followed by an optional softmax layer (SM) if the network is used for discrete classification."
\end{mdframed}
We followed the suggestion.
\begin{mdframed}[style=comment]
A non-linear function $\sigma(\cdot)$ is applied after every linear GC and FC layer, except for the last FC layer where it is set to the identity.
- Why specify this is identity? It could take any activation - you have already mentioned that it could be a softmax.
\end{mdframed}
We have removed the identity specification.
\begin{mdframed}[style=comment]
The rectified linear unit (ReLU) $\sigma(\cdot) = \textrm{max}(\cdot, 0)$ is a common choice.
- Is this the choice you adopt? You should mention that you do, if you indeed do.
\end{mdframed}
We followed the suggestion.
\begin{mdframed}[style=comment]
Note that the output $\bf{Y}\in\mathbb{R}^{N_{pix}\times F_{out}}$ of the last GC (or the output $\bf{Y}\in\mathbb{R}^{N_{stat}\times F_{out}}$ of the ST), ...
- What is ST, you don't mention it before and only once after?
\end{mdframed}
Thanks for spotting this.
It was a leftover from an earlier draft.
\begin{mdframed}[style=comment]
The output’s size of the neural network depends on the task.
- It would be more correct to write "The size of the output of the neural network depends on the task."
\end{mdframed}
We followed the suggestion.
\begin{mdframed}[style=comment]
Its radius should be large enough to capture statistics of interest.
- I think you mean the radius of the kernel. Perhaps this could be rewritten.
\end{mdframed}
% \todo{Assign: @tomek}
% \todo{To check: @all}
% \nati{The manuscript seems clear to me. But let me know if you think otherwise.}
% \mdeff{Radius of the field, not kernel or filter.}
We updated the text.
\begin{mdframed}[style=comment]
For example, a small a partial sky observation can provide only limited information of cosmological relevance.
- There is just a little mix up in this sentence, maybe a copy and paste accident.
\end{mdframed}
Thanks for spotting it. We have fixed this mistake.
\begin{mdframed}[style=comment]
The cost (or loss) function $C(\bf{Y},\bar{\bf{Y}}) = C(NN_\theta(\bf{X}),\bar{\bf{Y}})$ measures how good the prediction $\bf{Y}$ is for sample $\bf{X}$, given the ground truth $\bar{\bf{Y}}$.
- This is true for supervised learning only and should be stated.
\end{mdframed}
We do not think it is necessary to specify that this is true for supervised learning only.
The phrase ``ground truth'' should be sufficient to indicate that.
%This cost cannot be computed in an unsupervised setting anyway.
% \nati{In an unsupervised setting, this can just not be computed... Specifying it seems redundant to me.}
% \todo{To check: @michael}
% \mdeff{Agreed. The whole paper is about supervized learning anyway.}
\begin{mdframed}[style=comment]
For global prediction, the cost is as if $N_{pix} = 1$.
- I think this should be "For global prediction $N_{pix} = 1$."
\end{mdframed}
We changed the text to:
\begin{mdframed}[style=manuscript]
For global prediction, we have $N_{pix} = 1$.
\end{mdframed}
%\todo{To check: @michael}
%\mdeff{Agreed.}
\begin{mdframed}[style=comment]
We emphasize that the cost function and the SM layer is the sole difference between a neural network engineered for classification or regression.
- This should be "We emphasise that the cost function and the SM layer are the sole differences between a neural network engineered for classification or regression."
\end{mdframed}
We followed the suggestion.
\begin{mdframed}[style=comment]
The goal of learning is to find the parameters $\theta$ of the neural network that minimize the risk $R(\theta) = E\left[C\left(NN_\theta(\bf{X}),\bar{\bf{Y}}\right)\right]$.
- It is the goal of training, not of learning (in common parlance anyway). Also, you should say that E[] is the expectation rather than leaving people to guess according to the next sentence.
\end{mdframed}
% According to us, the goal of training is to minimize the empirical risk, which does not contain the expectation.
% On the contrary, the goal of learning is to minimize the expected risk, i.e., we additionally do not want to overfit the training set.
% Hence we left ``learning'' in the text.
% Nevertheless, we specified the expectation operator.
% \nati{@Michael, please check this one carefully, also in the text, I am not sure my notation is correct}
% \mdeff{Debatable. I can see overfitting as an anti-goal of training. Doesn't matter much to me. Agreeing with him is simpler. Hence:}
We followed the suggestions.
\begin{mdframed}[style=comment]
The optimization is performed by computing an error gradient w.r.t. all the parameters by back-propagation and updating them with a form of stochastic gradient descent (SGD):
- This sentence is just a bit turned upside-down, it should be "The optimisation is performed by computing an error gradient w.r.t. all the parameters and updating each parameter via back propagation using a form of stochastic gradient descent (SGD):"
\end{mdframed}
% \todo{To check: @michael, \nati{again I disagree with the reviewer}}
% \mdeff{Agree with Nathanaël.}
We are not sure, we understand the reviewer.
Back-propagation is a technique to compute gradients, not to update the parameters.
The latter is done with SGD.
We believe that the original formulation is more accurate than proposition.
\begin{mdframed}[style=comment]
As this formulation uses the standard 2D convolution, all the optimizations developed for images can be applied, what makes it computationally efficient.
- It should be "which makes it computationally efficient" instead of "what". (It's possibly "that" and not "which", but I'm never sure which ;))
\end{mdframed}
This mistake has been corrected.
\begin{mdframed}[style=comment]
A straightforward way to impose locality in the original domain is to impose smoothness in the spectral domain, by Heisenberg’s uncertainty principle.
- What do you mean by "by Heisenberg’s uncertainty principle." That seemed to come out of nowhere. This needs explaining further.
\end{mdframed}
We added the general idea behind the Heisenberg's uncertainty principle.
% \mdeff{I prefer to leave "by Heisenberg [...]". Readers will otherwise wonder why the recipe... I borrowed the text from your below justification.}
% The link with the Heisenberg's uncertainty principle does exist but is too complex to explain shortly.
% We decided to remove ``by Heisenberg’s uncertainty principle'' and keep the rest of the sentence as a general Fourier analysis recipe.
% \nati{The reviewer is correct. Our justification is shady. I am not sure how to make this clear, without writing too much. It is kind of a general consent that smoothness in one domain implies localization in the other. The reason is that for the continuous Fourier transform, we have:
% \begin{equation*}
% \widehat{f(t\cdot)}(\omega)=\frac{1}{t}\hat{f}\left(\frac{\omega}{t}\right).
% \end{equation*}
% Hence by concentrating $f$ in the time domain, i.e. making $t$ smaller, we also reduce the derivative of $\widehat{f(t\cdot)}$ in the spectral domain, making it smoother.
% Heisenberg uncertainty principle bounds the concentration in both domains simultaneously, i.e. we have (to be checked if added to the paper)
% \begin{equation*}
% \text{var}({f})\text{var}({\hat{f}}) \geq \frac{1}{16 \pi^2}
% \end{equation*}
% Hence the theorem says that you cannot concentrate arbitrarily in one domain without de-concentrating in the other.
% Now the question is: how do we fix the text? Also Heisenberg does not exist in the discrete case...
% }
\begin{mdframed}[style=comment]
All the above methods cannot be easily accelerated when the data lies on a part of the sphere only.
- You mention your method as part of the "above methods", but then you go on to say that using graphs does allow you to accelerate the method when data is only on part of the sphere.
\end{mdframed}
Indeed.
We corrected the text.
\begin{mdframed}[style=comment]
By interwinding graph convolutional layers and recurrent layers [65], they can for example model structured time series such as traffic on road networks [66], or recursively complete matrices for recommendation [67].
- I'm not sure what interwinding is, I guess you mean by making GC recurrent, or do you mean using GC and then RL separately? Either way there should be commas around "for example".
\end{mdframed}
Thanks for spotting this.