Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create RStudio_user.rst #950

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open
Changes from 10 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
e3d58b6
Create Studio_user.rst
jcolomb Apr 26, 2023
4542a93
Update Studio_user.rst
jcolomb Apr 26, 2023
52a2a2f
Update Studio_user.rst
jcolomb Apr 26, 2023
78eee42
Update Studio_user.rst
jcolomb Apr 26, 2023
24ae8d9
Update Studio_user.rst
jcolomb Apr 26, 2023
c2f7dad
Update Studio_user.rst
jcolomb Apr 26, 2023
6d5db5e
Update Studio_user.rst
jcolomb Apr 26, 2023
1aff71a
Update Studio_user.rst
jcolomb Apr 27, 2023
fa08b07
Merge branch 'datalad-handbook:main' into master
jcolomb Jun 1, 2023
ca3cc1c
rename file + change beginning (for review before applying to all cha…
jcolomb Jun 1, 2023
71c55f3
Apply suggestions from code review
jcolomb Jul 6, 2023
bc87dfc
ignoremore
jcolomb Jul 6, 2023
f0f8466
Rewriting second part, following the Max and Bobby interactions
jcolomb Jul 11, 2023
fecb534
add images
jcolomb Jul 11, 2023
2f97f94
ref typo ?
jcolomb Jul 16, 2023
ad710fd
Update docs/usecases/RStudio_user.rst
jcolomb Jul 16, 2023
474b6d9
Update docs/usecases/RStudio_user.rst
jcolomb Jul 16, 2023
73ede19
Update docs/usecases/RStudio_user.rst
jcolomb Aug 4, 2023
2ab1360
adding correct path for images
jcolomb Aug 7, 2023
ac69b49
look at code and commands synthax
jcolomb Aug 7, 2023
2f0d118
Merge pull request #2 from datalad-handbook/main
jcolomb Aug 10, 2023
9e928bb
typos
jcolomb Aug 14, 2023
f8c870d
Update docs/usecases/RStudio_user.rst
jcolomb Aug 14, 2023
53e6b62
typo
jcolomb Aug 14, 2023
7dc7feb
Update docs/usecases/RStudio_user.rst
jcolomb Aug 14, 2023
68d6f63
Update intro.rst
jcolomb Aug 14, 2023
435bdde
original .gitignore
jcolomb Oct 6, 2023
e29d9c7
Apply suggestions from code review: mostly typos
jcolomb Oct 6, 2023
a65103f
speel check
jcolomb Oct 6, 2023
b7577eb
DataLad spelling
jcolomb Oct 10, 2023
99e9c58
moving gintonic info into a box
jcolomb Oct 10, 2023
0ef0935
add notes on push
jcolomb Oct 10, 2023
c9c8697
Merge pull request #3 from datalad-handbook/main
jcolomb Oct 10, 2023
746d3ed
trying to correct new image address
jcolomb Oct 10, 2023
72e0dbe
correct copypaste error
jcolomb Oct 10, 2023
c8776da
adding some precisions
jcolomb Oct 18, 2023
299983f
adding comments on datalad run use in practice
jcolomb Oct 18, 2023
1422d0a
add reference to git-annex intro
jcolomb Oct 18, 2023
c64774f
debug links
jcolomb Oct 18, 2023
4460734
trying to clean handbook links
jcolomb Oct 18, 2023
3c6bc95
Fix anchor format
adswa Dec 18, 2023
85e8e5e
fix heading
adswa Dec 18, 2023
2ed4587
Merge branch 'datalad-handbook:main' into master
jcolomb Dec 18, 2023
7074015
add tab for gitusernote
jcolomb Dec 18, 2023
a4a5155
fix references
adswa Dec 18, 2023
d870dde
formatting fix
adswa Dec 19, 2023
ab18dc1
add space
adswa Dec 19, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
221 changes: 221 additions & 0 deletions docs/usecases/RStudio_user.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
.. \_usecase_Rstat:
adswa marked this conversation as resolved.
Show resolved Hide resolved

My first steps as a RStudio users
---------------------------

jcolomb marked this conversation as resolved.
Show resolved Hide resolved
.. index:: ! Usecase; R users quickstart

This use case sketches typical entry points for R and Rstudio users.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

#. A repository having submodules for data and code is cloned.
#. R scripts are developped in Rstudio
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
#. R scripts are run usind datalad run
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

This is a `hello world` type of analysis, used only for demonstration purposes
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

The Challenge
^^^^^^^^^^^^^

Max has been using Rstudio together with GitHub for a long time. They know how git
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
works. Max has learned that git will not work with their new project,
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
because there will be too many files and some dataset files will be too large.
Max read the datalad handbook basics and is decided to apply datalad.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
Max still want to use Rstudio and a combination of R and python scripts for the
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
data analysis.

Bobby is a data manager who already learned (the hard way), how to handle datalad
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
using Rstudio. They have also created a GIN repository with submodules
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
for data and for code, using the Tonic tool and templates.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved


Setting up
^^^^^^^^^^

Max first want to clone the repository on their computer, they use the Rstudio
`create a new project` function using the SSH address of the parent repository.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
Max can't see submodules content and come to Bobby.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

Bobby comes and run `datalad get . -n -r` in the terminal window of Rstudio.

They then explain:
- Rstudio can only use simple git commands, which do not clone submodule content.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
- datalad command are run in the terminal window. Datalad does not have a R package and do not run in the console
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
- This specific function `get .` will download all files, it has two options:
- `-n` option means annexed files will not be downloaded
- `-r` option means that the function is run in all submodules, recursively

Max has learned something, and it realise that git-annex is probably not set in its repositoy.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
After reading the handbook, they see the create function need to be forced
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
when the folder already exist, so they run
`datalad create --force -r` in the parent repository.
Now they are sure they set up datalad to work in the repository and all submodules,
since they used the `-r` option.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this can conclude with a statement that with this setup, everything is good to go for DataLad commands from the console, for example for saving changes,pushing modifications, pulling updates, or adding siblings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some infos in the box, indeed pushing would need additional setups in the scenario

Working on the code
^^^^^^^^^^^^^^^^^^^


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for this section it makes sense to focus it on the topic of reproducible execution with datalad run.
I think there is no need to spent too much work on rewriting content about the difference between files kept in git versus in git annex (instead, references to existing parts in the handbook).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this in the following version because:

  • beginners may not care about reproducible execution
  • Infos about difference between Git and Git-annex is necessary to explain Rstudio behavior. I was personally very surprised of that behavior and needed testing and thinking to understand what happens. I try to add some reference for more info.

THAT IS WHERE NEW WRITING STOPS.


1. Code and data: git and git-annex
2. Using `datalad run` ?
3. Dealing with (relative) path
4. oups, how to undo `datalad save` done on the wrong repository ?

.. gitusernote:: Take home messages:
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

datalad commands run in the terminal, not the R Console.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

The simplest way to tell datalad not to use git-annex for your code files is to use `datalad create -c text2git --force` command.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

the `datalad run Rscript "path-to-script.r"` command will run your script.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

Use additional options to read or write annexed files (and give more info for commit message).

In your R script, use path relative to the project, not relative to the code position.


Working with existing repositories
----------------------------------

I usually create my repositories online, and clone them afterwards on my computer,
using the Rstudio `create a new project` function.
While Rstudio does only support basic git functions, this will not clone submodules.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
I am therefore using the `datalad get . -n -r` function to do that.
As a Rstudio user, I am using the integrated terminal tab (next to the console tab),
so that the command is executed in the right folder.

Details
- `get .` means download all files
- `-n` option means annexed files will not be downloaded
- `-r` option means that the function is run in all submodules

Note that it you cloned "pure git" repositories, datalad functions will not use git-annex.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved
To make it use git-annex (in all submodules), you need to run
`datalad create --force -r` in the parent repository.



Why git-annex
-------------

As you probably know if you read this, git
does not work well with large or numerous files, and I want to use
datalad to circonvent the issues. The rough idea is that data files
should be annexed, while the code should use the normal git
workflow, which is more powerful and convenient for text files.

.. gitusernote:: annexed files

When files are added via git-annex, they are moved somewhere eles and the file is now a kind of link to the real content. Using the Rstudio file system, clicking on the simlink will actually open the file content, but that file is in read-only mode. So if you git-annex your code, you will not be able to make changes and save them directly in Rstudio. In addition, the advantage of git for text files are lost, as annexed content is treated as binary files: each new version is saved in its entirety.

To save changes to an annexed file, one needs to unlock the file in question (using the `datalad unlock <filename>` command) first. Then you can overwrite the file and save its new state.

Datalad default: all annexed !?
--------------------------------

I am used to write code, and version control it using git (all inside
Rstudio), my usual workflow is to modify files, save them, and then
commit all the changes at once. I would push these changes to a remote
repository from time to time.

My first reflex was to keep the workflow but running `datalad save` in
the terminal window of Rstudio, instead of the commit step. This does
not work, because datalad will use git-annex per default for all files
(see detail box if you do not get why it fails). It also will use
git-annex on files that were previously added via `git add`. Therefore,
one should tell datalad not to use git-annex for your code files, to
keep your usual workflow.



The simplest way to tell datalad not to use git-annex for your code
files is to use `datalad create -c text2git --force` command (force is
necessary if you change an existing repository). Note that all text
files will be added to git using this, so if you have large text files
(.csv or .json files), you will need to be more precise in what text
file should not be annexed. See
<http://handbook.datalad.org/en/inm7/basics/101-124-procedures.html#>
for details on how text2git change `.gitattributes` to achieve that.

Using Datalad run ?
---------------------

### Do I have to use datalad run?

In theory, you can run your R script the way you are used to, as long as all files are present locally, and you are not overwriting files. If you need to access files that are only on the server (because you dropped them), you need to run `datalad get` to download them first. If you need to overwrite files which were saved via git-annex (that is that are not text files), you need to unlock them. You can unlock all your repository, including submodules files, running `datalad unlock -r .`

.. gitusernote:: locking

to lock the files again, you can use `datalad save` (and derivates), this will not create a new commit (unless they are other changes made than the unlock).




### how to use datalad run

Because datalad runs in the terminal, it needs a terminal command to run the script, for R, that command is "Rscript": `datalad run Rscript "<path-to-script.r>"`. Not the path is relative to where the terminal is, if you are using Rstudio projects, the terminal tab is per default in the working directory of the project. If your code is in a submodule and the data is in another one, you should run this command from the parent repository.

To access annexed files, we need to use the input and output options:

.. code-block:: bash


$ datalad run Rscript \
--input "file1.csv" \
--input "data/file2.json" \
--output "figures/*.png" \
--explicit \
"<path-to-script.r>" {inputs} {outputs}



Behavior explained :

- Input: To be read, files are downloaded if not present. Note that they are not unlocked (no need for reading them) and that they will not be dropped again after being read.
- Output: files are unlocked so they can be overwritten. If the files are not present (dropped), they will not be downloaded. This may make your code fail: if it does, either get the files manually before running `datalad run`, or remove them in the R code (`r file.remove()`). In other case, it will work and it will even detect when the file has not been modified and make no commit.
- explicit: datalad runs normally only in clean repositories, this includes all submodules. By adding --explicit, datalad will only test that the output files are clean, and only output files will be saved. Please use with care, as the script and data you use will not be tested and provenance information can be lost.
- {inputs} {outputs} If you add the placeholders, the terminal will actually gives the input and output text as argument to the Rscript bash function. One can access them in the R script with `args <- commandArgs(trailingOnly = TRUE)` (then get them with `args[i]`, with i starts at 1).
- At the end, datalad usually runs `datalad save -r` so that modification made by the code in the whole repository, including submodules will be done (exception when --explicit is given, see above.) This will include any intermediate file created by your code in bash mode, that is using `Rscript "path-to-code.R "` in the terminal (it can happen that bash mode creates more files than running the code directly)
jcolomb marked this conversation as resolved.
Show resolved Hide resolved



On can set as many input and output files, one can use `*` to define several files with a similar ending (in the example all .png figures will be unlocked), one can list files who are not annexed to give more information in the commit message.

.. gitusernote:: using datalad run

unlocking the files will make its state "unclean", so if you use datalad run, you need to set output options in the function, you cannot unlock files manually before.
jcolomb marked this conversation as resolved.
Show resolved Hide resolved

The commit message will only look at the options, whether the code use these input and output files is not checked.

One can write these datalad commands in a shell script file in Rstudio, and push the run button will run them in the terminal.

Using `datalad run` correctly is sometimes tricky, and since it does save each time, it can make the repository history quite messy. Make sure to give good commit messages.


The advantage of using datalad run and not running the code directly is that R code cannot access directly annexed files, that might even be only present in the server but not on the computer. For each input and output files, one would need to get it or unlock it manually before running the code, then save it again. Datalad run can do all that automatically.

In addition, datalad run write specific comments in the commit message, so that it is easy to understand what was done, and so that the `datalad rerun` command can be used.


Dealing with (relative) path
----------------------------

You may work on your code in a submodule using your usual git workflow. It is still best practice to write your code as run from the parent repository in term of path. You may run them there too.

My current workflow is to have 2 Rstudio projects open. I work in the parent repository, but make commits and push in the code repository.

Undo`datala d save`
-------------------

Sometimes one goes to fast and make a `datalad save` in a repository that was not ready to be saved, or one runs the `datalad run` command and one would want to undo it. This is a bit complex and needs some manual interventions.

The handbook explain what to do well: https://handbook.datalad.org/en/0.17/basics/101-137-history.html#untracking-accidentally-saved-contents-stored-in-git-annex:

- You need to manually check what is the hash of the commit you want to go back to, and what was changed in git-annex since then. You can do that in Rstudio via the history button of the git tab, and patience if you want to go far back.
- unlock all files that were created with `datalad unlock <filename>`
- Then you go back git commits with `git reset --mixed <hash>`

The save (but not the run) has been undone, and the files are present as untracked content (both the files that were put in git-annex and the file put in git).