diff --git a/404.html b/404.html index a9ce9bd1d..d39cd07bf 100644 --- a/404.html +++ b/404.html @@ -1 +1 @@ -
TBC
The steps in this section only ever need to be done once on any particular system.
Google Cloud configuration: 1. Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install. 1. Log in to your work Google Account: run gcloud auth login
and follow instructions. 1. Obtain Google application credentials: run gcloud auth application-default login
and follow instructions.
Check that you have the make
utility installed, and if not (which is unlikely), install it using your system package manager.
Check that you have java
installed.
Run make setup-dev
to install/update the necessary packages and activate the development environment. You need to do this every time you open a new shell.
It is recommended to use VS Code as an IDE for development.
All pipelines in this repository are intended to be run in Google Dataproc. Running them locally is not currently supported.
In order to run the code:
Manually edit your local workflow/dag.yaml
file and comment out the steps you do not want to run.
Manually edit your local pyproject.toml
file and modify the version of the code.
1.2.3+jdoe
.1.2.3+jdoe.myfeature
.Run make build
.
Submit the Dataproc job with poetry run python workflow/workflow_template.py
--help
to see usage.When making changes, and especially when implementing a new module or feature, it's essential to ensure that all relevant sections of the code base are modified. - [ ] Run make check
. This will run the linter and formatter to ensure that the code is compliant with the project conventions. - [ ] Develop unit tests for your code and run make test
. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods. - [ ] Update the configuration if necessary. - [ ] Update the documentation and check it with run build-documentation
. This will start a local server to browse it (URL will be printed, usually http://127.0.0.1:8000/
)
For more details on each of these steps, see the sections below.
:::
. They will automatically generate sections of the documentation based on class and method docstrings. Be sure to update them for:docs/reference/dataset
(example: docs/reference/dataset/study_index/study_index_finngen.md
)docs/reference/step
(example: docs/reference/step/finngen.md
)config/datasets/gcp.yaml
config/step/my_STEP.yaml
(example: config/step/my_finngen.yaml
)src/org/config.py
(example: FinnGenStepConfig
class in that module)src/org/dataset/
(example: src/otg/dataset/study_index.py
→ StudyIndexFinnGen
)src/org/STEP.py
(example: src/org/finngen.py
)tests/conftest.py
(example: mock_study_index_finngen
in that module)tests/data_samples
(example: tests/data_samples/finngen_studies_sample.json
)tests/
(example: tests/dataset/test_study_index.py
→ test_study_index_finngen_creation
)If you see errors related to BLAS/LAPACK libraries, see this StackOverflow post for guidance.
If you see various errors thrown by Pyenv or Poetry, they can be hard to specifically diagnose and resolve. In this case, it often helps to remove those tools from the system completely. Follow these steps:
exit
curl -sSL https://install.python-poetry.org | python3 - --uninstall
rm -rf ~/.cache/pypoetry
rm -rf ~/.cache/pre-commit
pyenv shell system
~/.bashrc
to remove the lines related to Pyenv configurationrm -rf ~/.pyenv
After that, open a fresh shell session and run make setup-dev
again.
Officially, PySpark requires Java version 8 (a.k.a. 1.8) or above to work. However, if you have a very recent version of Java, you may experience issues, as it may introduce breaking changes that PySpark hasn't had time to integrate. For example, as of May 2023, PySpark did not work with Java 20.
If you are encountering problems with initialising a Spark session, try using Java 11.
If you see an error message thrown by pre-commit, which looks like this (SyntaxError: Unexpected token '?'
), followed by a JavaScript traceback, the issue is likely with your system NodeJS version.
One solution which can help in this case is to upgrade your system NodeJS version. However, this may not always be possible. For example, Ubuntu repository is several major versions behind the latest version as of July 2023.
Another solution which helps is to remove Node, NodeJS, and npm from your system entirely. In this case, pre-commit will not try to rely on a system version of NodeJS and will install its own, suitable one.
On Ubuntu, this can be done using sudo apt remove node nodejs npm
, followed by sudo apt autoremove
. But in some cases, depending on your existing installation, you may need to also manually remove some files. See this StackOverflow answer for guidance.
After running these commands, you are advised to open a fresh shell, and then also reinstall Pyenv and Poetry to make sure they pick up the changes (see relevant section above).
This section contains various technical information on how to develop and run the code.
Airflow code is located in src/airflow
. Make sure to execute all of the instructions from that directory, unless stated otherwise.
We will be running a local Airflow setup using Docker Compose. First, make sure it is installed (this and subsequent commands are tested on Ubuntu):
sudo apt install docker-compose
+
Next, verify that you can run Docker. This should say "Hello from Docker":
docker run hello-world
+
If the command above raises a permission error, fix it and reboot:
sudo usermod -a -G docker $USER
+newgrp docker
+
This section is adapted from instructions from https://airflow.apache.org/docs/apache-airflow/stable/tutorial/pipeline.html. When you run the commands, make sure your current working directory is src/airflow
.
# Download the latest docker-compose.yaml file.
+curl -sLfO https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml
+
+# Make expected directories.
+mkdir -p ./config ./dags ./logs ./plugins
+
+# Construct the modified Docker image with additional PIP dependencies.
+docker build . --tag opentargets-airflow:2.7.1
+
+# Set environment variables.
+cat << EOF > .env
+AIRFLOW_UID=$(id -u)
+AIRFLOW_IMAGE_NAME=opentargets-airflow:2.7.1
+EOF
+
Now modify docker-compose.yaml
and add the following to the x-airflow-common → environment section:
GOOGLE_APPLICATION_CREDENTIALS: '/opt/airflow/config/application_default_credentials.json'
+AIRFLOW__CELERY__WORKER_CONCURRENCY: 32
+AIRFLOW__CORE__PARALLELISM: 32
+AIRFLOW__CORE__MAX_ACTIVE_TASKS_PER_DAG: 32
+AIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY: 16
+AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG: 1
+
docker-compose up
+
Airflow UI will now be available at http://localhost:8080/home. Default username and password are both airflow
.
In order to be able to access Google Cloud and do work with Dataproc, Airflow will need to be configured. First, obtain Google default application credentials by running this command and following the instructions:
gcloud auth application-default login
+
Next, copy the file into the config/
subdirectory which we created above:
cp ~/.config/gcloud/application_default_credentials.json config/
+
Now open the Airflow UI and:
google_cloud_default
./opt/airflow/config/application_default_credentials.json
.Workflows, which must be placed under the dags/
directory, will appear in the "DAGs" section of the UI, which is also the main page. They can be triggered manually by opening a workflow and clicking on the "Play" button in the upper right corner.
In order to restart a failed task, click on it and then click on "Clear task".
Note that when you a a new workflow under dags/
, Airflow will not pick that up immediately. By default the filesystem is only scanned for new DAGs every 300s. However, once the DAG is added, updates are applied nearly instantaneously.
Also, if you edit the DAG while an instance of it is running, it might cause problems with the run, as Airflow will try to update the tasks and their properties in DAG according to the file changes.
The steps in this section only ever need to be done once on any particular system.
Google Cloud configuration: 1. Install Google Cloud SDK: https://cloud.google.com/sdk/docs/install. 1. Log in to your work Google Account: run gcloud auth login
and follow instructions. 1. Obtain Google application credentials: run gcloud auth application-default login
and follow instructions.
Check that you have the make
utility installed, and if not (which is unlikely), install it using your system package manager.
Check that you have java
installed.
Run make setup-dev
to install/update the necessary packages and activate the development environment. You need to do this every time you open a new shell.
It is recommended to use VS Code as an IDE for development.
All pipelines in this repository are intended to be run in Google Dataproc. Running them locally is not currently supported.
In order to run the code:
Manually edit your local workflow/dag.yaml
file and comment out the steps you do not want to run.
Manually edit your local pyproject.toml
file and modify the version of the code.
1.2.3+jdoe
.1.2.3+jdoe.myfeature
.Run make build
.
Submit the Dataproc job with poetry run python workflow/workflow_template.py
--help
to see usage.When making changes, and especially when implementing a new module or feature, it's essential to ensure that all relevant sections of the code base are modified. - [ ] Run make check
. This will run the linter and formatter to ensure that the code is compliant with the project conventions. - [ ] Develop unit tests for your code and run make test
. This will run all unit tests in the repository, including the examples appended in the docstrings of some methods. - [ ] Update the configuration if necessary. - [ ] Update the documentation and check it with make build-documentation
. This will start a local server to browse it (URL will be printed, usually http://127.0.0.1:8000/
)
For more details on each of these steps, see the sections below.
:::
. They will automatically generate sections of the documentation based on class and method docstrings. Be sure to update them for:docs/reference/dataset
(example: docs/reference/dataset/study_index/study_index_finngen.md
)docs/reference/step
(example: docs/reference/step/finngen.md
)config/datasets/gcp.yaml
config/step/my_STEP.yaml
(example: config/step/my_finngen.yaml
)src/org/config.py
(example: FinnGenStepConfig
class in that module)src/org/dataset/
(example: src/otg/dataset/study_index.py
→ StudyIndexFinnGen
)src/org/STEP.py
(example: src/org/finngen.py
)tests/conftest.py
(example: mock_study_index_finngen
in that module)tests/data_samples
(example: tests/data_samples/finngen_studies_sample.json
)tests/
(example: tests/dataset/test_study_index.py
→ test_study_index_finngen_creation
)If you see errors related to BLAS/LAPACK libraries, see this StackOverflow post for guidance.
If you see various errors thrown by Pyenv or Poetry, they can be hard to specifically diagnose and resolve. In this case, it often helps to remove those tools from the system completely. Follow these steps:
exit
curl -sSL https://install.python-poetry.org | python3 - --uninstall
rm -rf ~/.cache/pypoetry
rm -rf ~/.cache/pre-commit
pyenv shell system
~/.bashrc
to remove the lines related to Pyenv configurationrm -rf ~/.pyenv
After that, open a fresh shell session and run make setup-dev
again.
Officially, PySpark requires Java version 8 (a.k.a. 1.8) or above to work. However, if you have a very recent version of Java, you may experience issues, as it may introduce breaking changes that PySpark hasn't had time to integrate. For example, as of May 2023, PySpark did not work with Java 20.
If you are encountering problems with initialising a Spark session, try using Java 11.
If you see an error message thrown by pre-commit, which looks like this (SyntaxError: Unexpected token '?'
), followed by a JavaScript traceback, the issue is likely with your system NodeJS version.
One solution which can help in this case is to upgrade your system NodeJS version. However, this may not always be possible. For example, Ubuntu repository is several major versions behind the latest version as of July 2023.
Another solution which helps is to remove Node, NodeJS, and npm from your system entirely. In this case, pre-commit will not try to rely on a system version of NodeJS and will install its own, suitable one.
On Ubuntu, this can be done using sudo apt remove node nodejs npm
, followed by sudo apt autoremove
. But in some cases, depending on your existing installation, you may need to also manually remove some files. See this StackOverflow answer for guidance.
After running these commands, you are advised to open a fresh shell, and then also reinstall Pyenv and Poetry to make sure they pick up the changes (see relevant section above).
Ingestion and analysis of genetic and functional genomic data for the identification and prioritisation of drug targets.
This project is still in experimental phase. Please refer to the roadmap section for more information.
For information on how to contribute to the project see the contributing section.
Ingestion and analysis of genetic and functional genomic data for the identification and prioritisation of drug targets.
This project is still in experimental phase. Please refer to the roadmap section for more information.
For all development information, including running the code, troubleshooting, or contributing, see the development section.