logo

1 How-to

Note: ~ in the following is /home/rhapsody

1.1 Without Docker

1.1.1 Files & Scripts

1.1.1.1 Retrieve the latest version

In order to retrieve the latest version of the main script RHAPSODY_WP3_PreDiab.Rmd and utils/* scripts, you can copy/paste the git commands below.
This will clone (i.e., download) the scripts from GitHub.
Scripts will be downloaded into your home directory within a directory named WP3.

git clone https://github.com/mcanouil/RHAPSODY.git ~/WP3/scripts

Once donwload is complete, you should get the following directory tree in your home (`cd ~`).

## 1  root2                                            
## 2   °--WP3                                          
## 3       °--scripts                                  
## 4           ¦--docker                               
## 5           ¦   ¦--Dockerfile                       
## 6           ¦   ¦--Dockerfile_update                
## 7           ¦   ¦--login.html                       
## 8           ¦   ¦--logo.png                         
## 9           ¦   ¦--markdown_packages.R              
## 10          ¦   °--r_packages.R                     
## 11          ¦--docker_analysis                      
## 12          ¦   ¦--opal_credentials.txt             
## 13          ¦   ¦--rhapsody.R                       
## 14          ¦   °--rhapsody.sh                      
## 15          ¦--docs                                 
## 16          ¦   ¦--howto.Rmd                        
## 17          ¦   ¦--index.html                       
## 18          ¦   °--RHAPSODY_Logo_WEB_Color.png      
## 19          ¦--opal_credentials.txt                 
## 20          ¦--README.md                            
## 21          ¦--RHAPSODY_WP3_PreDiab.Rmd             
## 22          °--utils                                
## 23              ¦--ethnicity_template.csv           
## 24              ¦--ethnicity_template.R             
## 25              ¦--handleVCF.sh                     
## 26              ¦--Install_Rpackages.R              
## 27              ¦--RHAPSODY_WP3_PreDiab_DEBUG.R     
## 28              ¦--RHAPSODY_WP3_PreDiab_OPAL_DEBUG.R
## 29              °--RunAnalysis.R

1.1.1.2 Description

  • ~/WP3/scripts/README.pdf
    a short README file;
  • ~/WP3/scripts/RHAPSODY_WP3_PreDiab.Rmd
    a Rmarkdown script which perform:
    • opal node access,
    • phenotype QC,
    • preliminary modelling,
    • VCFs formatting/filtering,
    • variants analysis;
  • ~/WP3/scripts/opal_credentials.txt
    the server, login and password informations to access phenotype data on your local node;
  • ~/WP3/scripts/utils/RHAPSODY_WP3_PreDiab_DEBUG.R,
    a small part of the main Rmarkdown script, which includes only node access and phenotype QC;
  • ~/WP3/scripts/utils/RunAnalysis.R
    a example of a R script to run the whole analysis in bash (parameters have to be filled properly according to your informations)
    using `Rscript ~/WP3/scripts/utils/RunAnalysis.R`;
  • ~/WP3/scripts/utils/handleVCF.sh
    the bash/shell script used to format/filter VCFs files prior to the variant analysis.
    This bash/shell script is also included within the main script, but you can use it directly as a standalone, if so (or if your already done formatting and don’t want to do it again), parameter format_vcfs must be set to FALSE).
  • ~/WP3/scripts/utils/Install_Rpackages.R
    a R script which install a predefined version of all R packages loaded in the Rmarkdown script, only if needed (also avaiable within the main script);
  • ~/WP3/scripts/utils/check_analysis.Rmd
    a Rmarkdown script to ‘’quickly’’ check if the analyses went well (i.e., compute quantile-quatile plot, manhattan plot and a plot to compare chromosomal position to sequencial position);
  • ~/WP3/scripts/utils/README.Rmd
    a Rmarkdown script to create README.pdf.
  • ~/WP3/scripts/utils/howto.Rmd
    a how-to document and its html version howto.html.

1.1.2 Check environment settings

1.1.2.1 R and R packages version

Before runing anything, you can check the current settings of your cluster/grid/computer/laptop.
For reproducibility:

  • The default R version is R version 3.4.2 (2017-09-28).
  • The default R packages are listed below.
package version package version package version package version
broom 0.5.0 data.tree 0.7.8 kableExtra 0.9.0 parallel 3.4.2
scales 1.0.0 devtools 1.13.6 knitr 1.20 readxl 1.1.0
cowplot 0.9.3 grid 3.4.2 lme4 1.1-18-1 writexl 01.0
viridis 0.5.1 Hmisc 4.1-1 lmerTest 3.0-1 tidyverse 1.2.1
  • To quickly check and install all R packages which are not installed or in a different version, you can source/run the script ~/WP3/scripts/utils/Install_Rpackages.R.
    • Interactively using R

      source("~/WP3/scripts/utils/Install_Rpackages.R")
    • Non-interactively using bash/shell

      Rscript ~/WP3/scripts/utils/Install_Rpackages.R

1.1.2.2 VCFtools version

If VCFtools is not available on your machine, please install it following the instructions on https://vcftools.github.io/.
For reproducibility:

  • The default version is VCFtools (0.1.16).

1.1.3 Run the analysis

1.1.3.1 OPAL database credentials

To avoid OPAL database credentials to be hard written within the script, you should create a text file with the server, username and password to access your local OPAL database (i.e., RHAPSODY node hosting the phenotype data).
You can also modify the default file ~/WP3/scripts/opal_credentials.txt.

http://localhost:8080
administrator
password

NOTE: The username provided in that file should have “download” rights.
The data are downloaded locally in the R session for and during the analysis, then all local data are deleted (i.e., the data are not leaving the R session in any way).

1.1.3.2 Imputation quality (VCF)

Please check where the imputation quality is stored in your VCF files.
Depending on where (i.e., locally, Sanger Imputation Server, Michigan Imputation Server, etc.) and with which softwares (i.e., impute2, PBWT, etc.) you used to impute your genetic data, this information might be stored as INFO or R2 (most frequent names used, but it might be something else).

1.1.3.3 Set the parameters

A template script (i.e., ~/WP3/scripts/utils/RunAnalysis.R) is available to help run the analysis as a bash/shell command.

  1. Set the output directory.
    The default outputs will be generated in the current directory (./).

    working_directory <- "/media/rhapsody_output" 
  2. Set a cohort name.
    It’s a identifier for a cohort and array. For example, “DESIR_Metabochip” for the results from the Metabochip Array in the DESIR cohort.

    author_name <- "Firstname LASTNAME"
  3. Set the number of CPUs you are going to use to run the Linear Mixed Model. Just to know who performed the analysis, if there are questions at a later point.

    n_cpu <- 2
  4. Set the path to your file opal_credentials.txt. For example, ~/WP3/scripts/opal_credentials.txt, if you modified the default file (it is better to use absolute path, i.e., without ~/).

    opal_credentials <- "/media/credentials/opal_credentials.txt"
  5. Set the path to your (RAW) VCF files

    vcf_directory <- "/media/vcf"
  6. Set the imputation quality tag for your VCF

    imputation_quality_tag <- "INFO" # To be set according to VCF (could also be "R2")
  7. Set the path to VCFtools.
    Default location of the software is usually /usr/local/bin.

    vcftools_binary_path <- "/usr/local/bin"
  8. Define if you want to format/split the VCF files.
    Default is to format and split the VCF files available within the script.

    format_vcfs <- TRUE
  9. Define to which part of the analysis you want to go.
    1. Node access
    2. Phenotype QC
    3. Run Mixed Model without variants
    4. Check VCFs files
    5. Format VCFs files
    6. Run Mixed Model with (imputed) variants
    7. Check output files
    analysis_step <- 7
  10. Define if genetic variants analyses should be performed.

    variants_analysis <- TRUE
  11. Define if genetic components should be used.

    genomic_component <- NULL # A csv file with SUBJID as first column and PC01 to PCXX

Template: ~/WP3/scripts/utils/RunAnalysis.R

# set the output directory or leave as is; 
# output will be generated where the Rmarkdown file is
working_directory <- "/media/rhapsody_output" 

cohort_name <- "Cohort_Name"
author_name <- "Firstname LASTNAME"

n_cpu <- 2

opal_credentials <- "/media/credentials/opal_credentials.txt"

vcf_directory <- "/media/vcf"
imputation_quality_tag <- "INFO" # To be set according to VCF (could also be "R2")
vcftools_binary_path <- "/usr/local/bin"

format_vcfs <- TRUE
analysis_step <- 7
variants_analysis <- TRUE

genomic_component <- NULL # A csv file with SUBJID as first column and PC01 to PCXX



# Run the analysis
dir.create(path = working_directory, showWarnings = FALSE, mode = "0777")
rmarkdown::render(
  input = "/home/rhapsody/WP3/scripts/RHAPSODY_WP3_PreDiab.Rmd", 
  output_format = "html_document", 
  output_file = paste0(
    "RHAPSODY_WP3_PreDiab_", 
    cohort_name, "_step", 
    analysis_step, ".html"
  ), 
  output_dir = working_directory,
  params = list(
    cohort_name = cohort_name, 
    author_name = author_name, 
    opal_credentials = opal_credentials,
    vcf_input_directory = vcf_directory, 
    imputation_quality_tag = imputation_quality_tag, 
    vcftools_binary_path = vcftools_binary_path,
    output_directory = working_directory,
    analysis_step = analysis_step,
    format_vcfs = format_vcfs,
    variants_analysis = variants_analysis,
    chunk_size = 1000,
    exclude_X = TRUE,
    genomic_component = genomic_component,
    n_cpu = n_cpu,
    echo = FALSE, # Should R code be printed in the report
    warning = FALSE, # Should warnings be printed in the report
    message = FALSE, # Should messages be printed in the report
    debug = FALSE
  ),
  encoding = "UTF-8"
)

1.1.3.4 Start the analysis

1.1.3.4.1 Standard use

Once you created a R script using the template (i.e., ~/WP3/scripts/utils/RunAnalysis.R), you can start the analysis with a simple command.

Rscript ~/WP3/scripts/utils/RunAnalysis.R
1.1.3.4.2 Non-standard use

In the case you want to format your VCF files before running the script.

  1. The best way to proceed is to use/set the parameters as showed below.

    format_vcfs <- FALSE
    analysis_step <- 4
    variants_analysis <- FALSE
  2. Run the script within bash/shell

    Rscript ~/WP3/scripts/utils/RunAnalysis.R
    This will create (i.e., in the output directory you set up):
    • ./RHAPSODY_WP3_PreDiab_Cohort_Name_step4.html
      The report with the phenotype QC and without the genetic variant analyses.
    • ./vcffilespath.txt
      The list of all available VCF files to format using ./handleVCF_Cohort_Name.sh. The files are also listed in the report in Section 6.2.1 Check available VCF files.

      /media/vcf/chr1.vcf.gz
      /media/vcf/chr2.vcf.gz
      ...
      /media/vcf/chr*.vcf.gz
      ...
      /media/vcf/chr22.vcf.gz
    • ./handleVCF_Cohort_Name.sh
      The command (with the correct parameters) to format the VCF using ~/WP3/scripts/utils/handleVCF.sh. This is also stated in the report in Section 6.2.2.3.1 Manually format VCFs.

      #!/bin/sh
      
      ./handleVCF.sh Cohort_Name /media/vcf INFO /usr/local/bin 
  3. Run the script to format the VCF

    sh ./handleVCF_Cohort_Name.sh
  4. Run the whole analysis with the proper parameter

    format_vcfs <- FALSE # keep this to FALSE
    analysis_step <- 7
    variants_analysis <- TRUE
    Rscript ~/WP3/scripts/utils/RunAnalysis.R

1.2 Using Docker & Rstudio-server

1.2.1 Install Docker

If Docker is not installed on your cluster/grid/computer/laptop, you can install it following the instructions on docs.docker.com.

1.2.2 Deploy an image within a container

1.2.2.1 Pull the image

You can download the Docker image from Docker Hub:

docker pull ghcr.io/mcanouil/rhapsody:1.3.0

To pull a particular version (i.e., not the latest), you can add the tag version.
Tags are listed in the repository.

docker pull ghcr.io/mcanouil/rhapsody:latest

1.2.2.2 Run a container

The easiest way to start a the Docker container using ghcr.io/mcanouil/rhapsody image is to run the following Docker command.

docker run \
  --name rhapsody \
  --detach \
  ghcr.io/mcanouil/rhapsody:latest

With those settings, the Docker container will not have access to any data stored in your cluster/grid/computer/laptop.
In order to allow the Docker container to see some data, you have to use --volume from:to argument.

docker run \
  --name rhapsody \
  --detach \
  --volume /path/to/vcf:/media/vcf \
  ghcr.io/mcanouil/rhapsody:latest

If you want to more carefully set the Docker container, (, CPUs, memory, etc.), you can find all the parameters at the following address: https://docs.docker.com/engine/reference/commandline/run/

1.2.3 Connect to the container

Open a web browser and type http://localhost:8787.
You will see an authentication box with Username and Password: * Username: rhapsody * Password: wp3

1.2.4 Files & Scripts

1.2.4.1 Retrieve the latest version

In order to retrieve the latest version of the main script RHAPSODY_WP3_PreDiab.Rmd and utils/* scripts, you can copy/paste the git commands below.
This will pull (i.e., download) the scripts from GitHub.

sudo git -C /home/rhapsody/WP3/scripts pull origin master

Once donwload is complete, you should get the following directory tree in your home (`cd ~`).

## 1  home                                                 
## 2   °--rhapsody                                         
## 3       °--WP3                                          
## 4           °--scripts                                  
## 5               ¦--docker                               
## 6               ¦   ¦--Dockerfile                       
## 7               ¦   ¦--Dockerfile_update                
## 8               ¦   ¦--login.html                       
## 9               ¦   ¦--logo.png                         
## 10              ¦   ¦--markdown_packages.R              
## 11              ¦   °--r_packages.R                     
## 12              ¦--docker_analysis                      
## 13              ¦   ¦--opal_credentials.txt             
## 14              ¦   ¦--rhapsody.R                       
## 15              ¦   °--rhapsody.sh                      
## 16              ¦--docs                                 
## 17              ¦   ¦--howto.Rmd                        
## 18              ¦   ¦--index.html                       
## 19              ¦   °--RHAPSODY_Logo_WEB_Color.png      
## 20              ¦--opal_credentials.txt                 
## 21              ¦--README.md                            
## 22              ¦--RHAPSODY_WP3_PreDiab.Rmd             
## 23              °--utils                                
## 24                  ¦--ethnicity_template.csv           
## 25                  ¦--ethnicity_template.R             
## 26                  ¦--handleVCF.sh                     
## 27                  ¦--Install_Rpackages.R              
## 28                  ¦--RHAPSODY_WP3_PreDiab_DEBUG.R     
## 29                  ¦--RHAPSODY_WP3_PreDiab_OPAL_DEBUG.R
## 30                  °--RunAnalysis.R

1.2.4.2 Description

  • /home/rhapsody/WP3/scripts/README.pdf
    a short README file;
  • /home/rhapsody/WP3/scripts/RHAPSODY_WP3_PreDiab.Rmd
    a Rmarkdown script which perform:
    • opal node access,
    • phenotype QC,
    • preliminary modelling,
    • VCFs formatting/filtering,
    • variants analysis;
  • /home/rhapsody/WP3/scripts/opal_credentials.txt
    the server, login and password informations to access phenotype data on your local node;
  • /home/rhapsody/WP3/scripts/utils/RHAPSODY_WP3_PreDiab_DEBUG.R,
    a small part of the main Rmarkdown script, which includes only node access and phenotype QC;
  • /home/rhapsody/WP3/scripts/utils/RunAnalysis.R
    a example of a R script to run the whole analysis in bash (parameters have to be filled properly according to your informations)
    using `Rscript ~/WP3/scripts/utils/RunAnalysis.R`;
  • /home/rhapsody/WP3/scripts/utils/handleVCF.sh
    the bash/shell script used to format/filter VCFs files prior to the variant analysis.
    This bash/shell script is also included within the main script, but you can use it directly as a standalone, if so (or if your already done formatting and don’t want to do it again), parameter format_vcfs must be set to FALSE).
  • /home/rhapsody/WP3/scripts/utils/Install_Rpackages.R
    a R script which install a predefined version of all R packages loaded in the Rmarkdown script, only if needed (also avaiable within the main script);
  • /home/rhapsody/WP3/scripts/utils/check_analysis.Rmd
    a Rmarkdown script to ‘’quickly’’ check if the analyses went well (i.e., compute quantile-quatile plot, manhattan plot and a plot to compare chromosomal position to sequencial position);
  • /home/rhapsody/WP3/scripts/utils/README.Rmd
    a Rmarkdown script to create README.pdf.
  • /home/rhapsody/WP3/scripts/utils/howto.Rmd
    a how-to document and its html version howto.html.

1.2.5 Run the analysis

1.2.5.1 OPAL database credentials

To avoid OPAL database credentials to be hard written within the script, you should create a text file with the server, username and password to access your local OPAL database (i.e., RHAPSODY node hosting the phenotype data).
You can also modify the default file /home/rhapsody/WP3/scripts/opal_credentials.txt.

http://localhost:8080
administrator
password

NOTE: The username provided in that file should have “download” rights.
The data are downloaded locally in the R session for and during the analysis, then all local data are deleted (i.e., the data are not leaving the R session in any way).

1.2.5.2 Imputation quality (VCF)

Please check where the imputation quality is stored in your VCF files.
Depending on where (i.e., locally, Sanger Imputation Server, Michigan Imputation Server, etc.) and with which softwares (i.e., impute2, PBWT, etc.) you used to impute your genetic data, this information might be stored as INFO or R2 (most frequent names used, but it might be something else).

1.2.5.3 Set the parameters

A template script (i.e., /home/rhapsody/WP3/scripts/utils/RunAnalysis.R) is available to help run the analysis as a bash/shell command.

  1. Set the output directory.
    The default outputs will be generated in the current directory (./).

    working_directory <- "/media/rhapsody_output" 
  2. Set a cohort name.
    It’s a identifier for a cohort and array. For example, “DESIR_Metabochip” for the results from the Metabochip Array in the DESIR cohort.

    author_name <- "Firstname LASTNAME"
  3. Set the number of CPUs you are going to use to run the Linear Mixed Model. Just to know who performed the analysis, if there are questions at a later point.

    n_cpu <- 2
  4. Set the path to your file opal_credentials.txt. For example, ~/WP3/scripts/opal_credentials.txt, if you modified the default file (it is better to use absolute path, i.e., without ~/).

    opal_credentials <- "/media/credentials/opal_credentials.txt"
  5. Set the path to your (RAW) VCF files

    vcf_directory <- "/media/vcf"
  6. Set the imputation quality tag for your VCF

    imputation_quality_tag <- "INFO" # To be set according to VCF (could also be "R2")
  7. Set the path to VCFtools.
    Default location of the software is usually /usr/local/bin.

    vcftools_binary_path <- "/usr/local/bin"
  8. Define if you want to format/split the VCF files.
    Default is to format and split the VCF files available within the script.

    format_vcfs <- TRUE
  9. Define to which part of the analysis you want to go.
    1. Node access
    2. Phenotype QC
    3. Run Mixed Model without variants
    4. Check VCFs files
    5. Format VCFs files
    6. Run Mixed Model with (imputed) variants
    7. Check output files
    analysis_step <- 7
  10. Define if genetic variants analyses should be performed.

    variants_analysis <- TRUE
  11. Define if genetic components should be used.

    genomic_component <- NULL # A csv file with SUBJID as first column and PC01 to PCXX

Template: ~/WP3/scripts/utils/RunAnalysis.R

# set the output directory or leave as is; 
# output will be generated where the Rmarkdown file is
working_directory <- "/media/rhapsody_output" 

cohort_name <- "Cohort_Name"
author_name <- "Firstname LASTNAME"

n_cpu <- 2

opal_credentials <- "/media/credentials/opal_credentials.txt"

vcf_directory <- "/media/vcf"
imputation_quality_tag <- "INFO" # To be set according to VCF (could also be "R2")
vcftools_binary_path <- "/usr/local/bin"

format_vcfs <- TRUE
analysis_step <- 7
variants_analysis <- TRUE

genomic_component <- NULL # A csv file with SUBJID as first column and PC01 to PCXX



# Run the analysis
dir.create(path = working_directory, showWarnings = FALSE, mode = "0777")
rmarkdown::render(
  input = "/home/rhapsody/WP3/scripts/RHAPSODY_WP3_PreDiab.Rmd", 
  output_format = "html_document", 
  output_file = paste0(
    "RHAPSODY_WP3_PreDiab_", 
    cohort_name, "_step", 
    analysis_step, ".html"
  ), 
  output_dir = working_directory,
  params = list(
    cohort_name = cohort_name, 
    author_name = author_name, 
    opal_credentials = opal_credentials,
    vcf_input_directory = vcf_directory, 
    imputation_quality_tag = imputation_quality_tag, 
    vcftools_binary_path = vcftools_binary_path,
    output_directory = working_directory,
    analysis_step = analysis_step,
    format_vcfs = format_vcfs,
    variants_analysis = variants_analysis,
    chunk_size = 1000,
    exclude_X = TRUE,
    genomic_component = genomic_component,
    n_cpu = n_cpu,
    echo = FALSE, # Should R code be printed in the report
    warning = FALSE, # Should warnings be printed in the report
    message = FALSE, # Should messages be printed in the report
    debug = FALSE
  ),
  encoding = "UTF-8"
)

1.2.5.4 Start the analysis

1.2.5.4.1 Standard use

Once you created a R script using the template (i.e., /home/rhapsody/WP3/scripts/utils/RunAnalysis.R), you can start the analysis with a simple command.

Rscript /home/rhapsody/WP3/scripts/utils/RunAnalysis.R
1.2.5.4.2 Non-standard use

In the case you want to format your VCF files before running the script.

  1. The best way to proceed is to use/set the parameters as showed below.

    format_vcfs <- FALSE
    analysis_step <- 4
    variants_analysis <- FALSE
  2. Run the script within bash/shell

    Rscript /home/rhapsody/WP3/scripts/utils/RunAnalysis.R
    This will create (i.e., in the output directory you set up):
    • ./RHAPSODY_WP3_PreDiab_Cohort_Name_step4.html
      The report with the phenotype QC and without the genetic variant analyses.
    • ./vcffilespath.txt
      The list of all available VCF files to format using ./handleVCF_Cohort_Name.sh. The files are also listed in the report in Section 6.2.1 Check available VCF files.

      /media/vcf/chr1.vcf.gz
      /media/vcf/chr2.vcf.gz
      ...
      /media/vcf/chr*.vcf.gz
      ...
      /media/vcf/chr22.vcf.gz
    • ./handleVCF_Cohort_Name.sh
      The command (with the correct parameters) to format the VCF using /home/rhapsody/WP3/scripts/utils/handleVCF.sh. This is also stated in the report in Section 6.2.2.3.1 Manually format VCFs.

      #!/bin/sh
      
      ./handleVCF.sh Cohort_Name /media/vcf INFO /usr/local/bin 
  3. Run the script to format the VCF

    sh ./handleVCF_Cohort_Name.sh
  4. Run the whole analysis with the proper parameter

    format_vcfs <- FALSE # keep this to FALSE
    analysis_step <- 7
    variants_analysis <- TRUE
    Rscript /home/rhapsody/WP3/scripts/utils/RunAnalysis.R

1.3 Using Docker as an app

1.3.1 Install Docker

If Docker is not installed on your cluster/grid/computer/laptop, you can install it following the instructions on docs.docker.com.

1.3.2 Files & Scripts

1.3.2.1 Retrieve the latest version

In order to retrieve the latest version of the main script RHAPSODY_WP3_PreDiab.Rmd and utils/* scripts, you can copy/paste the git commands below.
This will pull (i.e., download) the scripts from GitHub.

sudo git -C /home/rhapsody/WP3/scripts pull origin master

Once donwload is complete, you should get the following directory tree in your home (`cd ~`).

## 1  home                                                 
## 2   °--rhapsody                                         
## 3       °--WP3                                          
## 4           °--scripts                                  
## 5               ¦--docker                               
## 6               ¦   ¦--Dockerfile                       
## 7               ¦   ¦--Dockerfile_update                
## 8               ¦   ¦--login.html                       
## 9               ¦   ¦--logo.png                         
## 10              ¦   ¦--markdown_packages.R              
## 11              ¦   °--r_packages.R                     
## 12              ¦--docker_analysis                      
## 13              ¦   ¦--opal_credentials.txt             
## 14              ¦   ¦--rhapsody.R                       
## 15              ¦   °--rhapsody.sh                      
## 16              ¦--docs                                 
## 17              ¦   ¦--howto.Rmd                        
## 18              ¦   ¦--index.html                       
## 19              ¦   °--RHAPSODY_Logo_WEB_Color.png      
## 20              ¦--opal_credentials.txt                 
## 21              ¦--README.md                            
## 22              ¦--RHAPSODY_WP3_PreDiab.Rmd             
## 23              °--utils                                
## 24                  ¦--ethnicity_template.csv           
## 25                  ¦--ethnicity_template.R             
## 26                  ¦--handleVCF.sh                     
## 27                  ¦--Install_Rpackages.R              
## 28                  ¦--RHAPSODY_WP3_PreDiab_DEBUG.R     
## 29                  ¦--RHAPSODY_WP3_PreDiab_OPAL_DEBUG.R
## 30                  °--RunAnalysis.R

1.3.2.2 Description

  • /home/rhapsody/WP3/scripts/README.pdf
    a short README file;
  • /home/rhapsody/WP3/scripts/RHAPSODY_WP3_PreDiab.Rmd
    a Rmarkdown script which perform:
    • opal node access,
    • phenotype QC,
    • preliminary modelling,
    • VCFs formatting/filtering,
    • variants analysis;
  • /home/rhapsody/WP3/scripts/opal_credentials.txt
    the server, login and password informations to access phenotype data on your local node;
  • /home/rhapsody/WP3/scripts/utils/RHAPSODY_WP3_PreDiab_DEBUG.R,
    a small part of the main Rmarkdown script, which includes only node access and phenotype QC;
  • /home/rhapsody/WP3/scripts/utils/RunAnalysis.R
    a example of a R script to run the whole analysis in bash (parameters have to be filled properly according to your informations)
    using `Rscript ~/WP3/scripts/utils/RunAnalysis.R`;
  • /home/rhapsody/WP3/scripts/utils/handleVCF.sh
    the bash/shell script used to format/filter VCFs files prior to the variant analysis.
    This bash/shell script is also included within the main script, but you can use it directly as a standalone, if so (or if your already done formatting and don’t want to do it again), parameter format_vcfs must be set to FALSE).
  • /home/rhapsody/WP3/scripts/utils/Install_Rpackages.R
    a R script which install a predefined version of all R packages loaded in the Rmarkdown script, only if needed (also avaiable within the main script);
  • /home/rhapsody/WP3/scripts/utils/check_analysis.Rmd
    a Rmarkdown script to ‘’quickly’’ check if the analyses went well (i.e., compute quantile-quatile plot, manhattan plot and a plot to compare chromosomal position to sequencial position);
  • /home/rhapsody/WP3/scripts/utils/README.Rmd
    a Rmarkdown script to create README.pdf.
  • /home/rhapsody/WP3/scripts/utils/howto.Rmd
    a how-to document and its html version howto.html.

1.3.3 Run the analysis

1.3.3.1 OPAL database credentials

To avoid OPAL database credentials to be hard written within the script, you should create a text file with the server, username and password to access your local OPAL database (i.e., RHAPSODY node hosting the phenotype data).
You can also modify the default file /home/rhapsody/WP3/scripts/opal_credentials.txt.

http://localhost:8080
administrator
password

NOTE: The username provided in that file should have “download” rights.
The data are downloaded locally in the R session for and during the analysis, then all local data are deleted (i.e., the data are not leaving the R session in any way).

1.3.3.2 Imputation quality (VCF)

Please check where the imputation quality is stored in your VCF files.
Depending on where (i.e., locally, Sanger Imputation Server, Michigan Imputation Server, etc.) and with which softwares (i.e., impute2, PBWT, etc.) you used to impute your genetic data, this information might be stored as INFO or R2 (most frequent names used, but it might be something else).

1.3.3.3 Set the parameters

A template script (i.e., /home/rhapsody/WP3/scripts/utils/RunAnalysis.R) is available to help run the analysis as a bash/shell command.

  1. Set the output directory.
    The default outputs will be generated in the current directory (./).

    working_directory <- "/media/rhapsody_output" 
  2. Set a cohort name.
    It’s a identifier for a cohort and array. For example, “DESIR_Metabochip” for the results from the Metabochip Array in the DESIR cohort.

    author_name <- "Firstname LASTNAME"
  3. Set the number of CPUs you are going to use to run the Linear Mixed Model. Just to know who performed the analysis, if there are questions at a later point.

    n_cpu <- 2
  4. Set the path to your file opal_credentials.txt. For example, /home/rhapsody/WP3/scripts/opal_credentials.txt, if you modified the default file (it is better to use absolute path, i.e., without ~/).

    opal_credentials <- "/media/credentials/opal_credentials.txt"
  5. Set the path to your (RAW) VCF files

    vcf_directory <- "/media/vcf"
  6. Set the imputation quality tag for your VCF

    imputation_quality_tag <- "INFO" # To be set according to VCF (could also be "R2")
  7. Set the path to VCFtools.
    Default location of the software is usually /usr/local/bin.

    vcftools_binary_path <- "/usr/local/bin"
  8. Define if you want to format/split the VCF files.
    Default is to format and split the VCF files available within the script.

    format_vcfs <- TRUE
  9. Define to which part of the analysis you want to go.
    1. Node access
    2. Phenotype QC
    3. Run Mixed Model without variants
    4. Check VCFs files
    5. Format VCFs files
    6. Run Mixed Model with (imputed) variants
    7. Check output files
    analysis_step <- 7
  10. Define if genetic variants analyses should be performed.

    variants_analysis <- TRUE
  11. Define if genetic components should be used.

    genomic_component <- NULL # A csv file with SUBJID as first column and PC01 to PCXX

Template: /home/rhapsody/WP3/scripts/utils/RunAnalysis.R

# set the output directory or leave as is; 
# output will be generated where the Rmarkdown file is
working_directory <- "/media/rhapsody_output" 

cohort_name <- "Cohort_Name"
author_name <- "Firstname LASTNAME"

n_cpu <- 2

opal_credentials <- "/media/credentials/opal_credentials.txt"

vcf_directory <- "/media/vcf"
imputation_quality_tag <- "INFO" # To be set according to VCF (could also be "R2")
vcftools_binary_path <- "/usr/local/bin"

format_vcfs <- TRUE
analysis_step <- 7
variants_analysis <- TRUE

genomic_component <- NULL # A csv file with SUBJID as first column and PC01 to PCXX



# Run the analysis
dir.create(path = working_directory, showWarnings = FALSE, mode = "0777")
rmarkdown::render(
  input = "/home/rhapsody/WP3/scripts/RHAPSODY_WP3_PreDiab.Rmd", 
  output_format = "html_document", 
  output_file = paste0(
    "RHAPSODY_WP3_PreDiab_", 
    cohort_name, "_step", 
    analysis_step, ".html"
  ), 
  output_dir = working_directory,
  params = list(
    cohort_name = cohort_name, 
    author_name = author_name, 
    opal_credentials = opal_credentials,
    vcf_input_directory = vcf_directory, 
    imputation_quality_tag = imputation_quality_tag, 
    vcftools_binary_path = vcftools_binary_path,
    output_directory = working_directory,
    analysis_step = analysis_step,
    format_vcfs = format_vcfs,
    variants_analysis = variants_analysis,
    chunk_size = 1000,
    exclude_X = TRUE,
    genomic_component = genomic_component,
    n_cpu = n_cpu,
    echo = FALSE, # Should R code be printed in the report
    warning = FALSE, # Should warnings be printed in the report
    message = FALSE, # Should messages be printed in the report
    debug = FALSE
  ),
  encoding = "UTF-8"
)

1.3.3.4 Start the analysis

1.3.3.4.1 Standard use

Once you created a R script using the template (i.e., /home/rhapsody/WP3/scripts/utils/RunAnalysis.R), you can start the analysis with a simple command.

docker run \
  --name rhapsody \
  --detach \
  --volume /path/to/vcf/:/media/vcf \
  --volume /path/to/opal_credentials:/media/credentials \
  --volume /path/to/RunAnalysis:/media/RunAnalysis \
  --volume /path/to/rhapsody_output:/media/rhapsody_output \
  --rm \
  ghcr.io/mcanouil/rhapsody:latest Rscript /media/RunAnalysis/RunAnalysis.R

In that configuration, the script /home/rhapsody/WP3/scripts/utils/RunAnalysis.R will be as follow:

# set the output directory or leave as is; 
# output will be generated where the Rmarkdown file is
working_directory <- "/media/rhapsody_output" 

cohort_name <- "Cohort_Name"
author_name <- "Firstname LASTNAME"

n_cpu <- 2

opal_credentials <- "/media/credentials/opal_credentials.txt"

vcf_directory <- "/media/vcf"
imputation_quality_tag <- "INFO" # To be set according to VCF (could also be "R2")
vcftools_binary_path <- "/usr/local/bin"

format_vcfs <- TRUE
analysis_step <- 7
variants_analysis <- TRUE

genomic_component <- NULL # A csv file with SUBJID as first column and PC01 to PCXX



# Run the analysis
dir.create(path = working_directory, showWarnings = FALSE, mode = "0777")
rmarkdown::render(
  input = "/home/rhapsody/WP3/scripts/RHAPSODY_WP3_PreDiab.Rmd", 
  output_format = "html_document", 
  output_file = paste0(
    "RHAPSODY_WP3_PreDiab_", 
    cohort_name, "_step", 
    analysis_step, ".html"
  ), 
  output_dir = working_directory,
  params = list(
    cohort_name = cohort_name, 
    author_name = author_name, 
    opal_credentials = opal_credentials,
    vcf_input_directory = vcf_directory, 
    imputation_quality_tag = imputation_quality_tag, 
    vcftools_binary_path = vcftools_binary_path,
    output_directory = working_directory,
    analysis_step = analysis_step,
    format_vcfs = format_vcfs,
    variants_analysis = variants_analysis,
    chunk_size = 1000,
    exclude_X = TRUE,
    genomic_component = genomic_component,
    n_cpu = n_cpu,
    echo = FALSE, # Should R code be printed in the report
    warning = FALSE, # Should warnings be printed in the report
    message = FALSE, # Should messages be printed in the report
    debug = FALSE
  ),
  encoding = "UTF-8"
)
1.3.3.4.2 Non-standard use

In the case you want to format your VCF files before running the script.

  1. The best way to proceed is to use/set the parameters as showed below.

    format_vcfs <- FALSE
    analysis_step <- 4
    variants_analysis <- FALSE
  2. Run the script within bash/shell

    docker run \
      --name rhapsody \
      --detach \
      --volume /path/to/vcf/:/media/vcf \
      --volume /path/to/opal_credentials:/media/credentials \
      --volume /path/to/RunAnalysis:/media/RunAnalysis \
      --volume /path/to/rhapsody_output:/media/rhapsody_output \
      --rm \
      ghcr.io/mcanouil/rhapsody:latest Rscript /media/RunAnalysis/RunAnalysis.R

    In that configuration, the script /home/rhapsody/WP3/scripts/utils/RunAnalysis.R will be as follow:

    # set the output directory or leave as is; 
    # output will be generated where the Rmarkdown file is
    working_directory <- "/media/rhapsody_output" 
    
    cohort_name <- "Cohort_Name"
    author_name <- "Firstname LASTNAME"
    
    n_cpu <- 2
    
    opal_credentials <- "/media/credentials/opal_credentials.txt"
    
    vcf_directory <- "/media/vcf"
    imputation_quality_tag <- "INFO" # To be set according to VCF (could also be "R2")
    vcftools_binary_path <- "/usr/local/bin"
    
    format_vcfs <- FALSE
    analysis_step <- 4
    variants_analysis <- FALSE
    
    genomic_component <- NULL # A csv file with SUBJID as first column and PC01 to PCXX
    
    
    
    # Run the analysis
    dir.create(path = working_directory, showWarnings = FALSE, mode = "0777")
    rmarkdown::render(
      input = "/home/rhapsody/WP3/scripts/RHAPSODY_WP3_PreDiab.Rmd", 
      output_format = "html_document", 
      output_file = paste0(
        "RHAPSODY_WP3_PreDiab_", 
        cohort_name, "_step", 
        analysis_step, ".html"
      ), 
      output_dir = working_directory,
      params = list(
        cohort_name = cohort_name, 
        author_name = author_name, 
        opal_credentials = opal_credentials,
        vcf_input_directory = vcf_directory, 
        imputation_quality_tag = imputation_quality_tag, 
        vcftools_binary_path = vcftools_binary_path,
        output_directory = working_directory,
        analysis_step = analysis_step,
        format_vcfs = format_vcfs,
        variants_analysis = variants_analysis,
        chunk_size = 1000,
        exclude_X = TRUE,
        genomic_component = genomic_component,
        n_cpu = n_cpu,
        echo = FALSE, # Should R code be printed in the report
        warning = FALSE, # Should warnings be printed in the report
        message = FALSE, # Should messages be printed in the report
        debug = FALSE
      ),
      encoding = "UTF-8"
    )
    This will create (i.e., in the output directory you set up):
    • /media/rhapsody_output/RHAPSODY_WP3_PreDiab_Cohort_Name_step4.html
      The report with the phenotype QC and without the genetic variant analyses.
    • /media/rhapsody_output/vcffilespath.txt
      The list of all available VCF files to format using ./handleVCF_Cohort_Name.sh. The files are also listed in the report in Section 6.2.1 Check available VCF files.

      /media/vcf/chr1.vcf.gz
      /media/vcf/chr2.vcf.gz
      ...
      /media/vcf/chr*.vcf.gz
      ...
      /media/vcf/chr22.vcf.gz
    • /media/rhapsody_output/handleVCF_Cohort_Name.sh
      The command (with the correct parameters) to format the VCF using ~/WP3/scripts/utils/handleVCF.sh. This is also stated in the report in Section 6.2.2.3.1 Manually format VCFs.

      #!/bin/sh
      
      ./handleVCF.sh Cohort_Name /media/vcf INFO /usr/local/bin 
  3. Run the script to format the VCF

    docker run \
      --name rhapsody \
      --detach \
      --volume /path/to/vcf/:/media/vcf \
      --volume /path/to/opal_credentials:/media/credentials \
      --volume /path/to/RunAnalysis:/media/RunAnalysis \
      --volume /path/to/rhapsody_output:/media/rhapsody_output \
      --rm \
      ghcr.io/mcanouil/rhapsody:latest sh /media/rhapsody_output/handleVCF_Cohort_Name.sh
  4. Run the whole analysis with the proper parameter

    format_vcfs <- FALSE # keep this to FALSE
    analysis_step <- 7
    variants_analysis <- TRUE
    docker run \
      --name rhapsody \
      --detach \
      --volume /path/to/vcf/:/media/vcf \
      --volume /path/to/opal_credentials:/media/credentials \
      --volume /path/to/RunAnalysis:/media/RunAnalysis \
      --volume /path/to/rhapsody_output:/media/rhapsody_output \
      --rm \
      ghcr.io/mcanouil/rhapsody:latest Rscript /media/RunAnalysis/RunAnalysis.R

    In that configuration, the script ~/WP3/scripts/utils/RunAnalysis.R will be as follow:

    # set the output directory or leave as is; 
    # output will be generated where the Rmarkdown file is
    working_directory <- "/media/rhapsody_output" 
    
    cohort_name <- "Cohort_Name"
    author_name <- "Firstname LASTNAME"
    
    n_cpu <- 2
    
    opal_credentials <- "/media/credentials/opal_credentials.txt"
    
    vcf_directory <- "/media/vcf"
    imputation_quality_tag <- "INFO" # To be set according to VCF (could also be "R2")
    vcftools_binary_path <- "/usr/local/bin"
    
    format_vcfs <- FALSE # keep this to FALSE
    analysis_step <- 7
    variants_analysis <- TRUE
    
    genomic_component <- NULL # A csv file with SUBJID as first column and PC01 to PCXX
    
    
    
    # Run the analysis
    dir.create(path = working_directory, showWarnings = FALSE, mode = "0777")
    rmarkdown::render(
      input = "/home/rhapsody/WP3/scripts/RHAPSODY_WP3_PreDiab.Rmd", 
      output_format = "html_document", 
      output_file = paste0(
        "RHAPSODY_WP3_PreDiab_", 
        cohort_name, "_step", 
        analysis_step, ".html"
      ), 
      output_dir = working_directory,
      params = list(
        cohort_name = cohort_name, 
        author_name = author_name, 
        opal_credentials = opal_credentials,
        vcf_input_directory = vcf_directory, 
        imputation_quality_tag = imputation_quality_tag, 
        vcftools_binary_path = vcftools_binary_path,
        output_directory = working_directory,
        analysis_step = analysis_step,
        format_vcfs = format_vcfs,
        variants_analysis = variants_analysis,
        chunk_size = 1000,
        exclude_X = TRUE,
        genomic_component = genomic_component,
        n_cpu = n_cpu,
        echo = FALSE, # Should R code be printed in the report
        warning = FALSE, # Should warnings be printed in the report
        message = FALSE, # Should messages be printed in the report
        debug = FALSE
      ),
      encoding = "UTF-8"
    )