FIXME
Home Page Collection of TutorialsOverview
Time estimation: FIXME
Version: main
Last update: 2025-06-23
QuestionsObjectives
FIXME
FIXME
FIXME
FIXME
This workshop demonstrates a typical workflow of a SimpleVM user. In this workshop your goal will be to identify pathogenic bacteria that were classified as “greatest threat to human health” by the World Health Organisation (WHO) in 2017
You will search for those microbes in publicly available metagenomic datasets that are stored in the Sequence Read Archive (SRA). In metagenomics, microbial genetic material is extracted from environmental samples like human gut, soil, freshwater or biogas plants in order to investigate the functions and interactions of the microbial community.
In order to find those microbes, you will have to interact with the de.NBI Cloud via SimpleVM. This workshop is divided into three parts.
In the first part you will learn the basic concept of virtual machines and how to configure them.
Since you are new to SimpleVM, it is resource-saving to start with a VM that has a few cores and a small amount of RAM.
You start this tutorial from your profile page.
If you do not have a de.NBI Cloud account, please register for one via this link. You can read more about the registration process in our de.NBI Cloud wiki. Please make sure to to click on “continue” if this button shows up.
If you successfully registered for a de.NBI Cloud account, you should be able to log in to the de.NBI Cloud Portal.
Hands On: Select the SimpleVMIntro23 project
Click on the
New Instancetab.If you are already member of a SimpleVM project then you are offered a drop down menu to select a project. In this case please select the SimpleVMIntro23 project. If this is your first SimpleVM project, you are now able to select/generate a key (next point) or directly start a VM.
If you have no SSH key set so far, just click on generate key and save the private key. During this workshop you will not need this file because you will access all VMs via the browser. However, for your future work using SimpleVM, we highly recommend to read our de.NBI Cloud wiki regarding SSH keys
Hands On: Start a VM
- Choose a name for your VM.
- Select de.NBI default.
- In the image section, please click on the Research Environments tab and select the TheiaIDE-ubuntu22.04 image.
- Select the Conda tab and choose the following tools with their version numbers given below for installation via Conda:
- ncbi-genome-download (0.3.3)
- mash (2.2)
- csvtk (0.28.0)
- entrez-direct (16.2)
- jq (1.6)
- parallel (20230922)
The filter in the name column can be used to search for the packages. You will learn in the next sections how to apply these tools.
- Select a URL path for Theia. You will access Theia via this URL.
- Grant access to all project members with a
Cloud-portal-supporttag. This way these members get ssh access to your VM and can help you in case something does not work as expected.- Confirm the checkboxes and click on Start.
In the second section you will test whether SimpleVM correctly provisioned your VM with all your tools installed on it.
After the start of the machine has been triggered, some time may pass before the machine is available. As soon as this is the case, this becomes visible via a green icon.
Once the VM is available, you can use it for testing the tools and inspecting the data before you scale up your analysis in the next section.
Log in to the VM and verify that SimpleVM provisioned the VM correctly.
Hands On: Verify VM properties and tools
- Click on the Instances tab (Overview -> Instances). After you have initiated the start-up of the machine, you should have been automatically redirected there. Now > open the “How to connect” dropdown of your machine. Click on the Theia ide URL which opens a new browser tab.
- Click on
Terminalin the upper menu and selectNew Terminal.Inspect the VM before starting to work with it. Let’s check whether the VM has the properties that SimpleVM promised you by typing the following commands in your newly opened terminal window.
nproctells you the number of processing units.nprocDoes that correspond to the actual number of cores of the flavor you selected?
free -htells you the amount of RAM that is available to your VM. You will see that the sum of the total amount of Mem (totalcolumn,Memrow) corresponds roughly to the RAM size of your selected flavor.free -hYou can also check what kind of processes are running on your VM by executing
toporhtop.htopExit
htopby typingqorF10.You can use the tools you selected in the previous part by running
conda activate denbi.Test if the needed commands are installed by running all of them with
-hparameter. You will get an explanation of their usage in the next chapter.
ncbi-genome-download -hmash -hcsvtk -hjq -hIf there is an error reported, then something went wrong, and we have to either repeat the conda installation manually or install it a different way.
Remember that you have root permissions on the VM. You can install any tool that you need for your research. Let’s test this statement by first fetching the latest information about available packages and installing the following commands (
fortune-mod,cowsay) via > >sudo.Update:
sudo apt updateInstall the commands:
sudo apt install -y fortune-mod cowsayYou can run both commands via
/usr/games/fortune | /usr/games/cowsay
In the first part you have tested the SimpleVM environment. Now it is time for using a VM with more cores to scale up the analysis. For this reason you have either saved your installed tools by creating a snapshot or, if you are starting with this section, a snapshot has been prepared for you. You will now reuse one of these snapshots with a larger flavor. Further, we will alos search for more metagenomic datasets via object storage and scale up our analysis by providing more cores to mash.
Hands On: Create a new VM based on snapshot
Click on
Overviews->Snapshotsin left menu and check which status your snapshot has. You can also filter of the name in the top menu. If it has the statusactive, you can navigate to theNew Instancetab (and select the SimpleVMIntro23 project).- Provide again a name for your instance.
In the flavors sections please choose the de.NBI large flavor which has 28 cores available.
Click on the Snapshot tab to select the snapshot SimpleVMIntro23.
Please create a volume for your VM and enter your name without whitespace (Example: Max Mustermann -> MaxMusterman) as the volume name. Enter
data(/vol/data) as mountpath and provide 1 GB as the storage size.Grant again access to all project members with a
Cloud-portal-supporttag. This way these members get ssh access to your VM and can help you in case something does not work as expected.- Confirm the checkboxes and click on Start. While the VM is starting please fill out our user survey.
Hands On
You are now on the
Instance Overviewpage. You can delete your old VM which we used to create your snapshot. To do this, open the action selection of the old machine again by clicking on ‘Show Actions’ and select ‘Delete VM’. Confirm the deletion of the machine.On your new VM, please click on
how to connect. You should see again a link. Please click on the link to open Theia-IDE on a new browser tab.Click on
Terminalin the upper menu and selectNew Terminal.- Activate the conda environment by running:
conda activate denbi- Unfortunately, conda does not offer a minio cli binary, which means that we would have to install it manually. Download the binary:
wget https://dl.min.io/client/mc/release/linux-amd64/mcMove it to a folder where other binaries usually are stored:
sudo mv mc /usr/local/bin/Change file permissions:
chmod a+x /usr/local/bin/mc- Add S3 config for our public SRA mirror on our Bielefeld Cloud site:
mc config host add sra https://openstack.cebitec.uni-bielefeld.de:8080 "" ""- List which files are available for SRA number
SRR3984908:mc ls sra/ftp.era.ebi.ac.uk/vol1/fastq/SRR398/008/SRR3984908- Check the size of these files
mc du sra/ftp.era.ebi.ac.uk/vol1/fastq/SRR398/008/SRR3984908- You can read the first lines of these files by using
mc cat.mc cat sra/ftp.era.ebi.ac.uk/vol1/fastq/SRR398/008/SRR3984908/SRR3984908_1.fastq.gz | zcat | head- Search for SRA run accessions we want to analyse and check the sum of their size (this may take a while to complete):
mc find --regex "SRR6439511.*|SRR6439513.*|ERR3277263.*|ERR929737.*|ERR929724.*" sra/ftp.era.ebi.ac.uk/vol1/fastq -exec " mc ls -r --json {} " \ | jq -s 'map(.size) | add' \ | numfmt --to=iec-i --suffix=B --padding=7
Explanation
mc findreports all files that have one of the following prefixes in their file name:SRR6439511.,SRR6439513.,ERR3277263.,ERR929737.,ERR929724..jquses the json that is produced bymc findand sums up the size of all files (.sizefield).numfmttransforms the sum to a human-readable string.
wget https://openstack.cebitec.uni-bielefeld.de:8080/simplevm-workshop/genomes.msh
wget https://openstack.cebitec.uni-bielefeld.de:8080/simplevm-workshop/reads.tsv
You can inspect the file by using cat:
cat reads.tsv
mkdir -p output
search(){
left_read=$(echo $1 | cut -d ' ' -f 1);
right_read=$(echo $1 | cut -d ' ' -f 2);
sra_id=$(echo ${left_read} | rev | cut -d '/' -f 1 | rev | cut -d '_' -f 1 | cut -d '.' -f 1);
mc cat $left_read $right_read | mash screen -p 3 genomes.msh - \
| sed "s/^/${sra_id}\t/g" \
| sed 's/\//\t/' > output/${sra_id}.txt ;
}
Explanation
In order to understand what this function does let’s take the following datasets as an example:
sra/ftp.era.ebi.ac.uk/vol1/fastq/SRR643/001/SRR6439511/SRR6439511_1.fastq.gz sra/ftp.era.ebi.ac.uk/vol1/fastq/SRR643/001/SRR6439511/SRR6439511_2.fastq.gzwhere
left_readis left file (sra/ftp.era.ebi.ac.uk/vol1/fastq/SRR643/001/SRR6439511/SRR6439511_1.fastq.gz)right_readis the right file (sra/ftp.era.ebi.ac.uk/vol1/fastq/SRR643/001/SRR6439511/SRR6439511_2.fastq.gz)sra_idis the prefix of the file name (SRR6439511)mc catstreams the files intomash screenwhich is using the sketched genomesgenomes.mshto filter the datasets.- Both
seds are just post-processing the output and place every match in theoutputfolder.
Export this function, so that we can use it in the next command.
export -f search
Run your analysis in parallel.
parallel -a reads.tsv search
where
* reads.tsv is a list of datasets that we want to scan.
* search is the function that we want to call.
Optional: This command will run a few minutes. You could open a second terminal
and inspect the cpu utilization with htop.

cat output/*.txt > output.tsv
csvtk -t plot hist -H -f 3 output.tsv -o output.pdf
You can open this file by a click on the Explorer View and selecting the pdf.

for sraid in $(ls -1 output/ | cut -f 1 -d '.'); do
esearch -db sra -query ${sraid} \
| esummary \
| xtract -pattern DocumentSummary -element @ScientificName,Title \
| sort | uniq \
| sed "s/^/${sraid}\t/g";
done > publications.tsv
Explanation
for sraid in $(ls -1 output/ | cut -f 1 -d '.');iterates over all datasets found in the output directory.esearchjust looks up the scientific name and title of the SRA study.- ‘sed’ adds the SRA ID to the output table. The first column is the SRA ID, the second column is the scientific name and the third column is the study title.
- All results are stored the
publications.tsvfile.
sudo chown ubuntu:ubuntu /vol/data/
cp publications.tsv output.tsv /vol/data
Go to the Instance Overview page. Click on actions and detach the volume.

We now want to start a new VM. This time we would like to use RStudio in order to inspect and visualize our results.
Start a new VM. This time select again the de.NBI default flavor since we do not need that much resources anymore.
In the image tab please select Rstudio (RStudio-ubuntu22.04).
In the volume tab please choose the volume you created
in the previous part of the workshop.
Please use again /vol/data as mountpath. Click on Add + to add the volume.

Grant again access to all project members with a Cloud-portal-support tag.
This way these members get ssh access to your VM and can help you in case
something does not work as expected.

Confirm all checkboxes and click on start. Since it takes some time until the VM is started, please complete the last part of the unix tutorial in the meantime.
Again it will take some while to start the machine. On the instance overview, select How to connect of the newly started VM
and click on the URL. A tab should be opened up in your browser.
Username: ubuntu
Password: simplevm
In RStudio please open a Terminal first by either selecting the Terminal tab, or by clicking on
Tools -> Terminal -> New Terminal.
wget https://openstack.cebitec.uni-bielefeld.de:8080/simplevm-workshop/analyse.Rmd
Further you have to install necessary R libraries. Please switch back
to the R console:

Install the following libraries:
install.packages(c("ggplot2","RColorBrewer","rmarkdown"))
You can now open the analyse.Rmd R notebook via File -> Open File.
Run -> Restart R and run all chunks.

Finally, you may want to publish your results once you are done with your research project. You could provide your data and tools via your snapshot and volumes to a reviewer, who could reproduce your results. Alternatively, you can also provide the Rmarkdown document together with the input data to reproduce the last part of the analysis and the visualization.
You can share your research results via Zenodo, Figshare and other providers who will generate a citable, stable Digital Object Identifier (DOI) for your results. re3data provides an overview of research data repositories that are suitable for your domain-specific research results.
In this part of the tutorial you will scale up your cluster horizontally by using a SimpleVM Cluster. A SimpleVM Cluster consists of a master and multiple worker nodes. On all nodes a SLURM workload manager will be installed. SLURM allows you to submit scripts, so-called jobs, that are queued up and once there are free resources (CPUs, RAM) available on one of the worker nodes the script will be executed on that node. This way you don’t have to look up which nodes are free in order to run your jobs. In the following you will configure a cluster and submit your tools to a SLURM job scheduler.
Click on “New Cluster” on the left menu. If you can not see the “New Cluster” item then reload the page.
The same snapshot will also be used for all worker nodes.The worker nodes will run the actual tools, so we need a flavor wir more cores then the one
that the master node is using. Therefore, please select de.NBI large as flavor and start
two worker nodes by providing 2 as the worker count.

Click on the Clusters tab (Overview -> Clusters). After you have initiated the start-up of the cluster, you should have been automatically redirected there. Now open the “How to connect” dropdown of your machine. Click on the Theia ide URL which opens a new browser tab.
Click on Terminal in the upper menu and select New Terminal.

Check how many nodes are part of your cluster by using sinfo
sinfo
which will produce the following example output
Code Out
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 2 idle bibigrid-worker-1-1-us6t5hdtlklq7h9,bibigrid-worker-1-2-us6t5hdtlklq7h9
The important columns here are STATE which tells you if the worker nodes are processing jobs
or are just in idle state and the column NODELIST which is just a list of nodes.
/vol/spool is the folder which is shared between all nodes. You should always submit jobs
from that directory.
cd /vol/spool
wget https://openstack.cebitec.uni-bielefeld.de:8080/simplevm-workshop/basic.sh
The script contains the following content:
#!/bin/bash
#Do not do anything for 30 seconds
sleep 30
#Print out the name of the machine where the job was executed
hostname
where
sleep 30 will delay the process for 30 seconds.hostname reports the name of the worker node.You could now submit the job to the SLURM scheduler by using sbatch and directly after that
check if SLURM is executing your script with squeue.
sbatch:
sbatch basic.sh
squeue:
squeue
which will produce the following example output:
Code Out
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 212 debug basic.sh ubuntu R 0:03 1 bibigrid-worker-1-1-us6t5hdtlklq7h9
Squeue tells you the state of your jobs and which nodes are actually executing them.
In this example you should see that bibigrid-worker-1-1-us6t5hdtlklq7h9 is running (ST column) your job
with the name basic.sh.
slurm-212.out)
which will contain the name of the worker node which executed your script.
Open the file with the following command:
cat slurm-*.out
Code Out: Example output
bibigrid-worker-1-1-us6t5hdtlklq7h9
One way to distribute jobs is to use so-called array jobs. With array jobs you specify how many times
your script should be executed. Every time the script is executed, a number between 1 and the number of times
you want the script to be executed is assigned to the script execution. The specific number is saved in a
variable (SLURM_ARRAY_TASK_ID). If you specify --array=1-100 then your script is 100 times executed and
the SLURM_ARRAY_TASK_ID variable will get a value between 1 and 100. SLURM will distribute the
jobs on your cluster.
Please fetch the modified script
wget https://openstack.cebitec.uni-bielefeld.de:8080/simplevm-workshop/basic_array.sh
Which is simply reading out the SLURM_ARRAY_TASK_ID variable and placing them in a file in an
output directory:
Code In
#!/bin/bash # Create output directory in case it was not created so far mkdir -p output_array #Do not do anything for 10 seconds sleep 10 #Create a file with the name of SLURM_ARRAY_TASK_ID content. touch output_array/${SLURM_ARRAY_TASK_ID}
You can execute this script a 100 times with the following command
sbatch --array=1-100 basic_array.sh
If you now check the output_array folder, you should see numbers from 0 to 100.
ls output_array
search function of the third part of this tutorial and
submit an array job with the number of datasets we want to scan. Remember the search function
searches a list of genomes in a list of metagenomic datasets.Please download the updated script by using wget:
wget https://openstack.cebitec.uni-bielefeld.de:8080/simplevm-workshop/search.sh
This is the content of the script
#!/bin/bash #Create an output directory mkdir output_final #Use the conda environment you installed in your snapshot and activate it eval "$(conda shell.bash hook)" conda activate denbi #Add S3 SRA OpenStack Config /vol/spool/mc config host add sra https://openstack.cebitec.uni-bielefeld.de:8080 "" "" #Define search function you have already used in part 3 search(){ left_read=$(echo $1 | cut -d ' ' -f 1); right_read=$(echo $1 | cut -d ' ' -f 2); sra_id=$(echo ${left_read} | rev | cut -d '/' -f 1 | rev | cut -d '_' -f 1 | cut -d '.' -f 1); /vol/spool/mc cat $left_read $right_read | mash screen -p 3 genomes.msh - \ | sed "s/^/${sra_id}\t/g" \ | sed 's/\//\t/' > output_final/${sra_id}.txt ; } #Create a variable for the array task id LINE_NUMBER=${SLURM_ARRAY_TASK_ID} LINE=$(sed "${LINE_NUMBER}q;d" reads2.tsv) #Search for the datasets search ${LINE}
reads.tsv) and
a file containing a sketch of the genomes.Fastq datasets:
wget https://openstack.cebitec.uni-bielefeld.de:8080/simplevm-workshop/reads2.tsv
Sketch:
wget https://openstack.cebitec.uni-bielefeld.de:8080/simplevm-workshop/genomes.msh
mc again since it was not saved as part of the snapshot.wget https://dl.min.io/client/mc/release/linux-amd64/mc
Please set executable rights:
chmod a+x mc
sbatch --array=1-386 search.sh
You could now check the state of your jobs by using squeue.
Please note that the job execution might take a few hours. The VM will be available even after the workshop.
If you are interested in the results, you could plot them later.
Concatenate all results into one file via cat output_final/*.txt > output_final.tsv
conda activate denbi
Run csvtk on the output
csvtk -t plot hist -H -f 3 output_final.tsv -o output_final.pdf
You can open this file by a click on the Explorer View and selecting the pdf.

Since there are many matches with a low number of k-mer hashes, you could filter the table first and plot the filtered results.
sort -rnk 3,3 output_final.tsv | head -n 50 > output_final_top50.tsv
csvtk -t plot hist -H -f 3 output_final_top50.tsv -o output_final_top50.pdf
less and check their description on the
SRA website by providing the SRA run accession
(Example ERR4181696) for further investigation.
less output_final_top50.tsv
Key Points
FIXME
FIXME
Contributions
Author(s): Peter Belmann, Nils Hoffmann, dweinholz