1. Using a Google Cloud virtual machine for
Prosit peptide MS/MS and retention time prediction
Tobias Kind
UC Davis Genome Center
2019
2. General steps and URLs
1) VM Setup
2) Utility installation
3) Prosit installation
4) Benchmarks
…estimated time for setup ca 20 min
• Prosit
https://github.com/kusterlab/prosit
• Prosit data files
https://figshare.com/projects/Prosit/35582
• Google Cloud Console
https://console.cloud.google.com
• Nvidia GPU Cloud Image
https://console.cloud.google.com/marketplace/details/nvidia-ngc-public/nvidia_gpu_cloud_image
8. CPU and memory recommendations for Prosit
Tesla P100: $941.77 per month - Effective hourly rate $1.29
Tesla V100: $1,462.99 per month - Effective hourly rate $2.004
Source: https://www.xcelerit.com/computing-benchmarks/insights/benchmarks-deep-learning-nvidia-p100-vs-v100-gpu/
During the prediction phase Prosit does not use the GPU heavily,
it is therefore recommended to use the cheapest GPU option.
During training of new models the faster GPU is recommended.
9. CPU and memory recommendations for Prosit
MEM Rule of thumb: 100k CSV input files require 100 Gbyte RAM
CPU Rule of thumb: 8 CPU cores minimum 16 is better
During the prediction phase the initial calculation is single threaded
later processes use multiple threads. Benchmark if needed
16. htop is for monitoring CPU use
nvidia-smi is for monitoring GPU use
mc (Midnight Commander) is for file based processes
They should be run in independent windows
Utility installations
17. Go back to console and open another ssh window then type: htop
18. Go back to console and open another ssh window the type: nvidia-smi -l
htop is for monitoring CPU use
nvidia-smi is for monitoring GPU use
mc (Midnight Commander) is for file based processes
They should be run in independent windows
If the GPU shows up we have successfully deployed the engine with CUDA 10.1
and Tensorflow and all other related dependencies
19. Install mc (Midnight Commander)
1) on console type: sudo apt install mc
2) on console type: sudo mc (files copied will have root permission)
20. Add user to the docker group
See: https://cloud.google.com/container-registry/docs/troubleshooting
1) run the following command in console: sudo usermod -a -G docker ${USER}
2) restart VM!
3) Check if docker is executable without sudo: docker ps
STOP
…
START
22. Installation of Prosit from GitHub
1) Go to https://github.com/kusterlab/prosit
2) Copy: https://github.com/kusterlab/prosit.git
3) In google cloud type: git clone https://github.com/kusterlab/prosit.git
4) Check if Prosit is installed: ls –l
5) Type: cd prosit && ls –l && pwd
23. Upload of learning files from Figshare
1) Goto https://figshare.com/projects/Prosit/35582
and download the RT and MS/MS prediction files
a) RT models https://figshare.com/articles/Prosit_-_Model_-_iRT/6965801
b) MS/MS models: https://figshare.com/articles/Prosit_-_Model_-_Fragmentation/6965753
It is recommended to process and pack those locally and then move
a ZIP file to the cloud VM or follow the next slide to extract directly to VM
make sure config.yml and model.yml files + weight hdf5 files are there.
24. Load and install trained models for MS/MS from Figshare
On cloud command line type:
1) cd prosit
2) ls –l
3) mkdir prosit-msms
4) cd prosit-msms
5) wget https://ndownloader.figshare.com/files/13687205 -O msms.zip
6) unzip msms.zip
7) cp prosit1/* .
8) ls –l
9) Check the three files config.yml, mode.yml and weight_32_0.10211.hdf5
10) rm msms.zip
25. Load and install trained models for retention times (RT) from Figshare
On cloud command line type:
We need to go back to prosit main and so type: cd.. && pwd ls –l
1) mkdir prosit-iRT
2) cd prosit-iRT/
3) wget https://ndownloader.figshare.com/files/13698893 -O iRT.zip
4) unzip iRT.zip
5) cp model_irt_prediction/* .
6) ls –l
7) Check the three files config.yml, mode.yml and weight_66_0.00796.hdf5
8) If not rename model.yaml to model.yml with: mv model.yaml model.yml
9) rm iRT.zip
wrong!
rename to
model.yml
26. Run the docker Prosit installation
1) Change back to the prosit main directory (cd ..)
2) on command line type: make build
The process includes pulling the docker file from the Docker repository and
can take up to 5 minutes
27. Run the prosit example
Template
make server MODEL_SPECTRA=/home/user/prosit/prosit-msms/ MODEL_IRT=/home/user/prosit/prosit-iRT/
becomes in my case the user name is tkind
make server MODEL_SPECTRA=/home/tkind/prosit/prosit-msms/ MODEL_IRT=/home/tkind/prosit/prosit-iRT/
1) the make command requires the model files as absolute path
2) check your username and current directory
3) replace the user with your username on your VM save the
command for future use
current directory
(your) username
4) save the command for further use in run-prosit.sh using the nano editor
call nano on the commandline and then copy/paste your command: nano
5) change the file mode to executable : chmod +x run-prosit.sh
28. Start the Prosit server
1) use the newly created file: ./run-prosit.sh
2) or run the following command with your replaced user name and file location
make server MODEL_SPECTRA=/home/tkind/prosit/prosit-msms/ MODEL_IRT=/home/tkind/prosit/prosit-iRT/
3) Open another ssh terminal and execute jobs from there. The server window above
stays open as logging window and for viewing of potential errors
29. Run the Prosit example
1) in new ssh terminal type: cat README.md
2) then copy/paste:
curl -F "peptides=@examples/peptidelist.csv" http://127.0.0.1:5000/predict/generic
3) The server should receive the data and inform about the progress
4) The output window should list all the data
5) Prosit is ready to roll!
30. WARNING
Cloud VMs can accumulate costs very fast, it is imperative to stop the VM
in the cloud console. For the small p100 instance including traffic costs of
around $1.50 per hour ($36 per day) are incurred.
32. General use and preparation of digested FASTApeptide data
using Encyclopedia
1) Download Encyclopedia (current version 9.0): encyclopedia-0.9.0-executable.jar
2) Link: https://bitbucket.org/searleb/encyclopedia/downloads/
3) Download FASTA files from Uniprot via: https://www.uniprot.org/proteomes/
4) Convert >> Create Prosit CSV from FASTA
33. Upload and download of own data
1) via the ssh google upload button
2) via the terminal upload download
3) other ways (Cloud storage, FileZilla)
Recommendation:
All data should be compressed
via zip and unzip
Link: https://www.cloudbooklet.com/6-ways-to-transfer-files-in-google-cloud-platform/
34. Benchmark of 100k CSV data set
(100,000 digested peptides)
curl -F "peptides=@examples/prosit-100k.csv" http://127.0.0.1:5000/predict/generic > "prosit-100k-out.csv"
Tesla P100 instance (8 cores CPU): 10 min
Tesla V100 instance (16 cores CPU): 14 min
Results inconclusive, probably due to single thread CPU speed. The V100 instance used
a lower clock speed Xeon @ 2.20 Ghz. Very little time is spent on the GPU for prediction.
35. File size and compute considerations
• A FASTA input file containing 40,000 proteins will be around 150 KByte large.
• The tryptic digest file with one voltage (z=2,3) will have 4 million lines and will be 100 kByte in size.
• The Prosit prediction file will have 130 million lines and a size of 18 Gbyte large.
• Around 100 GByte main memory are needed on the prosit server or the input file has to be split
• 100k tryptic digests are processed in 10 min; 1 million tryptic digests can be processed in 100 min
• 1 million tryptic digests will cost $2.25 on the cloud, easy to deploy, easy to scale
• A new local LINUX PC with fast CPU and 8 Gbyte GPU and 100 GByte RAM ~ $2,200