Compare commits

...
Sign in to create a new pull request.

971 commits

Author SHA1 Message Date
0bb81f6566 Update README.md
making some of my repos public
2023-11-27 15:12:37 +00:00
32cd353e12 Update README.md
making some aspects of myrepos public
2023-11-27 15:11:46 +00:00
1c616eeb71 bugfixes 2023-02-27 20:02:10 +00:00
a405cd8035 add consurf-only scripts 2023-02-25 17:08:22 +00:00
9bc18169cf tweaks to thesis figs 2023-02-24 22:38:37 +00:00
e9d841d989 table generator 2023-02-23 21:04:58 +00:00
c2d6eb49ea figure fixup 2023-02-21 21:01:00 +00:00
c1441dc0d6 import thesis plots from Misc 2023-02-21 20:41:59 +00:00
1571f430a5 fix HARDCODED HOMEDIRS 2023-02-19 18:01:03 +00:00
db7e2912e1 SNP -> SAV for thesis plots 2023-02-19 17:43:22 +00:00
777d3765cf a bunch of small changes 2023-02-19 17:22:41 +00:00
4640f9024a remove empty LF files 2022-10-11 14:19:39 +01:00
ca73bc9b48 lf_bp2: extend y 2022-10-11 12:04:55 +01:00
a74f11ce79 lf_bp2: extend y 2022-10-11 12:03:45 +01:00
39fe2da8be lf_bp2: extend y 2022-10-11 12:01:46 +01:00
763ffa55b8 lf_bp2: extend y 2022-10-11 11:56:41 +01:00
5966c4e21b lf_bp2: extend y 2022-10-11 11:53:17 +01:00
37e970dea3 lf_bp2: extend y axis 2022-10-11 11:51:36 +01:00
25cfe39c71 saving 2022-09-14 17:11:51 +01:00
4f01695391 site_snp_count_bp error handler 2022-09-09 19:37:38 +01:00
0a4be280c3 various bugfixes 2022-09-07 20:03:58 +01:00
590cec5e99 added pnca plots 2022-09-06 21:36:40 +01:00
ade1739753 saving 2022-09-06 21:36:40 +01:00
23bb087ea3 plot function fixes for position tiles 2022-09-06 17:06:01 +01:00
07826fbc91 alr: remove 113 2022-09-05 19:32:36 +01:00
bb022470d9 stuff 2022-09-05 19:27:17 +01:00
4a34d7a94d alr config 2022-09-05 19:03:49 +01:00
f1f0d3e62e alr config 2022-09-05 19:00:09 +01:00
e9fb582db9 msa indexes 2022-09-05 18:49:40 +01:00
1241ad0b22 various usability tweaks to LogoPlotMSA and position_annotation 2022-09-05 16:06:10 +01:00
f949592dd8 added consurf_colours_no_isd to ensure consurf plots are not messed up in the absesnce of 0 categ 2022-09-05 16:01:21 +01:00
2cec743ae0 added pnca plot dir to generate plots that weren#t covered in the paper 2022-09-05 14:02:04 +01:00
1dacebbaf6 renamed files for lineage_diff_sensitivites.R 2022-09-05 13:19:06 +01:00
69a0da0a59 added script for lineage diff sensitivities 2022-09-04 21:36:49 +01:00
4963f18f1d added 113 to alr aa 2022-09-04 21:36:14 +01:00
58c25e23c0 added ml_iterator.py 2022-09-03 14:42:13 +01:00
4976e9d8af saving 2022-09-03 14:41:22 +01:00
2b953583e2 added combined model FS code and run script 2022-09-03 12:28:36 +01:00
78704dec5a saving 2022-09-03 12:28:21 +01:00
889bea1e63 ran ml_iterator for actual genes 2022-09-03 09:23:37 +01:00
d77389acfc running combined model with FS 2022-09-03 09:22:06 +01:00
c7351970a2 chmod 2022-09-02 21:14:02 +01:00
f9ce90e3f4 saving 2022-09-02 10:08:58 +01:00
93e958ae6a now running for combined gene actual 2022-09-02 10:04:27 +01:00
00ca7a6b27 tweaks 2022-09-02 09:52:11 +01:00
c845d96102 added combined_model_iterator.py that has oversampling 2022-09-02 09:50:51 +01:00
338dd329e9 running ml_iterator_CV for all targets with different CV threholds 2022-09-01 16:29:32 +01:00
80b4a1850c tested to make sure the cols added dont break code 2022-09-01 16:23:57 +01:00
de9e6a709b added 3 cols for snp counts for ML 2022-09-01 16:11:23 +01:00
2bf91681b4 preparing to rerun ML iterator 2022-09-01 13:41:02 +01:00
56b71c6ca2 added avg affinity and stability cols with mask for avg affinity 2022-09-01 13:04:37 +01:00
e03ce277b7 checked masked cols after running 2022-09-01 12:57:38 +01:00
f9129b9ebc added nca dist criteria for masking 2022-09-01 12:55:38 +01:00
f94eadf1d4 added 2022-09-01 12:54:41 +01:00
82e2da4f3b ml df stuff 2022-09-01 11:39:11 +01:00
c2b46286d8 added rpob ks test script 2022-08-31 22:03:39 +01:00
bc9d1a7149 added embb plotting scripts 2022-08-31 22:03:07 +01:00
a5d22540e1 renamed count_vars_ML previous version as such 2022-08-31 22:02:46 +01:00
14e655eeeb gid correction 2022-08-30 12:11:43 +01:00
317b97bc9c stuff 2022-08-30 12:10:56 +01:00
d7f348318c removed template files from rpob 2022-08-29 23:28:53 +01:00
f39bbdcce7 added katg and rpob files 2022-08-29 23:27:37 +01:00
7c2e4b898e added rpob plot scripts 2022-08-29 23:27:13 +01:00
8f97ab7cc8 rpob 2022-08-29 18:23:47 +01:00
6441be21ab added katg tables 2022-08-28 22:31:29 +01:00
d5da923a74 stuff 2022-08-28 13:17:43 +01:00
7bed6a1e22 added all scripts 2022-08-27 23:12:12 +01:00
da8a1069a8 saving work 2022-08-27 23:12:12 +01:00
c3067b9176 gg pairs fixup for alr 2022-08-27 17:15:40 +01:00
5b89beb2b5 readded removed positions in alr 2022-08-27 17:07:08 +01:00
927ab850a8 added DCS to alr config 2022-08-27 16:36:49 +01:00
c6f5a446c3 stuff 2022-08-27 15:35:48 +01:00
79b251047d allow position_annotation to specify colours 2022-08-27 15:35:48 +01:00
2cbc460f87 added output tables with active site 2022-08-26 21:50:33 +01:00
f290d8ec9e addded appendix tables and frequent muts 2022-08-25 22:21:05 +01:00
741aad3dd1 added nsSNPs to lineage dist plots 2022-08-25 22:17:41 +01:00
ca619bc662 fix directory assumption 2022-08-25 18:02:25 +01:00
afa9166ca8 things 2022-08-25 17:58:04 +01:00
cd76a4b919 things 2022-08-25 17:58:04 +01:00
3b77b4b611 stuff 2022-08-25 17:58:04 +01:00
e4e8bd7278 added colors for tiles for gid plots and appendix tables 2022-08-25 17:50:49 +01:00
ac72634b48 added dir for embb for consistency and checks and moved others to version1 2022-08-25 10:19:25 +01:00
19b820e316 corrected the pe colour mapping 2022-08-24 20:55:36 +01:00
11b936f09b added more scripts 2022-08-24 20:04:29 +01:00
9aed99e805 various 2022-08-23 21:54:16 +01:00
0284122ef2 aadded replace bfactor for na 2022-08-23 16:31:17 +01:00
23b4f06017 added scripts 2022-08-23 16:30:50 +01:00
dd69da01f6 get_plotting_dfs 2022-08-23 15:35:47 +01:00
ee70845939 stuff 2022-08-23 15:21:15 +01:00
a2e7e6c26b plerts 2022-08-22 23:11:42 +01:00
3d817fde0c more plot files 2022-08-22 22:57:56 +01:00
04253b961f lots of per-plot configs 2022-08-22 21:56:13 +01:00
13999a477d fixed source to contain plotting cols and pos_count correctly 2022-08-22 14:33:06 +01:00
4147a6b90f a massive waste of time 2022-08-22 13:05:53 +01:00
8d6c148fff renamed 2 to _v2 2022-08-22 11:43:13 +01:00
802d6f8495 renamed 2 to _v2 2022-08-22 11:41:42 +01:00
c9d7ea9fad AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 2022-08-17 18:42:26 +01:00
cd9b1ad245 stability and cons revised bp out 2022-08-16 14:45:31 +01:00
f244741e83 added Lin_count.R 2022-08-15 21:29:32 +01:00
2e29cf8011 config files 2022-08-15 20:39:59 +01:00
0cc7a8fcae config: add tile stuff for all targets. Other functions: many rewrites! 2022-08-15 20:39:59 +01:00
a3e5283a9b generated ggpairs plots finally 2022-08-15 19:05:22 +01:00
b68841b337 saving work 2022-08-14 22:57:23 +01:00
7c40e13771 dadded v2 of barplot layput 2022-08-14 22:56:08 +01:00
da8f8d90d4 removed setDT and replaced with dplyr alt in position_count_bp.R 2022-08-14 14:19:09 +01:00
65d697d3a2 saving work 2022-08-14 12:35:46 +01:00
939528024a turn off thing 2022-08-14 12:35:09 +01:00
fcbf87705f gg_pairs_all: output to a png in /tmp 2022-08-14 12:20:29 +01:00
2acea43bcf added maf column in appendix_tables 2022-08-14 12:18:29 +01:00
6f354ab390 oops! 2022-08-14 12:17:42 +01:00
c09d7530c9 added ORandSNP writing results 2022-08-13 21:19:40 +01:00
4609757efb added appendix tables script 2022-08-13 21:19:19 +01:00
d984d283c5 generated embb lineage plots 2022-08-13 21:18:01 +01:00
c8f3ddf892 colour changes 2022-08-13 15:49:24 +01:00
c6a720770d some stuff 2022-08-13 15:47:41 +01:00
365c322953 added fd corrected p-values for ks stats 2022-08-13 14:54:51 +01:00
f5f1e388c3 modified ks test to output all stats needed in one script 2022-08-12 14:35:03 +01:00
22be845e1f added ggpairs code 2022-08-12 10:04:35 +01:00
318941e91b older file renamed for prominent effects 2022-08-11 21:09:48 +01:00
d6766d9d9c added plotting_colnames.R in scripts/plotting 2022-08-11 21:08:54 +01:00
1b08080078 added prominent_effects.R 2022-08-11 21:07:55 +01:00
b32398d16f output corr plots with log10 labels 2022-08-11 21:03:27 +01:00
3d3e74306c added log10 to corr plot labels 2022-08-11 21:01:16 +01:00
b302daaa60 rearranged corr plot cols and also added example for ggpairs 2022-08-11 20:56:34 +01:00
fdb3f00503 added dist_mutation_to_na2.pl script from UQ to calculate dist to na 2022-08-11 10:09:33 +01:00
e714034678 commented out test code 2022-08-10 20:03:40 +01:00
842efe4409 created prominent effect calcualtions 2022-08-10 20:02:43 +01:00
3af11ec3d3 wideP_consurf3 2022-08-10 14:08:43 +01:00
0bcbb44ae5 addded old script to redundant 2022-08-10 14:08:23 +01:00
ccc7dd7bf2 added csv for all colnames 2022-08-10 14:08:23 +01:00
2f7558a883 aadded colnames to plot as names 2022-08-10 14:08:23 +01:00
4315adc556 wideP_consurf3 2022-08-10 14:07:11 +01:00
f6a3b7f066 added lineage plots in one 2022-08-10 12:58:58 +01:00
285b28b1d6 various plots 2022-08-10 11:06:13 +01:00
a6d93b3fa8 starting corr plots 2022-08-09 21:55:24 +01:00
cd86fcf8e8 added separate scripts for layout for convinience 2022-08-09 21:47:24 +01:00
d78b072732 comment out run 2022-08-09 20:00:42 +01:00
5ef9eb8826 saving 2022-08-09 19:58:32 +01:00
faf9ab9707 various 2022-08-09 19:56:47 +01:00
af03fc6fd6 various 2022-08-09 19:56:17 +01:00
415d05ab6e various stuff 2022-08-09 19:42:29 +01:00
5cbaef3d36 various heat-bar/position tile faff 2022-08-09 19:42:29 +01:00
94454d6fba added combined lineage plot 2022-08-09 19:34:07 +01:00
fe292e3717 fix: add NULL for aa_pos_lig[123] for configs that don't have them 2022-08-09 13:51:05 +01:00
79d83b6240 various changes 2022-08-09 13:46:05 +01:00
4ce2087224 various changes 2022-08-08 16:48:53 +01:00
5bdfd03443 refactored dm om plots and generated the final layout 2022-08-08 16:46:27 +01:00
28510471f0 consurf tile fixes 2022-08-08 15:37:28 +01:00
0234a8f77b more plot modifications dm and om plots mainly 2022-08-08 15:32:16 +01:00
4e6f10d1ba added outcome col to dm_om data 2022-08-07 11:21:25 +01:00
54b3dd9d42 various refactoring 2022-08-06 18:49:00 +01:00
c968089cd2 refactoring logo plots to add flame bar 2022-08-06 18:48:12 +01:00
f0a9eb4eec aadded complete obs count for corr plots and added placeholder for mcsm-na 2022-08-06 14:31:18 +01:00
f194b5ea4f added corr plot TODO for lineage counts 2022-08-06 13:30:41 +01:00
0e777108a9 saving and strating to write 2022-08-06 13:09:26 +01:00
569e372476 agenerated corr plots with MAF and provean 2022-08-06 13:08:42 +01:00
1a513913ce reflected the factor thing as been added as new additon 2022-08-05 20:00:04 +01:00
fe9c3f8afe added dm_om_plots.R 2022-08-05 19:59:07 +01:00
ae8bc8ae85 added as.factor in unpired stats 2022-08-05 19:58:31 +01:00
9c955cddc0 still playing 2022-08-05 18:49:26 +01:00
92fba43dc1 mods 2022-08-05 18:32:23 +01:00
57433607bc lf_bp2: add a "monochrome" option 2022-08-05 18:28:10 +01:00
d5bc1c272e lf_bp2 2022-08-05 18:13:44 +01:00
33925dafe9 things 2022-08-05 16:13:57 +01:00
6cb9998c4c added the script to redundant 2022-08-05 14:45:33 +01:00
164113f665 moved dm_om_data.R the script version to redundant and the function version has been trimmed to for readability 2022-08-05 14:44:36 +01:00
cc27ffe82f added the rev script that i used to develop the trimmed version 2022-08-05 14:36:58 +01:00
05ab89ec09 git trimmed downthe dm_om_data.R 2022-08-05 14:36:02 +01:00
fae846395d fix 2022-08-05 12:52:51 +01:00
31148dfbdc lf_bp fuckage 2022-08-05 12:46:28 +01:00
c0f59bc9c9 breakage 2022-08-05 12:45:16 +01:00
14f8f5d6d4 generated lineage barplots and corr plots for conservation 2022-08-04 19:48:09 +01:00
424c1d184d added linage barplots 2022-08-04 19:29:20 +01:00
7f9facc1e6 moved old corr files 2022-08-04 19:28:39 +01:00
dab8294a01 renamed folder 2022-08-04 18:52:03 +01:00
ad2e538ec2 separated plotting_thesis for generating plots 2022-08-04 18:47:18 +01:00
95131abc3c consurf fixes 2022-08-04 16:48:53 +01:00
bf41d01b39 multiple script updates and bug fixes 2022-08-04 16:48:31 +01:00
599cd7493f buggy bugs that bug me 2022-08-04 15:18:23 +01:00
e1b8e103ea saving from my panino, made lineage dist plots 2022-08-04 13:47:24 +01:00
1efb534f0f fix some derps 2022-08-04 13:46:52 +01:00
61c7a30835 minor tidy and leble adjust sizes for simple barplots 2022-08-04 10:26:18 +01:00
bcba0c359e moved code for structure figure to sep sir 2022-08-03 21:33:19 +01:00
aabe466599 added plots for thesis 2022-08-03 21:32:55 +01:00
41c4996426 more of the previous thing 2022-08-03 20:51:32 +01:00
bdbc97c40a fix many plot functions to stop them using the "g=ggplot()" pattern,
which annoyingly throws away lots of useful data that RShiny needs for
clickable plots. Also split the "flame bar" for ligand distance out into
separate functions in generate_distance_colour_map.R. This can now be
easily incorporated into any "wide" graph showing all positions.
2022-08-03 18:58:27 +01:00
e498d46f8b consurf plot: add debug option. logoP SNP: distance flame tiles 2022-08-03 18:58:27 +01:00
f2709f3992 added replace b factor scripts for lig affinity and ppi2 2022-08-02 20:30:43 +01:00
87a3d7acf2 generated simple affinity plots for embb 2022-08-02 20:29:31 +01:00
214e9232c6 still trying affinity phew 2022-08-02 16:55:31 +01:00
d94aa10c9b saving work 2022-08-01 21:44:11 +01:00
66337c289c added scripts to generate mean stability for rpob 2022-08-01 21:41:55 +01:00
ccc877e811 attempting affintiy stuff 2022-08-01 21:41:02 +01:00
0d8979dfcb separted cols 2022-08-01 14:09:46 +01:00
e750ee59aa saving 2022-08-01 13:41:43 +01:00
79c261963b sorting the ensemble and priority for ligand affinity 2022-08-01 13:36:05 +01:00
f3710bfaf5 added mcsm_mean_affinity_ensemble.R and replaceBfactor_pdb_stability.R 2022-07-31 19:25:28 +01:00
1bf66b145c separating mcsm_mean_stability_ensemble from combined script 2022-07-31 19:24:35 +01:00
06e5363112 added scriptaa mcsm_mean_stability_ensemble.R to get ensemble of averages across predictors for stability and affinity 2022-07-31 16:29:47 +01:00
4de1e800ec moved mut_landscape files to plotting 2022-07-31 12:01:56 +01:00
d6a63eed21 added mut_landscape.R that outputs the mutational positions and annotations for generating the structural landscape of all gene-targets 2022-07-31 12:01:56 +01:00
26f284d76e added test_func_combined.py 2022-07-29 00:14:39 +01:00
9cd6613da6 added cm_logo_skf_v2.py 2022-07-29 00:13:54 +01:00
e55906d2c7 saving work 2022-07-29 00:12:43 +01:00
1695e90b42 running none_complete with diff cv threhsolds 2022-07-28 15:33:43 +01:00
85f59155fa uncommented all models for a full run 2022-07-28 15:28:31 +01:00
37f5199c5c added the cm_ml_iterator_TODO.py for later 2022-07-28 15:25:31 +01:00
e32308d984 moved old logs 2022-07-28 15:24:49 +01:00
fd0ccc9032 added genes_ml_logs 2022-07-28 15:23:49 +01:00
cd68e60f09 updated running_scripts with added rt and none and none_bts 2022-07-28 15:22:44 +01:00
d3d5698a3e added MultClfs_CVs.py and ml_iterator_CVs.py 2022-07-28 15:19:58 +01:00
b87f8d0295 trying diff cv thresholds for single gene 2022-07-28 15:19:13 +01:00
8d8a61675f various edits 2022-07-28 13:20:14 +01:00
90b9477520 ml_iterator tweaks 2022-07-28 13:19:58 +01:00
584b866f0b moved ../ml_functions/MultClfs_logo_skf.py to del 2022-07-28 12:25:29 +01:00
2c50124b1b moved logo_skf function to del as using the MultClfs for combined data 2022-07-28 12:24:24 +01:00
a6532ddfa3 just running for pnca 2022-07-27 17:11:30 +01:00
744bc8f4a1 added dummy classifier to models 2022-07-27 17:10:04 +01:00
c32005c99c adding other split types on ml_iterator 2022-07-27 15:58:15 +01:00
63c8876764 checking the splitTTS script to make sure other splits have been factored in 2022-07-27 15:52:20 +01:00
f4cab1fdfb fixed masking condition for ML training data for genes and wrote revised mask files out 2022-07-27 13:36:16 +01:00
0adf69f75a added model names sklearn 2022-07-16 15:35:42 +01:00
39e72b2dfb added all_estimators 2022-07-16 13:32:19 +01:00
a590354f15 saving work from panino 2022-07-12 10:14:06 +01:00
33e3b5a0a6 various bugs 2022-07-12 10:14:06 +01:00
6950c4b057 added reverse traininig as split type in SplitTTS.py 2022-07-11 20:03:06 +01:00
1965517681 added other split_type options i.e none and none with bts 2022-07-11 19:27:14 +01:00
ce730fbe57 uncommented 2022-07-10 13:29:01 +01:00
350be30f19 minor changes to run combined_model 2022-07-10 13:23:13 +01:00
057c98c2f1 updated rs 2022-07-10 13:04:51 +01:00
e37442efd2 updated ml_iterator function args 2022-07-10 13:03:14 +01:00
9594d0a328 uncommented models for logo 2022-07-10 12:53:51 +01:00
4d5b848471 iterator 2022-07-10 12:47:22 +01:00
01ff9d5be6 added instructions to run individual genes 2022-07-10 12:45:45 +01:00
3b4cfecc9f added target_count_numbers.py 2022-07-10 12:43:00 +01:00
de5c1270be added Mult_clfs_logo and Mult_clsf.py with consistency 2022-07-10 12:32:52 +01:00
06f2ce97b6 minor var name fix for MultClfs.py and ml_iterator 2022-07-09 11:02:37 +01:00
ef52fd7a94 saving before running 2022-07-09 10:59:26 +01:00
e07fa3bc05 rerunning ml_iteraror.py 2022-07-09 10:58:14 +01:00
8bde6f0640 minor var bame update in ml_iterator 2022-07-09 10:52:50 +01:00
8079dd7b6c reran to generate merged_df3 with correct dst for dst muts. modified combining_dfs_plotting.R 2022-07-08 21:33:57 +01:00
289c8913d0 added MultClds_SIMPLE.py to simplify my function to run without blind test 2022-07-08 13:54:49 +01:00
880ef46099 added CHECK_model 2022-07-08 13:53:44 +01:00
23799275a0 saving work from thinkpad 2022-07-08 13:53:17 +01:00
5577f5b195 fucking shit count vars 2022-07-07 20:27:03 +01:00
3e18193a36 added examples 2022-07-07 17:46:06 +01:00
f57f25f47a various changes to count_vars_ML.R 2022-07-07 12:33:20 +01:00
d14c3f9c4a adde dummy classifier 2022-07-07 12:28:58 +01:00
a15d801c2a tried pca 2022-07-05 23:05:37 +01:00
8d831f3613 added different scaling options 2022-07-05 22:47:13 +01:00
ebef0c7967 added test script to see one gene 2022-07-05 16:06:24 +01:00
79cb89a019 saving work 2022-07-05 16:06:03 +01:00
652cf4802e added MultClfs_fi to add FI scores for models, in development 2022-07-05 14:19:35 +01:00
53c229f480 added random state to split in function 2022-07-05 14:15:43 +01:00
e5f882841e added cm_datai.py to get data for cm model for running fs later 2022-07-02 16:57:41 +01:00
b2d0b827ad added cm run for logo_skf for actual data 2022-07-02 16:57:11 +01:00
9071a87056 fs: cut down the number of iterations 2022-07-02 11:12:39 +01:00
7ba838b493 redo unconfusion 2022-07-02 10:32:19 +01:00
a166a37c0e undo wrong rename 2022-07-02 10:31:55 +01:00
5a81511163 filename 2022-07-02 10:28:10 +01:00
dccd3c8eb2 multiple changes 2022-07-02 10:25:55 +01:00
2fda32901b change more hardcoded CPU counts to os.cpu_count() 2022-07-02 10:25:13 +01:00
b8653c6afe ML scripts: {'n_jobs': os.cpu_count() } 2022-07-02 10:22:07 +01:00
11af00f1db changed ml output dirs and ready to run fs 2022-07-01 21:40:14 +01:00
57348f1874 added log frile for cm run for complete data with skf 2022-07-01 20:38:34 +01:00
b7e1b51a31 added ml_iterator_fs.py 2022-07-01 20:38:08 +01:00
b5777a17c9 saving work 2022-07-01 20:37:41 +01:00
d812835713 added cm_logo_skf.py and placeholder for splits 2022-07-01 13:55:12 +01:00
952cfeb4c0 added MultClf for combined model to make changes with cv 2022-07-01 11:40:23 +01:00
0494765c9b fixed indentation 2022-07-01 11:38:59 +01:00
7eef463915 TEMP -> ml_iterator 2022-06-29 22:20:23 +01:00
50cb36f2b3 add iterating iterator of iterators (TEMP) 2022-06-29 22:12:31 +01:00
cbfa9ff31b lineage plots: be a function and then run over all pairs 2022-06-29 19:24:47 +01:00
8965bee5d6 LINEAGE2 plots: p value stars 2022-06-29 17:18:55 +01:00
087170a798 added .py 2022-06-29 12:08:35 +01:00
9aadb0329f added ml_functions dir 2022-06-29 12:06:47 +01:00
c85c965c3e added TODO for lineage2.R 2022-06-29 10:26:08 +01:00
aff7247e3b really!?!?!?!? 2022-06-28 21:52:04 +01:00
478df927cc horrible lineage analysis hell 2022-06-28 21:51:02 +01:00
ce0f12382e added a comment to data extraction 2022-06-28 13:45:59 +01:00
6c10776ea9 saving work 2022-06-25 14:15:21 +01:00
f99b5d1888 added GetMLData.py for combined model and added to functions including previous ones that have been moved there 2022-06-25 14:12:07 +01:00
5d38cde912 added all run scripts for diffferent splits 2022-06-24 20:39:50 +01:00
e2bc384155 added FS to MultClfs.py and modified data for different splits for consistency 2022-06-24 20:35:53 +01:00
edb7aebd6a saving 2022-06-24 15:43:00 +01:00
1160ad7268 added running_scripts to keep track 2022-06-24 15:41:27 +01:00
b37a950fec optimised run_7030.py to generate ouput from dict now that the processfunction and parameter dicts have been added 2022-06-24 15:40:18 +01:00
7dc7e25016 appened sys.path to allow local imports 2022-06-24 13:41:07 +01:00
a15ab80bc6 added log_FS_pnca_7030.txt after running FS for pnca 2022-06-24 13:27:16 +01:00
96f4e7085a added test_MultClfs.py to test the functions now in a single script 2022-06-24 13:26:42 +01:00
a3c644d04b removed MultModelsCl.py and ProcessMultModelsCl.py as these are merged into a single script for convenience 2022-06-24 13:25:51 +01:00
fba1481c08 added MultClfs.py that contains my ML functions 2022-06-24 13:25:00 +01:00
19da36842b removed the two functions MultModelsCl.py and ProcessMultModelsCl.py as these have now been combined 2022-06-24 13:24:04 +01:00
ad99efedd7 saving work 2022-06-24 13:21:21 +01:00
3514e1b4ba added run_7030_LOOP.py to loop through the resampling data and get processed output 2022-06-23 21:29:54 +01:00
1d3190899d added ProcessMultModelsCl.py that processes the output for multiple models 2022-06-23 21:27:13 +01:00
4fe62c072b added metadata output for running multiple models 2022-06-23 21:25:00 +01:00
5dea35f97c aaded scripts for FS including test call, etc 2022-06-23 14:53:01 +01:00
8fe0048328 saving work 2022-06-23 14:52:27 +01:00
0350784d52 changed blind_test_input_df to blind_test_df in MultModelsCl 2022-06-22 16:42:04 +01:00
bc12dbd7c2 added run_7030.py that runs as cmd for all gene targets and sampling methods and outputs a single csv 2022-06-21 20:37:53 +01:00
5b0ccdfec4 added ml_data_fg.py 2022-06-21 18:21:41 +01:00
11ef627150 removed _dissected files and renamed them to _fg 2022-06-21 18:20:22 +01:00
fe0986aa28 adde script to run ml baseline models orig version with feature groups 2022-06-21 18:17:56 +01:00
137f19a285 saving work 2022-06-21 18:12:31 +01:00
7b378ca6f3 adding formatting to get all output from ML for feature grpups starting with genomics 2022-06-21 14:08:12 +01:00
cadaed2ba7 ML logs 2022-06-20 21:55:47 +01:00
4c5afa614f python scripts for original analysis with logs 2022-06-20 21:54:48 +01:00
8d8fc03f72 added test script to test dissected model 2022-06-20 21:53:15 +01:00
e68a153883 working on dissected model, testing diff feature groups 2022-06-20 21:51:07 +01:00
135efcee41 added option to add confusion matrix and target numbers in the mult function 2022-06-20 17:08:22 +01:00
905327bf4e script to run models based on group of features 2022-06-20 14:59:02 +01:00
4ab99dcbd2 saving work for yesterday where uq runs were repeated 2022-06-20 14:57:11 +01:00
efeaf52cde added ml runs for complete data with _cd_ in filenames reflecting this 2022-06-18 19:36:33 +01:00
9bc26c1947 slight formatting for existing scripts 2022-06-18 19:35:49 +01:00
a53fce5455 added notes for running ml scripts 2022-06-18 14:45:48 +01:00
e176d018cb added log files for these ml runs 2022-06-18 14:44:02 +01:00
5bd8ba33f7 added scripts for reverse training 2022-06-18 14:43:35 +01:00
d85415daf8 added scripts for scaling law split 2022-06-18 14:42:46 +01:00
4037641dfa added data and ml scripts for 8020 splits 2022-06-18 14:42:02 +01:00
2e50a555a0 minor formatting consistency for 7030 scripts 2022-06-18 14:41:05 +01:00
e05e4e2e38 added script for running 7030 split 2022-06-17 18:28:26 +01:00
91e868736c changed dir to allow ml script to import functions 2022-06-17 18:27:20 +01:00
e6d3692445 changed dir for reading func in pnca_config.py 2022-06-17 16:37:07 +01:00
96d4e61dca added baseline config files for running v2 ml analysis 2022-06-17 14:14:26 +01:00
05dd9698c4 added aa_index data for running ml 2022-06-17 13:41:25 +01:00
39ccd6cdf4 initial adding of ml scripts for baseline models 2022-06-17 13:40:09 +01:00
f355846dae added active site indication for merged_dfs in count_vars_ML.R and also added 'gene_name' in combining_dfs.py 2022-06-15 18:36:28 +01:00
1204f1faba added scripts and files to make AA index work for all drug targets, add header to the aa index output and fetch the aa index headers 2022-06-15 11:24:07 +01:00
03321c261e working new_aa.sh 2022-06-13 22:05:41 +01:00
c4ae6d2412 improved aa script 2022-06-13 21:48:44 +01:00
2307a19d86 added example bash cmds 2022-06-13 21:22:49 +01:00
40c4d382f4 added eg to run aaindex from a diff dir 2022-06-13 21:15:12 +01:00
bd7d01c7e6 various aa_index_scripts 2022-06-13 09:42:48 +01:00
0c316e4a41 renamed aa_index folder to aa_index_scripts 2022-05-30 02:24:54 +01:00
650d357afc reran to output merged_df3 and merged_df2 csvs from count_vars.ML 2022-05-29 03:10:51 +01:00
f41cd0082e moved mmcsm_provean_ed_combine_CHECKS.py to scratch for when I need ED data for merging 2022-05-25 23:45:01 +01:00
1baf7fa9f0 lf 2022-05-25 23:44:13 +01:00
8e65d75b58 added checks before comnining provean and mmcsm_lig 2022-05-25 08:51:37 +01:00
a2bcc3a732 added mmcsm_lig and provean dfs merges in comnining_df.py 2022-05-25 08:50:33 +01:00
d8041fb494 added count_vars_ML.R to check numbers for revised counts 2022-05-05 19:32:34 +01:00
39566ceadd lineage labels gsub now redundant in combining_dfs_plotting.R 2022-05-05 19:31:00 +01:00
d61d11e020 Merge branch 'master' of https://git.tunstall.in/tanu/LSHTM_analysis 2022-05-05 13:36:03 +01:00
e54ae877a8 updated data extraction to ensure genes without common mutations and duplicate indices can run from the cmd 2022-05-05 13:35:24 +01:00
4f22cd3db1 various tweaks to make the RShiny dashboards work 2022-04-28 20:21:03 +01:00
5429b8fed7 saving data extraction updated script 2022-04-28 13:02:30 +01:00
e419d320ac saving data extraction with final processing 2022-04-27 12:09:14 +01:00
3c436f0c27 finally revised data processing is complete 2022-04-27 11:09:36 +01:00
ac0d14e116 done but getting an error when running from cmd 2022-04-26 17:03:40 +01:00
29e9d10e39 add chek for lin index duplicates before output 2022-04-25 18:38:04 +01:00
1371704685 got to the lineage extraction bit 2022-04-25 18:37:01 +01:00
0867827ec6 saved section for generating revised dst 2022-04-25 16:51:28 +01:00
cb93cef3c7 did all other mappings until dst column 2022-04-23 11:14:34 +01:00
7a10b4f223 phewwww! finally resolved counts for ambiguous muts 2022-04-22 17:51:11 +01:00
3cd7b36c59 updated data_extraction 2022-04-22 15:18:08 +01:00
bf7060baa9 finally added all the lineage calculations 2022-04-15 14:41:04 +01:00
95a73efdd2 saving work with sections reflecting activities 2022-04-14 19:43:14 +01:00
e99c169b35 saving work, lineage bits are all over the place, need to rerrange 2022-04-14 19:39:47 +01:00
ae3a5500c9 mostly done, now adding lineage magicry 2022-04-14 19:27:21 +01:00
f05cb96346 added sections and slotted relevant bits from data_extraction to v2 2022-04-14 12:21:16 +01:00
e6faf80c20 updating ambiguous muts manipulation section in data_extraction_v2 2022-04-14 10:36:08 +01:00
6330a2e716 added v2 for data extraction 2022-04-14 10:16:12 +01:00
2518556c96 added str.strip() instead of str.lstrip() 2022-04-08 17:05:24 +01:00
ac78fe16cd logoP SNP adj 2022-03-06 14:48:53 +00:00
56a6aa8c7e lineage_dist: add all_lineages feature 2022-03-06 13:31:43 +00:00
8fa585417e tweaks for params 2022-03-02 19:42:24 +00:00
bf3194259e lineage: toggle for "all lineages" 2022-03-02 15:54:09 +00:00
6c6709e41e various changes 2022-03-02 11:44:04 +00:00
2274f01f23 combined msa and wt seq into 1 list so only list is passed as an arg for plotting ED plots 2022-02-15 08:31:58 +00:00
d38521e03a added placeholder defaults for functions in R to make sure that R shiny layput works with a data set for meeting tomorrow 2022-02-14 19:33:00 +00:00
0460ca1708 added consurf plot function and corresponding test script 2022-02-14 11:16:30 +00:00
18e1f14455 saving changes 2022-02-14 11:14:40 +00:00
6ffb084546 added hbond residues in config for all genes 2022-02-09 15:59:18 +00:00
7a14655ecb added active site positions to all config.R 2022-02-03 18:02:27 +00:00
04cddbbf2b minor formatting, taken a line off the barplots_subcolours_aa_PS.R 2022-02-02 19:01:13 +00:00
3e5191f5c6 attempting to visualise consurf plot along with highlighting active site pos 2022-02-02 19:00:23 +00:00
9c3818fd98 added NOTE to the barplots_subcolours_aa_PS.R as I am trying it for other plots 2022-02-02 18:59:33 +00:00
d13484e8f5 added function lineage_plot_data.R and corresponding test script, also renamed corr_plot_df.R to corr_plot_data and corresponing test script 2022-02-01 16:25:58 +00:00
3d45780c1a updated docs for dm_om_data.R 2022-02-01 16:23:03 +00:00
e795c00831 added gene conditions to test_dm_om_data.R 2022-02-01 10:54:10 +00:00
b2e035d9bc moved old dm_om_data_nf to redundant/ 2022-01-31 17:52:29 +00:00
ef1fb96b88 tested dm_om_data function with alr gene to make sure valid dfs are being returned 2022-01-31 17:50:12 +00:00
5779b3fe87 added function to get wf and lf data and corresponding test 2022-01-31 17:36:34 +00:00
a287b801f7 saving before commiting 2022-01-31 17:36:00 +00:00
ea931b59f3 relaced ligand_distance with the variable 2022-01-30 09:47:56 +00:00
8df4a85798 deleted accidental blank line 2022-01-29 18:10:38 +00:00
6b7b5e6c98 saving work 2022-01-29 18:08:07 +00:00
cd772e9df1 moved corr_data.R to redundant/ 2022-01-29 17:34:18 +00:00
1035547309 added function to extract data for correlation plots and corresponding test script 2022-01-29 17:31:02 +00:00
a4a4890634 added FIXME to fix dm_om data for targets other than gid 2022-01-29 17:30:19 +00:00
5346431256 repurposing corr_data.R into a function to allow required params to be passed in 2022-01-29 17:24:15 +00:00
7317156bba updated docs for the logo functions and tested all of them again 2022-01-26 15:53:53 +00:00
2f7f40efb1 added a few tweaks to check logoplots 2022-01-26 15:22:13 +00:00
1b20f09075 tested edplot with alr gene 2022-01-26 13:35:57 +00:00
8750e3126a renamed logoP.R --> logoP_or.R 2022-01-26 12:37:08 +00:00
a2da95ef7c added ed_pfm_data.R function and its corresponding test 2022-01-26 11:55:38 +00:00
5f9a95ccb1 moved Header_TT.R from plotting/ to scripts 2022-01-26 11:54:38 +00:00
dec6c72fb5 more header_tt 2022-01-26 11:50:50 +00:00
586927ca56 wholesale change for Header_TT.R location 2022-01-26 11:39:03 +00:00
b133b8be24 save content 2022-01-26 11:34:59 +00:00
1c1e98ad4f added get_logo_heights() in my_logolas.R 2022-01-26 11:16:02 +00:00
92af9fd565 combined logolas and raw data msa plots into 1 script and called it the same as before logoP_msa.R 2022-01-26 11:06:04 +00:00
6365fff858 moved logoP_msa_raw to redundant 2022-01-26 11:04:21 +00:00
3bc5dcbad3 renamed logoP_msa.R --> logoP_msa_raw.R 2022-01-26 11:03:45 +00:00
6a9f4a0cab added logoP_logolas.R to plot logolas like plot to show ED regions 2022-01-24 17:23:32 +00:00
9aa62b33b1 added my_logolas.R 2022-01-24 13:46:50 +00:00
febb8f0f7f checked embb logo msa plot with chosen positions 2022-01-19 19:04:43 +00:00
b72ffb5d2c just tested logo plot msa with embb after correcting the OMINOUS fasta file 2022-01-19 18:57:04 +00:00
59eaf58747 now definitely checked for all targets 2022-01-18 17:43:21 +00:00
b9d173a2c4 checked combining_dfs.R for all targets 2022-01-18 17:42:04 +00:00
e2cdee2d08 added additional check in combining_df_plotting.R to account for check when generating merged_df2 as muts NOT present in mcsm can create trouble, so fixed that and ran it successfully for alr and katg 2022-01-18 17:36:54 +00:00
8f8a9db92c tried ED logo, but needs work 2022-01-18 16:53:16 +00:00
00094f036a playing with MSA plots to allow filtering of positions, arghhh 2022-01-18 15:30:41 +00:00
08bd8a2ee5 added config files for R plots 2022-01-18 11:15:03 +00:00
4e779b2945 saving work 2022-01-18 11:14:25 +00:00
ef4ac81a8a added config file for alr 2022-01-17 19:12:15 +00:00
c8c4afb28a added logoP_msa.R 2022-01-17 19:11:48 +00:00
68a092037b finding seq discrepancy in MSA for embb 2022-01-17 19:11:10 +00:00
af04c69d66 A MAAAADDD MAAADDD DAYYYYY,messy embb numbering agrrrhhhhh 2022-01-16 18:34:49 +00:00
07aedfe286 added both test logo plots in one script 2022-01-16 15:00:54 +00:00
efe6703673 attempted omitting snp from logo plot for OR 2022-01-16 14:45:39 +00:00
1244581cdd just added some minor tweaks for the logo plots with OR 2022-01-16 13:31:32 +00:00
9546355241 updated and tested logoP_snp.R. All done nicely 2022-01-15 13:17:26 +00:00
0334188801 adde tests scripts for logoP and logoP_snp.R 2022-01-14 16:10:49 +00:00
f640087922 added logoP_snp.R and renamed logo_plots_func.R to logoP.R 2022-01-14 16:09:57 +00:00
f27b536157 added option to remove empty positions from logo plot 2022-01-14 11:07:16 +00:00
4e2f5f35db added log option to function properly 2022-01-14 10:50:56 +00:00
3b7cea3c47 adding legend forlogo plot 2022-01-14 10:18:01 +00:00
426a5cb0b5 added logo_plot function and test to check it 2022-01-13 18:55:13 +00:00
344a74a9e1 saving work for logo plots 2022-01-13 18:53:47 +00:00
7cbd9b4996 added barebone notes for logo_plots_func.R 2022-01-12 17:59:02 +00:00
3f7bc908ec going through functions and script for interactive plots 2022-01-12 17:58:16 +00:00
1f266c4cb8 more tidying and formatting for combining_dfs.py. Hopefully no more after today 2022-01-11 17:51:28 +00:00
c48fa1dbb0 minor tidy up to check interactive graphs Rshiny 2022-01-07 16:07:44 +00:00
7d60a09297 tidy up 2022-01-06 16:38:59 +00:00
bffa3c376c thorough checking and updates for final running of all gene targets 2022-01-05 17:55:35 +00:00
b66cf31219 added untracked files in scripts/plotting 2022-01-04 12:27:25 +00:00
3ab6a3dbc1 added untracked files in scripts and dynamut 2022-01-04 12:26:54 +00:00
00b84ccb1c handled rpob 5uhc position offset in mcsm_ppi2 2022-01-04 10:45:29 +00:00
46e2c93885 merged changes from the combining_dfs.py file from branch 'embb_dev' 2021-11-24 07:58:27 +00:00
e5aca5e24f fixed the duplicate colum problem by removing them from combining_dfs.py 2021-11-24 07:57:20 +00:00
4f52627740 Merge branch 'embb_dev' 2021-11-19 08:05:46 +00:00
436bdafece added info re having run mcsm_na for RNAP 2021-11-19 07:51:13 +00:00
2925c89d11 ran mcsm_na for rpob's RNAP complex i.e 5UHC 2021-11-19 07:48:42 +00:00
cee10cc540 ran mcsm format for embb 2021-11-13 09:43:56 +00:00
c32de1bf0f Merge branch 'embb_dev' 2021-11-12 14:37:10 +00:00
4eaa0b5d2b saving work after running combining_dfs.py 2021-11-12 14:16:48 +00:00
dad8f526a2 added TESTING_plots.R 2021-11-09 13:55:21 +00:00
a5c7e1e9dd added FIXME and TODO related to alr in combinig_dfs.py 2021-11-09 13:23:50 +00:00
ddae107314 saving work in LSHTM_analysis before combining data for targets 2021-11-09 12:44:11 +00:00
63ec8a1c37 added split_csv_chain.sh for mCSM-NA analysis in scripts/ 2021-10-29 14:00:25 +01:00
8035308cdf cherry-pick mcsm_na/run_format_results_mcsm_na.py from master to ensure consistency 2021-10-28 12:54:04 +01:00
e2bc1cdde1 Merge branch 'gidb_dev' (including merge conflict while after adding CLI arguments for mcsm_na/format_results_mcsm_na.py) 2021-10-28 12:45:39 +01:00
9cfb32afb8 pretending that we added the CLI arguments 2021-10-28 12:43:44 +01:00
45208f4c3d saving ppi2 format script on embb_dev branch 2021-10-28 12:22:46 +01:00
e4661963ab saving work after embb branch merge 2021-10-28 12:15:29 +01:00
3368e949e8 bring in embb stuff which was in the wrong branch 2021-10-28 11:18:13 +01:00
057291a561 much development 2021-10-28 10:41:43 +01:00
873fd3a121 added gene.lower to dynamut2 format result script 2021-10-19 11:12:34 +01:00
ba21188bd2 added notes 2021-10-18 13:58:06 +01:00
675b222181 added cmd option for dynamut2 formatting results 2021-10-18 13:52:29 +01:00
98325d763f fixed output filename in deepddg_format.py 2021-09-30 13:37:17 +01:00
af227f9864 moved deepddg_format.py from ind output dir to scripts 2021-09-30 13:35:33 +01:00
93a91518e1 fix runFoldx so that it looks for a missing rotabase.txt in the process_dir and also print the foldx command that will be run 2021-09-29 18:24:06 +01:00
d443ecea6b added separate script for splitting csv after adding chain ID. saves lots of post processing 2021-09-20 16:13:15 +01:00
daa3556ede split csv for isoniazid 2021-09-20 16:12:45 +01:00
5cd6c300a7 saving minor update to function fix 2021-09-17 13:35:48 +01:00
e115c3636c fixed lf_bp function with aes_string and reformulate 2021-09-17 13:33:19 +01:00
e2d7a6567e minor bug fixes to allow i_graps for stability to render correctly 2021-09-16 18:59:02 +01:00
51aa321792 sorting out bp_subcolours in interaction 2021-09-16 12:44:42 +01:00
e8734b1c4b sorted merged_df2 and consequently others by position in combining_dfs_plotting.R 2021-09-16 12:43:36 +01:00
cb5d7aa5ab corrected foldx_outcome classification in combining_dfs.py as positive are Destabilising and neg as Stabilising 2021-09-16 10:59:55 +01:00
56600ac3f8 added config/ with drug gene names 2021-09-16 10:05:28 +01:00
746889b075 saving work for the day after massive repurpose 2021-09-15 19:48:56 +01:00
1d16c6848e moved coloured_bp_data.R to redundant in light of updated function and reflected this in notes withing get_plotting_dfs.R 2021-09-15 19:42:08 +01:00
96e6e8db5d saving work and tidying script 2021-09-15 19:37:39 +01:00
f0e66b2f7b added the scratch script as _v2 to play while repurposing bp_subcolours.R 2021-09-15 19:34:24 +01:00
2ac5ec410e added test_bp_subcolours.R 2021-09-15 19:33:52 +01:00
7550efbd4c added wideplot subcols generation within bp_subcolours.R to make it easier to call the whole thing as a function and use merged_df3 to generate plot without having to separately generate special data for it. Tested with real data on different stability params 2021-09-15 19:29:09 +01:00
449af7acf4 fixed pos_count cals in function by specifying dplyr and changed summarize to summarise 2021-09-15 15:46:42 +01:00
bf432cd054 more updates to pairs_panels to take colnames for plotting 2021-09-14 18:20:12 +01:00
b98977336c updated my_pairs_panel.R to make the dots coloured 2021-09-14 15:36:05 +01:00
996d67b423 added pretty colnames to corr_data.R 2021-09-13 10:24:41 +01:00
3f3fe89a6b added shorter scripts for each different processing for plots to make it wasire to read code 2021-09-10 18:20:45 +01:00
27f0b15d4c tidied script plotting_data.R by removing superceded code 2021-09-10 18:19:56 +01:00
3ddbee8c90 finally moved foldx_outcome and deepddg_outcome calcs to combine_dfs.py in python script i.e cleaned source data 2021-09-10 18:19:01 +01:00
5c8a9e8f00 sorted combining_dfs.py with all other data files and tidied up get_plotting_dfs.R 2021-09-10 18:16:41 +01:00
4ba4ff602e added foldx_scaled and deepddg_scaled values added to combine_df.py and also used that script to merge all the dfs so that merged_df2 and merged_df3 are infact what we need for downstream processing 2021-09-10 16:58:36 +01:00
dda5d1ea93 moved old lineage_dist plot scripts to redundant 2021-09-09 16:16:18 +01:00
2bd85f7021 added lineage_dist_plots.R 2021-09-09 16:15:07 +01:00
93038fa17c added lineage_dist.R nad renamed lineage_bp_data file to lineage_data 2021-09-09 16:14:14 +01:00
b7d50fbbcd added lineage_labels and mutation_info_labels to combinig_dfs_plotting 2021-09-09 16:10:11 +01:00
03031d2eb6 moved all test scripts for functions to tests/ 2021-09-09 13:12:07 +01:00
2ee66c770b updated notes 2021-09-07 11:18:10 +01:00
686fd0cd80 updated running_plotting_scripts.R 2021-09-07 11:16:41 +01:00
c9519b3b56 moved old lineage_basic_barplot.R to redundant 2021-09-07 10:52:26 +01:00
3cee341170 replaced old lineage barplot with count and diversity combined plots sourced from function 2021-09-07 09:27:47 +01:00
50b89cdcd7 one function with tuned params to generate count and diversity barplot 2021-09-06 19:52:56 +01:00
869fca7f94 added function for generating lineage barplots and also test script along wiadding script for processing data and added it to get_plotting_dfs.R 2021-09-06 19:50:50 +01:00
605eb54526 saving work for the day 2021-09-02 17:40:24 +01:00
a981580b7a separated get_plotting_dfs_with_lig.R 2021-09-02 12:51:31 +01:00
fcb4b85747 modified bp with option for adding stats and boxplplots. Moved old one to redundant 2021-09-02 12:50:24 +01:00
826d3c72b7 added functions for bp with stat and tested them 2021-08-27 14:05:00 +01:00
edb409baef renamed dm_om barplot function scriptto lf_bp_stability.R 2021-08-27 13:03:39 +01:00
da9bb67706 added function for stats from lf data 2021-08-27 13:01:52 +01:00
6e01ef22c0 added stat_bp_stability.R which needs to be a function for generating stat plots 2021-08-26 16:37:56 +01:00
6d9412d232 playing with dm_om (other)plots data and graph on gid branch 2021-08-26 16:35:46 +01:00
0e44958585 added log10 OR and P values to myaf_or_calcs.R 2021-08-23 20:01:01 +01:00
182465d579 added corr plots as function for interactive graphs on shiny 2021-08-20 18:52:47 +01:00
f7aac58081 adde format_results_dynamut2.py and ran shiny scripts for barplots 2021-08-19 16:25:38 +01:00
8cdf720702 added pdb_fasta_plot.R for generating some useful plots for shiny 2021-08-17 10:55:06 +01:00
8de22e14f3 extracting gid seq from pdb file using pdbtools 2021-08-17 10:53:26 +01:00
2511162a1d added aa_index/ with script that return dfs for plots for shiny perhaps 2021-08-13 16:22:11 +01:00
5529fbf63d added dynamut results formatting scripts, althouh needs to be rerun once b7 completes 2021-08-13 13:24:22 +01:00
64669eb05f indicated f for format for mcms_na formatting script 2021-08-13 13:23:42 +01:00
5232731bc5 saving work 2021-08-12 17:37:56 +01:00
8fbf5bcadd extracted results for dynamut gid bissection b10_21 2021-08-12 17:35:12 +01:00
efe5e0e391 Merge branch 'master' into gidb_dev 2021-08-12 15:35:28 +01:00
fed8cd83a0 minor tidy up for script submit_dynamut 2021-08-12 15:33:57 +01:00
93fae9e5f5 reran b7 since previous run file output was 0 bytes 2021-08-12 15:29:36 +01:00
ca07351086 ran b9 and b10 for gid after Dynamut team reran due to server issues 2021-08-12 10:06:43 +01:00
cf0db2a9c0 saving dynamut and mcsm_na jobs submitted and retrieved 2021-08-11 17:32:15 +01:00
93482df47a added script for formatting mcsm_na results 2021-08-06 19:12:57 +01:00
50cf6ca3ac ran submit and get_results for one last batch for mcsm_na and did some bash formatting to get proper filenames, etc. 2021-08-06 19:09:29 +01:00
4733ec9db0 resuming work after conference 2021-08-05 16:54:34 +01:00
a9119d7f03 indicated which cols are not available for pnca as I ran these scripts for generating plots for the poster 2021-07-07 13:12:29 +01:00
55adc3fa60 added leg_title size for bp function 2021-07-07 13:11:13 +01:00
59a1213f65 generated pncA plot for poster for ps_combined 2021-07-07 11:38:07 +01:00
cb3a0f71da reran plots with current lig dist 2021-06-30 17:35:57 +01:00
ed2fc016ca added the almost done shiny for barplots subcolours 2021-06-30 17:20:04 +01:00
374764b136 renamed barplot_colour_function.R to bp_subcolours.R and reflected it in scripts using it. 2021-06-29 14:05:48 +01:00
db66fdb844 added barplots_subcolours.R that generates heatmap style barplots 2021-06-29 14:00:10 +01:00
c9a5e7de6b moved subcols script to redundant 2021-06-29 13:59:38 +01:00
ac09cfc4e0 moved barplot_colour_function.R to functions 2021-06-29 13:58:22 +01:00
3d4ccc51d7 updated running_plotting_scripts.txt with corr_plots.R 2021-06-28 17:30:25 +01:00
237e293ca3 moved corr_data and corr_PS_LIG.R to redundant 2021-06-28 17:29:31 +01:00
55b5d31c07 added corr_plots.R to generate corr plots by adding source data in get_plotting_dfs.R and tested with cmd 2021-06-28 17:27:50 +01:00
a7d26412e5 added corr data to get_plotting_dfs.R and generate corr plots 2021-06-28 17:25:45 +01:00
2993ab722a moved old logo plots scripts to redundant and updated running_plotting_scripts.txt to reflect these and how to run the single logo_plots.R to generate logo plots 2021-06-24 17:45:40 +01:00
8de2686401 added logo_plots.R that now produces all logo plots while sourcing the get_plotting_df.R script 2021-06-24 17:34:53 +01:00
0e15c05d8b checked logo_multiple_muts.R with the new sourcing script for data 2021-06-24 16:43:23 +01:00
6bbc3328b9 added get_plotting_dfs.R as a mother script to be sourced by all plotting scripts 2021-06-24 14:21:34 +01:00
9668452e98 made logo_plot.R source script that pull in all the data 2021-06-24 14:19:46 +01:00
c8be8407e8 moved my_pairs_panel.R to functions/ 2021-06-24 12:13:15 +01:00
ca2315523d fixed cmd running script problem for logo plots 2021-06-24 12:12:36 +01:00
552c5e77aa added first line to all func to run from 2021-06-24 10:02:14 +01:00
48a85ede0c saving work on logo plots before finishing 2021-06-23 16:49:18 +01:00
9d964e84b6 generated logo_plot.R from cmd, checked 2021-06-23 16:35:44 +01:00
015e34894f added test_plotting_data.R, and replaced input param of csv into df 2021-06-23 16:16:23 +01:00
8277b489d6 changes made to combining_dfs_plotting.R 2021-06-23 16:15:15 +01:00
4f4734f565 updated logo_plot.R with functions 2021-06-23 12:06:41 +01:00
5dec604742 moved combining_dfs_plotting.R to function and added test script for this as well 2021-06-22 18:15:15 +01:00
754cd70a6f added files that were moved to redundant 2021-06-22 18:06:08 +01:00
ea79f3b3c7 turned combining_dfs_plotting.R to a function and moved old script to redundant 2021-06-22 18:04:10 +01:00
cd5cbce3a0 updating script to sort out proper merging for plotting 2021-06-22 14:46:03 +01:00
7f2c1d7ed8 took extra lines from data extraction 2021-06-21 16:15:44 +01:00
a3c10eb842 added af_or to add to combining_dfs.py 2021-06-21 14:53:04 +01:00
1155959e67 added deep ddg formatted data to combinig_dfs.py 2021-06-21 12:56:06 +01:00
3ff9604002 added deepddg data to combining_df.py 2021-06-21 11:53:56 +01:00
25fcebe448 added function to add aa code for mcsm and gwas style mutations to a given files 2021-06-18 17:48:26 +01:00
926d181120 saving work before adding files 2021-06-18 17:47:09 +01:00
0e0f7c89df Merge branch 'gidb_dev' 2021-06-14 13:27:00 +01:00
0881181f4b added aa_prop.py and add_aa_prop.py to add aa properties for wt and mutant in a given file containing one letter code wt and mut cols as csv 2021-06-14 13:24:00 +01:00
58b5b63595 changed aa_prop_water to 3 categ according to KD, updated ref dict 2021-06-14 13:22:56 +01:00
2a8133898f added function and test for aa_prop_bp.R 2021-06-14 09:22:05 +01:00
140bdc6d96 added example for layout 2021-06-14 09:06:30 +01:00
851683811e weird pdbtools commit 2021-06-11 21:45:18 +01:00
6dd8cc6f44 added another aa dict type to reference_dict.py and calculated electrostatisc changes for muts based on adding these properties to mcsm mut stule snps. This will allow the calculation on a given file type since the ref dict can now easily be adapted. 2021-06-11 17:12:21 +01:00
6e8116bc16 calculating af_or using function and cmd options now 2021-06-11 15:12:08 +01:00
5f82c8b393 added script to test af_or_calcs 2021-06-11 13:33:25 +01:00
96af746726 added mychisq_or.R and af_or_calcs.R 2021-06-11 13:28:07 +01:00
ad0b814f9c moved old af_or_calcs.R to redundant 2021-06-11 13:27:40 +01:00
587be435e9 saving the correct af or script 2021-06-11 13:26:28 +01:00
4e38e2a80e saving work before converting to a function 2021-06-11 13:25:02 +01:00
d9e00b9a42 minor tweak to plotting_globals.R to make gene_match a global var 2021-06-11 11:21:20 +01:00
29fa99b914 moved functions/ in the scripts dir 2021-06-11 11:11:39 +01:00
2dc81e72f4 moved old bp scripts to redundant 2021-06-10 16:18:08 +01:00
ff7522eca2 moved plotting_func to functions and replaced 3 basic_barplots scripts with 1 2021-06-10 16:09:58 +01:00
b9d176afa4 added function for position_count_bp.R 2021-06-10 14:46:11 +01:00
cbb3749a21 added functions dir for further tidying and tested this with ind scripts for stability 2021-06-09 18:13:18 +01:00
912a439589 moved bp function script to function/ 2021-06-09 17:08:56 +01:00
826b8aa9d0 added shiny app and turned stability bp to function 2021-06-09 17:05:02 +01:00
f871dd39cd saving work 2021-06-09 16:27:05 +01:00
47a6ddbf72 repuposed and ran basic_barplots for lig and foldx including filenames 2021-06-09 11:33:08 +01:00
27d79105ef repurposed basic_barplots_foldx.R 2021-06-09 11:24:50 +01:00
8f73c7b804 uddated how to run plotting scripts. This is a cleaner version to keep up-to-date 2021-06-08 16:53:07 +01:00
3f691281bc wrapper script basic_barplots_PS.R now takes cmd and calls functions to generate plots.Tested and verfiied. 2021-06-08 16:48:19 +01:00
1505a3c707 tidied plotting_data.R as a function returning a lits of dfs 2021-06-08 16:00:28 +01:00
9af0249e0e added plotting_globals and text file with info on how to run plotting scripst 2021-06-04 17:26:01 +01:00
a5715bcccc tweaking baic bp to make generic 2021-06-04 17:23:41 +01:00
aaa24ca32d minor updates to dir.R 2021-06-04 15:05:52 +01:00
a1fef205da adpated combining_dfs.py and plotting.R for gid and attempting to make it generic 2021-06-04 14:36:16 +01:00
2c5c704d0b test branch commit 2021-06-04 09:43:48 +01:00
b77f55fcc2 saving before starting work 2021-06-04 09:38:17 +01:00
59430a49dd updated counts.py with wt seq counts 2021-03-03 11:54:48 +00:00
88229860e2 added adjusted p-values for DM muts comparison 2021-02-27 10:42:04 +00:00
9784bc1729 updated count.py with indel and stop codon count 2021-02-24 09:56:36 +00:00
8975a4cedf retrieved results for gid b8 and b9 2021-02-23 08:59:01 +00:00
3ec42edc57 retrieved gid b7 and submitted b8,b9 and b10 2021-02-22 09:31:29 +00:00
7925a408cd retrieved results for gid b6 2021-02-21 16:23:22 +00:00
aca73048c1 added count.py to count samples for quick checks 2021-02-21 16:07:33 +00:00
9b0d2f6550 saving work and generating revised_figure7 2021-02-20 16:17:38 +00:00
e8644d7af5 dynamut retrieved b5 and b6, submitted 6 and 7 2021-02-20 13:05:30 +00:00
d27b5898f7 code to retrieve results from batch 4 and 5 once ready 2021-02-19 12:09:26 +00:00
b12a7769ca updated .gitignore 2021-02-18 12:01:04 +00:00
0c563274b4 updated .gitignore to include temp dirs 2021-02-18 11:54:36 +00:00
8e68fb8f6a add files 2021-02-18 11:50:46 +00:00
ce5b545703 running dynamut in batches 2021-02-18 11:27:20 +00:00
ff2bc78645 renamed files in dynamut for consistency 2021-02-18 10:52:51 +00:00
84fccfbbb6 renamed file in mcsm_na to be consistent 2021-02-18 10:51:17 +00:00
501492f9fb renaming file 2021-02-18 10:48:06 +00:00
96e101cf15 renamed file run_submit to run_submit_dynamut 2021-02-18 10:45:35 +00:00
06df7369de renamed file run_results to run_get_results 2021-02-18 10:43:45 +00:00
4d3389264d ran mcsm_na for all 26 batches for gid 2021-02-16 13:55:31 +00:00
c02a55c167 sunmitting mcsm_na jobs manually 2021-02-16 10:51:06 +00:00
3925c4fd29 added get_results_mcsm_na.py run_get_results.py to retrieve results for each batch run of 20 for mcsm_na 2021-02-15 12:22:52 +00:00
4df1c54674 saving work for mcsm_na 2021-02-15 12:22:19 +00:00
0d58c4800b added mcsm_na_temp 2021-02-12 17:40:02 +00:00
84970a667c added shell script to format muts for mcsm NA 2021-02-12 17:38:42 +00:00
a6f1f65acf added mcsm_na scripts to submit batches of 20 2021-02-12 16:51:41 +00:00
7c84e8b044 minor cody tidy up 2021-02-12 16:50:34 +00:00
26fe956d47 tested and added note to reflect that tar.gz needs to be made into a cmd line option 2021-02-12 15:32:16 +00:00
2b6ffec100 checked tar.gz downlaod from the script with example 2021-02-12 15:25:32 +00:00
a5f1878158 added tar.gz download within get_results.py 2021-02-12 15:24:51 +00:00
deb0aa8e58 separated defs and calls and added a separate script to test examples 2021-02-12 14:15:55 +00:00
6c458f8883 updating and cleaning get_results script 2021-02-12 12:04:49 +00:00
db1c950c39 updating get_results_def.py 2021-02-12 11:38:21 +00:00
c146ea0f43 added example files to test dynamut results fetching for single and multiple urls 2021-02-11 19:22:19 +00:00
72426fd949 updated with def for get_results.py for dynamut 2021-02-11 19:21:26 +00:00
9629d24169 extracting single mut url from the batch processing step 2021-02-11 17:19:04 +00:00
e67a716d82 added submit_def.py with example to run batch of 50 2021-02-11 14:36:32 +00:00
91f214f014 added split_csv.sh 2021-02-11 13:42:14 +00:00
e302aafacf uncommented some debug output for mcsm, pandas and numpy conflict. So temporarily resolved it by running from base env 2021-02-11 10:53:23 +00:00
acd51ab3e4 saving work in dynamut submit 2021-02-11 09:46:11 +00:00
e3189df74b dynamut scripts and minor change dir for rd_df.py 2021-02-10 15:40:33 +00:00
21451789e7 renamed files 2021-02-10 11:53:20 +00:00
53ff6a9a1a added sample test_snps 2021-02-10 10:38:08 +00:00
af1446253b updated minor changes 2021-02-10 10:37:44 +00:00
8d0bd8011f added depricated shell scripts 2021-02-10 10:36:02 +00:00
6103254442 updated testing cmds for foldx 2021-02-10 10:32:09 +00:00
3280cdb2a1 added test2/ for testing updated foldx script 2021-02-10 10:16:28 +00:00
c8feba90bc added script to submit jobs 2021-02-09 20:16:27 +00:00
4f25acfa35 adding and saving files 2021-02-09 18:30:47 +00:00
d77491b507 testing dynamut script 2021-02-09 18:28:16 +00:00
aec61e2c1f Merge branch 'master' of https://git.tunstall.in/tanu/LSHTM_analysis 2021-02-09 16:12:34 +00:00
d26b17fd19 added dynamut dir 2021-02-09 16:11:07 +00:00
660ab31ce8 work from thinkpad 2021-02-09 16:03:02 +00:00
2f0e508679 add foldx5 wrapper 2021-02-09 15:45:21 +00:00
b5b54c7658 dont break when the pdb file is in a weird place with a weird name 2021-02-09 15:20:55 +00:00
73b705a563 check to handle missing I/O/P dirs if drug unset 2021-02-09 15:00:03 +00:00
0d2f5c55ef test2 runfoldx symlink 2021-02-09 14:43:03 +00:00
93f6707b8f various changes 2021-02-09 14:42:44 +00:00
80e00b0dfa renamed file runFoldx.py in test2/ to reflect this 2021-02-09 10:54:35 +00:00
f95f2a3c93 remove shell scripts run with subprocess() and launch foldx directly from python 2021-02-08 18:06:02 +00:00
d4a7e3b635 modifying script to avoid invoking bash as a subprocess 2021-02-08 16:59:42 +00:00
fab1fb0492 more debug 2021-02-08 16:16:53 +00:00
c9698d7550 fixup broken shell scripts 2021-02-08 15:44:21 +00:00
a67156bc87 test2 bugfixes 2021-02-08 15:24:22 +00:00
4d03a43c4a added user defined option for processing dir to allow me to specify external storage device for running it 2020-12-02 11:26:26 +00:00
619a828659 added chain_extract.py and pdb_chain_extract.py 2020-11-30 14:11:46 +00:00
a7d7bceb00 adding options to specify files by user 2020-11-27 13:02:15 +00:00
50744f046f added my_pdbtools containing pdbtools cloned from a git repo 2020-11-17 13:56:23 +00:00
2911678177 updating notes to running_scripts.py as running for another drug-target 2020-11-17 13:55:16 +00:00
d4cd5aea0a modified running script to mention chain info for foldx 2020-11-16 16:16:24 +00:00
91b7f73a63 added script to interrogate pdb files mainly for res numbers 2020-11-16 16:01:31 +00:00
073381e5a2 updated results summary in the data_extraction.py 2020-11-12 17:05:29 +00:00
e67fbfd986 handling missing dir for data_extraction.py 2020-11-12 13:21:06 +00:00
c7194b7423 added what is required as a minimum to run data_extraction 2020-11-06 19:04:27 +00:00
719f18a226 added base histogram script for af and or 2020-10-13 13:38:17 +01:00
784199d48f added ns prefix to SNPs for unambiguity 2020-10-13 13:37:22 +01:00
02fae30c29 changing labels in graphs for frontiers journal 2020-10-09 13:10:08 +01:00
e91d704929 renamed other_plots.R to other_plots_combined.R and changing labels to capital letters for journal 2020-10-09 12:17:24 +01:00
bf3e830f64 saving work minor changes perhaps 2020-10-08 16:03:12 +01:00
7158f5b2c9 added af and OR columns in the data 2020-10-06 19:39:59 +01:00
da3b23d502 indicated hardcoded active site residues for pnca 2020-10-06 19:12:32 +01:00
10e1baee82 script to subset data for dnds cals 2020-10-06 19:11:34 +01:00
4157b8137c added barplot_subcolours_aa_combined.R to combine and label these plots 2020-10-06 18:43:20 +01:00
861b2a7ee1 adjusted x axis position label for barplot_subcols_aa_LIG.R 2020-10-06 18:42:24 +01:00
4368c061c7 generated labelled ps_plots_combined.R and capital "P" for position in barplots coloured aa for Lig 2020-10-06 18:15:50 +01:00
315f7b1e0e output corr plots with coloured dots 2020-10-06 17:47:24 +01:00
0cdc507ba5 updated TASK in hist_af_or_combined.R 2020-10-06 16:43:59 +01:00
711781933c renamed dist_plots.R to dist_plots_check.R as its exploratory 2020-10-06 16:39:24 +01:00
cc8443c7d4 added hist_af_or_combined.R to generate plots for output and moved previosu run to scratch_plots/ 2020-10-06 16:33:25 +01:00
9b9ee07801 added hist_af.R 2020-10-06 15:07:42 +01:00
66db9ddd9c updated .gitignore 2020-10-06 09:55:19 +01:00
3ecba79eb9 added basic_barplots_foldx.R for supp figure 2020-10-06 09:53:34 +01:00
b63bbd6f15 moved not required plots to scratch 2020-10-06 09:52:54 +01:00
923cad81b5 saving predictions script 2020-09-30 14:09:08 +01:00
4f32ffd3b6 added predictions for ps and lig and output to results 2020-09-30 13:12:05 +01:00
c95db27b06 added prediction.R to do logistic regression 2020-09-30 10:04:49 +01:00
8e16b2635e added ../data_extraction_epistasis.py for getting list for epistasis work 2020-09-29 16:09:54 +01:00
6354caae3c added corr_data.R corr_PS_LIG_all.R corr_PS_LIG_v2.R 2020-09-29 16:08:25 +01:00
9ba7b32c14 added dist_plot.R to generate plots for writing results 2020-09-23 19:24:42 +01:00
a0755eeab6 added more analysis in extreme_muts.R to be tidied later 2020-09-23 19:23:34 +01:00
5f10ad8075 added fold and duet agreement to extreme_muts.R 2020-09-23 11:20:22 +01:00
4398c049ca added foldx scaled and foldx outcome to plotting_data.R 2020-09-23 11:12:41 +01:00
5deb12187e updated extreme_muts.R with number of budding hotspots and mult muts numbers 2020-09-23 11:02:13 +01:00
3318f3f85a Update README.md 2020-09-21 18:11:24 +01:00
e0561f29c0 Update README.md 2020-09-21 18:11:10 +01:00
7c6581f19a Update README.md 2020-09-21 18:09:55 +01:00
e8aff6129a Update README.md 2020-09-21 18:08:49 +01:00
7243f0c7e7 Update README.md 2020-09-21 18:07:58 +01:00
759efd876d updated gitignore for more tidying 2020-09-21 17:58:51 +01:00
2c013124ad updated gitignore to tidyup 2020-09-21 17:54:54 +01:00
1cf1f4e70e remove unneeded dir 2020-09-21 17:49:19 +01:00
759054de35 added ks_test_all_PS.R, ks_test_dr_PS.R, ks_test_dr_others_PS.R 2020-09-21 17:46:22 +01:00
535a5e86c0 saving combined bubble plot with labels 2020-09-18 18:19:55 +01:00
ad447b62df updated .gitignore to include .RData 2020-09-18 18:10:23 +01:00
07fa82520e added script basic_barplots_combined.R to combine basic barplots for PS and lig 2020-09-18 18:09:24 +01:00
0c5ef2e72c saving work 2020-09-18 18:07:48 +01:00
c5266770af added ggcorr all plot figure for supp 2020-09-18 12:46:12 +01:00
c0b8d56fea added ggcorr plots combined for all params 2020-09-18 11:56:19 +01:00
9ae9042033 saving work 2020-09-18 11:55:08 +01:00
dcf3d474e7 updated Header file and saving work 2020-09-17 20:12:08 +01:00
6e991e928a logo_combined.R, outputs logo plot with multiple mutations and log_or 2020-09-17 20:01:57 +01:00
91f8707d47 minor tweaks in logo and corr plots 2020-09-17 20:00:34 +01:00
bf5854aeaa updated corr plots to show points with no colours 2020-09-17 17:17:11 +01:00
36040d90f2 updated corr_PS_LIG.R to output both styles of corr plots 2020-09-17 17:04:03 +01:00
3999aa26a3 renamed corr_plot scripts 2020-09-17 16:38:40 +01:00
faf52e1790 updated plot name in corr_plots_foldx.R 2020-09-17 16:36:45 +01:00
afe650fd7f renamed file to denote corr adjusted and plain 2020-09-17 16:35:35 +01:00
c09330130e added scratch_plots/ggpairs_test.R to play with ggally for future 2020-09-17 15:32:40 +01:00
a6182f2b3d added plotting/corr_plots_style2.Radded my version of pairs.panel with lower panel turned off. Also added new script for corr plots using my version of pairs.panel 2020-09-17 15:31:37 +01:00
e8b58bfe28 saving work 2020-09-17 15:29:17 +01:00
b41b33b73c added new layput for dm_om and facet_lineage plot 2020-09-17 14:01:04 +01:00
32da321f32 updated with two outputs: labelled and unlabelled 2020-09-16 15:37:56 +01:00
c4e96ce7d9 renaming and moving files 2020-09-16 14:57:51 +01:00
3269a27dd2 renamed file in scratch plot/ 2020-09-16 14:53:53 +01:00
19287f3b4b playing with lineage_dist_dm_om 2020-09-16 13:23:49 +01:00
1bc7f83916 added dir scratch_plots/ to practice extra plots 2020-09-16 11:51:17 +01:00
ae11792f46 updated plotting_data.R with stability colours as variables 2020-09-16 11:47:38 +01:00
4d34f5b5d7 saving work 2020-09-15 13:34:26 +01:00
aa4294dff2 updated distribution scripts to try adding points 2020-09-15 13:33:28 +01:00
3bb2d3c78c updating lineage_country.R with different data slices 2020-09-15 13:14:33 +01:00
65b5a3c049 added ggridges_lineage_country.R for dist by country 2020-09-15 12:50:25 +01:00
4f943bb1a3 updated gitignore to include TO_DO/ 2020-09-14 17:26:28 +01:00
f72e81664d added mutate.py and run_mutate.sh to create MSA aligbments for mutant sequences required to generate logoplot from sequence in R 2020-09-14 15:17:49 +01:00
da9075458f saving logoplot attempts 2020-09-14 15:13:52 +01:00
ff1e1cdaf1 added corr_plots_foldx.R 2020-09-11 20:28:18 +01:00
767418eb18 updated figure for multi mut plot 2020-09-11 19:30:20 +01:00
e2a8171113 added logo_multiple_muts.R 2020-09-11 18:12:06 +01:00
4e2bf1496a added check for active site mut count 2020-09-11 17:41:40 +01:00
0665bc80a9 saving extreme muts analysis 2020-09-11 16:43:27 +01:00
a14fc4dc33 added extreme_muts.R 2020-09-11 16:07:23 +01:00
8648320de7 added delta symbol to plotting_data.R and pretty labels for dr_other_muts figure 2020-09-11 14:40:37 +01:00
f6a440dc55 added plotting/other_plots_data.R 2020-09-11 12:52:17 +01:00
b0c4791206 results for electrostatic changes 2020-09-11 10:27:56 +01:00
b12d20e30f write merged_df3 files from combining_dfs_plotting 2020-09-11 09:51:53 +01:00
de3cfae795 add scripts/mut_electrostatic_changes.py 2020-09-10 20:18:35 +01:00
09a64932ce updated notes with supp table colnames 2020-09-10 20:15:00 +01:00
69e0c5d05f updated logo plot data to source from combining_df_plotting.R 2020-09-10 19:58:33 +01:00
3ac5ff7078 added logo plot 2020-09-10 19:56:33 +01:00
61b91bccb4 updated Header file with Logolas and ggseqlogo 2020-09-10 19:55:21 +01:00
96c95fb06c added merged_df3_short.csv for supp tables and struct figures 2020-09-10 19:17:05 +01:00
0699ebfc3a saving work 2020-09-10 19:16:24 +01:00
14182df12f saving other_plots.R 2020-09-10 17:53:49 +01:00
c9040cad21 Merge branch 'master' of github.com:tgttunstall/LSHTM_analysis 2020-09-10 16:14:46 +01:00
2540424308 changes 2020-09-10 16:06:14 +01:00
8874f9911f saving work yet again to be extra sure 2020-09-10 16:03:04 +01:00
01273a8184 saving recovered combining_dfs_plotting.R after editing 2020-09-10 15:52:22 +01:00
e023472091 move combining_dfs_plotting.R 2020-09-10 15:36:17 +01:00
9da808680d re-adding deleted combining_dfs_plotting.R 2020-09-10 15:28:10 +01:00
315b350466 updated gitignore and saving workk 2020-09-10 14:45:10 +01:00
4cd6c2acd7 added boxplots and stats for other numerical params 2020-09-10 14:09:40 +01:00
a04c3d6b0d saving work after correlation plots 2020-09-09 20:56:07 +01:00
9476fac83b added correlation plots 2020-09-09 20:48:21 +01:00
687d5613aa renamed file 2020-09-09 19:11:06 +01:00
464186f5bd regenerated combined_or figure with correct muts 2020-09-09 19:03:52 +01:00
41ff051dc9 script to generate combined ps plot wit af and or 2020-09-09 18:57:28 +01:00
926fe37ac3 saving work 2020-09-09 18:56:59 +01:00
ae760d193c renamed lineage_dist 2020-09-09 17:34:32 +01:00
6d50b961a5 corrected subcols_axis name in sucols_all_PS 2020-09-09 13:36:37 +01:00
f7b55d035a lineage dist plots combined generated
Please enter the commit message for your changes. Lines starting
2020-09-09 13:18:57 +01:00
98cd8a74e8 generated lineage dist plots combined. needs tweaking 2020-09-09 12:53:53 +01:00
af7c55b713 plotting script with resolved gene metadata 2020-09-09 12:00:42 +01:00
3ebb0d8f06 updated dir.R 2020-09-09 11:45:09 +01:00
8507d16b8b add dirs and resolving_ambiguous_muts 2020-09-09 11:36:40 +01:00
8dbe532937 `resolved ambiguous muts and generated clean output. Also seaprated dir.R 2020-09-09 11:26:13 +01:00
f10f8f6d2a changing category of ambiguous muts 2020-09-08 18:51:03 +01:00
e980085294 outputting revised all params file 2020-09-08 17:52:45 +01:00
f7ab799f74 hopefully finally sorted data merges! 2020-09-08 17:46:52 +01:00
e4608342a4 various changes 2020-09-08 17:13:02 +01:00
c72269dcd1 trying other num param plots 2020-09-07 17:17:56 +01:00
4bab45c634 ks test script added 2020-09-07 15:27:53 +01:00
739e9eadf8 Combining dfs for PS and lig in one 2020-09-07 14:05:46 +01:00
93e19e3186 lineage barplot script 2020-09-07 11:29:28 +01:00
e0a9da9893 updated giignore 2020-09-04 22:46:07 +01:00
95574d469b updated combining_two_df.R for plots 2020-09-04 22:43:30 +01:00
ef3a97d664 script to plot lineage dist plots 2020-09-04 22:40:49 +01:00
5d0e2d94ce adding missing mutation col in combining_dfs 2020-09-04 21:04:18 +01:00
c48c5177ca resolving missing mutation info in combining script 2020-09-04 20:56:16 +01:00
d6552628e4 added running scripts doc 2020-08-26 17:20:01 +01:00
de14752a0c all barplots generated for ps and lig 2020-08-26 17:18:45 +01:00
8ee7d4234e reflected change in running_scripts doc 2020-08-26 16:41:10 +01:00
25ff220b1d renamed file to reflect sucols_axis is commons script sourced by ps and lig plots 2020-08-26 16:40:36 +01:00
e0f14ed266 sorted subcols_axis script to generate correct axis cols for both PS and lig plots 2020-08-26 16:39:10 +01:00
2e53c8007a generated subcolour bps for PS 2020-08-26 12:45:09 +01:00
7e0bddd7d2 sourcing plotting_data for subcols_axis_PS 2020-08-26 12:07:04 +01:00
b5ad53f7d1 added ligand df in plotting 2020-08-26 10:02:44 +01:00
1a00ab614f added instructions on running plot scripts 2020-08-24 14:38:45 +01:00
e696064d42 generated replaced Bfactor pdbs 2020-08-24 14:37:28 +01:00
35a89f7761 rectified mcsm_mean_stability to avereage on raw values and then scale 2020-08-24 13:04:25 +01:00
739a648154 saving work to check merge conflicts resolved 2020-08-24 11:20:58 +01:00
9a97b2d2b4 sourced plotting script in mean_stability calcs 2020-08-21 17:33:09 +01:00
89358cf843 added plotting scripts from old run 2020-08-21 13:25:01 +01:00
40909d5951 script to format snp_info.txt 2020-08-21 13:23:29 +01:00
59cb57a795 updated script to combine dfs 2020-08-21 13:22:28 +01:00
208e0b6f62 sorted df by position for output in data_extraction 2020-08-14 17:57:12 +01:00
87f5a9ca05 tidy script for linking or_kinship with missense variant info 2020-08-14 16:41:11 +01:00
805868ce7e removed if clause for filenames 2020-08-13 18:39:16 +01:00
833e599550 added output file for checking 2020-08-11 18:34:02 +01:00
dbf8865203 saving work, ready for more remote working 2020-08-07 13:35:02 +01:00
779000ad4f added data cjeckings script 2020-08-07 13:34:24 +01:00
0e2e24134c saving work 2020-08-07 13:33:44 +01:00
e46e5484e8 separting data processing from plotting, started with basic_barplots_PS script 2020-07-16 18:59:17 +01:00
8a8790a7d1 replaced single quotes with double in R scripts 2020-07-16 14:18:18 +01:00
eed3450236 mean stability values calcs and replaceBfactor plots 2020-07-16 14:12:08 +01:00
38759c6b0c calculating mean stabilty per position
please enter the commit message for your changes. Lines starting
2020-07-16 10:37:40 +01:00
bf4a427239 scripts generating axis coloured subcols bp for PS 2020-07-15 16:31:10 +01:00
636100d383 made tweaks to output plot filenames 2020-07-15 16:29:36 +01:00
b2b95b80fd adding plots as I tidy and generate 2020-07-15 13:50:07 +01:00
acc6a42880 saved work before adding plots 2020-07-15 13:36:20 +01:00
f8fef60475 saving work for today 2020-07-14 16:13:17 +01:00
f27c223bdd resolving merge conflicts dur to shoddy data 2020-07-14 14:09:42 +01:00
8dc2fa7326 fixed white space prob with mcsm input with merge 2020-07-14 14:07:23 +01:00
5a2084ba11 remove white space in colnames before mcsm format output 2020-07-14 12:59:40 +01:00
c0827b56cc finding discrepancy in merging or dfs,grrrr 2020-07-13 18:31:29 +01:00
da0c03c2e0 trying to resolve copy warning in code 2020-07-13 12:20:43 +01:00
5655af42c0 added sanity checks for or_kinship calcs 2020-07-13 11:37:43 +01:00
167b051ae7 added sanity checks for or_kin 2020-07-10 15:24:57 +01:00
cb31c5c8f4 refactoring or_kin script minor changes only 2020-07-10 12:38:42 +01:00
6cedc3c14d refactoring or_kin script minor changes only 2020-07-10 12:37:41 +01:00
4fd3462fc8 added cleaned up af_or_calcs.R 2020-07-09 15:55:16 +01:00
7b50dc3a3a added consistent style scripts to format kd & rd values 2020-07-09 14:08:27 +01:00
08379c0def minor tidy up in foldx, mcsm and dssp scripts 2020-07-09 14:04:16 +01:00
44597ec563 renamed mcsm_wrapper to run_mcsm 2020-07-09 13:33:56 +01:00
c0fa9e3904 added dssp.py with refactored argeparse 2020-07-09 12:58:55 +01:00
6725f08829 adding default dirs and filenames to argparse in foldx and mcsm 2020-07-09 12:57:08 +01:00
6961a9cdb3 minor edits to format mcsm data like sorting df 2020-07-09 11:15:56 +01:00
8931441fa5 ran foldx and mcsm (get) for 33k dataset 2020-07-08 20:30:32 +01:00
172fa18420 modified extraction to be explicit for extracting nsSNP for specified gene 2020-07-08 18:47:22 +01:00
436125745d minor changes in data extraction 2020-07-08 16:01:54 +01:00
65e6b28d9e data extraction tidy up 2020-07-08 13:26:33 +01:00
e4328df255 saving work for the day 2020-07-07 18:31:14 +01:00
8f460347b4 adding clean files for rerrun 35k dataset 2020-07-07 18:28:55 +01:00
0973717287 added script to combine all files in one 2020-07-07 16:06:11 +01:00
01ef04613a renamed files that combine dfs 2020-07-07 15:46:13 +01:00
56d1617561 testing combining df script 2020-07-03 19:23:23 +01:00
b09c15004b stil fiddling iwth combining dfs 2020-07-03 19:22:46 +01:00
e0ba3108f6 added fixme: for some necessary required changes 2020-07-02 14:16:40 +01:00
fb277a1484 added combining funct & combining_mcsm_foldx script 2020-07-01 16:41:58 +01:00
973a1a33da refactor foldx pipeline to include:
* command-line args
* creating necessary dirs automagically
* code cleanup, syntax errors, etc etc
2020-06-30 17:14:30 +01:00
e8a66a7a94 updated code and made it tidy 2020-06-25 14:40:44 +01:00
7032baa08d tidying script 2020-06-25 13:12:09 +01:00
cdb1ea1476 updated ref dict to create separate dicts 2020-06-24 14:10:39 +01:00
27a656dba1 added commonly used mutation format for missense muts in the gene_specific nssnp_info file 2020-06-24 13:34:35 +01:00
a9498f8e08 combined and output all ors 2020-06-23 17:34:54 +01:00
d8b272b0ae script for calcuating various OR & output csv 2020-06-23 13:07:29 +01:00
c80c87235e further tidy for OR calcs 2020-06-23 12:19:26 +01:00
e3ae1c3a95 tody scracth script for various OR calcs 2020-06-23 11:57:51 +01:00
263527f576 all OR calcs using sapply and output as df 2020-06-22 18:17:06 +01:00
6b5ced65e5 extracting other params from logistic 2020-06-22 14:11:16 +01:00
28e52d4194 script to combine ors and afs 2020-06-22 13:07:26 +01:00
c98ca7c8ae script to combine all ors 2020-06-19 14:43:23 +01:00
07258120de renamed files & added or kinship link file 2020-06-19 10:33:26 +01:00
c36197d75e updated Af and OR calcs script with argeparse and minor tidyup 2020-06-18 18:37:55 +01:00
f9a8ed3dc7 getopt and commandArgs examples, and AF/OR update to use getopt() 2020-06-18 17:59:28 +01:00
864f814e1b removed merging df for AF_OR 2020-06-18 16:10:02 +01:00
3d1536f2b6 af and or calcs, not merging 2020-06-18 15:57:25 +01:00
c4f9e24007 foramtting and adding or 2020-06-18 13:55:45 +01:00
b73c506587 added AF_and OR calcs script and making it generic 2020-06-17 19:36:34 +01:00
2cebd338ba ran struc param analysis 2020-06-17 19:36:02 +01:00
96da4d8ed5 inlcuded the revised master file for 35k isolates 2020-06-16 11:39:11 +01:00
b28d0afded various debug, doc, and args 2020-05-25 14:27:25 +01:00
3a0ff9b35e added scratch/ 2020-05-22 12:03:11 +01:00
73762568e8 building script for inspecting pdb 2020-05-22 11:57:59 +01:00
bdad2dcfda fixing hetatm script 2020-05-21 12:54:10 +01:00
bc368be4b7 added script for pairwise alignment 2020-05-15 17:58:14 +01:00
cf3e507475 tidy up code 2020-05-15 13:48:50 +01:00
f28b287c86 script for saving pdb chains in single file 2020-05-15 13:44:57 +01:00
bc2844dffb renamed extract chain file 2020-05-15 10:59:19 +01:00
ea213d09aa added pdb_chain splitter code and wrapper 2020-05-13 16:54:20 +01:00
6b527baaff added pdbtools from github source and modified seq.py to exclude hetatm seq extraction 2020-05-12 14:08:08 +01:00
5ac87e76c7 adding commands for use of pdbtools 2020-05-12 12:50:49 +01:00
1d84846789 handle not ready (refresh) url
Please enter the commit message for your changes. Lines starting
2020-04-21 17:12:18 +01:00
8b1a7fc71c moved scripts to /ind_scripts & added add col to formatting script 2020-04-20 12:52:10 +01:00
368496733a fixed indentation error and ran mcsm_wrapper dcs 2020-04-17 12:19:08 +01:00
bc03aab82d add wrapper and mcsm library 2020-04-16 17:45:24 +01:00
23c2ddf45f defined method for formatting mcsm_results 2020-04-14 11:30:36 +01:00
8b7ccccc49 saving work for the day 2020-04-11 19:00:39 +01:00
fb7588cedf added lambda func to normalise duet and aff values 2020-04-11 18:52:57 +01:00
9eb7747065 added script to format results 2020-04-10 19:32:47 +01:00
95147b577c saving work for today 2020-04-09 16:40:45 +01:00
e4df1c5095 adding separae script for getting results for mcsm 2020-04-09 15:42:56 +01:00
41f118223c refactoring bash into python to run mcsm 2020-04-08 18:27:51 +01:00
f42b6f725f minor tweaks 2020-04-08 18:27:09 +01:00
e1e0313fe8 combine df script with command line args and added method 2020-04-08 12:44:17 +01:00
6f545413bc correcting indendation 2020-04-08 12:43:37 +01:00
2ca5aea897 refactoring: added command line args to combine_dfs 2020-04-08 11:44:53 +01:00
60ac125d2b saving work for today 2020-04-07 17:57:34 +01:00
bead9a48bd adapted rd_df script to make it take command line args and define function 2020-04-07 17:42:59 +01:00
73ae22e8a2 tidy kd_df script 2020-04-07 17:42:06 +01:00
ae541ca16a adapted kd calc script with command line args and made it into a function 2020-04-07 16:45:59 +01:00
ded7307c22 kd script with command line args and as function 2020-04-07 16:39:50 +01:00
129808d4a5 updating kd script to take command line args 2020-04-07 16:13:54 +01:00
bf0568345e renamed file for consistency 2020-04-07 16:04:01 +01:00
91068f5bd1 modified dssp_df to handle multiple chains 2020-04-07 16:02:19 +01:00
3a1431d8ed added dssp.py that runs, processes and outputs csv 2020-04-07 15:08:18 +01:00
08da425e7e adding settings params 2020-04-06 19:04:35 +01:00
3905a81c38 refactoring code to make it take command line args 2020-04-06 19:03:41 +01:00
0191ee3493 logoplot from df and seqs with custom height 2020-03-29 17:11:17 +01:00
4f50786c6f added R header file to base dir to allow general access by R scripts 2020-03-28 17:56:39 +00:00
745dd343fd tidied combining plot scripts 2020-03-28 17:54:45 +00:00
4569e704c1 added mutate.py script for msa generation 2020-03-27 17:11:16 +00:00
c2a4e1b0ec saving work for the day 2020-03-27 17:08:33 +00:00
aad225b3d4 changed filename to the new combined output (mcsm+struct params) 2020-03-27 12:43:48 +00:00
e598f7d5dd combining mcsm and struct params 2020-03-27 12:39:02 +00:00
0eaff73114 tidy code and saving work for the day 2020-03-26 17:58:39 +00:00
f0becbe386 added script to combined dfs of structural params like kd, dssp & rd 2020-03-26 17:14:20 +00:00
51a9d814c2 changed outcols in dssp and kd outputs 2020-03-26 17:12:59 +00:00
dd2e2d03eb added residue depth processing to generate df 2020-03-26 15:44:20 +00:00
a074d29f6e tidy code and renamed kd.py to kd_df.py 2020-03-26 15:43:13 +00:00
73e0029b65 tidied and updated kd and dssp scripts & generated their respective outputs 2020-03-25 18:19:23 +00:00
37e1d43b76 updated kd.py to relfect a merging col for combining num params later 2020-03-25 15:20:54 +00:00
d44ab57f5a output from comb script & electrostatic mut changes calculated 2020-03-25 13:42:18 +00:00
954eb88c45 updated combining df scripts for duet & lig 2020-03-24 18:28:52 +00:00
d81be80305 minor changes to variable names in .R & .py 2020-03-24 10:36:51 +00:00
0001c727e0 renamed files to make more generic 2020-03-23 18:13:02 +00:00
d29b81a686 renamed files to make more generic 2020-03-23 17:48:39 +00:00
8c7efcb276 fixed bugs and tidy code 2020-03-23 17:43:06 +00:00
5adef195e0 delete old file 2020-03-23 17:40:19 +00:00
f686563c98 updated pnca_extraction and AF_OR calcs 2020-03-23 17:36:42 +00:00
53d19d5dd8 bug fixes and massive clean up of data extraction script 2020-03-23 13:33:25 +00:00
a5356cf88b saving from work 2020-02-27 15:16:20 +00:00
61f8dc57c9 renamed file and updated logo plot code 2020-02-26 12:00:32 +00:00
95f0e28fb2 added 2 logo plot scripts 2020-02-25 19:09:43 +00:00
7b393a2b13 updating mut_seq script 2020-02-25 18:13:18 +00:00
2805fdda40 hydrophobicity script 2020-02-25 10:42:58 +00:00
2ed0df41e2 remove old surface_res3.py 2020-02-20 12:23:56 +00:00
26e4652d63 fixup 2020-02-20 10:41:49 +00:00
ec25e9fd2d adding scripts for struct params 2020-02-16 15:14:36 +00:00
31dd74d5ac remove __pycache__, update .gitignore 2020-02-16 15:08:45 +00:00
2e8053dc72 test commit 2020-02-16 15:00:49 +00:00
f22f674097 added script to calculate electrostatic changes of mutations 2020-02-11 15:03:21 +00:00
56e7a96b00 updated ref dict to inc aa_calcprop 2020-02-11 15:02:32 +00:00
84389631d5 saving a and b labels in bubble plot with brackets 2020-02-02 11:39:35 +00:00
ec6f607655 added script for KS_test for DUET 2020-02-02 11:36:17 +00:00
7842d87a0d tidy code for lineage_dist_PS 2020-02-02 11:14:25 +00:00
31383e945b tidying script for lineage dist PS and separating KS test results 2020-02-02 11:11:49 +00:00
8df0491721 added bubble plot 2020-02-02 09:17:11 +00:00
12c24d974e added script for coloured axis for ligand affinity 2020-01-31 16:39:22 +00:00
c5d47c1a18 remove .Rhistory 2020-01-31 15:35:25 +00:00
340f490d70 Merge branch 'master' of https://git.tunstall.in/tanu/LSHTM_analysis 2020-01-31 15:34:58 +00:00
077231b240 remove .Rhistory 2020-01-31 15:32:32 +00:00
c73c5571a7 added subaxis plots for PS and lig separately 2020-01-31 15:30:08 +00:00
29022c5462 saving previous stuff from work 2020-01-30 08:26:21 +00:00
c10d54f104 tidy script for data extraction 2020-01-28 11:53:10 +00:00
366bb3960d Merge branch 'master' of github.com:tgttunstall/LSHTM_analysis 2020-01-28 10:17:24 +00:00
d787f6fd45 Update README.md 2020-01-28 10:14:08 +00:00
1e304e4f9d saving data_extraction from home 2020-01-28 10:13:01 +00:00
772fd63d9f saving previous work from home pc 2020-01-28 10:13:01 +00:00
87060c036f added coloured axis barplots 2020-01-28 10:13:01 +00:00
dd7e48d7e2 updated lineage dist for LIG for consistency 2020-01-28 10:13:01 +00:00
f43878def2 graphs for PS lineage dist for all and dr muts 2020-01-28 10:13:01 +00:00
a3c564790a saving data_extraction from home 2020-01-28 10:10:16 +00:00
0ebd8a6d4b saving previous work from home pc 2020-01-23 09:31:35 +00:00
c56a5e4497 added coloured axis barplots 2020-01-22 15:09:21 +00:00
5ebb4a2d25 updated lineage dist for LIG for consistency 2020-01-22 11:34:59 +00:00
4de4549430 graphs for PS lineage dist for all and dr muts 2020-01-22 10:12:09 +00:00
78c2a64cc9 Update README.md
Updated README.md
2020-01-14 11:29:13 +00:00
fee3c2c13c Update README.md 2020-01-14 11:22:41 +00:00
448 changed files with 2265045 additions and 7156 deletions

17
.gitignore vendored
View file

@ -1,6 +1,23 @@
*.xls
*.xlsx
*.ods
*.tar.gz
.Rhistory
*.pyc
__pycache__
*/__pycache__
manual_*
*temp*
mcsm_analysis_fixme
meta_data_analysis
del
example*
scratch
historic
test
plotting_test
*old*
foldx/test/
TO_DO
.RData
scratch_plots

View file

@ -1,35 +1,45 @@
mCSM Analysis
mCSM
=============
This repo does mCSM analysis using Python, bash and R.
This contains scripts that does the following:
1. mcsm.py: function for submitting mcsm job and extracting results
2. run_mcsm.py: wrapper to call mcsm.py
Requires an additional 'Data' directory. Batteries not included.
foldx
=============
This contains scripts that does the following:
1. runFoldx.py: submitting foldx requests and extracting results
2. runfoldx.sh: is wrapped by runFoldx.py
Requires an additional 'Data' directory. Batteries not included:-)
## Assumptions
1. git repos are cloned to `~/git`
2. Requires a `Data/` in `~/git` which has the struc created by `mk_drug_dirs.sh`
2. Requires a data directory with an `input` and `output` subdirs. Can be specified on the CLI with `--datadir`, and optionally can be created with `mk_drug_dirs.sh <DRUG_NAME>`
## LSHTM\_analysis:
subdirs within this repo
```
meta\_data\_analysis/
scripts
*.R
*.py
plotting/
*.R
mcsm
*.py
foldx
*.py
*.sh
mcsm\_analysis/
<drug>/
scripts/
*.R
*.py
mcsm/
*.sh
*.py
*.R
plotting/
*.R
```
## ML\_analysis:
located in:
```
scripts/ml
```
More docs here as I write them.

176
config/alr.R Normal file
View file

@ -0,0 +1,176 @@
gene = "alr"
drug = "cycloserine"
#==========
# LIGPLUS
#===========
aa_ligplus_dcs = c(66, 64, 70, 112, 196
, 236, 237, 252, 253
, 254, 255, 388)
aa_ligplus_dcs_hbond = c(255, 254, 237, 66, 196)
aa_ligplus_dcs_other = aa_ligplus_dcs[!aa_ligplus_dcs%in%aa_ligplus_dcs_hbond]
c1 = length(aa_ligplus_dcs_other) == length(aa_ligplus_dcs) - length(aa_ligplus_dcs_hbond)
#==========
# PLIP
#===========
aa_plip_dcs = c(66, 70, 112, 196, 237
, 252, 254, 255, 295
, 314, 343)
aa_plip_dcs_hbond = c(66, 70, 196, 237
, 252, 254, 255, 295
, 314, 343)
aa_plip_dcs_other = aa_plip_dcs[!aa_plip_dcs%in%aa_plip_dcs_hbond]
c2 = length(aa_plip_dcs_other) == length(aa_plip_dcs) - length(aa_plip_dcs_hbond)
#==========
# Arpeggio
#===========
aa_arpeg_dcs = c(64, 66, 70, 112, 157, 164
, 194, 196, 200, 236, 237, 252, 253
, 254, 255, 256, 295, 314, 342, 343
, 344, 386, 388)
aa_arpeg_dcs_other = aa_arpeg_dcs[!aa_arpeg_dcs%in%c(aa_ligplus_dcs_other
, aa_plip_dcs_other)]
c3 = length(aa_arpeg_dcs_other) == length(aa_arpeg_dcs) - ( length(aa_ligplus_dcs_other) + length(aa_plip_dcs_other) )
#######################################################################
#NEW AFTER ADDING PLP to structure! huh
# ADDED: 18 Aug 2022
# PLIP server for co factor PLP (CONFUSING!)
#and 2019 lit:lys42, M319, and Y364 : OFFSET is 24
#K42: K66, Y271:Y295, M319:M343, W89: W113, W203: W227, H209:H233, Q321:Q345
aa_pos_paper = sort(unique(c(66,70,112,113,164,196,227,233,237,252,254,255,295,342,343,344,345,388)))
plp_pos_paper = sort(unique(c(66, 70, 112, 196, 227, 237, 252, 254, 255, 388)))
#active_aa_pos = sort(unique(c(aa_pos_paper, active_aa_pos)))
aa_pos_plp = sort(unique(c(plp_pos_paper, 66, 70, 112, 237, 252, 254, 255, 196)))
#######################################################################
# this is post inspection on chimera
#remove_pos = c(295, 314, 342, 343, 344)
remove_pos = c(0)
#select :295.A, 314.A, 342.A, 343.A, 344.A
#===============
# Active site aa
#===============
active_aa_pos = sort(unique(c(aa_ligplus_dcs
, aa_plip_dcs
, aa_arpeg_dcs
, aa_pos_plp)))
active_aa_pos = active_aa_pos[!active_aa_pos%in%remove_pos]
#=================
# Drug binding aa
#=================
aa_pos_dcs = sort(unique(c(aa_ligplus_dcs
, aa_plip_dcs
, aa_arpeg_dcs)))
aa_pos_dcs = aa_pos_dcs[!aa_pos_dcs%in%remove_pos]
aa_pos_drug = aa_pos_dcs
#===============
# Co-factor: PLP aa
#===============
aa_pos_plp = aa_pos_plp
#aa_pos_plp = aa_pos_plp[!aa_pos_plp%in%remove_pos]
#===============
# Hbond aa
#===============
aa_pos_dcs_hbond = sort(unique(c(aa_ligplus_dcs_hbond
, aa_plip_dcs_hbond)))
aa_pos_dcs_hbond = aa_pos_dcs_hbond[!aa_pos_dcs_hbond%in%remove_pos]
#=======================
# Other interactions aa
#=======================
aa_pos_dcs_other = active_aa_pos[!active_aa_pos%in%aa_pos_dcs_hbond]
aa_pos_dcs_other = aa_pos_dcs_other[!aa_pos_dcs_other%in%remove_pos]
c3 = length(aa_pos_dcs_other) == length(active_aa_pos) - length(aa_pos_dcs_hbond)
#######################################################################
if ( all(c1, c2, c3) ) {
cat("\nPASS:All active site residues and interctions checked and identified for"
, "\ngene:", gene
, "\ndrug:", drug
, "\n==================================================="
, "\nActive site residues for:", length(active_aa_pos)
, "\n==================================================="
, "\n"
, active_aa_pos
, "\n=================================================="
, "\nDrug binding residues:", length(aa_pos_drug)
, "\n==================================================="
, "\n"
#, aa_pos_dcs
, aa_pos_drug
, "\n==================================================="
, "\nHbond residues:", length(aa_pos_dcs_hbond)
, "\n==================================================="
, "\n"
, aa_pos_dcs_hbond
, "\n=================================================="
, "\nOther interaction residues:", length(aa_pos_dcs_other)
, "\n==================================================="
, "\n"
, aa_pos_dcs_other
, "\n\nNO other co-factors or ligands present\n")
}
######################################################################
#NEW
# PLIP server for co factor PLP (CONFUSING!)
#and 2019 lit:lys42, M319, and Y364 : OFFSET is 24
#K42: K66, Y271:Y295, M319:M343, W89: W113, W203: W227, H209:H233, Q321:Q345
aa_pos_paper = sort(unique(c(66,70,112,113,164,196,227,233,237,252,254,255,295,342,343,344,345,388)))
plp_pos_paper = sort(unique(c(66, 70, 112, 196, 227, 237, 252, 254, 255, 388)))
#add_to_dcs = c(113, 227, 233, 345)
#add_to_plp = c(113, 227, 233, 345) # 227 not in plp and 227, 233 and 345 not with snp
#active_aa_pos = sort(unique(c(aa_pos_paper, active_aa_pos)))
#aa_pos_plp = sort(unique(c(plp_pos_paper, 66, 70, 112, 237, 252, 254, 255, 196, add_to_plp)))
aa_pos_plp = sort(unique(c(plp_pos_paper, 66, 70, 112, 237, 252, 254, 255, 196)))
#aa_pos_dcs = sort(unique(c(aa_pos_dcs, add_to_dcs)))
#aa_pos_drug = aa_pos_dcs
# add two key residues
#aa_pos_drug = sort(unique(c(319, 364, aa_pos_drug)))
#active_aa_pos = sort(unique(c(319, 364, active_aa_pos, aa_pos_plp)))
# FIXME: these should be populated!
aa_pos_lig1 = aa_pos_plp
aa_pos_lig2 = NULL
aa_pos_lig3 = NULL
tile_map=data.frame(tile=c("DCS","PLP"),
tile_colour=c("green","navyblue")) #darkslategrey
######
chain_suffix = ".A"
toString(paste0(aa_pos_drug, chain_suffix))
toString(paste0(aa_pos_plp, chain_suffix))
toString(paste0(active_aa_pos, chain_suffix))
common_pos = aa_pos_drug[aa_pos_drug%in%aa_pos_plp]
cat("\nCommon interacting partners:", length(common_pos))
common_pos
toString(paste0(common_pos, chain_suffix))

123
config/embb.R Normal file
View file

@ -0,0 +1,123 @@
gene = "embB"
drug = "ethambutol"
# interacting chain B
#==========
# LIGPLUS
#===========
aa_ligplus_emb = c(299, 302, 303, 306, 334, 594, 988, 1028)
aa_ligplus_emb_hbond = c(299, 594)
aa_ligplus_ca = c(952, 954, 959)
aa_ligplus_ca_hbond = c(952, 954, 959)
aa_ligplus_cdl = c(460, 665, 568, 601, 572, 579, 580, 583)
aa_ligplus_cdl_hbond = c(601, 568, 665)
aa_ligplus_dsl = c(435, 442, 489, 452, 330, 589, 509, 446, 445, 506, 592, 590, 514, 403, 515)
aa_ligplus_dsl_hbond = c(445, 590, 592, 403)
#==========
# PLIP
#===========
aa_plip_emb = c(299, 302, 303, 327, 594, 988, 1028)
aa_plip_emb_hbond = c(299, 327, 594)
aa_plip_ca = c(952, 954, 959)
aa_plip_cdl = c(456, 572, 579, 583, 568)
#aa_plip_cdl_sb = c(537, 568, 601, 665)
aa_plip_dsl = c(330, 435, 446, 452, 489, 506, 589, 590, 445, 403, 595)
aa_plip_dsl_hbond = c(445, 590)
#aa_plip_dsl_sb = c(403, 595)
#==========
# Arpeggio
#===========
# emb:1402, 1403
aa_arpeg_emb = c(298, 299, 302, 303, 306, 318, 327, 334, 403, 445, 592, 594, 988, 1028)
aa_arpeg_ca = c(847, 853, 854, 952, 954, 955, 956, 959, 960)
aa_arpeg_cdl = c(456, 457, 460, 461, 521, 525, 533, 537, 554, 558, 568
, 569, 572, 573, 575, 576, 579, 580, 582, 583, 586, 601, 605, 616, 658
, 661, 662, 665)
aa_arpeg_dsl = c(299, 322, 329, 330, 403, 435, 438, 439, 442, 445, 446
, 449, 452, 455, 486, 489, 490, 493, 506, 509, 510, 513, 514
, 515, 587, 589, 590, 592, 595)
##############################################################
active_aa_pos = sort(unique(c(aa_ligplus_emb
, aa_plip_emb
, aa_arpeg_emb
, aa_ligplus_ca
, aa_plip_ca
, aa_arpeg_ca
, aa_ligplus_cdl
, aa_plip_cdl
, aa_arpeg_cdl
, aa_ligplus_dsl
, aa_plip_dsl
, aa_arpeg_dsl)))
##############################################################
cat("\nNo. of active site residues for gene"
, gene, ":"
, length(active_aa_pos)
, "\nThese are:\n"
, active_aa_pos)
##############################################################
aa_pos_emb = sort(unique(c( aa_ligplus_emb
, aa_plip_emb
, aa_arpeg_emb)))
aa_pos_drug = aa_pos_emb
aa_pos_emb_hbond = sort(unique(c( aa_ligplus_emb_hbond
, aa_plip_emb_hbond)))
aa_pos_ca = sort(unique(c( aa_ligplus_ca
, aa_plip_ca
, aa_arpeg_ca)))
aa_pos_cdl = sort(unique(c( aa_ligplus_cdl
, aa_plip_cdl
, aa_arpeg_cdl )))
aa_pos_cdl_hbond = sort(unique(c( aa_ligplus_cdl_hbond )))
aa_pos_dsl = sort(unique(c( aa_ligplus_dsl
, aa_plip_dsl
, aa_arpeg_dsl)))
aa_pos_dsl_hbond = sort(unique(c( aa_ligplus_dsl_hbond
, aa_plip_dsl_hbond)))
cat("\n==================================================="
, "\nActive site residues for", gene, "comprise of..."
, "\n==================================================="
, "\nNo. of", drug, "binding residues:" , length(aa_pos_emb), "\n"
, aa_pos_emb
, "\nNo. of co-factor 'Ca' binding residues:", length(aa_pos_ca) , "\n"
, aa_pos_ca
, "\nNo. of ligand 'CDL' binding residues:" , length(aa_pos_cdl), "\n"
, aa_pos_cdl
, "\nNo. of ligand 'DPA' binding residues:" , length(aa_pos_dsl), "\n"
, aa_pos_dsl, "\n"
)
##############################################################
# var for position customisation for plots
# aa_pos_lig1 = aa_pos_ca
# aa_pos_lig2 = aa_pos_cdl
# aa_pos_lig3 = aa_pos_dsl
aa_pos_lig1 = aa_pos_dsl #slategray
aa_pos_lig2 = aa_pos_cdl #navy blue
aa_pos_lig3 = aa_pos_ca #purple
tile_map=data.frame(tile=c("EMB","DPA","CDL","Ca"),
tile_colour=c("green","darkslategrey","navyblue","purple"))
drug_main_res = c(299 , 302, 303 , 306 , 327 , 592 , 594, 988, 1028)

143
config/gid.R Normal file
View file

@ -0,0 +1,143 @@
gene = "gid"
drug = "streptomycin"
#rna_site = G518
#rna_bind_aa_pos = c(96, 97, 118, 163)
#binding_aa_pos = c(48, 51, 137, 200)
# SAM: 226
# SRY: 1601
#==========
# LIGPLUS
#===========
aa_ligplus_sry = c(118, 220, 223) # 526 (rna) and 7mg527
aa_ligplus_sry_hbond = c(118, 220, 223)
aa_ligplus_sam = c(148, 137, 138, 139
, 93, 69, 119, 120
, 220, 219, 118, 223)
aa_ligplus_sam_hbond = c(220, 223)
aa_ligplus_amp = c(123, 125, 213, 214)
aa_ligplus_amp_hbond = c(125, 123, 213)
aa_ligplus_rna = c(137, 47, 48, 38, 35, 36, 37, 94, 33, 97, 139, 138, 163, 165, 164, 199)
aa_ligplus_rna_hbond = c(33, 97, 37, 47, 137)
#==========
# PLIP
#===========
aa_plip_sry = c(118, 220, 223)
aa_plip_sry_hbond = c(118, 220, 223)
aa_plip_sam = c(92, 118, 119, 120, 139, 220, 223, 148)
aa_plip_sam_hbond = c(92, 118, 119, 120, 139, 220, 223)
aa_plip_amp = c(123, 125, 213)
aa_plip_amp_hbond = c(123, 125, 213)
aa_plip_rna = c(33, 34, 36, 37, 47, 48, 51, 97, 137, 199)
aa_plip_rna_hbond = c(33, 34, 36, 37, 47, 51, 137, 199)
#==========
# Arpeggio
#===========
aa_arpeg_sry = c(118, 148, 220, 223, 224)
aa_arpeg_sam = c(68, 69, 92, 93, 97, 117
, 118, 119, 120, 136, 137
, 138, 139, 140, 148, 218
, 219, 220, 221, 222, 223)
aa_arpeg_amp = c(123, 125, 213)
##############################################################
#=============
# Active site
#=============
active_aa_pos = sort(unique(c(
#rna_bind_aa_pos
#, binding_aa_pos
aa_ligplus_sry
, aa_ligplus_sam
, aa_ligplus_amp
, aa_ligplus_rna
, aa_plip_sry
, aa_plip_sam
, aa_plip_amp
, aa_plip_rna
, aa_arpeg_sry
, aa_arpeg_sam
, aa_arpeg_amp
)))
##############################################################
cat("\nNo. of active site residues for gene"
, gene, ":"
, length(active_aa_pos)
, "\nThese are:\n"
, active_aa_pos)
##############################################################
aa_pos_sry = sort(unique(c(
aa_ligplus_sry
, aa_plip_sry
, aa_arpeg_sry)))
aa_pos_drug = aa_pos_sry
aa_pos_sry_hbond = sort(unique(c(
aa_ligplus_sry_hbond
, aa_plip_sry_hbond)))
aa_pos_rna = sort(unique(c(
aa_ligplus_rna
, aa_plip_rna)))
aa_pos_rna_hbond = sort(unique(c(
aa_ligplus_rna_hbond
, aa_plip_rna_hbond)))
aa_pos_sam = sort(unique(c(
aa_ligplus_sam
, aa_plip_sam
, aa_arpeg_sam)))
aa_pos_sam_hbond = sort(unique(c(
aa_ligplus_sam_hbond
, aa_plip_sam_hbond)))
aa_pos_amp = sort(unique(c(
aa_ligplus_amp
, aa_plip_amp
, aa_arpeg_amp)))
aa_pos_amp_hbond = sort(unique(c(
aa_ligplus_amp_hbond
, aa_plip_amp_hbond)))
cat("\n==================================================="
, "\nActive site residues for", gene, "comprise of..."
, "\n==================================================="
, "\nNo. of", drug, "binding residues:" , length(aa_pos_sry), "\n"
, aa_pos_sry
, "\nNo. of RNA binding residues:" , length(aa_pos_rna), "\n"
, aa_pos_rna
, "\nNo. of ligand 'SAM' binding residues:", length(aa_pos_sam), "\n"
, aa_pos_sam
, "\nNo. of ligand 'AMP' binding residues:", length(aa_pos_amp), "\n"
, aa_pos_amp, "\n")
##############################################################
# var for position customisation for plots
#aa_pos_drug = #00ff00 # green # as STR doesn't bind
aa_pos_lig1 = aa_pos_sam #2f4f4f # darkslategrey
aa_pos_lig2 = aa_pos_rna #ff1493 #deeppink
aa_pos_lig3 = aa_pos_amp #000080 #navyblue
tile_map=data.frame(tile=c("STR","SAM","RNA","AMP"),
tile_colour=c("#00ff00","#2f4f4f","#ff1493","#000080"))
# green: #00ff00
# darkslategrey : #2f4f4f
# deeppink : #ff1493
# navyblue :#000080

116
config/katg.R Normal file
View file

@ -0,0 +1,116 @@
gene = "katG"
drug = "isoniazid"
#==========
# LIGPLUS
#===========
# hem (1500)
aa_ligplus_inh = c(107, 108, 137, 229, 230)
#aa_ligplus_inh_hbond # none
aa_ligplus_hem = c(94, 276, 315, 274, 270, 381, 273, 104, 314, 275,
100, 101, 321, 103, 269, 107, 266, 230, 380, 275, 314)
aa_ligplus_hem_hbond = c(94, 276, 315, 274, 270, 381)
aa_ligplus_hem_other = aa_ligplus_hem[!aa_ligplus_hem%in%aa_ligplus_hem_hbond]
c1 = length(aa_ligplus_hem_other) == length(aa_ligplus_hem) - length(aa_ligplus_hem_hbond)
#==========
# PLIP
#===========
aa_plip_inh = c(104, 229, 230)
aa_plip_inh_hbond = c(104, 229, 230)
aa_plip_hem = c(104, 107, 248, 252, 265, 275, 321, 412, 274, 276, 315)
aa_plip_hem_hbond = c(274, 276, 315)
#aa_plip_hem_sb = c(104, 276)
#aa_plip_hem_pi = c(107)
aa_plip_hem_other = aa_plip_hem[!aa_plip_hem%in%aa_plip_hem_hbond]
c2 = length(aa_plip_hem_other) == length(aa_plip_hem) - length(aa_plip_hem_hbond)
#==========
# Arpeggio
#===========
aa_arpeg_inh = c(104, 107, 108, 136, 137, 228, 229, 230, 232, 315)
aa_arpeg_inh_hbond = c(104, 137)
aa_arpeg_hem = c(94, 100, 101, 103, 104, 107, 230, 231, 232, 248
, 252, 265, 266, 269, 270, 272, 273, 274, 275, 276, 314, 315
, 317, 321, 378, 380, 408, 412)
#from here
##############################################################
#===============
# Active site aa
#===============
active_aa_pos = sort(unique(c(aa_ligplus_inh
, aa_plip_inh
, aa_arpeg_inh
, aa_ligplus_hem
, aa_plip_hem
, aa_arpeg_hem
)))
cat("\nNo. of active site residues for gene"
, gene, ":"
, length(active_aa_pos)
, "\nThese are:\n"
, active_aa_pos)
#=================
# Drug binding aa
#=================
aa_pos_inh = sort(unique(c( aa_ligplus_inh
, aa_plip_inh
, aa_arpeg_inh)))
aa_pos_drug = aa_pos_inh
#===============
# Hbond aa
#===============
aa_pos_inh_hbond = sort(unique(c( aa_plip_inh_hbond
, aa_arpeg_inh_hbond)))
#=======================
# Other interactions aa
#=======================
#---------------------------------------------
aa_pos_hem = sort(unique(c( aa_ligplus_hem
, aa_plip_hem
, aa_arpeg_hem)))
aa_pos_hem_hbond = sort(unique(c( aa_ligplus_hem_hbond
, aa_plip_hem_hbond
#, aa_arpeg_hem_hbond
)))
cat("\n==================================================="
, "\nActive site residues for", gene, "comprise of..."
, "\n==================================================="
, "\nNo. of", drug, "binding residues:" , length(aa_pos_inh) , "\n"
, aa_pos_inh
, "\nNo. of 'HEM' binding residues:" , length(aa_pos_hem) , "\n"
, aa_pos_hem, "\n")
##############################################################
# var for position customisation for plots
aa_pos_lig1 = aa_pos_hem
aa_pos_lig2 = NULL
aa_pos_lig3 = NULL
tile_map=data.frame(tile=c("INH","HEME"),
tile_colour=c("green","darkslategrey"))
#toString(aa_pos_hem)
#toString(aa_pos_drug)
#toString(active_aa_pos)

61
config/pnca.R Normal file
View file

@ -0,0 +1,61 @@
gene = "pncA"
drug = "pyrazinamide"
#===================================
#Iron centre --> purple
#Catalytic triad --> yellow
#Substrate binding --> teal and blue
#H-bond --> green
#====================================
#aa_plip = c(49, 51, 57, 71, 96 , 133, 134, 138)
#aa_ligplus = c(8, 13 , 49 , 133, 134 , 138, 137)
#active_aa_pos = sort(unique(c(aa_plip, aa_ligplus)))
#aa_pos_substrate = c(13, 68, 103, 137)
aa_pos_pza = c(13, 68, 103, 137)
aa_pos_fe = c(49, 51, 57, 71)
aa_pos_catalytic = c(8, 96, 138)
aa_pos_hbond = c(133, 134, 8, 138)
aa_pos_drug = aa_pos_pza
#==========
# Arpeggio
#===========
# all same except one extra
aa_arpeg = c(102)
##############################################################
active_aa_pos = sort(unique(c(aa_pos_pza
, aa_pos_fe
, aa_pos_catalytic
, aa_pos_hbond
, aa_arpeg)))
##############################################################
cat("\nNo. of active site residues for gene"
, gene, ":"
, length(active_aa_pos)
, "\nThese are:\n"
, active_aa_pos)
cat("\n==================================================="
, "\nActive site residues for", gene, "comprise of..."
, "\n==================================================="
, "\nNo. of", drug, "binding residues:" , length(aa_pos_pza) , "\n"
, aa_pos_pza
, "\nMetal coordination centre residues:" , length(aa_pos_fe) , "\n"
, aa_pos_fe
, "\nCatalytic triad residues:" , length(aa_pos_catalytic) , "\n"
, aa_pos_catalytic
, "\nH-bonding residues:" , length(aa_pos_hbond) , "\n"
, aa_pos_hbond , "\n")
##############################################################
# var for position customisation for plots
aa_pos_lig1 = aa_pos_fe
aa_pos_lig2 = NULL
aa_pos_lig3 = NULL
#aa_pos_lig2 = aa_pos_catalytic
#aa_pos_lig3 = aa_pos_hbond
tile_map=data.frame(tile=c("PZA","DPA","CDL","Ca"),
tile_colour=c("green","darkslategrey","navyblue","purple"))

80
config/rpob.R Normal file
View file

@ -0,0 +1,80 @@
gene = "rpoB"
drug = "rifampicin"
#==========
# LIGPLUS
#===========
# Error! No atom records found!
#==========
# PLIP
#===========
aa_plip_rfp = c(429, 432, 491, 487)
aa_plip_rfp_hbond = c(429, 432, 487)
# chainC: equivalent with offset (-6 from 5uhc) accounted
aa_plip_5uhc_rfp = c(430, 452, 483
, 491, 432, 433
, 448, 450, 459, 487)
aa_plip_5uhc_rfp_hbond = c(432, 433, 448, 450, 459, 487)
#==========
# Arpeggio
#===========
# rfp: 1894
aa_arpeg_rfp = c(170, 428, 429, 430, 431, 432
, 433, 435, 445, 448, 450, 452
, 453, 458, 483, 487, 491, 604
, 607, 674)
##############################################################
remove_pos = c(170, 674, 604)
active_aa_pos = sort(unique(c(aa_plip_rfp
, aa_plip_5uhc_rfp
, aa_arpeg_rfp)))
active_aa_pos = active_aa_pos[!active_aa_pos%in%remove_pos]
##############################################################
cat("\nNo. of active site residues for gene"
, gene, ":"
, length(active_aa_pos)
, "\nThese are:\n"
, active_aa_pos)
##############################################################
aa_pos_rfp = sort(unique(c(aa_plip_rfp
, aa_plip_5uhc_rfp
, aa_arpeg_rfp)))
aa_pos_rfp = aa_pos_rfp[!aa_pos_rfp%in%remove_pos]
aa_pos_drug = aa_pos_rfp
aa_pos_rfp_hbond = sort(unique(c(aa_plip_rfp_hbond
, aa_plip_5uhc_rfp_hbond)))
aa_pos_rfp_hbond = aa_pos_rfp_hbond[!aa_pos_rfp_hbond%in%remove_pos]
cat("\n==================================================="
, "\nActive site residues for", gene, "comprise of..."
, "\n==================================================="
, "\nNo. of", drug, "binding residues:" , length(aa_pos_rfp), "\n"
, aa_pos_rfp
, "\n\nNO other co-factors or ligands present\n")
##############################################################
# FIXME: these should be populated!
aa_pos_lig1 = NULL
aa_pos_lig2 = NULL
aa_pos_lig3 = NULL
tile_map=data.frame(tile=c("RFP"),
tile_colour=c("green"))
####
chain_suffix = ".C"
print(toString(paste0(aa_pos_drug, chain_suffix)))
# # equivalent resiudes on 5uhc:
# active_aa_pos_5uhc = active_aa_pos+6
# active_aa_pos_5uhc
# print(toString(paste0(active_aa_pos_5uhc, chain_suffix)))

View file

@ -0,0 +1,162 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 19 14:33:51 2020
@author: tanu
"""
#%% load packages
import os,sys
import subprocess
import argparse
import requests
import re
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
#%%#####################################################################
def format_dynamut_output(dynamut_output_csv):
"""
@param dynamut_output_csv: file containing dynamut results for all muts
which is the result of combining all dynamut_output batch results, and using
bash scripts to combine all the batch results into one file.
This is post run_get_results_dynamut.py
Formatting df to a pandas df and output as csv.
@type string
@return (not true) formatted csv for dynamut output
@type pandas df
"""
#############
# Read file
#############
dynamut_data_raw = pd.read_csv(dynamut_output_csv, sep = ',')
# strip white space from both ends in all columns
dynamut_data = dynamut_data_raw.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
dforig_shape = dynamut_data.shape
print('dimensions of input file:', dforig_shape)
#%%============================================================================
#####################################
# create binary cols for each param
# >=0: Stabilising
######################################
outcome_cols = ['ddg_dynamut', 'ddg_encom', 'ddg_mcsm','ddg_sdm', 'ddg_duet']
# col test: ddg_dynamut
#len(dynamut_data[dynamut_data['ddg_dynamut'] >= 0])
#dynamut_data['ddg_dynamut_outcome'] = dynamut_data['ddg_dynamut'].apply(lambda x: 'Stabilising' if x >= 0 else 'Destabilising')
#len(dynamut_data[dynamut_data['ddg_dynamut_outcome'] == 'Stabilising'])
print('\nCreating classification cols for', len(outcome_cols), 'columns'
, '\nThese are:')
for cols in outcome_cols:
print(cols)
tot_muts = dynamut_data[cols].count()
print('\nTotal entries:', tot_muts)
outcome_colname = cols + '_outcome'
print(cols, ':', outcome_colname)
c1 = len(dynamut_data[dynamut_data[cols] >= 0])
dynamut_data[outcome_colname] = dynamut_data[cols].apply(lambda x: 'Stabilising' if x >= 0 else 'Destabilising')
c2 = len(dynamut_data[dynamut_data[outcome_colname] == 'Stabilising'])
if c1 == c2:
print('\nPASS: outcome classification column created successfully'
, '\nColumn created:', outcome_colname
#, '\nNo. of stabilising muts: ', c1
#, '\nNo. of DEstabilising muts: ', tot_muts-c1
, '\n\nCateg counts:\n', dynamut_data[outcome_colname].value_counts() )
else:
print('\nFAIL: outcome classification numbers MISmatch'
, '\nexpected length:', c1
, '\nGot:', c2)
# Rename categ for: dds_encom
len(dynamut_data[dynamut_data['dds_encom'] >= 0])
dynamut_data['dds_encom_outcome'] = dynamut_data['dds_encom'].apply(lambda x: 'Increased_flexibility' if x >= 0 else 'Decreased_flexibility')
dynamut_data['dds_encom_outcome'].value_counts()
#%%=====================================================================
################################
# scale all ddg param values
#################################
# Rescale values in all ddg cols col b/w -1 and 1 so negative numbers
# stay neg and pos numbers stay positive
outcome_cols = ['ddg_dynamut', 'ddg_encom', 'ddg_mcsm','ddg_sdm', 'ddg_duet', 'dds_encom']
for cols in outcome_cols:
#print(cols)
col_max = dynamut_data[cols].max()
col_min = dynamut_data[cols].min()
print( '\n===================='
, '\nColname:', cols
, '\n===================='
, '\nMax: ', col_max
, '\nMin: ', col_min)
scaled_colname = cols + '_scaled'
print('\nCreated scaled colname for', cols, ':', scaled_colname)
col_scale = lambda x : x/abs(col_min) if x < 0 else (x/col_max if x >= 0 else 'failed')
dynamut_data[scaled_colname] = dynamut_data[cols].apply(col_scale)
col_scaled_max = dynamut_data[scaled_colname].max()
col_scaled_min = dynamut_data[scaled_colname].min()
print( '\n===================='
, '\nColname:', scaled_colname
, '\n===================='
, '\nMax: ', col_scaled_max
, '\nMin: ', col_scaled_min)
#%%=====================================================================
#############
# reorder columns
#############
dynamut_data.columns
dynamut_data_f = dynamut_data[['mutationinformation'
, 'ddg_dynamut'
, 'ddg_dynamut_scaled'
, 'ddg_dynamut_outcome'
, 'ddg_encom'
, 'ddg_encom_scaled'
, 'ddg_encom_outcome'
, 'ddg_mcsm'
, 'ddg_mcsm_scaled'
, 'ddg_mcsm_outcome'
, 'ddg_sdm'
, 'ddg_sdm_scaled'
, 'ddg_sdm_outcome'
, 'ddg_duet'
, 'ddg_duet_scaled'
, 'ddg_duet_outcome'
, 'dds_encom'
, 'dds_encom_scaled'
, 'dds_encom_outcome']]
if len(dynamut_data.columns) == len(dynamut_data_f.columns) and sorted(dynamut_data.columns) == sorted(dynamut_data_f.columns):
print('\nPASS: outcome_classification, scaling and column reordering completed')
else:
print('\nFAIL: Something went wrong...'
, '\nExpected length: ', len(dynamut_data.columns)
, '\nGot: ', len(dynamut_data_f.columns))
sys.exit()
return(dynamut_data_f)
#%%#####################################################################

View file

@ -0,0 +1,137 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 19 14:33:51 2020
@author: tanu
"""
#%% load packages
import os,sys
import subprocess
import argparse
import requests
import re
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
#%%#####################################################################
def format_dynamut2_output(dynamut_output_csv):
"""
@param dynamut_output_csv: file containing dynamut2 results for all muts
which is the result of combining all dynamut2_output batch results, and using
bash scripts to combine all the batch results into one file.
Dynamut2ran manually from batches
Formatting df to a pandas df and output as csv.
@type string
@return (not true) formatted csv for dynamut output
@type pandas df
"""
#############
# Read file
#############
dynamut_data_raw = pd.read_csv(dynamut_output_csv, sep = ',')
# strip white space from both ends in all columns
dynamut_data = dynamut_data_raw.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
dforig_shape = dynamut_data.shape
print('dimensions of input file:', dforig_shape)
#%%============================================================================
#####################################
# create binary cols for ddg_dynamut2
# >=0: Stabilising
######################################
outcome_cols = ['ddg_dynamut2']
# col test: ddg_dynamut
#len(dynamut_data[dynamut_data['ddg_dynamut'] >= 0])
#dynamut_data['ddg_dynamut_outcome'] = dynamut_data['ddg_dynamut'].apply(lambda x: 'Stabilising' if x >= 0 else 'Destabilising')
#len(dynamut_data[dynamut_data['ddg_dynamut_outcome'] == 'Stabilising'])
print('\nCreating classification cols for', len(outcome_cols), 'columns'
, '\nThese are:')
for cols in outcome_cols:
print(cols)
tot_muts = dynamut_data[cols].count()
print('\nTotal entries:', tot_muts)
outcome_colname = cols + '_outcome'
print(cols, ':', outcome_colname)
c1 = len(dynamut_data[dynamut_data[cols] >= 0])
dynamut_data[outcome_colname] = dynamut_data[cols].apply(lambda x: 'Stabilising' if x >= 0 else 'Destabilising')
c2 = len(dynamut_data[dynamut_data[outcome_colname] == 'Stabilising'])
if c1 == c2:
print('\nPASS: outcome classification column created successfully'
, '\nColumn created:', outcome_colname
#, '\nNo. of stabilising muts: ', c1
#, '\nNo. of DEstabilising muts: ', tot_muts-c1
, '\n\nCateg counts:\n', dynamut_data[outcome_colname].value_counts() )
else:
print('\nFAIL: outcome classification numbers MISmatch'
, '\nexpected length:', c1
, '\nGot:', c2)
#%%=====================================================================
################################
# scale all ddg_dynamut2 values
#################################
# Rescale values in all ddg_dynamut2 col col b/w -1 and 1 so negative numbers
# stay neg and pos numbers stay positive
outcome_cols = ['ddg_dynamut2']
for cols in outcome_cols:
#print(cols)
col_max = dynamut_data[cols].max()
col_min = dynamut_data[cols].min()
print( '\n===================='
, '\nColname:', cols
, '\n===================='
, '\nMax: ', col_max
, '\nMin: ', col_min)
scaled_colname = cols + '_scaled'
print('\nCreated scaled colname for', cols, ':', scaled_colname)
col_scale = lambda x : x/abs(col_min) if x < 0 else (x/col_max if x >= 0 else 'failed')
dynamut_data[scaled_colname] = dynamut_data[cols].apply(col_scale)
col_scaled_max = dynamut_data[scaled_colname].max()
col_scaled_min = dynamut_data[scaled_colname].min()
print( '\n===================='
, '\nColname:', scaled_colname
, '\n===================='
, '\nMax: ', col_scaled_max
, '\nMin: ', col_scaled_min)
#%%=====================================================================
#############
# reorder columns
#############
dynamut_data.columns
dynamut_data_f = dynamut_data[['mutationinformation'
, 'chain'
, 'ddg_dynamut2'
, 'ddg_dynamut2_scaled'
, 'ddg_dynamut2_outcome']]
if len(dynamut_data.columns) == len(dynamut_data_f.columns) and sorted(dynamut_data.columns) == sorted(dynamut_data_f.columns):
print('\nPASS: outcome_classification, scaling and column reordering completed')
else:
print('\nFAIL: Something went wrong...'
, '\nExpected length: ', len(dynamut_data.columns)
, '\nGot: ', len(dynamut_data_f.columns))
sys.exit()
return(dynamut_data_f)
#%%#####################################################################

98
dynamut/get_results_dynamut.py Executable file
View file

@ -0,0 +1,98 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 19 14:33:51 2020
@author: tanu
"""
#%% load packages
import os,sys
import subprocess
import argparse
import requests
import re
import time
from bs4 import BeautifulSoup
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
#%%#####################################################################
def get_results(url_file, host_url, output_dir, outfile_suffix):
# initilialise empty df
dynamut_results_out_df = pd.DataFrame()
with open(url_file, 'r') as f:
for count, line in enumerate(f):
line = line.strip()
print('URL no.', count+1, '\n', line)
#batch_response = requests.get(line, headers=headers)
batch_response = requests.get(line)
batch_soup = BeautifulSoup(batch_response.text, features = 'html.parser')
# initilialise empty df
#dynamut_results_df = pd.DataFrame()
for a in batch_soup.find_all('a', href=True, attrs = {'class':'btn btn-default btn-sm'}):
print ("Found the URL:", a['href'])
single_result_url = host_url + a['href']
snp = re.search(r'([A-Z]+[0-9]+[A-Z]+$)', single_result_url).group(0)
print(snp)
print('\nGetting results from:', single_result_url)
result_response = requests.get(single_result_url)
if result_response.status_code == 200:
print('\nFetching results for SNP:', snp)
# extract results using the html parser
soup = BeautifulSoup(result_response.text, features = 'html.parser')
#web_result_raw = soup.find(id = 'predictions').get_text()
ddg_dynamut = soup.find(id = 'ddg_dynamut').get_text()
ddg_encom = soup.find(id = 'ddg_encom').get_text()
ddg_mcsm = soup.find(id = 'ddg_mcsm').get_text()
ddg_sdm = soup.find(id = 'ddg_sdm').get_text()
ddg_duet = soup.find(id = 'ddg_duet').get_text()
dds_encom = soup.find(id = 'dds_encom').get_text()
param_dict = {"mutationinformation" : snp
, "ddg_dynamut" : ddg_dynamut
, "ddg_encom" : ddg_encom
, "ddg_mcsm" : ddg_mcsm
, "ddg_sdm" : ddg_sdm
, "ddg_duet" : ddg_duet
, "dds_encom" : dds_encom
}
results_df = pd.DataFrame.from_dict(param_dict, orient = "index").T
print('Result DF:', results_df, 'for URL:', line)
#dynamut_results_df = dynamut_results_df.append(results_df)#!1 too many!:-)
dynamut_results_out_df = dynamut_results_out_df.append(results_df)
#print(dynamut_results_out_df)
#============================
# Writing results file: csv
#============================
dynamut_results_dir = output_dir + 'dynamut_results/'
if not os.path.exists(dynamut_results_dir):
print('\nCreating dir: dynamut_results within:', output_dir )
os.makedirs(dynamut_results_dir)
print('\nWriting dynamut results df')
print('\nResults File:'
, '\nNo. of rows:', dynamut_results_out_df.shape[0]
, '\nNo. of cols:', dynamut_results_out_df.shape[1])
print(dynamut_results_out_df)
#dynamut_results_out_df.to_csv('/tmp/test_dynamut.csv', index = False)
# build out filename
out_filename = dynamut_results_dir + 'dynamut_output_' + outfile_suffix + '.csv'
dynamut_results_out_df.to_csv(out_filename, index = False)
# TODO: add as a cmd option
# Download .tar.gz file
prediction_number = re.search(r'([0-9]+$)', line).group(0)
tgz_url = f"{host_url}/dynamut/results_file/results_" + prediction_number + '.tar.gz'
tgz_filename = dynamut_results_dir + outfile_suffix + '_results_' + prediction_number + '.tar.gz'
response_tgz = requests.get(tgz_url, stream = True)
if response_tgz.status_code == 200:
print('\nDownloading tar.gz file:', tgz_url
, '\n\nSaving file as:', tgz_filename)
with open(tgz_filename, 'wb') as f:
f.write(response_tgz.raw.read())
#%%#####################################################################

View file

@ -0,0 +1,817 @@
A G24V
A K27I
A K27E
A Y28L
A Y28H
A P29S
A V30A
A G32S
A G33S
A G34V
A G34A
A Q36P
A Q36H
A D37G
A P40T
A L43R
A L43P
A K46N
A V47I
A L48P
A L48R
A P52S
A D56H
A P57S
A A61S
A F62L
A D63G
A Y64C
A A65T
A A66T
A V68G
A I71F
A I71S
A V73A
A V73G
A A75P
A L76P
A T77R
A R78P
A R78G
A E81V
A E82D
A V83L
A V83G
A M84I
A M84T
A M84L
A T85A
A T85P
A T86P
A T86N
A S87L
A Q88P
A Q88E
A P89D
A W90R
A W90C
A W91G
A W91R
A W91L
A W91S
A P92T
A A93G
A A93D
A A93T
A D94N
A Y95F
A Y95S
A H97N
A H97P
A H97S
A Y98C
A Y98D
A Y98N
A G99R
A G99E
A P100T
A L101F
A L101M
A F102M
A F102S
A F102I
A I103N
A I103V
A I103T
A R104Q
A R104W
A M105I
A A106S
A A106V
A A106T
A A106R
A A106G
A A109T
A A109V
A A109S
A A109D
A A110V
A A110T
A G111D
A T112I
A Y113C
A I115V
A I115S
A I115T
A H116T
A H116E
A H116L
A H116G
A H116A
A H116Q
A H116F
A H116S
A H116P
A D117E
A G120S
A G121A
A G121S
A A122G
A A122D
A A122T
A A122V
A G123R
A G123E
A G124A
A G124Q
A G124D
A G124S
A G124H
A G124E
A G124R
A G124T
A G125D
A G125S
A M126Q
A M126I
A M126A
A M126L
A M126S
A Q127P
A R128Q
A R128L
A R128G
A R128W
A F129S
A A130E
A P131Q
A P131A
A P131L
A P131S
A L132R
A N133S
A N133D
A S134R
A W135S
A P136L
A N138S
A N138H
A N138D
A A139V
A A139P
A A139G
A S140N
A S140G
A S140I
A L141S
A L141F
A L141I
A L141V
A D142G
A D142N
A K143N
A K143E
A A144T
A A144V
A R145H
A R145C
A R145S
A R146L
A L148I
A W149R
A W149L
A W149G
A W149C
A V151L
A V151I
A K152E
A K152T
A K153Q
A Y155C
A Y155S
A Y155H
A G156D
A G156S
A K157N
A K157R
A K157Q
A K158S
A K158N
A L159I
A L159F
A L159P
A W161C
A W161R
A A162V
A A162E
A A162T
A D163N
A D163A
A L164R
A I165M
A I165L
A I165Y
A I165T
A V166I
A V166T
A F167S
A F167L
A F167C
A A168V
A A168T
A A168G
A G169S
A N170K
A C171V
A C171G
A A172T
A A172V
A L173R
A M176T
A M176I
A F178I
A F178S
A K179E
A T180M
A T180K
A G182R
A G182E
A F183L
A F183S
A G184D
A G184A
A G184C
A G186A
A G186S
A G186D
A R187P
A D189N
A D189G
A D189A
A D189Y
A W191R
A W191G
A E192A
A E192D
A D194N
A E195K
A V196G
A Y197D
A W204S
A L205R
A G206R
A E208K
A R209C
A S211N
A S211T
A K213E
A K213N
A R214L
A D215H
A D215E
A N218S
A P219L
A A222T
A Q224R
A M225V
A I228L
A N231K
A P232S
A P232R
A P232T
A P232A
A E233G
A E233Q
A G234R
A N236D
A G237A
A G237D
A P241H
A M242V
A M242T
A M242I
A A243T
A A244G
A V246R
A V246G
A I248T
A R249G
A R249C
A R249H
A T251K
A T251M
A F252L
A R253G
A R253W
A R254S
A R254C
A R254H
A R254L
A A256T
A A256V
A A256G
A M257I
A M257T
A M257V
A D259G
A D259E
A D259Y
A V260I
A V260E
A T262P
A A264V
A A264T
A V267A
A G268S
A G269S
A G269D
A T271P
A T271S
A T271I
A T271A
A F272L
A F272S
A F272V
A G273R
A G273C
A T275P
A T275A
A H276Q
A G277S
A G279D
A P280S
A P280Q
A A281V
A A281G
A A281T
A D282G
A G285C
A G285S
A G285V
A G285D
A G285A
A P286L
A P288H
A P288L
A E289A
A E289K
A A290V
A A290P
A A291D
A P292A
A Q295A
A Q295P
A Q295E
A M296V
A M296T
A G297V
A G297L
A L298S
A G299S
A G299C
A G299V
A G299A
A G299D
A W300S
A W300G
A W300R
A W300C
A S302R
A S302T
A G305C
A G305A
A T306A
A T306S
A G307R
A T308P
A T308S
A T308K
A T308A
A T308V
A T308I
A D311G
A A312P
A A312E
A A312V
A T314S
A T314N
A T314A
A S315T
A S315N
A S315I
A S315G
A S315R
A I317L
A I317V
A I317T
A E318K
A V320L
A V320A
A T322A
A T322M
A N323P
A N323S
A N323H
A T324N
A T324P
A T324S
A T324L
A P325S
A P325T
A T326P
A T326M
A K327T
A W328L
A W328S
A W328R
A W328C
A D329A
A D329E
A D329H
A S331T
A S331I
A S331R
A L333F
A L333C
A E334K
A I335V
A I335T
A I335N
A L336M
A Y337C
A Y337H
A Y337F
A Y337S
A G338S
A Y339N
A Y339C
A Y339S
A E340D
A E342G
A T344L
A T344K
A T344S
A T344M
A A348V
A A348G
A G349D
A Q352Y
A Y353H
A Y353F
A T354I
A D357H
A I364N
A D366N
A P367L
A F368L
A S374A
A S374P
A L378P
A L378M
A A379V
A A379T
A T380S
A T380P
A T380I
A T380A
A T380N
A D381A
A L382I
A L382R
A S383W
A S383A
A L384R
A R385P
A V386M
A V386E
A D387N
A Y390C
A R392W
A T394P
A T394M
A T394A
A R395C
A L398R
A E399D
A E399K
A H400Y
A H400P
A E402A
A E402K
A L404W
A D406A
A D406E
A D406G
A E407A
A E407K
A F408Y
A F408S
A F408L
A F408V
A A411D
A Y413C
A Y413F
A Y413H
A Y413S
A K414R
A I416M
A I416T
A I416L
A I416V
A D419H
A D419G
A D419Y
A D419V
A P422H
A P422L
A V423I
A A424V
A A424G
A R425K
A L427P
A L427R
A L427F
A L430A
A P432L
A P432T
A K433T
A Q434P
A L437R
A W438G
A Q439K
A Q439H
A Q439R
A Q439T
A D440G
A P441L
A V442L
A V442A
A V445I
A S446N
A D448A
A D448E
A V450I
A V450A
A G451D
A E452Q
A I455L
A L458H
A K459T
A S460N
A Q461P
A Q461R
A Q461E
A I462S
A R463L
A R463W
A S465P
A T468P
A V469L
A V469I
A Q471R
A V473L
A V473F
A S474Q
A T475I
A T475A
A A476E
A A476V
A A478R
A A479P
A A479G
A A479V
A A479Q
A A480Q
A A480S
A S481A
A S481L
A S482T
A F483L
A R484H
A R484G
A K488E
A R489C
A G490D
A G490C
A G490S
A G491S
A A492V
A A492D
A N493K
A G494S
A G494A
A G495S
A G495A
A G495C
A R496L
A R496C
A R498S
A P501S
A V503A
A V503S
A W505L
A V507I
A N508D
A D509E
A D509N
A P510A
A D511N
A D513N
A L514P
A L514V
A R515H
A K516R
A R519H
A T520A
A L521P
A E522K
A E523D
A Q525P
A Q525A
A Q525K
A Q525S
A E526D
A S527L
A N529T
A A532P
A A532V
A P533L
A G534A
A G534R
A K537E
A V538A
A F540S
A A541T
A D542E
A L546F
A C549S
A A550D
A A551S
A A555P
A A556S
A K557N
A G560R
A G560A
A G560S
A H561R
A N562H
A V565G
A P566L
A F567S
A F567L
A F567V
A T568P
A P569L
A G570F
A R571L
A A574V
A T579A
A T579S
A S583P
A F584V
A V586M
A L587R
A L587P
A E588G
A A591T
A G593C
A F594I
A F594L
A N596S
A Y597H
A Y597S
A Y597D
A L598F
A L598R
A G599R
A K600Q
A N602D
A P603L
A P605S
A A606P
A A606T
A E607D
A Y608D
A M609T
A L611R
A D612G
A A614T
A A614G
A A614E
A L616S
A T618M
A S620T
A A621T
A A621D
A M624V
A M624K
A M624I
A T625A
A T625K
A L627P
A V628I
A G629D
A G629C
A G630R
A G630V
A V633A
A V633I
A L634I
A A636T
A N637D
A N637H
A N637K
A Y638C
A Y638H
A G644D
A G644S
A G644V
A E648D
A A649T
A A649G
A S650F
A S650P
A E651D
A L653Q
A T654S
A N655D
A F657S
A F657L
A N660D
A L661M
A L662V
A D663G
A D663Y
A I666V
A T667P
A T667I
A W668C
A W668L
A A673V
A D675Y
A D675G
A D675H
A T677P
A Y678C
A Q679E
A Q679Y
A G680D
A K681Q
A K681T
A S684R
A K686E
A W689G
A W689R
A T690I
A T690P
A G691D
A S692R
A R693C
A R693H
A D695A
A L696Q
A L696P
A V697A
A F698V
A G699E
A G699V
A S700P
A S700F
A E703Q
A L704W
A L704S
A R705L
A R705G
A R705W
A L707R
A L707F
A E709A
A E709G
A V710I
A V710A
A Y711D
A A713S
A D714E
A D714N
A D714G
A P718S
A F720S
A D723N
A D723A
A A726T
A A727S
A A727T
A W728R
A D729N
A D729V
A D729G
A D729T
A V731M
A V731A
A N733S
A L734R
A D735A
A R736K
A R736S
A V739M
A R740S
1 A G24V
2 A K27I
3 A K27E
4 A Y28L
5 A Y28H
6 A P29S
7 A V30A
8 A G32S
9 A G33S
10 A G34V
11 A G34A
12 A Q36P
13 A Q36H
14 A D37G
15 A P40T
16 A L43R
17 A L43P
18 A K46N
19 A V47I
20 A L48P
21 A L48R
22 A P52S
23 A D56H
24 A P57S
25 A A61S
26 A F62L
27 A D63G
28 A Y64C
29 A A65T
30 A A66T
31 A V68G
32 A I71F
33 A I71S
34 A V73A
35 A V73G
36 A A75P
37 A L76P
38 A T77R
39 A R78P
40 A R78G
41 A E81V
42 A E82D
43 A V83L
44 A V83G
45 A M84I
46 A M84T
47 A M84L
48 A T85A
49 A T85P
50 A T86P
51 A T86N
52 A S87L
53 A Q88P
54 A Q88E
55 A P89D
56 A W90R
57 A W90C
58 A W91G
59 A W91R
60 A W91L
61 A W91S
62 A P92T
63 A A93G
64 A A93D
65 A A93T
66 A D94N
67 A Y95F
68 A Y95S
69 A H97N
70 A H97P
71 A H97S
72 A Y98C
73 A Y98D
74 A Y98N
75 A G99R
76 A G99E
77 A P100T
78 A L101F
79 A L101M
80 A F102M
81 A F102S
82 A F102I
83 A I103N
84 A I103V
85 A I103T
86 A R104Q
87 A R104W
88 A M105I
89 A A106S
90 A A106V
91 A A106T
92 A A106R
93 A A106G
94 A A109T
95 A A109V
96 A A109S
97 A A109D
98 A A110V
99 A A110T
100 A G111D
101 A T112I
102 A Y113C
103 A I115V
104 A I115S
105 A I115T
106 A H116T
107 A H116E
108 A H116L
109 A H116G
110 A H116A
111 A H116Q
112 A H116F
113 A H116S
114 A H116P
115 A D117E
116 A G120S
117 A G121A
118 A G121S
119 A A122G
120 A A122D
121 A A122T
122 A A122V
123 A G123R
124 A G123E
125 A G124A
126 A G124Q
127 A G124D
128 A G124S
129 A G124H
130 A G124E
131 A G124R
132 A G124T
133 A G125D
134 A G125S
135 A M126Q
136 A M126I
137 A M126A
138 A M126L
139 A M126S
140 A Q127P
141 A R128Q
142 A R128L
143 A R128G
144 A R128W
145 A F129S
146 A A130E
147 A P131Q
148 A P131A
149 A P131L
150 A P131S
151 A L132R
152 A N133S
153 A N133D
154 A S134R
155 A W135S
156 A P136L
157 A N138S
158 A N138H
159 A N138D
160 A A139V
161 A A139P
162 A A139G
163 A S140N
164 A S140G
165 A S140I
166 A L141S
167 A L141F
168 A L141I
169 A L141V
170 A D142G
171 A D142N
172 A K143N
173 A K143E
174 A A144T
175 A A144V
176 A R145H
177 A R145C
178 A R145S
179 A R146L
180 A L148I
181 A W149R
182 A W149L
183 A W149G
184 A W149C
185 A V151L
186 A V151I
187 A K152E
188 A K152T
189 A K153Q
190 A Y155C
191 A Y155S
192 A Y155H
193 A G156D
194 A G156S
195 A K157N
196 A K157R
197 A K157Q
198 A K158S
199 A K158N
200 A L159I
201 A L159F
202 A L159P
203 A W161C
204 A W161R
205 A A162V
206 A A162E
207 A A162T
208 A D163N
209 A D163A
210 A L164R
211 A I165M
212 A I165L
213 A I165Y
214 A I165T
215 A V166I
216 A V166T
217 A F167S
218 A F167L
219 A F167C
220 A A168V
221 A A168T
222 A A168G
223 A G169S
224 A N170K
225 A C171V
226 A C171G
227 A A172T
228 A A172V
229 A L173R
230 A M176T
231 A M176I
232 A F178I
233 A F178S
234 A K179E
235 A T180M
236 A T180K
237 A G182R
238 A G182E
239 A F183L
240 A F183S
241 A G184D
242 A G184A
243 A G184C
244 A G186A
245 A G186S
246 A G186D
247 A R187P
248 A D189N
249 A D189G
250 A D189A
251 A D189Y
252 A W191R
253 A W191G
254 A E192A
255 A E192D
256 A D194N
257 A E195K
258 A V196G
259 A Y197D
260 A W204S
261 A L205R
262 A G206R
263 A E208K
264 A R209C
265 A S211N
266 A S211T
267 A K213E
268 A K213N
269 A R214L
270 A D215H
271 A D215E
272 A N218S
273 A P219L
274 A A222T
275 A Q224R
276 A M225V
277 A I228L
278 A N231K
279 A P232S
280 A P232R
281 A P232T
282 A P232A
283 A E233G
284 A E233Q
285 A G234R
286 A N236D
287 A G237A
288 A G237D
289 A P241H
290 A M242V
291 A M242T
292 A M242I
293 A A243T
294 A A244G
295 A V246R
296 A V246G
297 A I248T
298 A R249G
299 A R249C
300 A R249H
301 A T251K
302 A T251M
303 A F252L
304 A R253G
305 A R253W
306 A R254S
307 A R254C
308 A R254H
309 A R254L
310 A A256T
311 A A256V
312 A A256G
313 A M257I
314 A M257T
315 A M257V
316 A D259G
317 A D259E
318 A D259Y
319 A V260I
320 A V260E
321 A T262P
322 A A264V
323 A A264T
324 A V267A
325 A G268S
326 A G269S
327 A G269D
328 A T271P
329 A T271S
330 A T271I
331 A T271A
332 A F272L
333 A F272S
334 A F272V
335 A G273R
336 A G273C
337 A T275P
338 A T275A
339 A H276Q
340 A G277S
341 A G279D
342 A P280S
343 A P280Q
344 A A281V
345 A A281G
346 A A281T
347 A D282G
348 A G285C
349 A G285S
350 A G285V
351 A G285D
352 A G285A
353 A P286L
354 A P288H
355 A P288L
356 A E289A
357 A E289K
358 A A290V
359 A A290P
360 A A291D
361 A P292A
362 A Q295A
363 A Q295P
364 A Q295E
365 A M296V
366 A M296T
367 A G297V
368 A G297L
369 A L298S
370 A G299S
371 A G299C
372 A G299V
373 A G299A
374 A G299D
375 A W300S
376 A W300G
377 A W300R
378 A W300C
379 A S302R
380 A S302T
381 A G305C
382 A G305A
383 A T306A
384 A T306S
385 A G307R
386 A T308P
387 A T308S
388 A T308K
389 A T308A
390 A T308V
391 A T308I
392 A D311G
393 A A312P
394 A A312E
395 A A312V
396 A T314S
397 A T314N
398 A T314A
399 A S315T
400 A S315N
401 A S315I
402 A S315G
403 A S315R
404 A I317L
405 A I317V
406 A I317T
407 A E318K
408 A V320L
409 A V320A
410 A T322A
411 A T322M
412 A N323P
413 A N323S
414 A N323H
415 A T324N
416 A T324P
417 A T324S
418 A T324L
419 A P325S
420 A P325T
421 A T326P
422 A T326M
423 A K327T
424 A W328L
425 A W328S
426 A W328R
427 A W328C
428 A D329A
429 A D329E
430 A D329H
431 A S331T
432 A S331I
433 A S331R
434 A L333F
435 A L333C
436 A E334K
437 A I335V
438 A I335T
439 A I335N
440 A L336M
441 A Y337C
442 A Y337H
443 A Y337F
444 A Y337S
445 A G338S
446 A Y339N
447 A Y339C
448 A Y339S
449 A E340D
450 A E342G
451 A T344L
452 A T344K
453 A T344S
454 A T344M
455 A A348V
456 A A348G
457 A G349D
458 A Q352Y
459 A Y353H
460 A Y353F
461 A T354I
462 A D357H
463 A I364N
464 A D366N
465 A P367L
466 A F368L
467 A S374A
468 A S374P
469 A L378P
470 A L378M
471 A A379V
472 A A379T
473 A T380S
474 A T380P
475 A T380I
476 A T380A
477 A T380N
478 A D381A
479 A L382I
480 A L382R
481 A S383W
482 A S383A
483 A L384R
484 A R385P
485 A V386M
486 A V386E
487 A D387N
488 A Y390C
489 A R392W
490 A T394P
491 A T394M
492 A T394A
493 A R395C
494 A L398R
495 A E399D
496 A E399K
497 A H400Y
498 A H400P
499 A E402A
500 A E402K
501 A L404W
502 A D406A
503 A D406E
504 A D406G
505 A E407A
506 A E407K
507 A F408Y
508 A F408S
509 A F408L
510 A F408V
511 A A411D
512 A Y413C
513 A Y413F
514 A Y413H
515 A Y413S
516 A K414R
517 A I416M
518 A I416T
519 A I416L
520 A I416V
521 A D419H
522 A D419G
523 A D419Y
524 A D419V
525 A P422H
526 A P422L
527 A V423I
528 A A424V
529 A A424G
530 A R425K
531 A L427P
532 A L427R
533 A L427F
534 A L430A
535 A P432L
536 A P432T
537 A K433T
538 A Q434P
539 A L437R
540 A W438G
541 A Q439K
542 A Q439H
543 A Q439R
544 A Q439T
545 A D440G
546 A P441L
547 A V442L
548 A V442A
549 A V445I
550 A S446N
551 A D448A
552 A D448E
553 A V450I
554 A V450A
555 A G451D
556 A E452Q
557 A I455L
558 A L458H
559 A K459T
560 A S460N
561 A Q461P
562 A Q461R
563 A Q461E
564 A I462S
565 A R463L
566 A R463W
567 A S465P
568 A T468P
569 A V469L
570 A V469I
571 A Q471R
572 A V473L
573 A V473F
574 A S474Q
575 A T475I
576 A T475A
577 A A476E
578 A A476V
579 A A478R
580 A A479P
581 A A479G
582 A A479V
583 A A479Q
584 A A480Q
585 A A480S
586 A S481A
587 A S481L
588 A S482T
589 A F483L
590 A R484H
591 A R484G
592 A K488E
593 A R489C
594 A G490D
595 A G490C
596 A G490S
597 A G491S
598 A A492V
599 A A492D
600 A N493K
601 A G494S
602 A G494A
603 A G495S
604 A G495A
605 A G495C
606 A R496L
607 A R496C
608 A R498S
609 A P501S
610 A V503A
611 A V503S
612 A W505L
613 A V507I
614 A N508D
615 A D509E
616 A D509N
617 A P510A
618 A D511N
619 A D513N
620 A L514P
621 A L514V
622 A R515H
623 A K516R
624 A R519H
625 A T520A
626 A L521P
627 A E522K
628 A E523D
629 A Q525P
630 A Q525A
631 A Q525K
632 A Q525S
633 A E526D
634 A S527L
635 A N529T
636 A A532P
637 A A532V
638 A P533L
639 A G534A
640 A G534R
641 A K537E
642 A V538A
643 A F540S
644 A A541T
645 A D542E
646 A L546F
647 A C549S
648 A A550D
649 A A551S
650 A A555P
651 A A556S
652 A K557N
653 A G560R
654 A G560A
655 A G560S
656 A H561R
657 A N562H
658 A V565G
659 A P566L
660 A F567S
661 A F567L
662 A F567V
663 A T568P
664 A P569L
665 A G570F
666 A R571L
667 A A574V
668 A T579A
669 A T579S
670 A S583P
671 A F584V
672 A V586M
673 A L587R
674 A L587P
675 A E588G
676 A A591T
677 A G593C
678 A F594I
679 A F594L
680 A N596S
681 A Y597H
682 A Y597S
683 A Y597D
684 A L598F
685 A L598R
686 A G599R
687 A K600Q
688 A N602D
689 A P603L
690 A P605S
691 A A606P
692 A A606T
693 A E607D
694 A Y608D
695 A M609T
696 A L611R
697 A D612G
698 A A614T
699 A A614G
700 A A614E
701 A L616S
702 A T618M
703 A S620T
704 A A621T
705 A A621D
706 A M624V
707 A M624K
708 A M624I
709 A T625A
710 A T625K
711 A L627P
712 A V628I
713 A G629D
714 A G629C
715 A G630R
716 A G630V
717 A V633A
718 A V633I
719 A L634I
720 A A636T
721 A N637D
722 A N637H
723 A N637K
724 A Y638C
725 A Y638H
726 A G644D
727 A G644S
728 A G644V
729 A E648D
730 A A649T
731 A A649G
732 A S650F
733 A S650P
734 A E651D
735 A L653Q
736 A T654S
737 A N655D
738 A F657S
739 A F657L
740 A N660D
741 A L661M
742 A L662V
743 A D663G
744 A D663Y
745 A I666V
746 A T667P
747 A T667I
748 A W668C
749 A W668L
750 A A673V
751 A D675Y
752 A D675G
753 A D675H
754 A T677P
755 A Y678C
756 A Q679E
757 A Q679Y
758 A G680D
759 A K681Q
760 A K681T
761 A S684R
762 A K686E
763 A W689G
764 A W689R
765 A T690I
766 A T690P
767 A G691D
768 A S692R
769 A R693C
770 A R693H
771 A D695A
772 A L696Q
773 A L696P
774 A V697A
775 A F698V
776 A G699E
777 A G699V
778 A S700P
779 A S700F
780 A E703Q
781 A L704W
782 A L704S
783 A R705L
784 A R705G
785 A R705W
786 A L707R
787 A L707F
788 A E709A
789 A E709G
790 A V710I
791 A V710A
792 A Y711D
793 A A713S
794 A D714E
795 A D714N
796 A D714G
797 A P718S
798 A F720S
799 A D723N
800 A D723A
801 A A726T
802 A A727S
803 A A727T
804 A W728R
805 A D729N
806 A D729V
807 A D729G
808 A D729T
809 A V731M
810 A V731A
811 A N733S
812 A L734R
813 A D735A
814 A R736K
815 A R736S
816 A V739M
817 A R740S

11
dynamut/notes.txt Normal file
View file

@ -0,0 +1,11 @@
Dynamut was painfully run for gid, part manually, part programatically!
However, it was decided to ditch that and only run Dynamut2 for future targets
Dynamut2 was run through the website in batches of 50 for
katG: 17 batches (00..16)
rpoB: 23 batches (00..22)
alr: 6 batches (00..05)
However, the use of API was made for rpoB batches (09-22) from 13 Oct 2021
as jobs started to flake and fail through the website!

View file

@ -0,0 +1,100 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Feb 12 12:15:26 2021
@author: tanu
"""
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# FIXME
# RE RUN when B07 completes!!!! as norm gets affected!
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#%% load packages
import os
homedir = os.path.expanduser('~')
os.chdir (homedir + '/git/LSHTM_analysis/dynamut')
from format_results_dynamut import *
from format_results_dynamut2 import *
########################################################################
#%% command line args
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('-d', '--drug' , help = 'drug name (case sensitive)', default = None)
arg_parser.add_argument('-g', '--gene' , help = 'gene name (case sensitive)', default = None)
arg_parser.add_argument('--datadir' , help = 'Data Directory. By default, it assmumes homedir + git/Data')
arg_parser.add_argument('-i', '--input_dir' , help = 'Input dir containing pdb files. By default, it assmumes homedir + <drug> + input')
arg_parser.add_argument('-o', '--output_dir', help = 'Output dir for results. By default, it assmes homedir + <drug> + output')
#arg_parser.add_argument('--mkdir_name' , help = 'Output dir for processed results. This will be created if it does not exist')
arg_parser.add_argument('-m', '--make_dirs' , help = 'Make dir for input and output', action='store_true')
arg_parser.add_argument('--debug' , action = 'store_true' , help = 'Debug Mode')
args = arg_parser.parse_args()
#%%============================================================================
# variable assignment: input and output paths & filenames
drug = args.drug
gene = args.gene
datadir = args.datadir
indir = args.input_dir
outdir = args.output_dir
#outdir_dynamut2 = args.mkdir_name
make_dirs = args.make_dirs
#=======
# dirs
#=======
if not datadir:
datadir = homedir + '/git/Data/'
if not indir:
indir = datadir + drug + '/input/'
if not outdir:
outdir = datadir + drug + '/output/'
#if not mkdir_name:
outdir_dynamut = outdir + 'dynamut_results/'
outdir_dynamut2 = outdir + 'dynamut_results/dynamut2/'
# Input file
#infile_dynamut = outdir_dynamut + gene.lower() + '_dynamut_all_output_clean.csv'
infile_dynamut2 = outdir_dynamut2 + gene.lower() + '_dynamut2_output_combined_clean.csv'
# Formatted output filename
outfile_dynamut_f = outdir_dynamut2 + gene + '_dynamut_norm.csv'
outfile_dynamut2_f = outdir_dynamut2 + gene + '_dynamut2_norm.csv'
#%%========================================================================
#===============================
# CALL: format_results_dynamut
# DYNAMUT results
# #===============================
# print('Formatting results for:', infile_dynamut)
# dynamut_df_f = format_dynamut_output(infile_dynamut)
# # writing file
# print('Writing formatted dynamut df to csv')
# dynamut_df_f.to_csv(outfile_dynamut_f, index = False)
# print('Finished writing file:'
# , '\nFile:', outfile_dynamut_f
# , '\nExpected no. of rows:', len(dynamut_df_f)
# , '\nExpected no. of cols:', len(dynamut_df_f.columns)
# , '\n=============================================================')
#===============================
# CALL: format_results_dynamut2
# DYNAMUT2 results
#===============================
print('Formatting results for:', infile_dynamut2)
dynamut2_df_f = format_dynamut2_output(infile_dynamut2) # dynamut2
# writing file
print('Writing formatted dynamut2 df to csv')
dynamut2_df_f.to_csv(outfile_dynamut2_f, index = False)
print('Finished writing file:'
, '\nFile:', outfile_dynamut2_f
, '\nExpected no. of rows:', len(dynamut2_df_f)
, '\nExpected no. of cols:', len(dynamut2_df_f.columns)
, '\n=============================================================')
#%%#####################################################################

View file

@ -0,0 +1,44 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 19 14:33:51 2020
@author: tanu
"""
#%% load packages
import os
homedir = os.path.expanduser('~')
os.chdir (homedir + '/git/LSHTM_analysis/dynamut')
from get_results_dynamut import *
########################################################################
# variables
my_host = 'http://biosig.unimelb.edu.au'
# Needed if things try to block the 'requests' user agent
#headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
# TODO: add cmd line args
#gene = ''
#drug = ''
datadir = homedir + '/git/Data/'
indir = datadir + drug + '/input/'
outdir = datadir + drug + '/output/'
outdir_dynamut_temp = outdir + 'dynamut_results/dynamut_temp/'
#==============================================================================
# batch 7 (previously 1b file): RETRIEVED 17 Aug 16:40
my_url_file = outdir_dynamut_temp + 'dynamut_result_url_gid_b7.txt'
my_suffix = 'gid_b7'
#==============================================================================
#==========================
# CALL: get_results()
# Data: gid+streptomycin
#==========================
# output file saves in dynamut_results/ (created if it doesn't exist) inside outdir
print('Fetching results from url file :', my_url_file, '\nsuffix:', my_suffix)
get_results(url_file = my_url_file
, host_url = my_host
, output_dir = outdir
, outfile_suffix = my_suffix)
########################################################################

58
dynamut/run_submit_dynamut.py Executable file
View file

@ -0,0 +1,58 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 19 14:33:51 2020
@author: tanu
"""
#%% load packages
import os
homedir = os.path.expanduser('~')
os.chdir (homedir + '/git/LSHTM_analysis/dynamut')
from submit_dynamut import *
########################################################################
# variables
my_host = 'http://biosig.unimelb.edu.au'
my_prediction_url = f"{my_host}/dynamut/prediction_list"
print(my_prediction_url)
# TODO: add cmd line args
gene = 'gid'
drug = 'streptomycin'
datadir = homedir + '/git/Data/'
indir = datadir + drug + '/input/'
outdir = datadir + drug + '/output/'
outdir_dynamut = outdir + 'dynamut_results/'
my_chain = 'A'
my_email = 'tanushree.tunstall@lshtm.ac.uk'
#my_pdb_file = indir + 'gid_complex.pdb'
my_pdb_file = indir + gene + '_complex.pdb'
#==============================================================================
# Rerunnig batch 7: 07.txt, # RAN: 12 Aug 15:22, 0 bytes file from previous run!
my_mutation_list = outdir + 'snp_batches/50/snp_batch_07.txt'
my_suffix = 'gid_b7'
#==============================================================================
#==========================
# CALL: submit_dynamut()
# Data: gid+streptomycin
#==========================
print('\nSubmitting batch for:'
, '\nFilename : ' , my_mutation_list
, '\nbatch : ' , my_suffix
, '\ndrug : ' , drug
, '\ngene : ' , gene
, '\npdb file : ' , my_pdb_file)
submit_dynamut(host_url = my_host
, pdb_file = my_pdb_file
, mutation_list = my_mutation_list
, chain = my_chain
, email_address = my_email
, prediction_url = my_prediction_url
, output_dir = outdir_dynamut
, outfile_suffix = my_suffix)
#%%#####################################################################

24
dynamut/split_csv.sh Executable file
View file

@ -0,0 +1,24 @@
#!/bin/bash
# FIXME: This is written for expediency to kickstart running dynamut, mcsm-PPI2 (batch pf 50) and mCSM-NA (batch of 20)
# Usage: ~/git/LSHTM_analysis/dynamut/split_csv.sh <input file> <output dir> <chunk size in lines>
# copy your snp file to split into the dynamut dir
INFILE=$1
OUTDIR=$2
CHUNK=$3
mkdir -p ${OUTDIR}/${CHUNK}
cd ${OUTDIR}/${CHUNK}
# makes the 2 dirs, hence ../..
split ../../${INFILE} -l ${CHUNK} -d snp_batch_
# use case
#~/git/LSHTM_analysis/dynamut/split_csv.sh gid_mcsm_formatted_snps.csv snp_batches 50
#~/git/LSHTM_analysis/dynamut/split_csv.sh embb_mcsm_formatted_snps.csv snp_batches 50
#~/git/LSHTM_analysis/dynamut/split_csv.sh pnca_mcsm_formatted_snps.csv snp_batches 50
#~/git/LSHTM_analysis/dynamut/split_csv.sh katg_mcsm_formatted_snps.csv snp_batches 50 #Date: 20/09/2021
# add .txt to the files

41
dynamut/split_csv_chain.sh Executable file
View file

@ -0,0 +1,41 @@
#!/bin/bash
# FIXME: This is written for expediency to kickstart running dynamut, mcsm-PPI2 (batch pf 50) and mCSM-NA (batch of 20)
# Usage: ~/git/LSHTM_analysis/dynamut/split_csv.sh <input file> <output dir> <chunk size in lines>
# copy your snp file to split into the dynamut dir
# use sed to add chain ID to snp file and then split to avoid post processing
INFILE=$1
OUTDIR=$2
CHUNK=$3
mkdir -p ${OUTDIR}/${CHUNK}/chain_added
cd ${OUTDIR}/${CHUNK}/chain_added
# makes the 3 dirs, hence ../..
split ../../../${INFILE} -l ${CHUNK} -d snp_batch_
########################################################################
# use cases
# Date: 20/09/2021
# sed -e 's/^/A /g' katg_mcsm_formatted_snps.csv > katg_mcsm_formatted_snps_chain.csv
#~/git/LSHTM_analysis/dynamut/split_csv_chain.sh katg_mcsm_formatted_snps_chain.csv snp_batches 50
# Date: 01/10/2021
# sed -e 's/^/A /g' rpob_mcsm_formatted_snps.csv > rpob_mcsm_formatted_snps_chain.csv
#~/git/LSHTM_analysis/dynamut/split_csv_chain.sh rpob_mcsm_formatted_snps_chain.csv snp_batches 50
# Date: 02/10/2021
# sed -e 's/^/A /g' alr_mcsm_formatted_snps.csv > alr_mcsm_formatted_snps_chain.csv
#~/git/LSHTM_analysis/dynamut/split_csv_chain.sh alr_mcsm_formatted_snps_chain.csv snp_batches 50
# Date: 05/10/2021
#~/git/LSHTM_analysis/dynamut/split_csv_chain.sh alr_mcsm_formatted_snps_chain.csv snp_batches 20
# Date: 30/11/2021
#~/git/LSHTM_analysis/dynamut/split_csv_chain.sh katg_mcsm_formatted_snps_chain.csv snp_batches 20
for i in {00..40}; do mv snp_batch_${i} snp_batch_${i}.txt; done
# add .txt to the files
########################################################################

89
dynamut/submit_dynamut.py Executable file
View file

@ -0,0 +1,89 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 19 14:33:51 2020
@author: tanu
"""
#%% load packages
import os,sys
import subprocess
import argparse
import requests
import re
import time
from bs4 import BeautifulSoup
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
#%%#####################################################################
def submit_dynamut(host_url
, pdb_file
, mutation_list
, chain
, email_address
, prediction_url
, output_dir
, outfile_suffix
):
"""
Makes a POST request for dynamut predictions.
@param host_url: valid host url for submitting the job
@type string
@param pdb_file: valid path to pdb structure
@type string
@param mutation_list: list of mutations (1 per line) of the format: {WT}<POS>{Mut}
@type string
@param chain: single-letter(caps)
@type chr
@param email_address: email address to inform of results
@type chr
@param prediction_url: dynamut url for prediction
@type string
@param output_dir: output dir
@type string
@param outfile_suffix: to append to outfile
@type string
@return writes a .txt file containing url for the snps processed with user provided suffix in filename
@type string
"""
with open(pdb_file, "rb") as pdb_file, open (mutation_list, "rb") as mutation_list:
files = {"wild": pdb_file
, "mutation_list": mutation_list}
body = {"chain": chain
, "email": email_address}
response = requests.post(prediction_url, files = files, data = body)
print(response.status_code)
if response.history:
print('\nPASS: valid submission. Fetching result url')
url_match = re.search('/dynamut/results_prediction/.+(?=")', response.text)
url = host_url + url_match.group()
print('\nURL for snp batch no ', str(outfile_suffix), ':', url)
#===============
# writing file: result urls
#===============
dynamut_temp_dir = output_dir + 'dynamut_temp/' # creates a temp dir within output_dir
if not os.path.exists(dynamut_temp_dir):
print('\nCreating dynamut_temp in output_dir', output_dir )
os.makedirs(dynamut_temp_dir)
out_url_file = dynamut_temp_dir + 'dynamut_result_url_' + str(outfile_suffix) + '.txt'
print('\nWriting output url file:', out_url_file
, '\nNow we wait patiently...')
myfile = open(out_url_file, 'a')
myfile.write(url)
myfile.close()
#%%#####################################################################

3
foldx/cmd_change Normal file
View file

@ -0,0 +1,3 @@
sed -i s/'\/Users\/Charlotte\/Downloads\/foldxMacC11\/' '\/home\/tanu\/git\/LSHTM_analysis\/foldx\/\/'/g *.sh
rm *.txt *.fxout *Repai*pdb

View file

@ -0,0 +1,68 @@
PDB=$1
n=$2
#cd /home/tanu/git/LSHTM_analysis/foldx/
logger "Running mutrenamefiles_mac"
cp Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout Matrix_Hbonds_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Distances_${PDB}_Repair_${n}_PN.fxout Matrix_Distances_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,4d Matrix_Distances_${PDB}_Repair_${n}_PN.txt
cp Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout Matrix_Volumetric_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Electro_${PDB}_Repair_${n}_PN.fxout Matrix_Electro_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout Matrix_Disulfide_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout Matrix_Partcov_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout Matrix_VdWClashes_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_SS_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Disulfide_${PDB}_Repair_${n}_PN.fxout AllAtoms_Disulfide_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,2d AllAtoms_Disulfide_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Electro_${PDB}_Repair_${n}_PN.fxout AllAtoms_Electro_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,2d AllAtoms_Electro_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Hbonds_${PDB}_Repair_${n}_PN.fxout AllAtoms_Hbonds_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,2d AllAtoms_Hbonds_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Partcov_${PDB}_Repair_${n}_PN.fxout AllAtoms_Partcov_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,2d AllAtoms_Partcov_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_VdWClashes_${PDB}_Repair_${n}_PN.fxout AllAtoms_VdWClashes_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,2d AllAtoms_VdWClashes_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Volumetric_${PDB}_Repair_${n}_PN.fxout AllAtoms_Volumetric_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,2d AllAtoms_Volumetric_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_VdWClashes_${PDB}_Repair_${n}_PN.fxout InteractingResidues_VdWClashes_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,5d InteractingResidues_VdWClashes_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Distances_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Distances_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,5d InteractingResidues_Distances_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Electro_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Electro_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,5d InteractingResidues_Electro_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Hbonds_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Hbonds_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,5d InteractingResidues_Hbonds_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Partcov_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Partcov_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,5d InteractingResidues_Partcov_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Volumetric_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Volumetric_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,5d InteractingResidues_Volumetric_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Disulfide_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Disulfide_${PDB}_Repair_${n}_PN.txt
sed -i .bak -e 1,5d InteractingResidues_Disulfide_${PDB}_Repair_${n}_PN.txt

View file

@ -0,0 +1,10 @@
PDB=$1
A=$2
B=$3
n=$4
OUTDIR=$5
cd ${OUTDIR}
logger "Running mutruncomplex"
foldx --command=AnalyseComplex --pdb="${PDB}_Repair_${n}.pdb" --analyseComplexChains=${A},${B} --water=PREDICT --vdwDesign=1
cp ${OUTDIR}/Summary_${PDB}_Repair_${n}_AC.fxout ${OUTDIR}/Summary_${PDB}_Repair_${n}_AC.txt
#sed -i .bak -e 1,8d ${OUTDIR}/Summary_${PDB}_Repair_${n}_AC.txt

View file

@ -0,0 +1,68 @@
PDB=$1
logger "Running renamefiles_mac"
#cp Dif_${PDB}_Repair.fxout Dif_${PDB}_Repair.txt
sed -i '.bak' -e 1,8d Dif_${PDB}_Repair.txt
cp Matrix_Hbonds_${PDB}_Repair_PN.fxout Matrix_Hbonds_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_SS_${PDB}_Repair_PN.txt
cp Matrix_Distances_${PDB}_Repair_PN.fxout Matrix_Distances_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,4d Matrix_Distances_${PDB}_Repair_PN.txt
cp Matrix_Volumetric_${PDB}_Repair_PN.fxout Matrix_Volumetric_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_SS_${PDB}_Repair_PN.txt
cp Matrix_Electro_${PDB}_Repair_PN.fxout Matrix_Electro_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_SS_${PDB}_Repair_PN.txt
cp Matrix_Disulfide_${PDB}_Repair_PN.fxout Matrix_Disulfide_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_SS_${PDB}_Repair_PN.txt
cp Matrix_Partcov_${PDB}_Repair_PN.fxout Matrix_Partcov_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_SS_${PDB}_Repair_PN.txt
cp Matrix_VdWClashes_${PDB}_Repair_PN.fxout Matrix_VdWClashes_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_SS_${PDB}_Repair_PN.txt
cp AllAtoms_Disulfide_${PDB}_Repair_PN.fxout AllAtoms_Disulfide_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,2d AllAtoms_Disulfide_${PDB}_Repair_PN.txt
cp AllAtoms_Electro_${PDB}_Repair_PN.fxout AllAtoms_Electro_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,2d AllAtoms_Electro_${PDB}_Repair_PN.txt
cp AllAtoms_Hbonds_${PDB}_Repair_PN.fxout AllAtoms_Hbonds_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,2d AllAtoms_Hbonds_${PDB}_Repair_PN.txt
cp AllAtoms_Partcov_${PDB}_Repair_PN.fxout AllAtoms_Partcov_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,2d AllAtoms_Partcov_${PDB}_Repair_PN.txt
cp AllAtoms_VdWClashes_${PDB}_Repair_PN.fxout AllAtoms_VdWClashes_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,2d AllAtoms_VdWClashes_${PDB}_Repair_PN.txt
cp AllAtoms_Volumetric_${PDB}_Repair_PN.fxout AllAtoms_Volumetric_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,2d AllAtoms_Volumetric_${PDB}_Repair_PN.txt
cp InteractingResidues_VdWClashes_${PDB}_Repair_PN.fxout InteractingResidues_VdWClashes_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,5d InteractingResidues_VdWClashes_${PDB}_Repair_PN.txt
cp InteractingResidues_Distances_${PDB}_Repair_PN.fxout InteractingResidues_Distances_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,5d InteractingResidues_Distances_${PDB}_Repair_PN.txt
cp InteractingResidues_Electro_${PDB}_Repair_PN.fxout InteractingResidues_Electro_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,5d InteractingResidues_Electro_${PDB}_Repair_PN.txt
cp InteractingResidues_Hbonds_${PDB}_Repair_PN.fxout InteractingResidues_Hbonds_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,5d InteractingResidues_Hbonds_${PDB}_Repair_PN.txt
cp InteractingResidues_Partcov_${PDB}_Repair_PN.fxout InteractingResidues_Partcov_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,5d InteractingResidues_Partcov_${PDB}_Repair_PN.txt
cp InteractingResidues_Volumetric_${PDB}_Repair_PN.fxout InteractingResidues_Volumetric_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,5d InteractingResidues_Volumetric_${PDB}_Repair_PN.txt
cp InteractingResidues_Disulfide_${PDB}_Repair_PN.fxout InteractingResidues_Disulfide_${PDB}_Repair_PN.txt
sed -i '.bak' -e 1,5d InteractingResidues_Disulfide_${PDB}_Repair_PN.txt

View file

@ -0,0 +1,9 @@
INDIR=$1
PDB=$2
OUTDIR=$3
logger "Running repairPDB"
#foldx --command=RepairPDB --pdb="${PDB}.pdb" --ionStrength=0.05 --pH=7 --water=PREDICT --vdwDesign=1 outPDB=true --output-dir=${OUTDIR}
foldx --command=RepairPDB --pdb-dir=${INDIR} --pdb=${PDB} --ionStrength=0.05 --pH=7 --water=PREDICT --vdwDesign=1 outPDB=true --output-dir=${OUTDIR}

View file

@ -0,0 +1,336 @@
#!/usr/bin/env python3
import subprocess
import os
import numpy as np
import pandas as pd
from contextlib import suppress
from pathlib import Path
import re
import csv
import argparse
#https://realpython.com/python-pathlib/
# FIXME
#strong dependency of file and path names
#cannot pass file with path. Need to pass them separately
#assumptions made for dir struc as standard
#datadir + drug + input
#=======================================================================
#%% specify input and curr dir
homedir = os.path.expanduser('~')
# set working dir
os.getcwd()
os.chdir(homedir + '/git/LSHTM_analysis/foldx/')
os.getcwd()
#=======================================================================
#%% command line args
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('-d', '--drug', help = 'drug name', default = None)
arg_parser.add_argument('-g', '--gene', help = 'gene name (case sensitive)', default = None)
arg_parser.add_argument('--datadir', help = 'Data Directory. By default, it assmumes homedir + git/Data')
arg_parser.add_argument('-i', '--input_dir', help = 'Input dir containing pdb files. By default, it assmumes homedir + <drug> + input')
arg_parser.add_argument('-o', '--output_dir', help = 'Output dir for results. By default, it assmes homedir + <drug> + output')
arg_parser.add_argument('-p', '--process_dir', help = 'Temp processing dir for running foldX. By default, it assmes homedir + <drug> + processing. Make sure it is somewhere with LOTS of storage as it writes all output!') #FIXME
arg_parser.add_argument('-pdb', '--pdb_file', help = 'PDB File to process. By default, it assmumes a file called <gene>_complex.pdb in input_dir')
arg_parser.add_argument('-m', '--mutation_file', help = 'Mutation list. By default, assumes a file called <gene>_mcsm_snps.csv exists')
# FIXME: Doesn't work with 2 chains yet!
arg_parser.add_argument('-c1', '--chain1', help = 'Chain1 ID', default = 'A') # case sensitive
arg_parser.add_argument('-c2', '--chain2', help = 'Chain2 ID', default = 'B') # case sensitive
args = arg_parser.parse_args()
#=======================================================================
#%% variable assignment: input and output
#drug = 'pyrazinamide'
#gene = 'pncA'
#gene_match = gene + '_p.'
#%%=====================================================================
# Command line options
drug = args.drug
gene = args.gene
datadir = args.datadir
indir = args.input_dir
outdir = args.output_dir
process_dir = args.process_dir
mut_filename = args.mutation_file
chainA = args.chain1
chainB = args.chain2
pdb_filename = args.pdb_file
# os.path.splitext will fail interestingly with file.pdb.txt.zip
#pdb_name = os.path.splitext(pdb_file)[0]
# Just the filename, thanks
#pdb_name = Path(in_filename_pdb).stem
#==============
# directories
#==============
if not datadir:
datadir = homedir + '/' + 'git/Data'
if not indir:
indir = datadir + '/' + drug + '/input'
if not outdir:
outdir = datadir + '/' + drug + '/output'
#TODO: perhaps better handled by refactoring code to prevent generating lots of output files!
if not process_dir:
process_dir = datadir + '/' + drug +'/' + 'processing'
#=======
# input
#=======
# FIXME
if pdb_filename:
pdb_name = Path(pdb_filename).stem
else:
pdb_filename = gene.lower() + '_complex.pdb'
pdb_name = Path(pdb_filename).stem
infile_pdb = indir + '/' + pdb_filename
actual_pdb_filename = Path(infile_pdb).name
if mut_filename:
mutation_file = mut_filename
else:
mutation_file = gene.lower() + '_mcsm_formatted_snps.csv'
infile_muts = outdir + '/' + mutation_file
#=======
# output
#=======
out_filename = gene.lower() + '_foldx.csv'
outfile_foldx = outdir + '/' + out_filename
print('Arguments being passed:'
, '\nDrug:', args.drug
, '\ngene:', args.gene
, '\ninput dir:', indir
, '\noutput dir:', outdir
, '\npdb file:', infile_pdb
, '\npdb name:', pdb_name
, '\nactual pdb name:', actual_pdb_filename
, '\nmutation file:', infile_muts
, '\nchain1:', args.chain1
, '\noutput file:', outfile_foldx
, '\n=============================================================')
#=======================================================================
def getInteractionEnergy(filename):
data = pd.read_csv(filename,sep = '\t')
return data['Interaction Energy'].loc[0]
def getInteractions(filename):
data = pd.read_csv(filename, index_col = 0, header = 0, sep = '\t')
contactList = getIndexes(data,1)
number = len(contactList)
return number
def formatMuts(mut_file,pdbname):
with open(mut_file) as csvfile:
readCSV = csv.reader(csvfile)
muts = []
for row in readCSV:
mut = row[0]
muts.append(mut)
mut_list = []
outfile = process_dir + '/individual_list_' + pdbname + '.txt'
with open(outfile, 'w') as output:
for m in muts:
print(m)
mut = m[:1] + chainA+ m[1:]
mut_list.append(mut)
mut = mut + ';'
print(mut)
output.write(mut)
output.write('\n')
return mut_list
def getIndexes(data, value):
colnames = data.columns.values
listOfPos = list()
result = data.isin([value])
result.columns = colnames
seriesdata = result.any()
columnNames = list(seriesdata[seriesdata==True].index)
for col in columnNames:
rows = list(result[col][result[col]==True].index)
for row in rows:
listOfPos.append((row,col))
return listOfPos
def loadFiles(df):
# load a text file in to np matrix
resultList = []
f = open(df,'r')
for line in f:
line = line.rstrip('\n')
aVals = line.split('\t')
fVals = list(map(np.float32, sVals))
resultList.append(fVals)
f.close()
return np.asarray(resultList, dtype=np.float32)
#=======================================================================
def main():
pdbname = pdb_name
comp = '' # for complex only
mut_filename = infile_muts #pnca_mcsm_snps.csv
mutlist = formatMuts(mut_filename, pdbname)
print(mutlist)
nmuts = len(mutlist)
print(nmuts)
print(mutlist)
print('start')
#subprocess.check_output(['bash','repairPDB.sh', pdbname, process_dir])
subprocess.check_output(['bash','repairPDB.sh', indir, actual_pdb_filename, process_dir])
print('end')
output = subprocess.check_output(['bash', 'runfoldx.sh', pdbname, process_dir])
for n in range(1,nmuts+1):
print(n)
with suppress(Exception):
subprocess.check_output(['bash', 'runPrintNetworks.sh', pdbname, str(n), process_dir])
for n in range(1,nmuts+1):
print(n)
with suppress(Exception):
subprocess.check_output(['bash', 'mutrenamefiles.sh', pdbname, str(n), process_dir])
out = subprocess.check_output(['bash','renamefiles.sh', pdbname, process_dir])
if comp=='y':
chain1=chainA
chain2=chainB
with suppress(Exception):
subprocess.check_output(['bash','runcomplex.sh', pdbname, chain1, chain2, process_dir])
for n in range(1,nmuts+1):
with suppress(Exception):
subprocess.check_output(['bash','mutruncomplex.sh', pdbname, chain1, chain2, str(n), process_dir])
interactions = ['Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS',
'Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM',
'VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS']
dGdatafile = process_dir + '/Dif_' + pdbname + '_Repair.txt'
dGdata = pd.read_csv(dGdatafile, sep = '\t')
ddG=[]
print('ddG')
print(len(dGdata))
for i in range(0,len(dGdata)):
ddG.append(dGdata['total energy'].loc[i])
nint = len(interactions)
wt_int = []
for i in interactions:
filename = process_dir + '/Matrix_' + i + '_'+ pdbname + '_Repair_PN.txt'
wt_int.append(getInteractions(filename))
print('wt')
print(wt_int)
ntotal = nint+1
print(ntotal)
print(nmuts)
data = np.empty((ntotal,nmuts))
data[0] = ddG
print(data)
for i in range(0,len(interactions)):
d=[]
p=0
for n in range(1, nmuts+1):
print(i)
filename = process_dir + '/Matrix_' + interactions[i] + '_' + pdbname + '_Repair_' + str(n) + '_PN.txt'
mut = getInteractions(filename)
diff = wt_int[i] - mut
print(diff)
print(wt_int[i])
print(mut)
d.append(diff)
print(d)
data[i+1] = d
interactions = ['ddG', 'Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS', 'Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM',
'VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS']
print(interactions)
IE = []
if comp=='y':
wtfilename = process_dir + '/Summary_' + pdbname + '_Repair_AC.txt'
wtE = getInteractionEnergy(wtfilename)
print(wtE)
for n in range(1,nmuts+1):
print(n)
filename = process_dir + '/Summary_' + pdbname + '_Repair_' + str(n) + '_AC.txt'
mutE = getInteractionEnergy(filename)
print(mutE)
diff = wtE - mutE
print(diff)
IE.append(diff)
print(IE)
IEresults = pd.DataFrame(IE,columns = ['Interaction Energy'], index = mutlist)
IEfilename = 'foldx_complexresults_'+pdbname+'.csv'
IEresults.to_csv(IEfilename)
print(len(IE))
data = np.append(data,[IE], axis = 0)
print(data)
interactions = ['ddG','Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS', 'Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM',
'VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS','Interaction Energy']
mut_file = process_dir + '/individual_list_' + pdbname + '.txt'
with open(mut_file) as csvfile:
readCSV = csv.reader(csvfile)
mutlist = []
for row in readCSV:
mut = row[0]
mutlist.append(mut)
print(mutlist)
print(len(mutlist))
print(data)
results = pd.DataFrame(data, columns = mutlist, index = interactions)
results.append(ddG)
#print(results.head())
# my style formatted results
results2 = results.T # transpose df
results2.index.name = 'mutationinformation' # assign name to index
results2 = results2.reset_index() # turn it into a columns
results2['mutationinformation'] = results2['mutationinformation'].replace({r'([A-Z]{1})[A-Z]{1}([0-9]+[A-Z]{1});' : r'\1 \2'}, regex = True) # capture mcsm style muts (i.e not the chain id)
results2['mutationinformation'] = results2['mutationinformation'].str.replace(' ', '') # remove empty space
results2.rename(columns = {'Distances': 'Contacts'}, inplace = True)
# lower case columns
results2.columns = results2.columns.str.lower()
print('Writing file in the format below:\n'
, results2.head()
, '\nNo. of rows:', len(results2)
, '\nNo. of cols:', len(results2.columns))
outputfilename = outfile_foldx
#outputfilename = 'foldx_results_' + pdbname + '.csv'
#results.to_csv(outputfilename)
results2.to_csv(outputfilename, index = False)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,7 @@
PDB=$1
n=$2
OUTDIR=$3
logger "Running runPrintNetworks"
cd ${OUTDIR}
foldx --command=PrintNetworks --pdb="${PDB}_Repair_${n}.pdb" --water=PREDICT --vdwDesign=1 --output-dir=${OUTDIR}

View file

@ -0,0 +1,9 @@
PDB=$1
A=$2
B=$3
OUTDIR=$4
cd ${OUTDIR}
logger "Running runcomplex"
foldx --command=AnalyseComplex --pdb="${PDB}_Repair.pdb" --analyseComplexChains=${A},${B} --water=PREDICT --vdwDesign=1 --output-dir=${OUTDIR}
cp ${OUTDIR}/Summary_${PDB}_Repair_AC.fxout ${OUTDIR}/Summary_${PDB}_Repair_AC.txt
#sed -i .bak -e 1,8d ${OUTDIR}/Summary_${PDB}_Repair_AC.txt

View file

@ -0,0 +1,9 @@
PDB=$1
OUTDIR=$2
cd ${OUTDIR}
pwd
ls
logger "Running runfoldx"
foldx --command=BuildModel --pdb="${PDB}_Repair.pdb" --mutant-file="individual_list_${PDB}.txt" --ionStrength=0.05 --pH=7 --water=PREDICT --vdwDesign=1 --out-pdb=true --numberOfRuns=1 --output-dir=${OUTDIR}
foldx --command=PrintNetworks --pdb="${PDB}_Repair.pdb" --water=PREDICT --vdwDesign=1 --output-dir=${OUTDIR}
foldx --command=SequenceDetail --pdb="${PDB}_Repair.pdb" --water=PREDICT --vdwDesign=1 --output-dir=${OUTDIR}

63
foldx/mutrenamefiles.sh Executable file
View file

@ -0,0 +1,63 @@
PDB=$1
n=$2
OUTDIR=$3
cd ${OUTDIR}
cp Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout Matrix_Hbonds_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Distances_${PDB}_Repair_${n}_PN.fxout Matrix_Distances_${PDB}_Repair_${n}_PN.txt
sed -i '1,4d' Matrix_Distances_${PDB}_Repair_${n}_PN.txt
cp Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout Matrix_Volumetric_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Electro_${PDB}_Repair_${n}_PN.fxout Matrix_Electro_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout Matrix_Disulfide_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout Matrix_Partcov_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout Matrix_VdWClashes_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_SS_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Disulfide_${PDB}_Repair_${n}_PN.fxout AllAtoms_Disulfide_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_Disulfide_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Electro_${PDB}_Repair_${n}_PN.fxout AllAtoms_Electro_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_Electro_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Hbonds_${PDB}_Repair_${n}_PN.fxout AllAtoms_Hbonds_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_Hbonds_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Partcov_${PDB}_Repair_${n}_PN.fxout AllAtoms_Partcov_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_Partcov_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_VdWClashes_${PDB}_Repair_${n}_PN.fxout AllAtoms_VdWClashes_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_VdWClashes_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Volumetric_${PDB}_Repair_${n}_PN.fxout AllAtoms_Volumetric_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_Volumetric_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_VdWClashes_${PDB}_Repair_${n}_PN.fxout InteractingResidues_VdWClashes_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_VdWClashes_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Distances_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Distances_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Distances_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Electro_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Electro_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Electro_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Hbonds_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Hbonds_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Hbonds_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Partcov_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Partcov_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Partcov_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Volumetric_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Volumetric_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Volumetric_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Disulfide_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Disulfide_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Disulfide_${PDB}_Repair_${n}_PN.txt

64
foldx/renamefiles.sh Executable file
View file

@ -0,0 +1,64 @@
PDB=$1
OUTDIR=$2
cd ${OUTDIR}
cp Dif_${PDB}_Repair.fxout Dif_${PDB}_Repair.txt
sed -i '1,8d' Dif_${PDB}_Repair.txt
cp Matrix_Hbonds_${PDB}_Repair_PN.fxout Matrix_Hbonds_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_SS_${PDB}_Repair_PN.txt
cp Matrix_Distances_${PDB}_Repair_PN.fxout Matrix_Distances_${PDB}_Repair_PN.txt
sed -i '1,4d' Matrix_Distances_${PDB}_Repair_PN.txt
cp Matrix_Volumetric_${PDB}_Repair_PN.fxout Matrix_Volumetric_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_SS_${PDB}_Repair_PN.txt
cp Matrix_Electro_${PDB}_Repair_PN.fxout Matrix_Electro_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_SS_${PDB}_Repair_PN.txt
cp Matrix_Disulfide_${PDB}_Repair_PN.fxout Matrix_Disulfide_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_SS_${PDB}_Repair_PN.txt
cp Matrix_Partcov_${PDB}_Repair_PN.fxout Matrix_Partcov_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_SS_${PDB}_Repair_PN.txt
cp Matrix_VdWClashes_${PDB}_Repair_PN.fxout Matrix_VdWClashes_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_SS_${PDB}_Repair_PN.txt
cp AllAtoms_Disulfide_${PDB}_Repair_PN.fxout AllAtoms_Disulfide_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_Disulfide_${PDB}_Repair_PN.txt
cp AllAtoms_Electro_${PDB}_Repair_PN.fxout AllAtoms_Electro_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_Electro_${PDB}_Repair_PN.txt
cp AllAtoms_Hbonds_${PDB}_Repair_PN.fxout AllAtoms_Hbonds_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_Hbonds_${PDB}_Repair_PN.txt
cp AllAtoms_Partcov_${PDB}_Repair_PN.fxout AllAtoms_Partcov_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_Partcov_${PDB}_Repair_PN.txt
cp AllAtoms_VdWClashes_${PDB}_Repair_PN.fxout AllAtoms_VdWClashes_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_VdWClashes_${PDB}_Repair_PN.txt
cp AllAtoms_Volumetric_${PDB}_Repair_PN.fxout AllAtoms_Volumetric_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_Volumetric_${PDB}_Repair_PN.txt
cp InteractingResidues_VdWClashes_${PDB}_Repair_PN.fxout InteractingResidues_VdWClashes_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_VdWClashes_${PDB}_Repair_PN.txt
cp InteractingResidues_Distances_${PDB}_Repair_PN.fxout InteractingResidues_Distances_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Distances_${PDB}_Repair_PN.txt
cp InteractingResidues_Electro_${PDB}_Repair_PN.fxout InteractingResidues_Electro_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Electro_${PDB}_Repair_PN.txt
cp InteractingResidues_Hbonds_${PDB}_Repair_PN.fxout InteractingResidues_Hbonds_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Hbonds_${PDB}_Repair_PN.txt
cp InteractingResidues_Partcov_${PDB}_Repair_PN.fxout InteractingResidues_Partcov_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Partcov_${PDB}_Repair_PN.txt
cp InteractingResidues_Volumetric_${PDB}_Repair_PN.fxout InteractingResidues_Volumetric_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Volumetric_${PDB}_Repair_PN.txt
cp InteractingResidues_Disulfide_${PDB}_Repair_PN.fxout InteractingResidues_Disulfide_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Disulfide_${PDB}_Repair_PN.txt

239965
foldx/rotabase.txt Normal file

File diff suppressed because it is too large Load diff

500
foldx/runFoldx.py Executable file
View file

@ -0,0 +1,500 @@
#!/usr/bin/env python3
import subprocess
import os
import sys
import numpy as np
import pandas as pd
from contextlib import suppress
from pathlib import Path
import re
import csv
import argparse
import shutil
import time
#https://realpython.com/python-pathlib/
# FIXME
#strong dependency of file and path names
#cannot pass file with path. Need to pass them separately
#assumptions made for dir struc as standard
#datadir + drug + input
#=======================================================================
#%% specify input and curr dir
homedir = os.path.expanduser('~')
# set working dir
os.getcwd()
#os.chdir(homedir + '/git/LSHTM_analysis/foldx/')
#os.getcwd()
#=======================================================================
#%% command line args
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('-d', '--drug', help = 'drug name', default = None)
arg_parser.add_argument('-g', '--gene', help = 'gene name (case sensitive)', default = None)
arg_parser.add_argument('--datadir', help = 'Data Directory. By default, it assmumes homedir + git/Data')
arg_parser.add_argument('-i', '--input_dir', help = 'Input dir containing pdb files. By default, it assmumes homedir + <drug> + input')
arg_parser.add_argument('-o', '--output_dir', help = 'Output dir for results. By default, it assmes homedir + <drug> + output')
arg_parser.add_argument('-p', '--process_dir', help = 'Temp processing dir for running foldX. By default, it assmes homedir + <drug> + processing. Make sure it is somewhere with LOTS of storage as it writes all output!') #FIXME
arg_parser.add_argument('-P', '--pdb_file', help = 'PDB File to process. By default, it assmumes a file called <gene>_complex.pdb in input_dir')
arg_parser.add_argument('-m', '--mutation_file', help = 'Mutation list. By default, assumes a file called <gene>_mcsm_formatted_snps.csv exists')
# FIXME: Doesn't work with 2 chains yet!
arg_parser.add_argument('-c1', '--chain1', help = 'Chain1 ID', default = 'A') # case sensitive
arg_parser.add_argument('-c2', '--chain2', help = 'Chain2 ID', default = 'B') # case sensitive
args = arg_parser.parse_args()
#=======================================================================
#%% variable assignment: input and output
#drug = 'pyrazinamide'
#gene = 'pncA'
#gene_match = gene + '_p.'
#%%=====================================================================
# Command line options
drug = args.drug
gene = args.gene
datadir = args.datadir
indir = args.input_dir
outdir = args.output_dir
process_dir = args.process_dir
mut_filename = args.mutation_file
chainA = args.chain1
chainB = args.chain2
pdb_filename = args.pdb_file
# os.path.splitext will fail interestingly with file.pdb.txt.zip
#pdb_name = os.path.splitext(pdb_file)[0]
# Just the filename, thanks
#pdb_name = Path(in_filename_pdb).stem
# Handle the case where neither 'drug'
# nor (indir,outdir,process_dir) are defined
if not drug:
if not indir or not outdir or not process_dir:
print('ERROR: if "drug" is not specified, you must specify Input, Output, and Process directories')
sys.exit()
#==============
# directories
#==============
if not datadir:
datadir = homedir + '/' + 'git/Data'
if not indir:
indir = datadir + '/' + drug + '/input'
if not outdir:
outdir = datadir + '/' + drug + '/output'
#TODO: perhaps better handled by refactoring code to prevent generating lots of output files!
if not process_dir:
process_dir = datadir + '/' + drug + '/processing'
# Make all paths absolute in case the user forgot
indir = os.path.abspath(indir)
process_dir = os.path.abspath(process_dir)
outdir = os.path.abspath(outdir)
datadir = os.path.abspath(datadir)
#=======
# input
#=======
# FIXME
if pdb_filename:
pdb_filename = os.path.abspath(pdb_filename)
pdb_name = Path(pdb_filename).stem
infile_pdb = pdb_filename
else:
pdb_filename = gene.lower() + '_complex.pdb'
pdb_name = Path(pdb_filename).stem
infile_pdb = indir + '/' + pdb_filename
actual_pdb_filename = Path(infile_pdb).name
if mut_filename:
mutation_file = os.path.abspath(mut_filename)
infile_muts = mutation_file
print('User-provided mutation file in use:', infile_muts)
else:
mutation_file = gene.lower() + '_mcsm_formatted_snps.csv'
infile_muts = outdir + '/' + mutation_file
print('WARNING: Assuming default mutation file:', infile_muts)
#=======
# output
#=======
out_filename = gene.lower() + '_foldx.csv'
outfile_foldx = outdir + '/' + out_filename
print('Arguments being passed:'
, '\nDrug:', args.drug
, '\ngene:', args.gene
, '\ninput dir:', indir
, '\nprocess dir:', process_dir
, '\noutput dir:', outdir
, '\npdb file:', infile_pdb
, '\npdb name:', pdb_name
, '\nactual pdb name:', actual_pdb_filename
, '\nmutation file:', infile_muts
, '\nchain1:', args.chain1
, '\noutput file:', outfile_foldx
, '\n=============================================================')
# make sure rotabase.txt exists in the process_dir
rotabase_file = process_dir + '/' + 'rotabase.txt'
if Path(rotabase_file).is_file():
print(f'rotabase file: {rotabase_file} exists')
else:
print(f'ERROR: rotabase file: {rotabase_file} does not exist. Please download it and put it in {process_dir}')
sys.exit()
#### Delay for 10 seconds to check the params ####
print('Sleeping for 10 seconds to give you time to cancel')
time.sleep(10)
#=======================================================================
def getInteractionEnergy(filename):
data = pd.read_csv(filename,sep = '\t')
return data['Interaction Energy'].loc[0]
def getInteractions(filename):
data = pd.read_csv(filename, index_col = 0, header = 0, sep = '\t')
contactList = getIndexes(data,1)
number = len(contactList)
return number
def formatMuts(mut_file,pdbname):
with open(mut_file) as csvfile:
readCSV = csv.reader(csvfile)
muts = []
for row in readCSV:
mut = row[0]
muts.append(mut)
mut_list = []
outfile = process_dir + '/individual_list_' + pdbname + '.txt'
with open(outfile, 'w') as output:
for m in muts:
print(m)
mut = m[:1] + chainA+ m[1:]
mut_list.append(mut)
mut = mut + ';'
print(mut)
output.write(mut)
output.write('\n')
return mut_list
def getIndexes(data, value):
colnames = data.columns.values
listOfPos = list()
result = data.isin([value])
result.columns = colnames
seriesdata = result.any()
columnNames = list(seriesdata[seriesdata==True].index)
for col in columnNames:
rows = list(result[col][result[col]==True].index)
for row in rows:
listOfPos.append((row,col))
return listOfPos
def loadFiles(df):
# load a text file in to np matrix
resultList = []
f = open(df,'r')
for line in f:
line = line.rstrip('\n')
aVals = line.split('\t')
fVals = list(map(np.float32, sVals))
resultList.append(fVals)
f.close()
return np.asarray(resultList, dtype=np.float32)
# TODO: put the subprocess call in a 'def'
#def repairPDB():
# subprocess.call(['foldx'
# , '--command=RepairPDB'
# , '--pdb-dir=' + indir
# , '--pdb=' + actual_pdb_filename
# , '--ionStrength=0.05'#
# , '--pH=7'
# , '--water=PREDICT'
# , '--vdwDesign=1'
# , 'outPDB=true'
# , '--output-dir=' + process_dir])
#=======================================================================
def main():
pdbname = pdb_name
comp = '' # for complex only
mut_filename = infile_muts #pnca_mcsm_snps.csv
mutlist = formatMuts(mut_filename, pdbname)
print(mutlist)
nmuts = len(mutlist)
print(nmuts)
print(mutlist)
print('start')
#subprocess.check_output(['bash','repairPDB.sh', pdbname, process_dir])
print('\033[95mSTAGE: repair PDB\033[0m')
print('EXECUTING: repairPDB.sh %s %s %s' % (indir, actual_pdb_filename, process_dir))
#subprocess.check_output(['bash','repairPDB.sh', indir, actual_pdb_filename, process_dir])
# once you decide to use the function
# repairPDB(pdbname)
print('start')
# some common parameters for foldX
foldx_common=' --ionStrength=0.05 --pH=7 --water=PREDICT --vdwDesign=1 '
print('\033[95mSTAGE: repair PDB (foldx subprocess) \033[0m')
print('Running foldx RepairPDB for WT')
fold_RepairDB = ['foldx'
, '--command=RepairPDB'
, foldx_common
# , '--pdb-dir=' + os.path.dirname(pdb_filename)
, '--pdb-dir=' + indir
, '--pdb=' + actual_pdb_filename
, 'outPDB=true'
, '--output-dir=' + process_dir]
print('CMD:', fold_RepairDB)
subprocess.call(fold_RepairDB)
print('\033[95mCOMPLETED STAGE: repair PDB\033[0m')
print('\n==========================================================')
print('\033[95mSTAGE: Foldx commands BM, PN and SD (foldx subprocess) for WT\033[0m')
print('Running foldx BuildModel for WT')
foldx_BuildModel = ['foldx'
, '--command=BuildModel'
, foldx_common
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--mutant-file=' + process_dir + '/' + 'individual_list_' + pdbname +'.txt'
, 'outPDB=true'
, '--numberOfRuns=1'
, '--output-dir=' + process_dir]
print('CMD:', foldx_BuildModel)
subprocess.call( foldx_BuildModel, cwd=process_dir)
print('Running foldx PrintNetworks for WT')
foldx_PrintNetworks = ['foldx'
, '--command=PrintNetworks'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir]
print('CMD:', foldx_PrintNetworks)
subprocess.call(foldx_PrintNetworks, cwd=process_dir)
print('Running foldx SequenceDetail for WT')
foldx_SequenceDetail = ['foldx'
, '--command=SequenceDetail'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir]
print('CMD:', foldx_SequenceDetail)
subprocess.call(foldx_SequenceDetail , cwd=process_dir)
print('\033[95mCOMPLETED STAGE: Foldx commands BM, PN and SD\033[0m')
print('\n==========================================================')
print('\033[95mSTAGE: Print Networks (foldx subprocess) for MT\033[0m')
for n in range(1,nmuts+1):
print('\033[95mNETWORK:\033[0m', n)
print('Running foldx PrintNetworks for mutation', n)
foldx_PrintNetworksMT = ['foldx'
, '--command=PrintNetworks'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair_' + str(n) + '.pdb'
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir]
print('CMD:', foldx_PrintNetworksMT)
subprocess.call( foldx_PrintNetworksMT , cwd=process_dir)
print('\033[95mCOMPLETED STAGE: Print Networks (foldx subprocess) for MT\033[0m')
print('\n==========================================================')
print('\033[95mSTAGE: Rename Mutation Files (shell)\033[0m')
for n in range(1,nmuts+1):
print('\033[95mMUTATION:\033[0m', n)
print('\033[96mCommand:\033[0m mutrenamefiles.sh %s %s %s' % (pdbname, str(n), process_dir ))
#FIXME: bad design and needs to be done in a pythonic way
with suppress(Exception):
subprocess.check_output(['bash', 'mutrenamefiles.sh', pdbname, str(n), process_dir])
print('\033[95mCOMPLETED STAGE: Rename Mutation Files (shell)\033[0m')
print('\n==========================================================')
print('\033[95mSTAGE: Rename Files (shell) for WT\033[0m')
# FIXME: this is bad design and needs to be done in a pythonic way
out = subprocess.check_output(['bash','renamefiles.sh', pdbname, process_dir])
print('\033[95mCOMPLETED STAGE: Rename Files (shell) for WT\033[0m')
print('\n==========================================================')
if comp=='y':
print('\033[95mSTAGE: Running foldx AnalyseComplex (foldx subprocess) for WT\033[0m')
chain1=chainA
chain2=chainB
foldx_AnalyseComplex = ['foldx'
, '--command=AnalyseComplex'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--analyseComplexChains=' + chain1 + ',' + chain2
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir]
print('CMD:',foldx_AnalyseComplex)
subprocess.call(foldx_AnalyseComplex, cwd=process_dir)
# FIXME why would we ever need to do this?!? Cargo-culted from runcomplex.sh
ac_source = process_dir + '/Summary_' + pdbname + '_Repair_AC.fxout'
ac_dest = process_dir + '/Summary_' + pdbname + '_Repair_AC.txt'
shutil.copyfile(ac_source, ac_dest)
print('\033[95mCOMPLETED STAGE: foldx AnalyseComplex (subprocess) for WT:\033[0m', n)
for n in range(1,nmuts+1):
print('\033[95mSTAGE: Running foldx AnalyseComplex (foldx subprocess) for mutation:\033[0m', n)
foldx_AnalyseComplex = ['foldx'
, '--command=AnalyseComplex'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair_' + str(n) + '.pdb'
, '--analyseComplexChains=' + chain1 + ',' + chain2
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir]
print('CMD:', foldx_AnalyseComplex)
subprocess.call( foldx_AnalyseComplex , cwd=process_dir)
# FIXME why would we ever need to do this?!? Cargo-culted from runcomplex.sh
ac_mut_source = process_dir + '/Summary_' + pdbname + '_Repair_' + str(n) +'_AC.fxout'
ac_mut_dest = process_dir + '/Summary_' + pdbname + '_Repair)' + str(n) +'_AC.txt'
shutil.copyfile(ac_mut_source, ac_mut_dest)
print('\033[95mCOMPLETED STAGE: foldx AnalyseComplex (subprocess) for mutation:\033[0m', n)
print('\n==========================================================')
interactions = ['Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS', 'Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM','VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS']
dGdatafile = process_dir + '/Dif_' + pdbname + '_Repair.txt'
dGdata = pd.read_csv(dGdatafile, sep = '\t')
ddG=[]
print('ddG')
print(len(dGdata))
for i in range(0,len(dGdata)):
ddG.append(dGdata['total energy'].loc[i])
nint = len(interactions)
wt_int = []
for i in interactions:
filename = process_dir + '/Matrix_' + i + '_'+ pdbname + '_Repair_PN.txt'
wt_int.append(getInteractions(filename))
print('wt')
print(wt_int)
ntotal = nint+1
print(ntotal)
print(nmuts)
data = np.empty((ntotal,nmuts))
data[0] = ddG
print(data)
for i in range(0,len(interactions)):
d=[]
p=0
for n in range(1, nmuts+1):
print(i)
filename = process_dir + '/Matrix_' + interactions[i] + '_' + pdbname + '_Repair_' + str(n) + '_PN.txt'
mut = getInteractions(filename)
diff = wt_int[i] - mut
print(diff)
print(wt_int[i])
print(mut)
d.append(diff)
print(d)
data[i+1] = d
interactions = ['ddG', 'Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS', 'Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM','VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS']
print(interactions)
IE = []
if comp=='y':
wtfilename = process_dir + '/Summary_' + pdbname + '_Repair_AC.txt'
wtE = getInteractionEnergy(wtfilename)
print(wtE)
for n in range(1,nmuts+1):
print(n)
filename = process_dir + '/Summary_' + pdbname + '_Repair_' + str(n) + '_AC.txt'
mutE = getInteractionEnergy(filename)
print(mutE)
diff = wtE - mutE
print(diff)
IE.append(diff)
print(IE)
IEresults = pd.DataFrame(IE,columns = ['Interaction Energy'], index = mutlist)
IEfilename = 'foldx_complexresults_'+pdbname+'.csv'
IEresults.to_csv(IEfilename)
print(len(IE))
data = np.append(data,[IE], axis = 0)
print(data)
interactions = ['ddG','Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS','Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM','VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS','Interaction Energy']
mut_file = process_dir + '/individual_list_' + pdbname + '.txt'
with open(mut_file) as csvfile:
readCSV = csv.reader(csvfile)
mutlist = []
for row in readCSV:
mut = row[0]
mutlist.append(mut)
print(mutlist)
print(len(mutlist))
print(data)
results = pd.DataFrame(data, columns = mutlist, index = interactions)
results.append(ddG)
#print(results.head())
# my style formatted results
results2 = results.T # transpose df
results2.index.name = 'mutationinformation' # assign name to index
results2 = results2.reset_index() # turn it into a columns
results2['mutationinformation'] = results2['mutationinformation'].replace({r'([A-Z]{1})[A-Z]{1}([0-9]+[A-Z]{1});' : r'\1 \2'}, regex = True) # capture mcsm style muts (i.e not the chain id)
results2['mutationinformation'] = results2['mutationinformation'].str.replace(' ', '') # remove empty space
results2.rename(columns = {'Distances': 'Contacts'}, inplace = True)
# lower case columns
results2.columns = results2.columns.str.lower()
print('Writing file in the format below:\n'
, results2.head()
, '\nNo. of rows:', len(results2)
, '\nNo. of cols:', len(results2.columns))
outputfilename = outfile_foldx
#outputfilename = 'foldx_results_' + pdbname + '.csv'
#results.to_csv(outputfilename)
results2.to_csv(outputfilename, index = False)
print ('end')
if __name__ == '__main__':
main()

466
foldx/runFoldx5.py Executable file
View file

@ -0,0 +1,466 @@
#!/usr/bin/env python3
import subprocess
import os
import sys
import numpy as np
import pandas as pd
from contextlib import suppress
from pathlib import Path
import re
import csv
import argparse
import shutil
import time
#https://realpython.com/python-pathlib/
# FIXME
#strong dependency of file and path names
#cannot pass file with path. Need to pass them separately
#assumptions made for dir struc as standard
#datadir + drug + input
#=======================================================================
#%% specify input and curr dir
homedir = os.path.expanduser('~')
# set working dir
os.getcwd()
#os.chdir(homedir + '/git/LSHTM_analysis/foldx/')
#os.getcwd()
#=======================================================================
#%% command line args
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('-d', '--drug', help = 'drug name', default = None)
arg_parser.add_argument('-g', '--gene', help = 'gene name (case sensitive)', default = None)
arg_parser.add_argument('--datadir', help = 'Data Directory. By default, it assmumes homedir + git/Data')
arg_parser.add_argument('-i', '--input_dir', help = 'Input dir containing pdb files. By default, it assmumes homedir + <drug> + input')
arg_parser.add_argument('-o', '--output_dir', help = 'Output dir for results. By default, it assmes homedir + <drug> + output')
arg_parser.add_argument('-p', '--process_dir', help = 'Temp processing dir for running foldX. By default, it assmes homedir + <drug> + processing. Make sure it is somewhere with LOTS of storage as it writes all output!') #FIXME
arg_parser.add_argument('-P', '--pdb_file', help = 'PDB File to process. By default, it assmumes a file called <gene>_complex.pdb in input_dir')
arg_parser.add_argument('-m', '--mutation_file', help = 'Mutation list. By default, assumes a file called <gene>_mcsm_snps.csv exists')
# FIXME: Doesn't work with 2 chains yet!
arg_parser.add_argument('-c1', '--chain1', help = 'Chain1 ID', default = 'A') # case sensitive
arg_parser.add_argument('-c2', '--chain2', help = 'Chain2 ID', default = 'B') # case sensitive
args = arg_parser.parse_args()
#=======================================================================
#%% variable assignment: input and output
#drug = 'pyrazinamide'
#gene = 'pncA'
#gene_match = gene + '_p.'
#%%=====================================================================
# Command line options
drug = args.drug
gene = args.gene
datadir = args.datadir
indir = args.input_dir
outdir = args.output_dir
process_dir = args.process_dir
mut_filename = args.mutation_file
chainA = args.chain1
chainB = args.chain2
pdb_filename = args.pdb_file
# os.path.splitext will fail interestingly with file.pdb.txt.zip
#pdb_name = os.path.splitext(pdb_file)[0]
# Just the filename, thanks
#pdb_name = Path(in_filename_pdb).stem
# Handle the case where neither 'drug'
# nor (indir,outdir,process_dir) are defined
if not drug:
if not indir or not outdir or not process_dir:
print('ERROR: if "drug" is not specified, you must specify Input, Output, and Process directories')
sys.exit()
#==============
# directories
#==============
if not datadir:
datadir = homedir + '/' + 'git/Data'
if not indir:
indir = datadir + '/' + drug + '/input'
if not outdir:
outdir = datadir + '/' + drug + '/output'
#TODO: perhaps better handled by refactoring code to prevent generating lots of output files!
if not process_dir:
process_dir = datadir + '/' + drug + '/processing'
# Make all paths absolute in case the user forgot
indir = os.path.abspath(indir)
process_dir = os.path.abspath(process_dir)
outdir = os.path.abspath(outdir)
datadir = os.path.abspath(datadir)
#=======
# input
#=======
# FIXME
if pdb_filename:
pdb_filename = os.path.abspath(pdb_filename)
pdb_name = Path(pdb_filename).stem
infile_pdb = pdb_filename
else:
pdb_filename = gene.lower() + '_complex.pdb'
pdb_name = Path(pdb_filename).stem
infile_pdb = indir + '/' + pdb_filename
actual_pdb_filename = Path(infile_pdb).name
if mut_filename:
mutation_file = os.path.abspath(mut_filename)
infile_muts = mutation_file
print('User-provided mutation file in use:', infile_muts)
else:
mutation_file = gene.lower() + '_mcsm_formatted_snps.csv'
infile_muts = outdir + '/' + mutation_file
print('WARNING: Assuming default mutation file:', infile_muts)
#=======
# output
#=======
out_filename = gene.lower() + '_foldx.csv'
outfile_foldx = outdir + '/' + out_filename
print('Arguments being passed:'
, '\nDrug:', args.drug
, '\ngene:', args.gene
, '\ninput dir:', indir
, '\nprocess dir:', process_dir
, '\noutput dir:', outdir
, '\npdb file:', infile_pdb
, '\npdb name:', pdb_name
, '\nactual pdb name:', actual_pdb_filename
, '\nmutation file:', infile_muts
, '\nchain1:', args.chain1
, '\noutput file:', outfile_foldx
, '\n=============================================================')
#### Delay for 10 seconds to check the params ####
print('Sleeping for 10 seconds to give you time to cancel')
time.sleep(10)
#=======================================================================
def getInteractionEnergy(filename):
data = pd.read_csv(filename,sep = '\t')
return data['Interaction Energy'].loc[0]
def getInteractions(filename):
data = pd.read_csv(filename, index_col = 0, header = 0, sep = '\t')
contactList = getIndexes(data,1)
number = len(contactList)
return number
def formatMuts(mut_file,pdbname):
with open(mut_file) as csvfile:
readCSV = csv.reader(csvfile)
muts = []
for row in readCSV:
mut = row[0]
muts.append(mut)
mut_list = []
outfile = process_dir + '/individual_list_' + pdbname + '.txt'
with open(outfile, 'w') as output:
for m in muts:
print(m)
mut = m[:1] + chainA+ m[1:]
mut_list.append(mut)
mut = mut + ';'
print(mut)
output.write(mut)
output.write('\n')
return mut_list
def getIndexes(data, value):
colnames = data.columns.values
listOfPos = list()
result = data.isin([value])
result.columns = colnames
seriesdata = result.any()
columnNames = list(seriesdata[seriesdata==True].index)
for col in columnNames:
rows = list(result[col][result[col]==True].index)
for row in rows:
listOfPos.append((row,col))
return listOfPos
def loadFiles(df):
# load a text file in to np matrix
resultList = []
f = open(df,'r')
for line in f:
line = line.rstrip('\n')
aVals = line.split('\t')
fVals = list(map(np.float32, sVals))
resultList.append(fVals)
f.close()
return np.asarray(resultList, dtype=np.float32)
# TODO: put the subprocess call in a 'def'
#def repairPDB():
# subprocess.call(['foldx'
# , '--command=RepairPDB'
# , '--pdb-dir=' + indir
# , '--pdb=' + actual_pdb_filename
# , '--ionStrength=0.05'#
# , '--pH=7'
# , '--water=PREDICT'
# , '--vdwDesign=1'
# , 'outPDB=true'
# , '--output-dir=' + process_dir])
#=======================================================================
def main():
pdbname = pdb_name
comp = '' # for complex only
mut_filename = infile_muts #pnca_mcsm_snps.csv
mutlist = formatMuts(mut_filename, pdbname)
print(mutlist)
nmuts = len(mutlist)
print(nmuts)
print(mutlist)
print('start')
# some common parameters for foldX
foldx_common=' --ionStrength=0.05 --pH=7 --water=PREDICT --vdwDesign=1 '
print('\033[95mSTAGE: repair PDB (foldx subprocess) \033[0m')
print('Running foldx RepairPDB for WT')
subprocess.call(['foldx5'
, '--command=RepairPDB'
, foldx_common
, '--pdb-dir=' + os.path.dirname(pdb_filename)
, '--pdb=' + actual_pdb_filename
, 'outPDB=true'
, '--output-dir=' + process_dir])
print('\033[95mCOMPLETED STAGE: repair PDB\033[0m')
print('\n==========================================================')
print('\033[95mSTAGE: Foldx commands BM, PN and SD (foldx subprocess) for WT\033[0m')
print('Running foldx BuildModel for WT')
subprocess.call(['foldx5'
, '--command=BuildModel'
, foldx_common
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--mutant-file="individual_list_' + pdbname +'.txt"'
, 'outPDB=true'
, '--numberOfRuns=1'
, '--output-dir=' + process_dir], cwd=process_dir)
print('Running foldx PrintNetworks for WT')
subprocess.call(['foldx5'
, '--command=PrintNetworks'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir], cwd=process_dir)
print('Running foldx SequenceDetail for WT')
subprocess.call(['foldx5'
, '--command=SequenceDetail'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir], cwd=process_dir)
print('\033[95mCOMPLETED STAGE: Foldx commands BM, PN and SD\033[0m')
print('\n==========================================================')
print('\033[95mSTAGE: Print Networks (foldx subprocess) for MT\033[0m')
for n in range(1,nmuts+1):
print('\033[95mNETWORK:\033[0m', n)
print('Running foldx PrintNetworks for mutation', n)
subprocess.call(['foldx5'
, '--command=PrintNetworks'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair_' + str(n) + '.pdb'
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir], cwd=process_dir)
print('\033[95mCOMPLETED STAGE: Print Networks (foldx subprocess) for MT\033[0m')
print('\n==========================================================')
print('\033[95mSTAGE: Rename Mutation Files (shell)\033[0m')
for n in range(1,nmuts+1):
print('\033[95mMUTATION:\033[0m', n)
print('\033[96mCommand:\033[0m mutrenamefiles.sh %s %s %s' % (pdbname, str(n), process_dir ))
#FIXME: bad design and needs to be done in a pythonic way
with suppress(Exception):
subprocess.check_output(['bash', 'mutrenamefiles.sh', pdbname, str(n), process_dir])
print('\033[95mCOMPLETED STAGE: Rename Mutation Files (shell)\033[0m')
print('\n==========================================================')
print('\033[95mSTAGE: Rename Files (shell) for WT\033[0m')
# FIXME: this is bad design and needs to be done in a pythonic way
out = subprocess.check_output(['bash','renamefiles.sh', pdbname, process_dir])
print('\033[95mCOMPLETED STAGE: Rename Files (shell) for WT\033[0m')
print('\n==========================================================')
if comp=='y':
print('\033[95mSTAGE: Running foldx AnalyseComplex (foldx subprocess) for WT\033[0m')
chain1=chainA
chain2=chainB
subprocess.call(['foldx5'
, '--command=AnalyseComplex'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--analyseComplexChains=' + chain1 + ',' + chain2
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir], cwd=process_dir)
# FIXME why would we ever need to do this?!? Cargo-culted from runcomplex.sh
ac_source = process_dir + '/Summary_' + pdbname + '_Repair_AC.fxout'
ac_dest = process_dir + '/Summary_' + pdbname + '_Repair_AC.txt'
shutil.copyfile(ac_source, ac_dest)
print('\033[95mCOMPLETED STAGE: foldx AnalyseComplex (subprocess) for WT:\033[0m', n)
for n in range(1,nmuts+1):
print('\033[95mSTAGE: Running foldx AnalyseComplex (foldx subprocess) for mutation:\033[0m', n)
subprocess.call(['foldx5'
, '--command=AnalyseComplex'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair_' + str(n) + '.pdb'
, '--analyseComplexChains=' + chain1 + ',' + chain2
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir], cwd=process_dir)
# FIXME why would we ever need to do this?!? Cargo-culted from runcomplex.sh
ac_mut_source = process_dir + '/Summary_' + pdbname + '_Repair_' + str(n) +'_AC.fxout'
ac_mut_dest = process_dir + '/Summary_' + pdbname + '_Repair)' + str(n) +'_AC.txt'
shutil.copyfile(ac_mut_source, ac_mut_dest)
print('\033[95mCOMPLETED STAGE: foldx AnalyseComplex (subprocess) for mutation:\033[0m', n)
print('\n==========================================================')
interactions = ['Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS', 'Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM','VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS']
dGdatafile = process_dir + '/Dif_' + pdbname + '_Repair.txt'
dGdata = pd.read_csv(dGdatafile, sep = '\t')
ddG=[]
print('ddG')
print(len(dGdata))
for i in range(0,len(dGdata)):
ddG.append(dGdata['total energy'].loc[i])
nint = len(interactions)
wt_int = []
for i in interactions:
filename = process_dir + '/Matrix_' + i + '_'+ pdbname + '_Repair_PN.txt'
wt_int.append(getInteractions(filename))
print('wt')
print(wt_int)
ntotal = nint+1
print(ntotal)
print(nmuts)
data = np.empty((ntotal,nmuts))
data[0] = ddG
print(data)
for i in range(0,len(interactions)):
d=[]
p=0
for n in range(1, nmuts+1):
print(i)
filename = process_dir + '/Matrix_' + interactions[i] + '_' + pdbname + '_Repair_' + str(n) + '_PN.txt'
mut = getInteractions(filename)
diff = wt_int[i] - mut
print(diff)
print(wt_int[i])
print(mut)
d.append(diff)
print(d)
data[i+1] = d
interactions = ['ddG', 'Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS', 'Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM','VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS']
print(interactions)
IE = []
if comp=='y':
wtfilename = process_dir + '/Summary_' + pdbname + '_Repair_AC.txt'
wtE = getInteractionEnergy(wtfilename)
print(wtE)
for n in range(1,nmuts+1):
print(n)
filename = process_dir + '/Summary_' + pdbname + '_Repair_' + str(n) + '_AC.txt'
mutE = getInteractionEnergy(filename)
print(mutE)
diff = wtE - mutE
print(diff)
IE.append(diff)
print(IE)
IEresults = pd.DataFrame(IE,columns = ['Interaction Energy'], index = mutlist)
IEfilename = 'foldx_complexresults_'+pdbname+'.csv'
IEresults.to_csv(IEfilename)
print(len(IE))
data = np.append(data,[IE], axis = 0)
print(data)
interactions = ['ddG','Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS','Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM','VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS','Interaction Energy']
mut_file = process_dir + '/individual_list_' + pdbname + '.txt'
with open(mut_file) as csvfile:
readCSV = csv.reader(csvfile)
mutlist = []
for row in readCSV:
mut = row[0]
mutlist.append(mut)
print(mutlist)
print(len(mutlist))
print(data)
results = pd.DataFrame(data, columns = mutlist, index = interactions)
results.append(ddG)
#print(results.head())
# my style formatted results
results2 = results.T # transpose df
results2.index.name = 'mutationinformation' # assign name to index
results2 = results2.reset_index() # turn it into a columns
results2['mutationinformation'] = results2['mutationinformation'].replace({r'([A-Z]{1})[A-Z]{1}([0-9]+[A-Z]{1});' : r'\1 \2'}, regex = True) # capture mcsm style muts (i.e not the chain id)
results2['mutationinformation'] = results2['mutationinformation'].str.replace(' ', '') # remove empty space
results2.rename(columns = {'Distances': 'Contacts'}, inplace = True)
# lower case columns
results2.columns = results2.columns.str.lower()
print('Writing file in the format below:\n'
, results2.head()
, '\nNo. of rows:', len(results2)
, '\nNo. of cols:', len(results2.columns))
outputfilename = outfile_foldx
#outputfilename = 'foldx_results_' + pdbname + '.csv'
#results.to_csv(outputfilename)
results2.to_csv(outputfilename, index = False)
print ('end')
if __name__ == '__main__':
main()

View file

@ -0,0 +1,10 @@
PDB=$1
A=$2
B=$3
n=$4
OUTDIR=$5
cd ${OUTDIR}
logger "Running mutruncomplex"
foldx --command=AnalyseComplex --pdb="${PDB}_Repair_${n}.pdb" --analyseComplexChains=${A},${B} --water=PREDICT --vdwDesign=1
cp ${OUTDIR}/Summary_${PDB}_Repair_${n}_AC.fxout ${OUTDIR}/Summary_${PDB}_Repair_${n}_AC.txt
#sed -i .bak -e 1,8d ${OUTDIR}/Summary_${PDB}_Repair_${n}_AC.txt

View file

@ -0,0 +1,9 @@
INDIR=$1
PDB=$2
OUTDIR=$3
cd ${OUTDIR}
logger "Running repairPDB"
#foldx --command=RepairPDB --pdb="${PDB}.pdb" --ionStrength=0.05 --pH=7 --water=PREDICT --vdwDesign=1 outPDB=true --output-dir=${OUTDIR}
foldx --command=RepairPDB --pdb-dir=${INDIR} --pdb=${PDB} --ionStrength=0.05 --pH=7 --water=PREDICT --vdwDesign=1 outPDB=true --output-dir=${OUTDIR}

View file

@ -0,0 +1,7 @@
PDB=$1
n=$2
OUTDIR=$3
logger "Running runPrintNetworks"
cd ${OUTDIR}
foldx --command=PrintNetworks --pdb="${PDB}_Repair_${n}.pdb" --water=PREDICT --vdwDesign=1 --output-dir=${OUTDIR}

View file

@ -0,0 +1,9 @@
PDB=$1
A=$2
B=$3
OUTDIR=$4
cd ${OUTDIR}
logger "Running runcomplex"
foldx --command=AnalyseComplex --pdb="${PDB}_Repair.pdb" --analyseComplexChains=${A},${B} --water=PREDICT --vdwDesign=1 --output-dir=${OUTDIR}
cp ${OUTDIR}/Summary_${PDB}_Repair_AC.fxout ${OUTDIR}/Summary_${PDB}_Repair_AC.txt
#sed -i .bak -e 1,8d ${OUTDIR}/Summary_${PDB}_Repair_AC.txt

View file

@ -0,0 +1,9 @@
PDB=$1
OUTDIR=$2
cd ${OUTDIR}
pwd
ls -l
logger "Running runfoldx"
foldx --command=BuildModel --pdb="${PDB}_Repair.pdb" --mutant-file="individual_list_${PDB}.txt" --ionStrength=0.05 --pH=7 --water=PREDICT --vdwDesign=1 --out-pdb=true --numberOfRuns=1 --output-dir=${OUTDIR}
foldx --command=PrintNetworks --pdb="${PDB}_Repair.pdb" --water=PREDICT --vdwDesign=1 --output-dir=${OUTDIR}
foldx --command=SequenceDetail --pdb="${PDB}_Repair.pdb" --water=PREDICT --vdwDesign=1 --output-dir=${OUTDIR}

View file

@ -0,0 +1,2 @@
S2C
S2F
1 S2C
2 S2F

63
foldx/test2/mutrenamefiles.sh Executable file
View file

@ -0,0 +1,63 @@
PDB=$1
n=$2
OUTDIR=$3
cd ${OUTDIR}
#cd /home/git/LSHTM_analysis/foldx/test2
cp Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout Matrix_Hbonds_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Hbonds_${PDB}_Repair_${n}_PN.fxout > Matrix_Hbonds_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Distances_${PDB}_Repair_${n}_PN.fxout Matrix_Distances_${PDB}_Repair_${n}_PN.txt
sed -i '1,4d' Matrix_Distances_${PDB}_Repair_${n}_PN.txt
cp Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout Matrix_Volumetric_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Volumetric_${PDB}_Repair_${n}_PN.fxout > Matrix_Volumetric_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Electro_${PDB}_Repair_${n}_PN.fxout Matrix_Electro_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Electro_${PDB}_Repair_${n}_PN.fxout > Matrix_Electro_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout Matrix_Disulfide_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Disulfide_${PDB}_Repair_${n}_PN.fxout > Matrix_Disulfide_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout Matrix_Partcov_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_Partcov_${PDB}_Repair_${n}_PN.fxout > Matrix_Partcov_SS_${PDB}_Repair_${n}_PN.txt
cp Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout Matrix_VdWClashes_${PDB}_Repair_${n}_PN.txt
sed -n '5,190p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_RR_${PDB}_Repair_${n}_PN.txt
sed -n '194,379p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_MM_${PDB}_Repair_${n}_PN.txt
sed -n '383,568p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_SM_${PDB}_Repair_${n}_PN.txt
sed -n '572,757p' Matrix_VdWClashes_${PDB}_Repair_${n}_PN.fxout > Matrix_VdWClashes_SS_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Disulfide_${PDB}_Repair_${n}_PN.fxout AllAtoms_Disulfide_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_Disulfide_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Electro_${PDB}_Repair_${n}_PN.fxout AllAtoms_Electro_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_Electro_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Hbonds_${PDB}_Repair_${n}_PN.fxout AllAtoms_Hbonds_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_Hbonds_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Partcov_${PDB}_Repair_${n}_PN.fxout AllAtoms_Partcov_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_Partcov_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_VdWClashes_${PDB}_Repair_${n}_PN.fxout AllAtoms_VdWClashes_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_VdWClashes_${PDB}_Repair_${n}_PN.txt
cp AllAtoms_Volumetric_${PDB}_Repair_${n}_PN.fxout AllAtoms_Volumetric_${PDB}_Repair_${n}_PN.txt
sed -i '1,2d' AllAtoms_Volumetric_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_VdWClashes_${PDB}_Repair_${n}_PN.fxout InteractingResidues_VdWClashes_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_VdWClashes_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Distances_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Distances_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Distances_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Electro_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Electro_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Electro_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Hbonds_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Hbonds_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Hbonds_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Partcov_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Partcov_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Partcov_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Volumetric_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Volumetric_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Volumetric_${PDB}_Repair_${n}_PN.txt
cp InteractingResidues_Disulfide_${PDB}_Repair_${n}_PN.fxout InteractingResidues_Disulfide_${PDB}_Repair_${n}_PN.txt
sed -i '1,5d' InteractingResidues_Disulfide_${PDB}_Repair_${n}_PN.txt

64
foldx/test2/renamefiles.sh Executable file
View file

@ -0,0 +1,64 @@
PDB=$1
OUTDIR=$2
cd ${OUTDIR}
#cd /home/git/LSHTM_analysis/foldx/test2
cp Dif_${PDB}_Repair.fxout Dif_${PDB}_Repair.txt
sed -i '1,8d' Dif_${PDB}_Repair.txt
cp Matrix_Hbonds_${PDB}_Repair_PN.fxout Matrix_Hbonds_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Hbonds_${PDB}_Repair_PN.fxout > Matrix_Hbonds_SS_${PDB}_Repair_PN.txt
cp Matrix_Distances_${PDB}_Repair_PN.fxout Matrix_Distances_${PDB}_Repair_PN.txt
sed -i '1,4d' Matrix_Distances_${PDB}_Repair_PN.txt
cp Matrix_Volumetric_${PDB}_Repair_PN.fxout Matrix_Volumetric_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Volumetric_${PDB}_Repair_PN.fxout > Matrix_Volumetric_SS_${PDB}_Repair_PN.txt
cp Matrix_Electro_${PDB}_Repair_PN.fxout Matrix_Electro_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Electro_${PDB}_Repair_PN.fxout > Matrix_Electro_SS_${PDB}_Repair_PN.txt
cp Matrix_Disulfide_${PDB}_Repair_PN.fxout Matrix_Disulfide_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Disulfide_${PDB}_Repair_PN.fxout > Matrix_Disulfide_SS_${PDB}_Repair_PN.txt
cp Matrix_Partcov_${PDB}_Repair_PN.fxout Matrix_Partcov_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_Partcov_${PDB}_Repair_PN.fxout > Matrix_Partcov_SS_${PDB}_Repair_PN.txt
cp Matrix_VdWClashes_${PDB}_Repair_PN.fxout Matrix_VdWClashes_${PDB}_Repair_PN.txt
sed -n '5,190p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_RR_${PDB}_Repair_PN.txt
sed -n '194,379p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_MM_${PDB}_Repair_PN.txt
sed -n '383,568p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_SM_${PDB}_Repair_PN.txt
sed -n '572,757p' Matrix_VdWClashes_${PDB}_Repair_PN.fxout > Matrix_VdWClashes_SS_${PDB}_Repair_PN.txt
cp AllAtoms_Disulfide_${PDB}_Repair_PN.fxout AllAtoms_Disulfide_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_Disulfide_${PDB}_Repair_PN.txt
cp AllAtoms_Electro_${PDB}_Repair_PN.fxout AllAtoms_Electro_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_Electro_${PDB}_Repair_PN.txt
cp AllAtoms_Hbonds_${PDB}_Repair_PN.fxout AllAtoms_Hbonds_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_Hbonds_${PDB}_Repair_PN.txt
cp AllAtoms_Partcov_${PDB}_Repair_PN.fxout AllAtoms_Partcov_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_Partcov_${PDB}_Repair_PN.txt
cp AllAtoms_VdWClashes_${PDB}_Repair_PN.fxout AllAtoms_VdWClashes_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_VdWClashes_${PDB}_Repair_PN.txt
cp AllAtoms_Volumetric_${PDB}_Repair_PN.fxout AllAtoms_Volumetric_${PDB}_Repair_PN.txt
sed -i '1,2d' AllAtoms_Volumetric_${PDB}_Repair_PN.txt
cp InteractingResidues_VdWClashes_${PDB}_Repair_PN.fxout InteractingResidues_VdWClashes_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_VdWClashes_${PDB}_Repair_PN.txt
cp InteractingResidues_Distances_${PDB}_Repair_PN.fxout InteractingResidues_Distances_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Distances_${PDB}_Repair_PN.txt
cp InteractingResidues_Electro_${PDB}_Repair_PN.fxout InteractingResidues_Electro_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Electro_${PDB}_Repair_PN.txt
cp InteractingResidues_Hbonds_${PDB}_Repair_PN.fxout InteractingResidues_Hbonds_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Hbonds_${PDB}_Repair_PN.txt
cp InteractingResidues_Partcov_${PDB}_Repair_PN.fxout InteractingResidues_Partcov_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Partcov_${PDB}_Repair_PN.txt
cp InteractingResidues_Volumetric_${PDB}_Repair_PN.fxout InteractingResidues_Volumetric_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Volumetric_${PDB}_Repair_PN.txt
cp InteractingResidues_Disulfide_${PDB}_Repair_PN.fxout InteractingResidues_Disulfide_${PDB}_Repair_PN.txt
sed -i '1,5d' InteractingResidues_Disulfide_${PDB}_Repair_PN.txt

239965
foldx/test2/rotabase.txt Normal file

File diff suppressed because it is too large Load diff

1
foldx/test2/runFoldx.py Symbolic link
View file

@ -0,0 +1 @@
../runFoldx.py

250
foldx/test2/runFoldx_test.py Executable file
View file

@ -0,0 +1,250 @@
#!/usr/bin/env python3
import subprocess
import os
import numpy as np
import pandas as pd
from contextlib import suppress
import re
import csv
def getInteractions(filename):
data = pd.read_csv(filename, index_col=0, header =0, sep="\t")
contactList = getIndexes(data,1)
print(contactList)
number = len(contactList)
return number
def formatMuts(mut_file,pdbname):
with open(mut_file) as csvfile:
readCSV = csv.reader(csvfile)
muts = []
for row in readCSV:
mut = row[0]
muts.append(mut)
mut_list = []
outfile = "/home/tanu/git/LSHTM_analysis/foldx/test2/individual_list_"+pdbname+".txt"
with open(outfile, "w") as output:
for m in muts:
print(m)
mut = m[:1]+'A'+m[1:]
mut_list.append(mut)
mut = mut + ";"
print(mut)
output.write(mut)
output.write("\n")
return mut_list
def getIndexes(data, value):
colnames = data.columns.values
listOfPos = list()
result = data.isin([value])
result.columns=colnames
seriesdata = result.any()
columnNames = list(seriesdata[seriesdata==True].index)
for col in columnNames:
rows = list(result[col][result[col]==True].index)
for row in rows:
listOfPos.append((row,col))
return listOfPos
def loadFiles(df):
# load a text file in to np matrix
resultList = []
f = open(df,'r')
for line in f:
line = line.rstrip('\n')
aVals = line.split("\t")
fVals = list(map(np.float32, sVals))
resultList.append(fVals)
f.close()
return np.asarray(resultList, dtype=np.float32)
#=======================================================================
def main():
pdbname = '3pl1'
mut_filename = "pnca_muts_sample.csv"
mutlist = formatMuts(mut_filename, pdbname)
print(mutlist)
nmuts = len(mutlist)+1
print(nmuts)
print(mutlist)
print("start")
output = subprocess.check_output(['bash', 'runfoldx.sh', pdbname])
print("end")
for n in range(1,nmuts):
print(n)
with suppress(Exception):
subprocess.check_output(['bash', 'runPrintNetworks.sh', pdbname,str(n)])
for n in range(1,nmuts):
print(n)
with suppress(Exception):
subprocess.check_output(['bash', 'mutrenamefiles.sh', pdbname,str(n)])
out = subprocess.check_output(['bash','renamefiles.sh',pdbname])
dGdatafile = "/home/tanu/git/LSHTM_analysis/foldx/test2/Dif_"+pdbname+"_Repair.txt"
dGdata = pd.read_csv(dGdatafile, sep="\t")
print(dGdata)
ddG=[]
for i in range(0,len(dGdata)):
ddG.append(dGdata['total energy'].loc[i])
print(ddG)
distfile = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Distances_"+pdbname+"_Repair_PN.txt"
wt_nc = getInteractions(distfile)
elecfileRR = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Electro_RR_"+pdbname+"_Repair_PN.txt"
wt_neRR = getInteractions(elecfileRR)
elecfileMM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Electro_MM_"+pdbname+"_Repair_PN.txt"
wt_neMM = getInteractions(elecfileMM)
elecfileSM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Electro_SM_"+pdbname+"_Repair_PN.txt"
wt_neSM = getInteractions(elecfileSM)
elecfileSS = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Electro_SS_"+pdbname+"_Repair_PN.txt"
wt_neSS = getInteractions(elecfileSS)
disufileRR = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Disulfide_RR_"+pdbname+"_Repair_PN.txt"
wt_ndRR = getInteractions(disufileRR)
disufileMM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Disulfide_MM_"+pdbname+"_Repair_PN.txt"
wt_ndMM = getInteractions(disufileMM)
disufileSM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Disulfide_SM_"+pdbname+"_Repair_PN.txt"
wt_ndSM = getInteractions(disufileSM)
disufileSS = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Disulfide_SS_"+pdbname+"_Repair_PN.txt"
wt_ndSS = getInteractions(disufileSS)
hbndfileRR = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Hbonds_RR_"+pdbname+"_Repair_PN.txt"
wt_nhRR = getInteractions(hbndfileRR)
hbndfileMM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Hbonds_MM_"+pdbname+"_Repair_PN.txt"
wt_nhMM = getInteractions(hbndfileMM)
hbndfileSM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Hbonds_SM_"+pdbname+"_Repair_PN.txt"
wt_nhSM = getInteractions(hbndfileSM)
hbndfileSS = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Hbonds_SS_"+pdbname+"_Repair_PN.txt"
wt_nhSS = getInteractions(hbndfileSS)
partfileRR = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Partcov_RR_"+pdbname+"_Repair_PN.txt"
wt_npRR = getInteractions(partfileRR)
partfileMM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Partcov_MM_"+pdbname+"_Repair_PN.txt"
wt_npMM = getInteractions(partfileMM)
partfileSM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Partcov_SM_"+pdbname+"_Repair_PN.txt"
wt_npSM = getInteractions(partfileSM)
partfileSS = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Partcov_SS_"+pdbname+"_Repair_PN.txt"
wt_npSS = getInteractions(partfileSS)
vdwcfileRR = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_VdWClashes_RR_"+pdbname+"_Repair_PN.txt"
wt_nvRR = getInteractions(vdwcfileRR)
vdwcfileMM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_VdWClashes_MM_"+pdbname+"_Repair_PN.txt"
wt_nvMM = getInteractions(vdwcfileMM)
vdwcfileSM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_VdWClashes_SM_"+pdbname+"_Repair_PN.txt"
wt_nvSM = getInteractions(vdwcfileSM)
vdwcfileSS = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_VdWClashes_SS_"+pdbname+"_Repair_PN.txt"
wt_nvSS = getInteractions(vdwcfileSS)
volufileRR = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Volumetric_RR_"+pdbname+"_Repair_PN.txt"
wt_nvoRR = getInteractions(volufileRR)
volufileMM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Volumetric_MM_"+pdbname+"_Repair_PN.txt"
wt_nvoMM = getInteractions(volufileMM)
volufileSM = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Volumetric_SM_"+pdbname+"_Repair_PN.txt"
wt_nvoSM = getInteractions(volufileSM)
volufileSS = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Volumetric_SS_"+pdbname+"_Repair_PN.txt"
wt_nvoSS = getInteractions(volufileSS)
dnc = []
dneRR = []
dneMM = []
dneSM = []
dneSS = []
dndRR = []
dndMM = []
dndSM = []
dndSS = []
dnhRR = []
dnhMM = []
dnhSM = []
dnhSS = []
dnpRR = []
dnpMM = []
dnpSM = []
dnpSS = []
dnvRR = []
dnvMM = []
dnvSM = []
dnvSS = []
dnvoRR = []
dnvoMM = []
dnvoSM = []
dnvoSS = []
for n in range(1, nmuts):
filename = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Distances_"+pdbname+"_Repair_" + str(n)+"_PN.txt"
mut_nc = getInteractions(filename)
diffc = wt_nc - mut_nc
dnc.append(diffc)
filename = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Electro_RR_"+pdbname+"_Repair_" + str(n)+"_PN.txt"
mut_neRR = getInteractions(filename)
diffeRR = wt_neRR - mut_neRR
dneRR.append(diffeRR)
filename = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Disulfide_RR_"+pdbname+"_Repair_" + str(n)+"_PN.txt"
mut_ndRR = getInteractions(filename)
diffdRR = wt_ndRR - mut_ndRR
dndRR.append(diffdRR)
filename = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Hbonds_RR_"+pdbname+"_Repair_" + str(n)+"_PN.txt"
mut_nhRR = getInteractions(filename)
diffhRR = wt_nhRR - mut_nhRR
dnhRR.append(diffhRR)
filename = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Partcov_RR_"+pdbname+"_Repair_" + str(n)+"_PN.txt"
mut_npRR = getInteractions(filename)
diffpRR = wt_npRR - mut_npRR
dnpRR.append(diffpRR)
filename = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_VdWClashes_RR_"+pdbname+"_Repair_" + str(n)+"_PN.txt"
mut_nvRR = getInteractions(filename)
diffvRR = wt_nvRR - mut_nvRR
dnvRR.append(diffvRR)
filename = "/home/tanu/git/LSHTM_analysis/foldx/test2/Matrix_Volumetric_RR_"+pdbname+"_Repair_" + str(n)+"_PN.txt"
mut_nvoRR = getInteractions(filename)
diffvoRR = wt_nvoRR - mut_nvoRR
dnvoRR.append(diffvoRR)
print(dnc)
print(dneRR)
print(dndRR)
print(dnhRR)
print(dnpRR)
print(dnvRR)
print(dnvoRR)
results = pd.DataFrame([(ddG),(dnc),(dneRR),(dndRR),(dnhRR),(dnpRR),(dnvRR),(dnvoRR)], columns=mutlist, index=["ddG","contacts","electro","disulfide","hbonds","partcov","VdWClashes","volumetric"])
results.append(ddG)
print(results)
results2 = results.T # transpose df
outputfilename = "foldx_results_"+pdbname+".csv"
# results.to_csv(outputfilename)
results2.to_csv(outputfilename)
if __name__ == "__main__":
main()

456
foldx/test2/runFoldx_test2.py Executable file
View file

@ -0,0 +1,456 @@
#!/usr/bin/env python3
import subprocess
import os
import sys
import numpy as np
import pandas as pd
from contextlib import suppress
from pathlib import Path
import re
import csv
import argparse
import shutil
#https://realpython.com/python-pathlib/
# FIXME
#strong dependency of file and path names
#cannot pass file with path. Need to pass them separately
#assumptions made for dir struc as standard
#datadir + drug + input
#=======================================================================
#%% specify input and curr dir
homedir = os.path.expanduser('~')
# set working dir
os.getcwd()
#os.chdir(homedir + '/git/LSHTM_analysis/foldx/')
#os.getcwd()
#=======================================================================
#%% command line args
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('-d', '--drug', help = 'drug name', default = None)
arg_parser.add_argument('-g', '--gene', help = 'gene name (case sensitive)', default = None)
arg_parser.add_argument('--datadir', help = 'Data Directory. By default, it assmumes homedir + git/Data')
arg_parser.add_argument('-i', '--input_dir', help = 'Input dir containing pdb files. By default, it assmumes homedir + <drug> + input')
arg_parser.add_argument('-o', '--output_dir', help = 'Output dir for results. By default, it assmes homedir + <drug> + output')
arg_parser.add_argument('-p', '--process_dir', help = 'Temp processing dir for running foldX. By default, it assmes homedir + <drug> + processing. Make sure it is somewhere with LOTS of storage as it writes all output!') #FIXME
arg_parser.add_argument('-pdb', '--pdb_file', help = 'PDB File to process. By default, it assmumes a file called <gene>_complex.pdb in input_dir')
arg_parser.add_argument('-m', '--mutation_file', help = 'Mutation list. By default, assumes a file called <gene>_mcsm_snps.csv exists')
# FIXME: Doesn't work with 2 chains yet!
arg_parser.add_argument('-c1', '--chain1', help = 'Chain1 ID', default = 'A') # case sensitive
arg_parser.add_argument('-c2', '--chain2', help = 'Chain2 ID', default = 'B') # case sensitive
args = arg_parser.parse_args()
#=======================================================================
#%% variable assignment: input and output
#drug = 'pyrazinamide'
#gene = 'pncA'
#gene_match = gene + '_p.'
#%%=====================================================================
# Command line options
drug = args.drug
gene = args.gene
datadir = args.datadir
indir = args.input_dir
outdir = args.output_dir
process_dir = args.process_dir
mut_filename = args.mutation_file
chainA = args.chain1
chainB = args.chain2
pdb_filename = args.pdb_file
# os.path.splitext will fail interestingly with file.pdb.txt.zip
#pdb_name = os.path.splitext(pdb_file)[0]
# Just the filename, thanks
#pdb_name = Path(in_filename_pdb).stem
#==============
# directories
#==============
if not datadir:
datadir = homedir + '/' + 'git/Data'
if not indir:
indir = datadir + '/' + drug + '/input'
if not outdir:
outdir = datadir + '/' + drug + '/output'
#TODO: perhaps better handled by refactoring code to prevent generating lots of output files!
#if not process_dir:
# process_dir = datadir + '/' + drug + '/processing'
# Make all paths absolute in case the user forgot
indir = os.path.abspath(indir)
process_dir = os.path.abspath(process_dir)
outdir = os.path.abspath(outdir)
datadir = os.path.abspath(datadir)
#=======
# input
#=======
# FIXME
if pdb_filename:
pdb_name = Path(pdb_filename).stem
else:
pdb_filename = gene.lower() + '_complex.pdb'
pdb_name = Path(pdb_filename).stem
infile_pdb = indir + '/' + pdb_filename
actual_pdb_filename = Path(infile_pdb).name
#actual_pdb_filename = os.path.abspath(infile_pdb)
if mut_filename:
mutation_file = os.path.abspath(mut_filename)
infile_muts = mutation_file
print('User-provided mutation file in use:', infile_muts)
else:
mutation_file = gene.lower() + '_mcsm_formatted_snps.csv'
infile_muts = outdir + '/' + mutation_file
print('WARNING: Assuming default mutation file:', infile_muts)
#=======
# output
#=======
out_filename = gene.lower() + '_foldx.csv'
outfile_foldx = outdir + '/' + out_filename
print('Arguments being passed:'
, '\nDrug:', args.drug
, '\ngene:', args.gene
, '\ninput dir:', indir
, '\nprocess dir:', process_dir
, '\noutput dir:', outdir
, '\npdb file:', infile_pdb
, '\npdb name:', pdb_name
, '\nactual pdb name:', actual_pdb_filename
, '\nmutation file:', infile_muts
, '\nchain1:', args.chain1
, '\noutput file:', outfile_foldx
, '\n=============================================================')
#=======================================================================
def getInteractionEnergy(filename):
data = pd.read_csv(filename,sep = '\t')
return data['Interaction Energy'].loc[0]
def getInteractions(filename):
data = pd.read_csv(filename, index_col = 0, header = 0, sep = '\t')
contactList = getIndexes(data,1)
number = len(contactList)
return number
def formatMuts(mut_file,pdbname):
with open(mut_file) as csvfile:
readCSV = csv.reader(csvfile)
muts = []
for row in readCSV:
mut = row[0]
muts.append(mut)
mut_list = []
outfile = process_dir + '/individual_list_' + pdbname + '.txt'
with open(outfile, 'w') as output:
for m in muts:
print(m)
mut = m[:1] + chainA+ m[1:]
mut_list.append(mut)
mut = mut + ';'
print(mut)
output.write(mut)
output.write('\n')
return mut_list
def getIndexes(data, value):
colnames = data.columns.values
listOfPos = list()
result = data.isin([value])
result.columns = colnames
seriesdata = result.any()
columnNames = list(seriesdata[seriesdata==True].index)
for col in columnNames:
rows = list(result[col][result[col]==True].index)
for row in rows:
listOfPos.append((row,col))
return listOfPos
def loadFiles(df):
# load a text file in to np matrix
resultList = []
f = open(df,'r')
for line in f:
line = line.rstrip('\n')
aVals = line.split('\t')
fVals = list(map(np.float32, sVals))
resultList.append(fVals)
f.close()
return np.asarray(resultList, dtype=np.float32)
# TODO: use this code pattern rather than invoking bash
#def repairPDB():
# subprocess.call(['foldx'
# , '--command=RepairPDB'
# , '--pdb-dir=' + indir
# , '--pdb=' + actual_pdb_filename
# , '--ionStrength=0.05'#
# , '--pH=7'
# , '--water=PREDICT'
# , '--vdwDesign=1'
# , 'outPDB=true'
# , '--output-dir=' + process_dir])
#=======================================================================
def main():
pdbname = pdb_name
comp = '' # for complex only
mut_filename = infile_muts #pnca_mcsm_snps.csv
mutlist = formatMuts(mut_filename, pdbname)
print(mutlist)
nmuts = len(mutlist)
print(nmuts)
print(mutlist)
print('start')
#subprocess.check_output(['bash','repairPDB.sh', pdbname, process_dir])
print('\033[95mSTAGE: repair PDB\033[0m')
print('EXECUTING: repairPDB.sh %s %s %s' % (indir, actual_pdb_filename, process_dir))
#subprocess.check_output(['bash','repairPDB.sh', indir, actual_pdb_filename, process_dir])
# once you decide to use the function
# repairPDB(pdbname)
# FIXME: put this hack elsewhere
foldx_common=' --ionStrength=0.05 --pH=7 --water=PREDICT --vdwDesign=1 '
subprocess.call(['foldx'
, '--command=RepairPDB'
, foldx_common
, '--pdb-dir=' + indir
, '--pdb=' + actual_pdb_filename
, 'outPDB=true'
, '--output-dir=' + process_dir])
print('\033[95mCOMPLETE: repair PDB\033[0m')
print('\033[95mSTAGE: run FoldX (subprocess)\033[0m')
print('EXECUTING: runfoldx.sh %s %s ' % (pdbname, process_dir))
#output = subprocess.check_output(['bash', 'runfoldx.sh', pdbname, process_dir])
print('Running foldx BuildModel')
subprocess.call(['foldx'
, '--command=BuildModel'
, foldx_common
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--mutant-file="individual_list_' + pdbname +'.txt"'
, 'outPDB=true'
, '--numberOfRuns=1'
, '--output-dir=' + process_dir], cwd=process_dir)
print('Running foldx PrintNetworks')
subprocess.call(['foldx'
, '--command=PrintNetworks'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir], cwd=process_dir)
print('Running foldx SequenceDetail')
subprocess.call(['foldx'
, '--command=SequenceDetail'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir], cwd=process_dir)
print('\033[95mCOMPLETE: run FoldX (subprocess)\033[0m')
print('\033[95mSTAGE: Print Networks (shell)\033[0m')
for n in range(1,nmuts+1):
print('\033[95mNETWORK:\033[0m', n)
#print('\033[96mCommand:\033[0m runPrintNetworks.sh %s %s %s' % (pdbname, str(n), process_dir ))
#with suppress(Exception):
#foldx --command=PrintNetworks --pdb="${PDB}_Repair_${n}.pdb" --water=PREDICT --vdwDesign=1 --output-dir=${OUTDIR}
print('Running foldx PrintNetworks for mutation', n)
subprocess.call(['foldx'
, '--command=PrintNetworks'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair_' + str(n) + '.pdb'
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir], cwd=process_dir)
#subprocess.check_output(['bash', 'runPrintNetworks.sh', pdbname, str(n), process_dir])
print('\033[95mCOMPLETE: Print Networks (shell)\033[0m')
print('\033[95mSTAGE: Rename Mutation Files (shell)\033[0m')
for n in range(1,nmuts+1):
print('\033[95mMUTATION:\033[0m', n)
print('\033[96mCommand:\033[0m mutrenamefiles.sh %s %s %s' % (pdbname, str(n), process_dir ))
# FIXME: this is bad design and needs to be done in a pythonic way
with suppress(Exception):
subprocess.check_output(['bash', 'mutrenamefiles.sh', pdbname, str(n), process_dir])
print('\033[95mCOMPLETE: Rename Mutation Files (shell)\033[0m')
print('\033[95mSTAGE: Rename Files (shell)\033[0m')
# FIXME: this is bad design and needs to be done in a pythonic way
out = subprocess.check_output(['bash','renamefiles.sh', pdbname, process_dir])
print('\033[95mCOMPLETE: Rename Files (shell)\033[0m')
if comp=='y':
print('\033[95mSTAGE: Running foldx AnalyseComplex (subprocess)\033[0m')
chain1=chainA
chain2=chainB
#with suppress(Exception):
#subprocess.check_output(['bash','runcomplex.sh', pdbname, chain1, chain2, process_dir])
subprocess.call(['foldx'
, '--command=AnalyseComplex'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair.pdb'
, '--analyseComplexChains=' + chain1 + ',' + chain2
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir], cwd=process_dir)
# FIXME why would we ever need to do this?!? Cargo-culted from runcomplex.sh
ac_source = process_dir + '/Summary_' + pdbname + '_Repair_AC.fxout'
ac_dest = process_dir + '/Summary_' + pdbname + '_Repair_AC.txt'
shutil.copyfile(ac_source, ac_dest)
for n in range(1,nmuts+1):
print('\033[95mSTAGE: Running foldx AnalyseComplex (subprocess) for mutation:\033[0m', n)
#with suppress(Exception):
# subprocess.check_output(['bash','mutruncomplex.sh', pdbname, chain1, chain2, str(n), process_dir])
subprocess.call(['foldx'
, '--command=AnalyseComplex'
, '--pdb-dir=' + process_dir
, '--pdb=' + pdbname + '_Repair_' + str(n) + '.pdb'
, '--analyseComplexChains=' + chain1 + ',' + chain2
, '--water=PREDICT'
, '--vdwDesign=1'
, '--output-dir=' + process_dir], cwd=process_dir)
# FIXME why would we ever need to do this?!? Cargo-culted from runcomplex.sh
ac_mut_source = process_dir + '/Summary_' + pdbname + '_Repair_' + str(n) +'_AC.fxout'
ac_mut_dest = process_dir + '/Summary_' + pdbname + '_Repair)' + str(n) +'_AC.txt'
shutil.copyfile(ac_mut_source, ac_mut_dest)
print('\033[95mCOMPLETE: foldx AnalyseComplex (subprocess) for mutation:\033[0m', n)
interactions = ['Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS',
'Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM',
'VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS']
dGdatafile = process_dir + '/Dif_' + pdbname + '_Repair.txt'
dGdata = pd.read_csv(dGdatafile, sep = '\t')
ddG=[]
print('ddG')
print(len(dGdata))
for i in range(0,len(dGdata)):
ddG.append(dGdata['total energy'].loc[i])
nint = len(interactions)
wt_int = []
for i in interactions:
filename = process_dir + '/Matrix_' + i + '_'+ pdbname + '_Repair_PN.txt'
wt_int.append(getInteractions(filename))
print('wt')
print(wt_int)
ntotal = nint+1
print(ntotal)
print(nmuts)
data = np.empty((ntotal,nmuts))
data[0] = ddG
print(data)
for i in range(0,len(interactions)):
d=[]
p=0
for n in range(1, nmuts+1):
print(i)
filename = process_dir + '/Matrix_' + interactions[i] + '_' + pdbname + '_Repair_' + str(n) + '_PN.txt'
mut = getInteractions(filename)
diff = wt_int[i] - mut
print(diff)
print(wt_int[i])
print(mut)
d.append(diff)
print(d)
data[i+1] = d
interactions = ['ddG', 'Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS', 'Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM','VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS']
print(interactions)
IE = []
if comp=='y':
wtfilename = process_dir + '/Summary_' + pdbname + '_Repair_AC.txt'
wtE = getInteractionEnergy(wtfilename)
print(wtE)
for n in range(1,nmuts+1):
print(n)
filename = process_dir + '/Summary_' + pdbname + '_Repair_' + str(n) + '_AC.txt'
mutE = getInteractionEnergy(filename)
print(mutE)
diff = wtE - mutE
print(diff)
IE.append(diff)
print(IE)
IEresults = pd.DataFrame(IE,columns = ['Interaction Energy'], index = mutlist)
IEfilename = 'foldx_complexresults_'+pdbname+'.csv'
IEresults.to_csv(IEfilename)
print(len(IE))
data = np.append(data,[IE], axis = 0)
print(data)
interactions = ['ddG','Distances','Electro_RR','Electro_MM','Electro_SM','Electro_SS','Disulfide_RR','Disulfide_MM','Disulfide_SM','Disulfide_SS','Hbonds_RR','Hbonds_MM','Hbonds_SM','Hbonds_SS','Partcov_RR','Partcov_MM','Partcov_SM','Partcov_SS','VdWClashes_RR','VdWClashes_MM','VdWClashes_SM','VdWClashes_SS','Volumetric_RR','Volumetric_MM','Volumetric_SM','Volumetric_SS','Interaction Energy']
mut_file = process_dir + '/individual_list_' + pdbname + '.txt'
with open(mut_file) as csvfile:
readCSV = csv.reader(csvfile)
mutlist = []
for row in readCSV:
mut = row[0]
mutlist.append(mut)
print(mutlist)
print(len(mutlist))
print(data)
results = pd.DataFrame(data, columns = mutlist, index = interactions)
results.append(ddG)
#print(results.head())
# my style formatted results
results2 = results.T # transpose df
results2.index.name = 'mutationinformation' # assign name to index
results2 = results2.reset_index() # turn it into a columns
results2['mutationinformation'] = results2['mutationinformation'].replace({r'([A-Z]{1})[A-Z]{1}([0-9]+[A-Z]{1});' : r'\1 \2'}, regex = True) # capture mcsm style muts (i.e not the chain id)
results2['mutationinformation'] = results2['mutationinformation'].str.replace(' ', '') # remove empty space
results2.rename(columns = {'Distances': 'Contacts'}, inplace = True)
# lower case columns
results2.columns = results2.columns.str.lower()
print('Writing file in the format below:\n'
, results2.head()
, '\nNo. of rows:', len(results2)
, '\nNo. of cols:', len(results2.columns))
outputfilename = outfile_foldx
#outputfilename = 'foldx_results_' + pdbname + '.csv'
#results.to_csv(outputfilename)
results2.to_csv(outputfilename, index = False)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,3 @@
mutationinformation,ddg,contacts,electro_rr,electro_mm,electro_sm,electro_ss,disulfide_rr,disulfide_mm,disulfide_sm,disulfide_ss,hbonds_rr,hbonds_mm,hbonds_sm,hbonds_ss,partcov_rr,partcov_mm,partcov_sm,partcov_ss,vdwclashes_rr,vdwclashes_mm,vdwclashes_sm,vdwclashes_ss,volumetric_rr,volumetric_mm,volumetric_sm,volumetric_ss
S2C,0.30861700000000003,-2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,0.0,0.0
S2F,-0.6481899999999999,-8.0,-4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-1.0,-1.0,0.0,0.0
1 mutationinformation ddg contacts electro_rr electro_mm electro_sm electro_ss disulfide_rr disulfide_mm disulfide_sm disulfide_ss hbonds_rr hbonds_mm hbonds_sm hbonds_ss partcov_rr partcov_mm partcov_sm partcov_ss vdwclashes_rr vdwclashes_mm vdwclashes_sm vdwclashes_ss volumetric_rr volumetric_mm volumetric_sm volumetric_ss
2 S2C 0.30861700000000003 -2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 -1.0 0.0 0.0
3 S2F -0.6481899999999999 -8.0 -4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 -1.0 -1.0 0.0 0.0

View file

@ -0,0 +1,3 @@
mutationinformation,ddg,contacts,electro_rr,electro_mm,electro_sm,electro_ss,disulfide_rr,disulfide_mm,disulfide_sm,disulfide_ss,hbonds_rr,hbonds_mm,hbonds_sm,hbonds_ss,partcov_rr,partcov_mm,partcov_sm,partcov_ss,vdwclashes_rr,vdwclashes_mm,vdwclashes_sm,vdwclashes_ss,volumetric_rr,volumetric_mm,volumetric_sm,volumetric_ss
L4S,5.7629,22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,4.0
L159R,1.66524,-56.0,-26.0,0.0,-2.0,-24.0,0.0,0.0,0.0,0.0,-2.0,0.0,-2.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,-1.0,-4.0,0.0,-4.0,0.0
1 mutationinformation ddg contacts electro_rr electro_mm electro_sm electro_ss disulfide_rr disulfide_mm disulfide_sm disulfide_ss hbonds_rr hbonds_mm hbonds_sm hbonds_ss partcov_rr partcov_mm partcov_sm partcov_ss vdwclashes_rr vdwclashes_mm vdwclashes_sm vdwclashes_ss volumetric_rr volumetric_mm volumetric_sm volumetric_ss
2 L4S 5.7629 22.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 0.0 4.0
3 L159R 1.66524 -56.0 -26.0 0.0 -2.0 -24.0 0.0 0.0 0.0 0.0 -2.0 0.0 -2.0 0.0 0.0 0.0 0.0 0.0 -1.0 0.0 0.0 -1.0 -4.0 0.0 -4.0 0.0

View file

@ -0,0 +1,34 @@
./runFoldx_test2.py -g pncA --datadir /home/tanu/git/LSHTM_analysis/foldx/test2 -i /home/tanu/git/LSHTM_analysis/foldx/test2 -o /home/tanu/git/LSHTM_analysis/foldx/test2/test2_output -p /home/tanu/git/LSHTM_analysis/foldx/test2/test2_process -pdb 3pl1.pdb -m pnca_muts_sample.csv -c1 A
============
# Example 1: pnca
# Delete processing output, copy rotabase.txt and individual_list_3pl1.txt in place, run a test
# get files from test/
============
#
clear; rm -rf test2_process/*; cp individual_list_3pl1.txt test2_process/ ; cp rotabase.txt test2_process/; ./runFoldx_test2.py -g pncA --datadir /home/tanu/git/LSHTM_analysis/foldx/test2 -i /home/tanu/git/LSHTM_analysis/foldx/test2 -o /home/tanu/git/LSHTM_analysis/foldx/test2/test2_output -p ./test2_process -pdb 3pl1.pdb -m /tmp/pnca_test_muts.csv -c1 A
============
# Example 2: gidb
============
clear
rm Unrecognized_molecules.txt
rm -rf test2_process/*
cp rotabase.txt test2_process/
./runFoldx.py \
-g gid \
--datadir /home/tanu/git/LSHTM_analysis/foldx/test2 \
-i /home/tanu/git/LSHTM_analysis/foldx/test2 \
-o /home/tanu/git/LSHTM_analysis/foldx/test2/test2_output \
-p ./test2_process \
-pdb gid_test2.pdb \
-m gid_test_snps.csv \
-c1 A
#==========
clear dir
#==========
rm Unrecognized_molecules.txt
find ~/git/LSHTM_analysis/foldx/test2/test2_process -type f -delete

View file

@ -0,0 +1,361 @@
#!/usr/bin/env python3
#=======================================================================
#TASK:
#=======================================================================
#%% load packages
import os,sys
import subprocess
import argparse
#import requests
import re
#import time
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import numpy as np
from mcsm import *
#=======================================================================
#%% specify input and curr dir
homedir = os.path.expanduser('~')
# set working dir
os.getcwd()
os.chdir(homedir + '/git/LSHTM_analysis/mcsm')
os.getcwd()
#=======================================================================
#%% variable assignment: input and output
#drug = 'pyrazinamide'
#gene = 'pncA'
drug = 'isoniazid'
gene = 'KatG'
#drug = args.drug
#gene = args.gene
gene_match = gene + '_p.'
#==========
# data dir
#==========
datadir = homedir + '/' + 'git/Data'
#=======
# input:
#=======
# 1) result_urls (from outdir)
outdir = datadir + '/' + drug + '/' + 'output'
in_filename = gene.lower() + '_mcsm_output.csv' #(outfile, from mcsm_results.py)
infile = outdir + '/' + in_filename
print('Input filename:', in_filename
, '\nInput path(from output dir):', outdir
, '\n=============================================================')
#=======
# output
#=======
outdir = datadir + '/' + drug + '/' + 'output'
out_filename = gene.lower() + '_complex_mcsm_results.csv'
outfile = outdir + '/' + out_filename
print('Output filename:', out_filename
, '\nOutput path:', outdir
, '\n=============================================================')
#%%=====================================================================
def format_mcsm_output(mcsm_outputcsv):
"""
@param mcsm_outputcsv: file containing mcsm results for all muts
which is the result of build_result_dict() being called for each
mutation and then converting to a pandas df and output as csv.
@type string
@return formatted mcsm output
@type pandas df
"""
#############
# Read file
#############
mcsm_data_raw = pd.read_csv(mcsm_outputcsv, sep = ',')
# strip white space from both ends in all columns
mcsm_data = mcsm_data_raw.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
dforig_shape = mcsm_data.shape
print('dimensions of input file:', dforig_shape)
#############
# rename cols
#############
# format colnames: all lowercase, remove spaces and use '_' to join
print('Assigning meaningful colnames i.e without spaces and hyphen and reflecting units'
, '\n===================================================================')
my_colnames_dict = {'Predicted Affinity Change': 'PredAffLog' # relevant info from this col will be extracted and the column discarded
, 'Mutation information': 'mutationinformation' # {wild_type}<position>{mutant_type}
, 'Wild-type': 'wild_type' # one letter amino acid code
, 'Position': 'position' # number
, 'Mutant-type': 'mutant_type' # one letter amino acid code
, 'Chain': 'chain' # single letter (caps)
, 'Ligand ID': 'ligand_id' # 3-letter code
, 'Distance to ligand': 'ligand_distance' # angstroms
, 'DUET stability change': 'duet_stability_change'} # in kcal/mol
mcsm_data.rename(columns = my_colnames_dict, inplace = True)
#%%===========================================================================
#################################
# populate mutationinformation
# col which is currently blank
#################################
# populate mutationinformation column:mcsm style muts {WT}<POS>{MUT}
print('Populating column : mutationinformation which is currently empty\n', mcsm_data['mutationinformation'])
mcsm_data['mutationinformation'] = mcsm_data['wild_type'] + mcsm_data['position'].astype(str) + mcsm_data['mutant_type']
print('checking after populating:\n', mcsm_data['mutationinformation']
, '\n===================================================================')
# Remove spaces b/w pasted columns
print('removing white space within column: \mutationinformation')
mcsm_data['mutationinformation'] = mcsm_data['mutationinformation'].str.replace(' ', '')
print('Correctly formatted column: mutationinformation\n', mcsm_data['mutationinformation']
, '\n===================================================================')
#%%===========================================================================
#############
# sanity check: drop dupliate muts
#############
# shouldn't exist as this should be eliminated at the time of running mcsm
print('Sanity check:'
, '\nChecking duplicate mutations')
if mcsm_data['mutationinformation'].duplicated().sum() == 0:
print('PASS: No duplicate mutations detected (as expected)'
, '\nDim of data:', mcsm_data.shape
, '\n===============================================================')
else:
print('FAIL (but not fatal): Duplicate mutations detected'
, '\nDim of df with duplicates:', mcsm_data.shape
, 'Removing duplicate entries')
mcsm_data = mcsm_data.drop_duplicates(['mutationinformation'])
print('Dim of data after removing duplicate muts:', mcsm_data.shape
, '\n===============================================================')
#%%===========================================================================
#############
# Create col: duet_outcome
#############
# classification based on DUET stability values
print('Assigning col: duet_outcome based on DUET stability values')
print('Sanity check:')
# count positive values in the DUET column
c = mcsm_data[mcsm_data['duet_stability_change']>=0].count()
DUET_pos = c.get(key = 'duet_stability_change')
# Assign category based on sign (+ve : Stabilising, -ve: Destabilising, Mind the spelling (British spelling))
mcsm_data['duet_outcome'] = np.where(mcsm_data['duet_stability_change']>=0, 'Stabilising', 'Destabilising')
mcsm_data['duet_outcome'].value_counts()
if DUET_pos == mcsm_data['duet_outcome'].value_counts()['Stabilising']:
print('PASS: DUET outcome assigned correctly')
else:
print('FAIL: DUET outcome assigned incorrectly'
, '\nExpected no. of stabilising mutations:', DUET_pos
, '\nGot no. of stabilising mutations', mcsm_data['duet_outcome'].value_counts()['Stabilising']
, '\n===============================================================')
#%%===========================================================================
#############
# Extract numeric
# part of ligand_distance col
#############
# Extract only the numeric part from col: ligand_distance
# number: '-?\d+\.?\d*'
mcsm_data['ligand_distance']
print('extracting numeric part of col: ligand_distance')
mcsm_data['ligand_distance'] = mcsm_data['ligand_distance'].str.extract('(\d+\.?\d*)')
mcsm_data['ligand_distance']
#%%===========================================================================
#############
# Create 2 columns:
# ligand_affinity_change and ligand_outcome
#############
# the numerical and categorical parts need to be extracted from column: PredAffLog
# regex used
# numerical part: '-?\d+\.?\d*'
# categorocal part: '\b(\w+ing)\b'
print('Extracting numerical and categorical parts from the col: PredAffLog')
print('to create two columns: ligand_affinity_change and ligand_outcome'
, '\n===================================================================')
# 1) Extracting the predicted affinity change (numerical part)
mcsm_data['ligand_affinity_change'] = mcsm_data['PredAffLog'].str.extract('(-?\d+\.?\d*)', expand = True)
print(mcsm_data['ligand_affinity_change'])
# 2) Extracting the categorical part (Destabillizing and Stabilizing) using word boundary ('ing')
#aff_regex = re.compile(r'\b(\w+ing)\b')
mcsm_data['ligand_outcome']= mcsm_data['PredAffLog'].str.extract(r'(\b\w+ing\b)', expand = True)
print(mcsm_data['ligand_outcome'])
print(mcsm_data['ligand_outcome'].value_counts())
#############
# changing spelling: British
#############
# ensuring spellings are consistent
american_spl = mcsm_data['ligand_outcome'].value_counts()
print('Changing to Bristish spellings for col: ligand_outcome')
mcsm_data['ligand_outcome'].replace({'Destabilizing': 'Destabilising', 'Stabilizing': 'Stabilising'}, inplace = True)
print(mcsm_data['ligand_outcome'].value_counts())
british_spl = mcsm_data['ligand_outcome'].value_counts()
# compare series values since index will differ from spelling change
check = american_spl.values == british_spl.values
if check.all():
print('PASS: spelling change successfull'
, '\nNo. of predicted affinity changes:\n', british_spl
, '\n===============================================================')
else:
print('FAIL: spelling change unsucessfull'
, '\nExpected:\n', american_spl
, '\nGot:\n', british_spl
, '\n===============================================================')
#%%===========================================================================
#############
# ensuring corrrect dtype columns
#############
# check dtype in cols
print('Checking dtypes in all columns:\n', mcsm_data.dtypes
, '\n===================================================================')
print('Converting the following cols to numeric:'
, '\nligand_distance'
, '\nduet_stability_change'
, '\nligand_affinity_change'
, '\n===================================================================')
# using apply method to change stabilty and affinity values to numeric
numeric_cols = ['duet_stability_change', 'ligand_affinity_change', 'ligand_distance']
mcsm_data[numeric_cols] = mcsm_data[numeric_cols].apply(pd.to_numeric)
# check dtype in cols
print('checking dtype after conversion')
cols_check = mcsm_data.select_dtypes(include='float64').columns.isin(numeric_cols)
if cols_check.all():
print('PASS: dtypes for selected cols:', numeric_cols
, '\nchanged to numeric'
, '\n===============================================================')
else:
print('FAIL:dtype change to numeric for selected cols unsuccessful'
, '\n===============================================================')
print(mcsm_data.dtypes)
#%%===========================================================================
#############
# scale duet values
#############
# Rescale values in DUET_change col b/w -1 and 1 so negative numbers
# stay neg and pos numbers stay positive
duet_min = mcsm_data['duet_stability_change'].min()
duet_max = mcsm_data['duet_stability_change'].max()
duet_scale = lambda x : x/abs(duet_min) if x < 0 else (x/duet_max if x >= 0 else 'failed')
mcsm_data['duet_scaled'] = mcsm_data['duet_stability_change'].apply(duet_scale)
print('Raw duet scores:\n', mcsm_data['duet_stability_change']
, '\n---------------------------------------------------------------'
, '\nScaled duet scores:\n', mcsm_data['duet_scaled'])
#%%===========================================================================
#############
# scale affinity values
#############
# rescale values in affinity change col b/w -1 and 1 so negative numbers
# stay neg and pos numbers stay positive
aff_min = mcsm_data['ligand_affinity_change'].min()
aff_max = mcsm_data['ligand_affinity_change'].max()
aff_scale = lambda x : x/abs(aff_min) if x < 0 else (x/aff_max if x >= 0 else 'failed')
mcsm_data['affinity_scaled'] = mcsm_data['ligand_affinity_change'].apply(aff_scale)
print('Raw affinity scores:\n', mcsm_data['ligand_affinity_change']
, '\n---------------------------------------------------------------'
, '\nScaled affinity scores:\n', mcsm_data['affinity_scaled'])
#=============================================================================
# Adding colname: wild_pos: sometimes useful for plotting and db
print('Creating column: wild_pos')
mcsm_data['wild_pos'] = mcsm_data['wild_type'] + mcsm_data['position'].astype(str)
print(mcsm_data['wild_pos'].head())
# Remove spaces b/w pasted columns
print('removing white space within column: wild_pos')
mcsm_data['wild_pos'] = mcsm_data['wild_pos'].str.replace(' ', '')
print('Correctly formatted column: wild_pos\n', mcsm_data['wild_pos'].head()
, '\n===================================================================')
#=============================================================================
# Adding colname: wild_chain_pos: sometimes useful for plotting and db and is explicit
print('Creating column: wild_chain_pos')
mcsm_data['wild_chain_pos'] = mcsm_data['wild_type'] + mcsm_data['chain'] + mcsm_data['position'].astype(str)
print(mcsm_data['wild_chain_pos'].head())
# Remove spaces b/w pasted columns
print('removing white space within column: wild_chain_pos')
mcsm_data['wild_chain_pos'] = mcsm_data['wild_chain_pos'].str.replace(' ', '')
print('Correctly formatted column: wild_chain_pos\n', mcsm_data['wild_chain_pos'].head()
, '\n===================================================================')
#=============================================================================
#%% ensuring dtypes are string for the non-numeric cols
#) char cols
char_cols = ['PredAffLog', 'mutationinformation', 'wild_type', 'mutant_type', 'chain'
, 'ligand_id', 'duet_outcome', 'ligand_outcome', 'wild_pos', 'wild_chain_pos']
#mcsm_data[char_cols] = mcsm_data[char_cols].astype(str)
cols_check_char = mcsm_data.select_dtypes(include='object').columns.isin(char_cols)
if cols_check_char.all():
print('PASS: dtypes for char cols:', char_cols, 'are indeed string'
, '\n===============================================================')
else:
print('FAIL:dtype change to numeric for selected cols unsuccessful'
, '\n===============================================================')
#mcsm_data['ligand_distance', 'ligand_affinity_change'].apply(is_numeric_dtype(mcsm_data['ligand_distance', 'ligand_affinity_change']))
print(mcsm_data.dtypes)
#=============================================================================
# Removing PredAff log column as it is not needed?
print('Removing col: PredAffLog since relevant info has been extracted from it')
mcsm_data_f = mcsm_data.drop(columns = ['PredAffLog'])
#=============================================================================
#sort df by position for convenience
print('Sorting df by position')
mcsm_data_fs = mcsm_data_f.sort_values(by = ['position'])
print('sorted df:\n', mcsm_data_fs.head())
# Ensuring column names are lowercase before output
mcsm_data_fs.columns = mcsm_data_fs.columns.str.lower()
#%%===========================================================================
#############
# sanity check before writing file
#############
expected_ncols_toadd = 5 # beware of hardcoded numbers
dforig_len = dforig_shape[1]
expected_cols = dforig_len + expected_ncols_toadd
if len(mcsm_data_fs.columns) == expected_cols:
print('PASS: formatting successful'
, '\nformatted df has expected no. of cols:', expected_cols
, '\ncolnames:', mcsm_data_fs.columns
, '\n----------------------------------------------------------------'
, '\ndtypes in cols:', mcsm_data_fs.dtypes
, '\n----------------------------------------------------------------'
, '\norig data shape:', dforig_shape
, '\nformatted df shape:', mcsm_data_fs.shape
, '\n===============================================================')
else:
print('FAIL: something went wrong in formatting df'
, '\nLen of orig df:', dforig_len
, '\nExpected number of cols to add:', expected_ncols_toadd
, '\nExpected no. of cols:', expected_cols, '(', dforig_len, '+', expected_ncols_toadd, ')'
, '\nGot no. of cols:', len(mcsm_data_fs.columns)
, '\nCheck formatting:'
, '\ncheck hardcoded value:', expected_ncols_toadd
, '\nis', expected_ncols_toadd, 'the no. of expected cols to add?'
, '\n===============================================================')
return mcsm_data_fs
#=======================================================================
# call function
mcsm_df_formatted = format_mcsm_output(infile)
# writing file
print('Writing formatted df to csv')
mcsm_df_formatted.to_csv(outfile, index = False)
print('Finished writing file:'
, '\nFile', outfile
, '\nExpected no. of rows:', len(mcsm_df_formatted)
, '\nExpected no. of cols:', len(mcsm_df_formatted)
, '\n=============================================================')
#%%
#End of script

View file

@ -0,0 +1,310 @@
#!/usr/bin/env python3
#=======================================================================
#TASK:
#=======================================================================
#%% load packages
import os,sys
import subprocess
import argparse
#import requests
import re
#import time
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import numpy as np
#=======================================================================
#%% specify input and curr dir
homedir = os.path.expanduser('~')
# set working dir
os.getcwd()
os.chdir(homedir + '/git/LSHTM_analysis/mcsm')
os.getcwd()
#=======================================================================
#%% variable assignment: input and output
drug = 'pyrazinamide'
gene = 'pncA'
gene_match = gene + '_p.'
#==========
# dirs
#==========
datadir = homedir + '/' + 'git/Data'
indir = datadir + '/' + drug + '/' + 'input'
outdir = datadir + '/' + drug + '/' + 'output'
#=======
# input:
#=======
# 1) result_urls (from outdir)
in_filename_mcsm_output = gene.lower() + '_mcsm_output.csv' #(outfile, from mcsm_results.py)
infile_mcsm_output = outdir + '/' + in_filename_mcsm_output
print('Input file:', infile_mcsm_output
, '\n=============================================================')
#=======
# output
#=======
out_filename_mcsm_norm = gene.lower() + '_complex_mcsm_norm.csv'
outfile_mcsm_norm = outdir + '/' + out_filename_mcsm_norm
print('Output file:', out_filename_mcsm_norm
, '\n=============================================================')
#=======================================================================
print('Reading input file')
mcsm_data_raw = pd.read_csv(infile_mcsm_output, sep = ',')
# strip white space from both ends in all columns
mcsm_data = mcsm_data_raw.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
# PredAffLog = affinity_change_log
# "DUETStability_Kcalpermol = DUET_change_kcalpermol
dforig_shape = mcsm_data.shape
print('dim of infile:', dforig_shape)
#############
# rename cols
#############
# format colnames: all lowercase, remove spaces and use '_' to join
print('Assigning meaningful colnames i.e without spaces and hyphen and reflecting units'
, '\n===================================================================')
my_colnames_dict = {'Predicted Affinity Change': 'PredAffLog' # relevant info from this col will be extracted and the column discarded
, 'Mutation information': 'mutationinformation' # {wild_type}<position>{mutant_type}
, 'Wild-type': 'wild_type' # one letter amino acid code
, 'Position': 'position' # number
, 'Mutant-type': 'mutant_type' # one letter amino acid code
, 'Chain': 'chain' # single letter (caps)
, 'Ligand ID': 'ligand_id' # 3-letter code
, 'Distance to ligand': 'ligand_distance' # angstroms
, 'DUET stability change': 'duet_stability_change'} # in kcal/mol
mcsm_data.rename(columns = my_colnames_dict, inplace = True)
#%%===========================================================================
# populate mutationinformation column:mcsm style muts {WT}<POS>{MUT}
print('Populating column : mutationinformation which is currently empty\n', mcsm_data['mutationinformation'])
mcsm_data['mutationinformation'] = mcsm_data['wild_type'] + mcsm_data['position'].astype(str) + mcsm_data['mutant_type']
print('checking after populating:\n', mcsm_data['mutationinformation']
, '\n===================================================================')
# Remove spaces b/w pasted columns: not needed as white space removed at the time of import
#print('removing white space within column: \mutationinformation')
#mcsm_data['mutationinformation'] = mcsm_data['mutationinformation'].str.replace(' ', '')
#print('Correctly formatted column: mutationinformation\n', mcsm_data['mutationinformation']
# , '\n===================================================================')
#%% Remove whitespace from column
#orig_dtypes = mcsm_data.dtypes
#https://stackoverflow.com/questions/33788913/pythonic-efficient-way-to-strip-whitespace-from-every-pandas-data-frame-cell-tha/33789292
#mcsm_data.columns = mcsm_data.columns.str.strip()
#new_dtypes = mcsm_data.dtypes
#%%===========================================================================
# very important
print('Sanity check:'
, '\nChecking duplicate mutations')
if mcsm_data['mutationinformation'].duplicated().sum() == 0:
print('PASS: No duplicate mutations detected (as expected)'
, '\nDim of data:', mcsm_data.shape
, '\n===============================================================')
else:
print('FAIL (but not fatal): Duplicate mutations detected'
, '\nDim of df with duplicates:', mcsm_data.shape
, 'Removing duplicate entries')
mcsm_data = mcsm_data.drop_duplicates(['mutationinformation'])
print('Dim of data after removing duplicate muts:', mcsm_data.shape
, '\n===============================================================')
#%%===========================================================================
# create duet_outcome column: classification based on DUET stability values
print('Assigning col: duet_outcome based on DUET stability values')
print('Sanity check:')
# count positive values in the DUET column
c = mcsm_data[mcsm_data['duet_stability_change']>=0].count()
DUET_pos = c.get(key = 'duet_stability_change')
# Assign category based on sign (+ve : Stabilising, -ve: Destabilising, Mind the spelling (British spelling))
mcsm_data['duet_outcome'] = np.where(mcsm_data['duet_stability_change']>=0, 'Stabilising', 'Destabilising')
mcsm_data['duet_outcome'].value_counts()
if DUET_pos == mcsm_data['duet_outcome'].value_counts()['Stabilising']:
print('PASS: DUET outcome assigned correctly')
else:
print('FAIL: DUET outcome assigned incorrectly'
, '\nExpected no. of stabilising mutations:', DUET_pos
, '\nGot no. of stabilising mutations', mcsm_data['duet_outcome'].value_counts()['Stabilising']
, '\n===============================================================')
#%%===========================================================================
# Extract only the numeric part from col: ligand_distance
# number: '-?\d+\.?\d*'
mcsm_data['ligand_distance']
print('extracting numeric part of col: ligand_distance')
mcsm_data['ligand_distance'] = mcsm_data['ligand_distance'].str.extract('(\d+\.?\d*)')
mcsm_data['ligand_distance']
#%%===========================================================================
# create ligand_outcome column: classification based on affinity change values
# the numerical and categorical parts need to be extracted from column: PredAffLog
# regex used
# number: '-?\d+\.?\d*'
# category: '\b(\w+ing)\b'
print('Extracting numerical and categorical parts from the col: PredAffLog')
print('to create two columns: ligand_affinity_change and ligand_outcome'
, '\n===================================================================')
# Extracting the predicted affinity change (numerical part)
mcsm_data['ligand_affinity_change'] = mcsm_data['PredAffLog'].str.extract('(-?\d+\.?\d*)', expand = True)
print(mcsm_data['ligand_affinity_change'])
# Extracting the categorical part (Destabillizing and Stabilizing) using word boundary ('ing')
#aff_regex = re.compile(r'\b(\w+ing)\b')
mcsm_data['ligand_outcome']= mcsm_data['PredAffLog'].str.extract(r'(\b\w+ing\b)', expand = True)
print(mcsm_data['ligand_outcome'])
print(mcsm_data['ligand_outcome'].value_counts())
# ensuring spellings are consistent
american_spl = mcsm_data['ligand_outcome'].value_counts()
print('Changing to Bristish spellings for col: ligand_outcome')
mcsm_data['ligand_outcome'].replace({'Destabilizing': 'Destabilising', 'Stabilizing': 'Stabilising'}, inplace = True)
print(mcsm_data['ligand_outcome'].value_counts())
british_spl = mcsm_data['ligand_outcome'].value_counts()
# compare series values since index will differ from spelling change
check = american_spl.values == british_spl.values
if check.all():
print('PASS: spelling change successfull'
, '\nNo. of predicted affinity changes:\n', british_spl
, '\n===============================================================')
else:
print('FAIL: spelling change unsucessfull'
, '\nExpected:\n', american_spl
, '\nGot:\n', british_spl
, '\n===============================================================')
#%%===========================================================================
# check dtype in cols: ensure correct dtypes for cols
print('Checking dtypes in all columns:\n', mcsm_data.dtypes
, '\n===================================================================')
#1) numeric cols
print('Converting the following cols to numeric:'
, '\nligand_distance'
, '\nduet_stability_change'
, '\nligand_affinity_change'
, '\n===================================================================')
# using apply method to change stabilty and affinity values to numeric
numeric_cols = ['duet_stability_change', 'ligand_affinity_change', 'ligand_distance']
mcsm_data[numeric_cols] = mcsm_data[numeric_cols].apply(pd.to_numeric)
# check dtype in cols
print('checking dtype after conversion')
cols_check = mcsm_data.select_dtypes(include='float64').columns.isin(numeric_cols)
if cols_check.all():
print('PASS: dtypes for selected cols:', numeric_cols
, '\nchanged to numeric'
, '\n===============================================================')
else:
print('FAIL:dtype change to numeric for selected cols unsuccessful'
, '\n===============================================================')
#mcsm_data['ligand_distance', 'ligand_affinity_change'].apply(is_numeric_dtype(mcsm_data['ligand_distance', 'ligand_affinity_change']))
print(mcsm_data.dtypes)
#%%===========================================================================
# Normalise values in DUET_change col b/w -1 and 1 so negative numbers
# stay neg and pos numbers stay positive
duet_min = mcsm_data['duet_stability_change'].min()
duet_max = mcsm_data['duet_stability_change'].max()
duet_scale = lambda x : x/abs(duet_min) if x < 0 else (x/duet_max if x >= 0 else 'failed')
mcsm_data['duet_scaled'] = mcsm_data['duet_stability_change'].apply(duet_scale)
print('Raw duet scores:\n', mcsm_data['duet_stability_change']
, '\n---------------------------------------------------------------'
, '\nScaled duet scores:\n', mcsm_data['duet_scaled'])
#%%===========================================================================
# Normalise values in affinity change col b/w -1 and 1 so negative numbers
# stay neg and pos numbers stay positive
aff_min = mcsm_data['ligand_affinity_change'].min()
aff_max = mcsm_data['ligand_affinity_change'].max()
aff_scale = lambda x : x/abs(aff_min) if x < 0 else (x/aff_max if x >= 0 else 'failed')
mcsm_data['ligand_affinity_change']
mcsm_data['affinity_scaled'] = mcsm_data['ligand_affinity_change'].apply(aff_scale)
mcsm_data['affinity_scaled']
print('Raw affinity scores:\n', mcsm_data['ligand_affinity_change']
, '\n---------------------------------------------------------------'
, '\nScaled affinity scores:\n', mcsm_data['affinity_scaled'])
#=============================================================================
# Adding colname: wild_pos: sometimes useful for plotting and db
print('Creating column: wild_pos')
mcsm_data['wild_pos'] = mcsm_data['wild_type'] + mcsm_data['position'].astype(str)
print(mcsm_data['wild_pos'].head())
# Remove spaces b/w pasted columns
print('removing white space within column: wild_position')
mcsm_data['wild_pos'] = mcsm_data['wild_pos'].str.replace(' ', '')
print('Correctly formatted column: wild_pos\n', mcsm_data['wild_pos'].head()
, '\n===================================================================')
#=============================================================================
#%% Adding colname: wild_chain_pos: sometimes useful for plotting and db and is explicit
print('Creating column: wild_chain_pos')
mcsm_data['wild_chain_pos'] = mcsm_data['wild_type'] + mcsm_data['chain'] + mcsm_data['position'].astype(str)
print(mcsm_data['wild_chain_pos'].head())
# Remove spaces b/w pasted columns
print('removing white space within column: wild_chain_pos')
mcsm_data['wild_chain_pos'] = mcsm_data['wild_chain_pos'].str.replace(' ', '')
print('Correctly formatted column: wild_chain_pos\n', mcsm_data['wild_chain_pos'].head()
, '\n===================================================================')
#=============================================================================
#%% ensuring dtypes are string for the non-numeric cols
#) char cols
char_cols = ['PredAffLog', 'mutationinformation', 'wild_type', 'mutant_type', 'chain'
, 'ligand_id', 'duet_outcome', 'ligand_outcome', 'wild_pos', 'wild_chain_pos']
#mcsm_data[char_cols] = mcsm_data[char_cols].astype(str)
cols_check_char = mcsm_data.select_dtypes(include='object').columns.isin(char_cols)
if cols_check_char.all():
print('PASS: dtypes for char cols:', char_cols, 'are indeed string'
, '\n===============================================================')
else:
print('FAIL:dtype change to numeric for selected cols unsuccessful'
, '\n===============================================================')
#mcsm_data['ligand_distance', 'ligand_affinity_change'].apply(is_numeric_dtype(mcsm_data['ligand_distance', 'ligand_affinity_change']))
print(mcsm_data.dtypes)
#%%
#=============================================================================
#%% Removing PredAff log column as it is not needed?
print('Removing col: PredAffLog since relevant info has been extracted from it')
mcsm_data_f = mcsm_data.drop(columns = ['PredAffLog'])
print(mcsm_data_f.head())
#=============================================================================
#%% sort df by position for convenience
print('Sorting df by position')
mcsm_data_fs = mcsm_data_f.sort_values(by = ['position'])
print('sorted df:\n', mcsm_data_fs.head())
#%%===========================================================================
expected_ncols_toadd = 6 # beware of hardcoded numbers
dforig_len = dforig_shape[1]
expected_cols = dforig_len + expected_ncols_toadd
if len(mcsm_data_fs.columns) == expected_cols:
print('PASS: formatting successful'
, '\nformatted df has expected no. of cols:', expected_cols
, '\ncolnames:', mcsm_data_fs.columns
, '\n----------------------------------------------------------------'
, '\ndtypes in cols:', mcsm_data_fs.dtypes
, '\n----------------------------------------------------------------'
, '\norig data shape:', dforig_shape
, '\nformatted df shape:', mcsm_data_fs.shape
, '\n===============================================================')
else:
print('FAIL: something went wrong in formatting df'
, '\nLen of orig df:', dforig_len
, '\nExpected number of cols to add:', expected_ncols_toadd
, '\nExpected no. of cols:', expected_cols, '(', dforig_len, '+', expected_ncols_toadd, ')'
, '\nGot no. of cols:', len(mcsm_data_fs.columns)
, '\nCheck formatting:'
, '\ncheck hardcoded value:', expected_ncols_toadd
, '\nis', expected_ncols_toadd, 'the no. of expected cols to add?'
, '\n===============================================================')
#%%============================================================================
# Ensuring column names are lowercase before output
mcsm_data_fs.columns = mcsm_data_fs.columns.str.lower()
# writing file
print('Writing formatted df to csv')
mcsm_data_fs.to_csv(outfile_mcsm_norm, index = False)
print('Finished writing file:'
, '\nFile:', outfile_mcsm_norm
, '\nExpected no. of rows:', len(mcsm_data_fs)
, '\nExpected no. of cols:', len(mcsm_data_fs.columns)
, '\n=============================================================')
#%%
#End of script

149
mcsm/ind_scripts/mcsm_results.py Executable file
View file

@ -0,0 +1,149 @@
#!/usr/bin/env python3
#=======================================================================
#TASK:
#=======================================================================
#%% load packages
import os,sys
import subprocess
import argparse
import requests
import re
import time
import pandas as pd
from bs4 import BeautifulSoup
#import beautifulsoup4
from csv import reader
#=======================================================================
#%% specify input and curr dir
homedir = os.path.expanduser('~')
# set working dir
os.getcwd()
os.chdir(homedir + '/git/LSHTM_analysis/mcsm')
os.getcwd()
#=======================================================================
#%% variable assignment: input and output
#drug = 'pyrazinamide'
#gene = 'pncA'
#drug = 'isoniazid'
#gene = 'KatG'
drug = 'cycloserine'
gene = 'alr'
#drug = args.drug
#gene = args.gene
gene_match = gene + '_p.'
#==========
# data dir
#==========
datadir = homedir + '/' + 'git/Data'
#=======
# input:
#=======
# 1) result_urls (from outdir)
outdir = datadir + '/' + drug + '/' + 'output'
in_filename_url = gene.lower() + '_result_urls.txt' #(outfile, sub write_result_url)
infile_url = outdir + '/' + in_filename_url
print('Input filename:', in_filename_url
, '\nInput path(from output dir):', outdir
, '\n=============================================================')
#=======
# output
#=======
outdir = datadir + '/' + drug + '/' + 'output'
out_filename = gene.lower() + '_mcsm_output.csv'
outfile = outdir + '/' + out_filename
print('Output filename:', out_filename
, '\nOutput path:', outdir
, '\n=============================================================')
#=======================================================================
def scrape_results(out_result_url):
"""
Extract results data using the result url
@params out_result_url: txt file containing result url
one per line for each mutation
@type string
returns: mcsm prediction results (raw)
@type chr
"""
result_response = requests.get(out_result_url)
# if results_response is not None:
# page = results_page.text
if result_response.status_code == 200:
print('SUCCESS: Fetching results')
else:
print('FAIL: Could not fetch results'
, '\nCheck if url is valid')
# extract results using the html parser
soup = BeautifulSoup(result_response.text, features = 'html.parser')
# print(soup)
web_result_raw = soup.find(class_ = 'span4').get_text()
return web_result_raw
def build_result_dict(web_result_raw):
"""
Build dict of mcsm output for a single mutation
Format web results which is preformatted to enable building result dict
# preformatted string object: Problematic!
# make format consistent
@params web_result_raw: directly from html parser extraction
@type string
@returns result dict
@type {}
"""
# remove blank lines from web_result_raw
mytext = os.linesep.join([s for s in web_result_raw.splitlines() if s])
# affinity change and DUET stability change cols are are split over
# multiple lines and Mutation information is empty!
mytext = mytext.replace('ange:\n', 'ange: ')
#print(mytext)
# initiliase result_dict
result_dict = {}
for line in mytext.split('\n'):
fields = line.split(':')
# print(fields)
if len(fields) > 1: # since Mutaton information is empty
dict_entry = dict([(x, y) for x, y in zip(fields[::2], fields[1::2])])
result_dict.update(dict_entry)
return result_dict
#=====================================================================
#%% call function
#request_results(infile_url)
#response = requests.get('http://biosig.unimelb.edu.au/mcsm_lig/results_prediction/1586364780.41')
results_interim = scrape_results('http://biosig.unimelb.edu.au/mcsm_lig/results_prediction/1587053996.55')
result_dict = build_result_dict(results_interim)
output_df = pd.DataFrame()
url_counter = 1 # HURR DURR COUNT STARTEDS AT ONE1`!1
infile_len = os.popen('wc -l < %s' % infile_url).read() # quicker than using Python :-)
print('Total URLs:',infile_len)
with open(infile_url, 'r') as urlfile:
for line in urlfile:
url_line = line.strip()
# response = request_results(url_line)
#response = requests.get(url_line)
results_interim = scrape_results(url_line)
result_dict = build_result_dict(results_interim)
print('Processing URL: %s of %s' % (url_counter, infile_len))
df = pd.DataFrame(result_dict, index=[url_counter])
url_counter += 1
output_df = output_df.append(df)
#print(output_df)
output_df.to_csv(outfile, index = None, header = True)

240
mcsm/ind_scripts/run_mcsm.py Executable file
View file

@ -0,0 +1,240 @@
#!/usr/bin/env python3
#=======================================================================
#TASK:
#=======================================================================
#%% load packages
import os,sys
import subprocess
import argparse
import requests
import re
import time
import pandas as pd
from bs4 import BeautifulSoup
#from csv import reader
#=======================================================================
#%% specify input and curr dir
homedir = os.path.expanduser('~')
# set working dir
os.getcwd()
os.chdir(homedir + '/git/LSHTM_analysis/mcsm')
os.getcwd()
#=======================================================================
#%% command line args
#arg_parser = argparse.ArgumentParser()
#arg_parser.add_argument('-d', '--drug', help='drug name', default = 'pyrazinamide')
#arg_parser.add_argument('-g', '--gene', help='gene name', default = 'pncA') # case sensitive
#arg_parser.add_argument('-d', '--drug', help='drug name', default = 'TESTDRUG')
#arg_parser.add_argument('-g', '--gene', help='gene name (case sensitive)', default = 'testGene') # case sensitive
#args = arg_parser.parse_args()
#=======================================================================
#%% variable assignment: input and output
#drug = 'pyrazinamide'
#gene = 'pncA'
#drug = 'isoniazid'
#gene = 'KatG'
drug = 'cycloserine'
gene = 'alr'
#drug = args.drug
#gene = args.gene
gene_match = gene + '_p.'
#==========
# data dir
#==========
datadir = homedir + '/' + 'git/Data'
#==========
# input dir
#==========
indir = datadir + '/' + drug + '/' + 'input'
#==========
# output dir
#==========
outdir = datadir + '/' + drug + '/' + 'output'
#=======
# input files:
#=======
# 1) pdb file
in_filename_pdb = gene.lower() + '_complex.pdb'
infile_pdb = indir + '/' + in_filename_pdb
print('Input pdb file:', infile_pdb
, '\n=============================================================')
# 2) mcsm snps
in_filename_snps = gene.lower() + '_mcsm_snps.csv' #(outfile2, from data_extraction.py)
infile_snps = outdir + '/' + in_filename_snps
print('Input mutation file:', infile_snps
, '\n=============================================================')
#=======
# output files
#=======
# 1) result urls file
#result_urls_filename = gene.lower() + '_result_urls.txt'
#result_urls = outdir + '/' + result_urls_filename
# 2) invalid mutations file
#invalid_muts_filename = gene.lower() + '_invalid_mutations.txt'
#outfile_invalid_muts = outdir + '/' + invalid_muts_filename
#print('Result url file:', result_urls
# , '\n==================================================================='
# , '\nOutput invalid muations file:', outfile_invalid_muts
# , '\n===================================================================')
#%% global variables
host = "http://biosig.unimelb.edu.au"
prediction_url = f"{host}/mcsm_lig/prediction"
#=======================================================================
def format_data(data_file):
"""
Read file containing SNPs for mcsm analysis and remove duplicates
@param data_file csv file containing nsSNPs for given drug and gene.
csv file format:
single column with no headers with nsSNP format as below:
A1B
B2C
@type data_file: string
@return unique SNPs
@type list
"""
data = pd.read_csv(data_file, header = None, index_col = False)
data = data.drop_duplicates()
mutation_list = data[0].tolist()
# print(data.head())
return mutation_list
def request_calculation(pdb_file, mutation, chain, ligand_id, wt_affinity, prediction_url, output_dir, gene_name):
"""
Makes a POST request for a ligand affinity prediction.
@param pdb_file: valid path to pdb structure
@type string
@param mutation: single mutation of the format: {WT}<POS>{Mut}
@type string
@param chain: single-letter(caps)
@type chr
@param lig_id: 3-letter code (should match pdb file)
@type string
@param wt affinity: in nM
@type number
@param prediction_url: mcsm url for prediction
@type string
@return response object
@type object
"""
with open(pdb_file, "rb") as pdb_file:
files = {"wild": pdb_file}
body = {
"mutation": mutation,
"chain": chain,
"lig_id": ligand_id,
"affin_wt": wt_affinity
}
response = requests.post(prediction_url, files = files, data = body)
# print(response.status_code)
# result_status = response.raise_for_status()
if response.history:
# if result_status is not None: # doesn't work!
print('PASS: valid mutation submitted. Fetching result url')
# response = requests.post(prediction_url, files = files, data = body)
# return response
url_match = re.search('/mcsm_lig/results_prediction/.+(?=")', response.text)
url = host + url_match.group()
#===============
# writing file: result urls
#===============
out_url_file = output_dir + '/' + gene_name.lower() + '_result_urls.txt'
myfile = open(out_url_file, 'a')
myfile.write(url + '\n')
myfile.close()
else:
print('ERROR: invalid mutation! Wild-type residue doesn\'t match pdb file.'
, '\nSkipping to the next mutation in file...')
#===============
# writing file: invalid mutations
#===============
out_error_file = output_dir + '/' + gene_name.lower() + '_errors.txt'
failed_muts = open(out_error_file, 'a')
failed_muts.write(mutation + '\n')
failed_muts.close()
#def write_result_url(holding_page, out_result_url, host):
# """
# Extract and write results url from the holding page returned after
# requesting a calculation.
# @param holding_page: response object containinig html content
# @type object
# @param out_result_url: txt file containing urls for mcsm results
# @type string
# @param host: mcsm server name
# @type string
# @return None, writes a file containing result urls (= total no. of muts)
# """
# if holding_page:
# url_match = re.search('/mcsm_lig/results_prediction/.+(?=")', holding_page.text)
# url = host + url_match.group()
#===============
# writing file
#===============
# myfile = open(out_result_url, 'a')
# myfile.write(url+'\n')
# myfile.close()
# print(myfile)
# return url
#%%
#=======================================================================
# variables to run mcsm lig predictions
#pdb_file = infile_snps_pdb
my_chain = 'A'
my_ligand_id = 'DCS'
my_affinity = 10
print('Result urls and error file (if any) will be written in: ', outdir)
# call function to format data to remove duplicate snps before submitting job
mcsm_muts = format_data(infile_snps)
mut_count = 1 # HURR DURR COUNT STARTEDS AT ONE1`!1
infile_snps_len = os.popen('wc -l < %s' % infile_snps).read() # quicker than using Python :-)
print('Total SNPs for', gene, ':', infile_snps_len)
for mcsm_mut in mcsm_muts:
print('Processing mutation: %s of %s' % (mut_count, infile_snps_len), mcsm_mut)
print('Parameters for mcsm_lig:', in_filename_pdb, mcsm_mut, my_chain, my_ligand_id, my_affinity, prediction_url, outdir, gene)
# function call: to request mcsm prediction
# which writes file containing url for valid submissions and invalid muts to respective files
holding_page = request_calculation(infile_pdb, mcsm_mut, my_chain, my_ligand_id, my_affinity, prediction_url, outdir, gene)
# holding_page = request_calculation(infile_pdb, mcsm_mut, my_chain, my_ligand_id, my_affinity, prediction_url, outdir, gene)
time.sleep(1)
mut_count += 1
# result_url = write_result_url(holding_page, result_urls, host)
print('Request submitted'
, '\nCAUTION: Processing will take at least ten'
, 'minutes, but will be longer for more mutations.')
#%%

494
mcsm/mcsm.py Normal file
View file

@ -0,0 +1,494 @@
#%% load packages
import os,sys
import subprocess
import argparse
import requests
import re
import time
from bs4 import BeautifulSoup
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import numpy as np
#from csv import reader
from mcsm import *
#==============================
#%% global variables for defs
#==============================
#%%
def format_data(data_file):
"""
Read file containing SNPs for mcsm analysis and remove duplicates
@param data_file csv file containing nsSNPs for given drug and gene.
csv file format:
single column with no headers with nsSNP format as below:
A1B
B2C
@type data_file: string
@return unique SNPs
@type list
"""
data = pd.read_csv(data_file, header = None, index_col = False)
data = data.drop_duplicates()
mutation_list = data[0].tolist()
# print(data.head())
return mutation_list
# FIXME: documentation
def request_calculation(pdb_file, mutation, chain, ligand_id, wt_affinity, prediction_url, output_dir, gene_name, host):
"""
Makes a POST request for a ligand affinity prediction.
@param pdb_file: valid path to pdb structure
@type string
@param mutation: single mutation of the format: {WT}<POS>{Mut}
@type string
@param chain: single-letter(caps)
@type chr
@param lig_id: 3-letter code (should match pdb file)
@type string
@param wt affinity: in nM
@type number
@param prediction_url: mcsm url for prediction
@type string
@return response object
@type object
"""
with open(pdb_file, "rb") as pdb_file:
files = {"wild": pdb_file}
body = {
"mutation": mutation,
"chain": chain,
"lig_id": ligand_id,
"affin_wt": wt_affinity
}
response = requests.post(prediction_url, files = files, data = body)
#print(response.status_code)
#result_status = response.raise_for_status()
if response.history:
# if result_status is not None: # doesn't work!
print('PASS: valid mutation submitted. Fetching result url')
#return response
url_match = re.search('/mcsm_lig/results_prediction/.+(?=")', response.text)
url = host + url_match.group()
#===============
# writing file: result urls
#===============
out_url_file = output_dir + '/' + gene_name.lower() + '_result_urls.txt'
myfile = open(out_url_file, 'a')
myfile.write(url + '\n')
myfile.close()
else:
print('ERROR: invalid mutation! Wild-type residue doesn\'t match pdb file.'
, '\nSkipping to the next mutation in file...')
#===============
# writing file: invalid mutations
#===============
out_error_file = output_dir + '/' + gene_name.lower() + '_errors.txt'
failed_muts = open(out_error_file, 'a')
failed_muts.write(mutation + '\n')
failed_muts.close()
#=======================================================================
def scrape_results(result_url):
"""
Extract results data using the result url
@params result_url: txt file containing result url
one per line for each mutation
@type string
returns: mcsm prediction results (raw)
@type chr
"""
result_response = requests.get(result_url)
# if results_response is not None:
# page = results_page.text
if result_response.status_code == 200:
print('Fetching results')
# extract results using the html parser
soup = BeautifulSoup(result_response.text, features = 'html.parser')
# print(soup)
web_result_raw = soup.find(class_ = 'span4').get_text()
#metatags = soup.find_all('meta')
metatags = soup.find_all('meta', attrs={'http-equiv':'refresh'})
#print('meta tags:', metatags)
if metatags:
print('WARNING: Submission not ready for URL:', result_url)
# TODO: Add logging
#if debug:
# debug.warning('submission not ready for URL:', result_url)
else:
return web_result_raw
else:
sys.exit('FAIL: Could not fetch results'
, '\nCheck if url is valid')
def build_result_dict(web_result_raw):
"""
Build dict of mcsm output for a single mutation
Format web results which is preformatted to enable building result dict
# preformatted string object: Problematic!
# make format consistent
@params web_result_raw: directly from html parser extraction
@type string
@returns result dict
@type {}
"""
# remove blank lines from web_result_raw
mytext = os.linesep.join([s for s in web_result_raw.splitlines() if s])
# affinity change and DUET stability change cols are are split over
# multiple lines and Mutation information is empty!
mytext = mytext.replace('ange:\n', 'ange: ')
#print(mytext)
# initiliase result_dict
result_dict = {}
for line in mytext.split('\n'):
fields = line.split(':')
#print(fields)
if len(fields) > 1: # since Mutaton information is empty
dict_entry = dict([(x, y) for x, y in zip(fields[::2], fields[1::2])])
result_dict.update(dict_entry)
print(result_dict)
return result_dict
#%%
#=======================================================================
def format_mcsm_output(mcsm_outputcsv):
"""
@param mcsm_outputcsv: file containing mcsm results for all muts
which is the result of build_result_dict() being called for each
mutation and then converting to a pandas df and output as csv.
@type string
@return formatted mcsm output
@type pandas df
"""
#############
# Read file
#############
mcsm_data_raw = pd.read_csv(mcsm_outputcsv, sep = ',')
# strip white space from both ends in all columns
mcsm_data = mcsm_data_raw.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
dforig_shape = mcsm_data.shape
print('dimensions of input file:', dforig_shape)
#############
# rename cols
#############
# format colnames: all lowercase, remove spaces and use '_' to join
print('Assigning meaningful colnames i.e without spaces and hyphen and reflecting units'
, '\n=======================================================')
my_colnames_dict = {'Predicted Affinity Change': 'PredAffLog' # relevant info from this col will be extracted and the column discarded
, 'Mutation information': 'mutationinformation' # {wild_type}<position>{mutant_type}
, 'Wild-type': 'wild_type' # one letter amino acid code
, 'Position': 'position' # number
, 'Mutant-type': 'mutant_type' # one letter amino acid code
, 'Chain': 'chain' # single letter (caps)
, 'Ligand ID': 'ligand_id' # 3-letter code
, 'Distance to ligand': 'ligand_distance' # angstroms
, 'DUET stability change': 'duet_stability_change'} # in kcal/mol
mcsm_data.rename(columns = my_colnames_dict, inplace = True)
#%%=====================================================================
#################################
# populate mutationinformation
# col which is currently blank
#################################
# populate mutationinformation column:mcsm style muts {WT}<POS>{MUT}
print('Populating column : mutationinformation which is currently empty\n', mcsm_data['mutationinformation'])
mcsm_data['mutationinformation'] = mcsm_data['wild_type'] + mcsm_data['position'].astype(str) + mcsm_data['mutant_type']
print('checking after populating:\n', mcsm_data['mutationinformation']
, '\n=======================================================')
# Remove spaces b/w pasted columns
print('removing white space within column: \mutationinformation')
mcsm_data['mutationinformation'] = mcsm_data['mutationinformation'].str.replace(' ', '')
print('Correctly formatted column: mutationinformation\n', mcsm_data['mutationinformation']
, '\n=======================================================')
#%%=====================================================================
#############
# sanity check: drop dupliate muts
#############
# shouldn't exist as this should be eliminated at the time of running mcsm
print('Sanity check:'
, '\nChecking duplicate mutations')
if mcsm_data['mutationinformation'].duplicated().sum() == 0:
print('PASS: No duplicate mutations detected (as expected)'
, '\nDim of data:', mcsm_data.shape
, '\n===================================================')
else:
print('WARNING: Duplicate mutations detected'
, '\nDim of df with duplicates:', mcsm_data.shape
, 'Removing duplicate entries')
mcsm_data = mcsm_data.drop_duplicates(['mutationinformation'])
print('Dim of data after removing duplicate muts:', mcsm_data.shape
, '\n===========================================================')
#%%=====================================================================
#############
# Create col: duet_outcome
#############
# classification based on DUET stability values
print('Assigning col: duet_outcome based on DUET stability values')
print('Sanity check:')
# count positive values in the DUET column
c = mcsm_data[mcsm_data['duet_stability_change']>=0].count()
DUET_pos = c.get(key = 'duet_stability_change')
# Assign category based on sign (+ve : Stabilising, -ve: Destabilising, Mind the spelling (British spelling))
mcsm_data['duet_outcome'] = np.where(mcsm_data['duet_stability_change']>=0, 'Stabilising', 'Destabilising')
print('DUET Outcome:', mcsm_data['duet_outcome'].value_counts())
#if DUET_pos == mcsm_data['duet_outcome'].value_counts()['Stabilising']:
# print('PASS: DUET outcome assigned correctly')
#else:
# print('FAIL: DUET outcome assigned incorrectly'
# , '\nExpected no. of stabilising mutations:', DUET_pos
# , '\nGot no. of stabilising mutations', mcsm_data['duet_outcome'].value_counts()['Stabilising']
# , '\n======================================================')
#%%=====================================================================
#############
# Extract numeric
# part of ligand_distance col
#############
# Extract only the numeric part from col: ligand_distance
# number: '-?\d+\.?\d*'
mcsm_data['ligand_distance']
print('extracting numeric part of col: ligand_distance')
mcsm_data['ligand_distance'] = mcsm_data['ligand_distance'].str.extract('(\d+\.?\d*)')
print('Ligand Distance:',mcsm_data['ligand_distance'])
#%%=====================================================================
#############
# Create 2 columns:
# ligand_affinity_change and ligand_outcome
#############
# the numerical and categorical parts need to be extracted from column: PredAffLog
# regex used
# numerical part: '-?\d+\.?\d*'
# categorocal part: '\b(\w+ing)\b'
print('Extracting numerical and categorical parts from the col: PredAffLog')
print('to create two columns: ligand_affinity_change and ligand_outcome'
, '\n=======================================================')
# 1) Extracting the predicted affinity change (numerical part)
mcsm_data['ligand_affinity_change'] = mcsm_data['PredAffLog'].str.extract('(-?\d+\.?\d*)', expand = True)
print(mcsm_data['ligand_affinity_change'])
# 2) Extracting the categorical part (Destabillizing and Stabilizing) using word boundary ('ing')
#aff_regex = re.compile(r'\b(\w+ing)\b')
mcsm_data['ligand_outcome']= mcsm_data['PredAffLog'].str.extract(r'(\b\w+ing\b)', expand = True)
print(mcsm_data['ligand_outcome'])
print(mcsm_data['ligand_outcome'].value_counts())
#############
# changing spelling: British
#############
# ensuring spellings are consistent
american_spl = mcsm_data['ligand_outcome'].value_counts()
print('Changing to Bristish spellings for col: ligand_outcome')
mcsm_data['ligand_outcome'].replace({'Destabilizing': 'Destabilising', 'Stabilizing': 'Stabilising'}, inplace = True)
print(mcsm_data['ligand_outcome'].value_counts())
british_spl = mcsm_data['ligand_outcome'].value_counts()
# compare series values since index will differ from spelling change
check = american_spl.values == british_spl.values
if check.all():
print('PASS: spelling change successfull'
, '\nNo. of predicted affinity changes:\n', british_spl
, '\n===================================================')
else:
sys.exit('FAIL: spelling change unsucessfull'
, '\nExpected:\n', american_spl
, '\nGot:\n', british_spl
, '\n===================================================')
#%%=====================================================================
#############
# ensuring corrrect dtype for numeric columns
#############
# check dtype in cols
print('Checking dtypes in all columns:\n', mcsm_data.dtypes
, '\n=======================================================')
print('Converting the following cols to numeric:'
, '\nligand_distance'
, '\nduet_stability_change'
, '\nligand_affinity_change'
, '\n=======================================================')
# using apply method to change stabilty and affinity values to numeric
numeric_cols = ['duet_stability_change', 'ligand_affinity_change', 'ligand_distance']
mcsm_data[numeric_cols] = mcsm_data[numeric_cols].apply(pd.to_numeric)
# check dtype in cols
print('checking dtype after conversion')
cols_check = mcsm_data.select_dtypes(include='float64').columns.isin(numeric_cols)
if cols_check.all():
print('PASS: dtypes for selected cols:', numeric_cols
, '\nchanged to numeric'
, '\n===================================================')
else:
sys.exit('FAIL:dtype change to numeric for selected cols unsuccessful'
, '\n===================================================')
print(mcsm_data.dtypes)
#%%=====================================================================
#############
# scale duet values
#############
# Rescale values in DUET_change col b/w -1 and 1 so negative numbers
# stay neg and pos numbers stay positive
duet_min = mcsm_data['duet_stability_change'].min()
duet_max = mcsm_data['duet_stability_change'].max()
duet_scale = lambda x : x/abs(duet_min) if x < 0 else (x/duet_max if x >= 0 else 'failed')
mcsm_data['duet_scaled'] = mcsm_data['duet_stability_change'].apply(duet_scale)
print('Raw duet scores:\n', mcsm_data['duet_stability_change']
, '\n---------------------------------------------------------------'
, '\nScaled duet scores:\n', mcsm_data['duet_scaled'])
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# additional check added
c2 = mcsm_data[mcsm_data['duet_scaled']>=0].count()
DUET_pos2 = c2.get(key = 'duet_scaled')
if DUET_pos == DUET_pos2:
print('\nPASS: DUET values scaled correctly')
else:
print('\nFAIL: DUET values scaled numbers MISmatch'
, '\nExpected number:', DUET_pos
, '\nGot:', DUET_pos2
, '\n======================================================')
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#%%=====================================================================
#############
# scale affinity values
#############
# rescale values in affinity change col b/w -1 and 1 so negative numbers
# stay neg and pos numbers stay positive
aff_min = mcsm_data['ligand_affinity_change'].min()
aff_max = mcsm_data['ligand_affinity_change'].max()
aff_scale = lambda x : x/abs(aff_min) if x < 0 else (x/aff_max if x >= 0 else 'failed')
mcsm_data['affinity_scaled'] = mcsm_data['ligand_affinity_change'].apply(aff_scale)
print('Raw affinity scores:\n', mcsm_data['ligand_affinity_change']
, '\n---------------------------------------------------------------'
, '\nScaled affinity scores:\n', mcsm_data['affinity_scaled'])
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# additional check added
c_lig = mcsm_data[mcsm_data['ligand_affinity_change']>=0].count()
Lig_pos = c_lig.get(key = 'ligand_affinity_change')
c_lig2 = mcsm_data[mcsm_data['affinity_scaled']>=0].count()
Lig_pos2 = c_lig2.get(key = 'affinity_scaled')
if Lig_pos == Lig_pos2:
print('\nPASS: Ligand affintiy values scaled correctly')
else:
print('\nFAIL: Ligand affinity values scaled numbers MISmatch'
, '\nExpected number:', Lig_pos
, '\nGot:', Lig_pos2
, '\n======================================================')
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#%%=====================================================================
#############
# adding column: wild_pos
# useful for plots and db
#############
print('Creating column: wild_pos')
mcsm_data['wild_pos'] = mcsm_data['wild_type'] + mcsm_data['position'].astype(str)
print(mcsm_data['wild_pos'].head())
# Remove spaces b/w pasted columns
print('removing white space within created column: wild_pos')
mcsm_data['wild_pos'] = mcsm_data['wild_pos'].str.replace(' ', '')
print('Correctly formatted column: wild_pos\n', mcsm_data['wild_pos'].head()
, '\n=========================================================')
#%%=====================================================================
#############
# adding column: wild_chain_pos
# useful for plots and db and its explicit
#############
print('Creating column: wild_chain_pos')
mcsm_data['wild_chain_pos'] = mcsm_data['wild_type'] + mcsm_data['chain'] + mcsm_data['position'].astype(str)
print(mcsm_data['wild_chain_pos'].head())
# Remove spaces b/w pasted columns
print('removing white space within created column: wild_chain_pos')
mcsm_data['wild_chain_pos'] = mcsm_data['wild_chain_pos'].str.replace(' ', '')
print('Correctly formatted column: wild_chain_pos\n', mcsm_data['wild_chain_pos'].head()
, '\n=========================================================')
#%%=====================================================================
#############
# ensuring corrrect dtype in non-numeric cols
#############
#) char cols
char_cols = ['PredAffLog', 'mutationinformation', 'wild_type', 'mutant_type', 'chain', 'ligand_id', 'duet_outcome', 'ligand_outcome', 'wild_pos', 'wild_chain_pos']
#mcsm_data[char_cols] = mcsm_data[char_cols].astype(str)
cols_check_char = mcsm_data.select_dtypes(include = 'object').columns.isin(char_cols)
if cols_check_char.all():
print('PASS: dtypes for char cols:', char_cols, 'are indeed string'
, '\n===================================================')
else:
sys.exit('FAIL:dtype change to numeric for selected cols unsuccessful'
, '\n===================================================')
#mcsm_data['ligand_distance', 'ligand_affinity_change'].apply(is_numeric_dtype(mcsm_data['ligand_distance', 'ligand_affinity_change']))
print(mcsm_data.dtypes)
#%%=====================================================================
# Removing PredAff log column as it is not needed?
print('Removing col: PredAffLog since relevant info has been extracted from it')
mcsm_data_f = mcsm_data.drop(columns = ['PredAffLog'])
#%%=====================================================================
# sort df by position for convenience
print('Sorting df by position')
mcsm_data_fs = mcsm_data_f.sort_values(by = ['position'])
print('sorted df:\n', mcsm_data_fs.head())
# Ensuring column names are lowercase before output
mcsm_data_fs.columns = mcsm_data_fs.columns.str.lower()
#%%=====================================================================
#############
# sanity check before writing file
#############
expected_ncols_toadd = 6 # beware hardcoding!
dforig_len = dforig_shape[1]
expected_cols = dforig_len + expected_ncols_toadd
if len(mcsm_data_fs.columns) == expected_cols:
print('PASS: formatting successful'
, '\nformatted df has expected no. of cols:', expected_cols
, '\n---------------------------------------------------'
, '\ncolnames:', mcsm_data_fs.columns
, '\n---------------------------------------------------'
, '\ndtypes in cols:', mcsm_data_fs.dtypes
, '\n---------------------------------------------------'
, '\norig data shape:', dforig_shape
, '\nformatted df shape:', mcsm_data_fs.shape
, '\n===================================================')
else:
print('FAIL: something went wrong in formatting df'
, '\nLen of orig df:', dforig_len
, '\nExpected number of cols to add:', expected_ncols_toadd
, '\nExpected no. of cols:', expected_cols, '(', dforig_len, '+', expected_ncols_toadd, ')'
, '\nGot no. of cols:', len(mcsm_data_fs.columns)
, '\nCheck formatting:'
, '\ncheck hardcoded value:', expected_ncols_toadd
, '\nis', expected_ncols_toadd, 'the no. of expected cols to add?'
, '\n===================================================')
sys.exit()
return mcsm_data_fs

219
mcsm/run_mcsm.py Executable file
View file

@ -0,0 +1,219 @@
#!/usr/bin/env python3
# mCSM Wrapper
import os,sys
import subprocess
import argparse
import pandas as pd
from mcsm import *
#%% command line args
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('-d', '--drug', help='drug name' , required=True)
arg_parser.add_argument('-g', '--gene', help='gene name (case sensitive)', required=True) # case sensitive
arg_parser.add_argument('-s', '--stage', help='mCSM Pipeline Stage', default = 'get', choices=['submit', 'get', 'format'], required=True)
arg_parser.add_argument('-H', '--host', help='mCSM Server', default = 'http://biosig.unimelb.edu.au')
arg_parser.add_argument('-U', '--url', help='mCSM Server URL', default = 'http://biosig.unimelb.edu.au/mcsm_lig/prediction')
arg_parser.add_argument('-c', '--chain', help='Chain ID as per PDB, Case sensitive', default = 'A')
arg_parser.add_argument('-l','--ligand', help='Ligand ID as per PDB, Case sensitive. REQUIRED only in "submit" stage', default = None)
arg_parser.add_argument('-a','--affinity', help='Affinity in nM. REQUIRED only in "submit" stage', default = 10) #0.99 for pnca, gid, embb. For SP targets (alr,katg, rpob), use 10.
arg_parser.add_argument('-pdb','--pdb_file', help = 'PDB File')
arg_parser.add_argument('-m','--mutation_file', help = 'Mutation File, mcsm style')
arg_parser.add_argument('--datadir', help = 'Data Directory. By default, it assmumes homedir + git/Data')
arg_parser.add_argument('-i', '--input_dir', help = 'Input dir containing pdb files. By default, it assmumes homedir + <drug> + input')
arg_parser.add_argument('-o', '--output_dir', help = 'Output dir for results. By default, it assmes homedir + <drug> + output')
# stage: submit, output url file
arg_parser.add_argument('--url_file', help = 'Output results url file. The result of stage "submit". By default, it creates a output result url file in the output dir: "output_dir + gene.lower() + _result_urls.txt" ')
# stage: get, intermediate mcsm output file
arg_parser.add_argument('--outfile_scraped', help = 'Output mcsm results scraped. The result of stage "get". By default, it creates an interim output file in the output dir: "output_dir + gene.lower() +_mcsm_output.csv" ')
# stage: format, formatted output with scaled values, etc
# FIXME: Don't call this stage until you have ALL the interim results for your snps as the normalisation will be affected!
arg_parser.add_argument('--outfile_formatted', help = 'Output mcsm results formatted. The result of stage "format". By default, it creates a formatted output file in the output dir: "output_dir + gene.lower() + _complex_mcsm_norm.csv" ')
arg_parser.add_argument('--debug', action='store_true', help = 'Debug Mode')
args = arg_parser.parse_args()
#=======================================================================
#%% variables
#host = "http://biosig.unimelb.edu.au"
#prediction_url = f"{host}/mcsm_lig/prediction"
#drug = ''
#gene = ''
#%%=====================================================================
# Command line options
gene = args.gene
drug = args.drug
stage = args.stage
chain = args.chain
ligand = args.ligand
affinity = args.affinity
pdb_filename = args.pdb_file
mutation_filename = args.mutation_file
result_urls = args.url_file
mcsm_output = args.outfile_scraped
outfile_format = args.outfile_formatted
datadir = args.datadir
indir = args.input_dir
outdir = args.output_dir
DEBUG = args.debug
# Actual Globals :-)
host = args.host
prediction_url = args.url
# submit_mcsm globals
homedir = os.path.expanduser('~')
#os.chdir(homedir + '/git/LSHTM_analysis/mcsm')
gene_match = gene + '_p.'
#============
# directories
#============
if not datadir:
datadir = homedir + '/git/Data/'
if not indir:
indir = datadir + drug + 'input/'
if not outdir:
outdir = datadir + drug + 'output/'
#=======
# input
#=======
if pdb_filename:
in_filename_pdb = pdb_filename
else:
in_filename_pdb = gene.lower() + '_complex.pdb'
infile_pdb = indir + in_filename_pdb
#in_filename_snps = gene.lower() + '_mcsm_snps.csv' #(outfile_mcsm_snps, from data_extraction.py)
#infile_snps = outdir + '/' + in_filename_snps
if mutation_filename:
in_filename_snps = mutation_filename
else:
in_filename_snps = gene.lower() + '_mcsm_formatted_snps.csv'
infile_snps = outdir + in_filename_snps
#=======
# output
#=======
# mcsm_results globals
if not result_urls:
result_urls_filename = gene.lower() + '_result_urls.txt'
result_urls = outdir + result_urls_filename
if DEBUG:
print('DEBUG: Result URLs:', result_urls)
if not mcsm_output:
mcsm_output_filename = gene.lower() + '_mcsm_output.csv'
mcsm_output = outdir + mcsm_output_filename
if DEBUG:
print('DEBUG: mCSM output CSV file:', mcsm_output)
# format_results globals
#out_filename_format = gene.lower() + '_mcsm_processed.csv'
if not outfile_format:
out_filename_format = gene.lower() + '_complex_mcsm_norm.csv'
outfile_format = outdir + out_filename_format
if DEBUG:
print('DEBUG: formatted CSV output:', outfile_format)
#%%=====================================================================
def submit_mcsm():
# Example:
# chain = 'A'
# ligand_id = 'RMP'
# affinity = 10
print('Result urls and error file (if any) will be written in: ', outdir)
# call function to format data to remove duplicate snps before submitting job
mcsm_muts = format_data(infile_snps)
mut_count = 1 # HURR DURR COUNT STARTEDS AT ONE1`!1
infile_snps_len = os.popen('wc -l < %s' % infile_snps).read() # quicker than using Python :-)
print('Total SNPs for', gene, ':', infile_snps_len)
for mcsm_mut in mcsm_muts:
print('Processing mutation: %s of %s' % (mut_count, infile_snps_len), mcsm_mut)
if DEBUG:
print('DEBUG: Parameters for mcsm_lig:', in_filename_pdb, mcsm_mut, chain, ligand, affinity, prediction_url, outdir, gene)
# function call: to request mcsm prediction
# which writes file containing url for valid submissions and invalid muts to respective files
holding_page = request_calculation(infile_pdb, mcsm_mut, chain, ligand, affinity, prediction_url, outdir, gene, host)
time.sleep(1)
mut_count += 1
# result_url = write_result_url(holding_page, result_urls, host)
print('Request submitted'
, '\nCAUTION: Processing will take at least ten'
, 'minutes, but will be longer for more mutations.')
#%%=====================================================================
def get_results():
output_df = pd.DataFrame()
url_counter = 1 # HURR DURR COUNT STARTEDS AT ONE1`!1
success_counter = 1
infile_len = os.popen('wc -l < %s' % result_urls).read() # quicker than using Python :-)
print('Total URLs:', infile_len)
with open(result_urls, 'r') as urlfile:
for line in urlfile:
url_line = line.strip()
# call functions
results_interim = scrape_results(url_line)
if results_interim is not None:
print('Processing URL: %s of %s' % (url_counter, infile_len))
result_dict = build_result_dict(results_interim)
df = pd.DataFrame(result_dict, index=[url_counter])
output_df = output_df.append(df)
success_counter += 1
url_counter += 1
print('Total URLs: %s Successful: %s Failed: %s' % (url_counter-1, success_counter-1, (url_counter - success_counter)))
#print('\nOutput file created:', output_dir + gene.lower() + '_mcsm_output.csv')
output_df.to_csv(mcsm_output, index = None, header = True)
#%%=====================================================================
def format_results():
print('Input file:', mcsm_output
, '\n============================================================='
, '\nOutput file:', outfile_format
, '\n=============================================================')
# call function
mcsm_df_formatted = format_mcsm_output(mcsm_output)
# writing file
print('Writing formatted df to csv')
mcsm_df_formatted.to_csv(outfile_format, index = False)
print('Finished writing file:'
, '\nFile:', outfile_format
, '\nExpected no. of rows:', len(mcsm_df_formatted)
, '\nExpected no. of cols:', len(mcsm_df_formatted.columns)
, '\n=============================================================')
#%%=====================================================================
def main():
if stage == 'submit':
print('mCSM stage: submit mutations for mcsm analysis')
submit_mcsm()
elif stage == 'get':
print('mCSM stage: get results')
get_results()
elif stage == 'format':
print('mCSM stage: format results')
format_results()
else:
print('ERROR: invalid stage')
main()

View file

@ -1,512 +0,0 @@
###########################
# you need merged_df3
# or
# merged_df3_comp
# since these have unique SNPs
# I prefer to use the merged_df3
# because using the _comp dataset means
# we lose some muts and at this level, we should use
# as much info as available
###########################
# uncomment as necessary
#%%%%%%%%%%%%%%%%%%%%%%%%
# REASSIGNMENT
my_df = merged_df3
#my_df = merged_df3_comp
#%%%%%%%%%%%%%%%%%%%%%%%%
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
###########################
# Data for bfactor figure
# PS average
# Lig average
###########################
head(my_df$Position)
head(my_df$ratioDUET)
# order data frame
df = my_df[order(my_df$Position),]
head(df$Position)
head(df$ratioDUET)
#***********
# PS: average by position
#***********
mean_DUET_by_position <- df %>%
group_by(Position) %>%
summarize(averaged.DUET = mean(ratioDUET))
#***********
# Lig: average by position
#***********
mean_Lig_by_position <- df %>%
group_by(Position) %>%
summarize(averaged.Lig = mean(ratioPredAff))
#***********
# cbind:mean_DUET_by_position and mean_Lig_by_position
#***********
combined = as.data.frame(cbind(mean_DUET_by_position, mean_Lig_by_position ))
# sanity check
# mean_PS_Lig_Bfactor
colnames(combined)
colnames(combined) = c("Position"
, "average_DUETR"
, "Position2"
, "average_PredAffR")
colnames(combined)
identical(combined$Position, combined$Position2)
n = which(colnames(combined) == "Position2"); n
combined_df = combined[,-n]
max(combined_df$average_DUETR) ; min(combined_df$average_DUETR)
max(combined_df$average_PredAffR) ; min(combined_df$average_PredAffR)
#=============
# output csv
#============
outDir = "~/Data/pyrazinamide/input/processed/"
outFile = paste0(outDir, "mean_PS_Lig_Bfactor.csv")
print(paste0("Output file with path will be:","", outFile))
head(combined_df$Position); tail(combined_df$Position)
write.csv(combined_df, outFile
, row.names = F)
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages #
########################################################################
source("../Header_TT.R")
#source("barplot_colour_function.R")
require(data.table)
require(dplyr)
########################################################################
# Read file: call script for combining df for PS #
########################################################################
source("../combining_two_df.R")
###########################
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
###########################
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
###########################
# you need merged_df3
# or
# merged_df3_comp
# since these have unique SNPs
# I prefer to use the merged_df3
# because using the _comp dataset means
# we lose some muts and at this level, we should use
# as much info as available
###########################
# uncomment as necessary
#%%%%%%%%%%%%%%%%%%%%%%%%
# REASSIGNMENT
my_df = merged_df3
#my_df = merged_df3_comp
#%%%%%%%%%%%%%%%%%%%%%%%%
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
###########################
# Data for bfactor figure
# PS average
# Lig average
###########################
head(my_df$Position)
head(my_df$ratioDUET)
# order data frame
df = my_df[order(my_df$Position),]
head(df$Position)
head(df$ratioDUET)
#***********
# PS: average by position
#***********
mean_DUET_by_position <- df %>%
group_by(Position) %>%
summarize(averaged.DUET = mean(ratioDUET))
#***********
# Lig: average by position
#***********
mean_Lig_by_position <- df %>%
group_by(Position) %>%
summarize(averaged.Lig = mean(ratioPredAff))
#***********
# cbind:mean_DUET_by_position and mean_Lig_by_position
#***********
combined = as.data.frame(cbind(mean_DUET_by_position, mean_Lig_by_position ))
# sanity check
# mean_PS_Lig_Bfactor
colnames(combined)
colnames(combined) = c("Position"
, "average_DUETR"
, "Position2"
, "average_PredAffR")
colnames(combined)
identical(combined$Position, combined$Position2)
n = which(colnames(combined) == "Position2"); n
combined_df = combined[,-n]
max(combined_df$average_DUETR) ; min(combined_df$average_DUETR)
max(combined_df$average_PredAffR) ; min(combined_df$average_PredAffR)
#=============
# output csv
#============
outDir = "~/git/Data/pyrazinamide/input/processed/"
outFile = paste0(outDir, "mean_PS_Lig_Bfactor.csv")
print(paste0("Output file with path will be:","", outFile))
head(combined_df$Position); tail(combined_df$Position)
write.csv(combined_df, outFile
, row.names = F)
# read in pdb file complex1
inDir = "~/git/Data/pyrazinamide/input/structure"
inFile = paste0(inDir, "complex1_no_water.pdb")
# read in pdb file complex1
inDir = "~/git/Data/pyrazinamide/input/structure/"
inFile = paste0(inDir, "complex1_no_water.pdb")
complex1 = inFile
my_pdb = read.pdb(complex1
, maxlines = -1
, multi = FALSE
, rm.insert = FALSE
, rm.alt = TRUE
, ATOM.only = FALSE
, hex = FALSE
, verbose = TRUE)
#########################
#3: Read complex pdb file
##########################
source("Header_TT.R")
# list of 8
my_pdb = read.pdb(complex1
, maxlines = -1
, multi = FALSE
, rm.insert = FALSE
, rm.alt = TRUE
, ATOM.only = FALSE
, hex = FALSE
, verbose = TRUE)
rm(inDir, inFile)
#====== end of script
inDir = "~/git/Data/pyrazinamide/input/structure/"
inFile = paste0(inDir, "complex1_no_water.pdb")
complex1 = inFile
complex1 = inFile
my_pdb = read.pdb(complex1
, maxlines = -1
, multi = FALSE
, rm.insert = FALSE
, rm.alt = TRUE
, ATOM.only = FALSE
, hex = FALSE
, verbose = TRUE)
inFile
inDir = "~/git/Data/pyrazinamide/input/structure/"
inFile = paste0(inDir, "complex1_no_water.pdb")
complex1 = inFile
#inFile2 = paste0(inDir, "complex2_no_water.pdb")
#complex2 = inFile2
# list of 8
my_pdb = read.pdb(complex1
, maxlines = -1
, multi = FALSE
, rm.insert = FALSE
, rm.alt = TRUE
, ATOM.only = FALSE
, hex = FALSE
, verbose = TRUE)
rm(inDir, inFile, complex1)
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts")
getwd()
source("Header_TT.R")
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts")
getwd()
########################################################################
# Installing and loading required packages #
########################################################################
source("Header_TT.R")
#########################################################
# TASK: replace B-factors in the pdb file with normalised values
# use the complex file with no water as mCSM lig was
# performed on this file. You can check it in the script: read_pdb file.
#########################################################
###########################
# 2: Read file: average stability values
# or mcsm_normalised file, output of step 4 mcsm pipeline
###########################
inDir = "~/git/Data/pyrazinamide/input/processed/"
inFile = paste0(inDir, "mean_PS_Lig_Bfactor.csv"); inFile
my_df <- read.csv(inFile
# , row.names = 1
# , stringsAsFactors = F
, header = T)
str(my_df)
source("read_pdb.R") # list of 8
# extract atom list into a variable
# since in the list this corresponds to data frame, variable will be a df
d = my_pdb[[1]]
# make a copy: required for downstream sanity checks
d2 = d
# sanity checks: B factor
max(d$b); min(d$b)
par(oma = c(3,2,3,0)
, mar = c(1,3,5,2)
, mfrow = c(3,2))
#par(mfrow = c(3,2))
#1: Original B-factor
hist(d$b
, xlab = ""
, main = "B-factor")
plot(density(d$b)
, xlab = ""
, main = "B-factor")
# 2: DUET scores
hist(my_df$average_DUETR
, xlab = ""
, main = "Norm_DUET")
plot(density(my_df$average_DUETR)
, xlab = ""
, main = "Norm_DUET")
# Set the margin on all sides
par(oma = c(3,2,3,0)
, mar = c(1,3,5,2)
, mfrow = c(3,2))
#par(mfrow = c(3,2))
#1: Original B-factor
hist(d$b
, xlab = ""
, main = "B-factor")
plot(density(d$b)
, xlab = ""
, main = "B-factor")
# 2: DUET scores
hist(my_df$average_DUETR
, xlab = ""
, main = "Norm_DUET")
plot(density(my_df$average_DUETR)
, xlab = ""
, main = "Norm_DUET")
#=========
# step 1_P1
#=========
# Be brave and replace in place now (don't run sanity check)
# this makes all the B-factor values in the non-matched positions as NA
d$b = my_df$average_DUETR[match(d$resno, my_df$Position)]
#=========
# step 2_P1
#=========
# count NA in Bfactor
b_na = sum(is.na(d$b)) ; b_na
# count number of 0's in Bactor
sum(d$b == 0)
# replace all NA in b factor with 0
d$b[is.na(d$b)] = 0
# sanity check: should be 0
sum(is.na(d$b))
# sanity check: should be True
if (sum(d$b == 0) == b_na){
print ("Sanity check passed: NA's replaced with 0's successfully")
} else {
print("Error: NA replacement NOT successful, Debug code!")
}
max(d$b); min(d$b)
# sanity checks: should be True
if(max(d$b) == max(my_df$average_DUETR)){
print("Sanity check passed: B-factors replaced correctly")
} else {
print ("Error: Debug code please")
}
if (min(d$b) == min(my_df$average_DUETR)){
print("Sanity check passed: B-factors replaced correctly")
} else {
print ("Error: Debug code please")
}
#=========
# step 3_P1
#=========
# sanity check: dim should be same before reassignment
# should be TRUE
dim(d) == dim(d2)
#=========
# step 4_P1
#=========
# assign it back to the pdb file
my_pdb[[1]] = d
max(d$b); min(d$b)
#=========
# step 5_P1
#=========
# output dir
getwd()
outDir = "~/git/Data/pyrazinamide/output/"
getwd()
outFile = paste0(outDir, "complex1_BwithNormDUET.pdb")
outFile = paste0(outDir, "complex1_BwithNormDUET.pdb"); outFile
outDir = "~/git/Data/pyrazinamide/input/structure"
outDir = "~/git/Data/pyrazinamide/input/structure/"
outFile = paste0(outDir, "complex1_BwithNormDUET.pdb"); outFile
write.pdb(my_pdb, outFile)
hist(d$b
, xlab = ""
, main = "repalced-B")
plot(density(d$b)
, xlab = ""
, main = "replaced-B")
# graph titles
mtext(text = "Frequency"
, side = 2
, line = 0
, outer = TRUE)
mtext(text = "DUET_stability"
, side = 3
, line = 0
, outer = TRUE)
#=========================================================
# Processing P2: Replacing B values with PredAff Scores
#=========================================================
# clear workspace
rm(list = ls())
#=========================================================
# Processing P2: Replacing B values with PredAff Scores
#=========================================================
# clear workspace
rm(list = ls())
###########################
# 2: Read file: average stability values
# or mcsm_normalised file, output of step 4 mcsm pipeline
###########################
inDir = "~/git/Data/pyrazinamide/input/processed/"
inFile = paste0(inDir, "mean_PS_Lig_Bfactor.csv"); inFile
my_df <- read.csv("../Data/mean_PS_Lig_Bfactor.csv"
# , row.names = 1
# , stringsAsFactors = F
, header = T)
str(my_df)
#=========================================================
# Processing P2: Replacing B factor with mean ratioLig scores
#=========================================================
#########################
# 3: Read complex pdb file
# form the R script
##########################
source("read_pdb.R") # list of 8
# extract atom list into a vari
inDir = "~/git/Data/pyrazinamide/input/processed/"
inFile = paste0(inDir, "mean_PS_Lig_Bfactor.csv"); inFile
my_df <- read.csv(inFile
# , row.names = 1
# , stringsAsFactors = F
, header = T)
str(my_df)
# extract atom list into a variable
# since in the list this corresponds to data frame, variable will be a df
d = my_pdb[[1]]
# make a copy: required for downstream sanity checks
d2 = d
# sanity checks: B factor
max(d$b); min(d$b)
par(oma = c(3,2,3,0)
, mar = c(1,3,5,2)
, mfrow = c(3,2))
#par(mfrow = c(3,2))
# 1: Original B-factor
hist(d$b
, xlab = ""
, main = "B-factor")
plot(density(d$b)
, xlab = ""
, main = "B-factor")
# 2: Pred Aff scores
hist(my_df$average_PredAffR
, xlab = ""
, main = "Norm_lig_average")
plot(density(my_df$average_PredAffR)
, xlab = ""
, main = "Norm_lig_average")
# 3: After the following replacement
#********************************
par(oma = c(3,2,3,0)
, mar = c(1,3,5,2)
, mfrow = c(3,2))
#par(mfrow = c(3,2))
# 1: Original B-factor
hist(d$b
, xlab = ""
, main = "B-factor")
plot(density(d$b)
, xlab = ""
, main = "B-factor")
# 2: Pred Aff scores
hist(my_df$average_PredAffR
, xlab = ""
, main = "Norm_lig_average")
plot(density(my_df$average_PredAffR)
, xlab = ""
, main = "Norm_lig_average")
# 3: After the following replacement
#********************************
#=========
# step 1_P2: BE BRAVE and replace in place now (don't run step 0)
#=========
# this makes all the B-factor values in the non-matched positions as NA
d$b = my_df$average_PredAffR[match(d$resno, my_df$Position)]
#=========
# step 2_P2
#=========
# count NA in Bfactor
b_na = sum(is.na(d$b)) ; b_na
# count number of 0's in Bactor
sum(d$b == 0)
# replace all NA in b factor with 0
d$b[is.na(d$b)] = 0
# sanity check: should be 0
sum(is.na(d$b))
if (sum(d$b == 0) == b_na){
print ("Sanity check passed: NA's replaced with 0's successfully")
} else {
print("Error: NA replacement NOT successful, Debug code!")
}
max(d$b); min(d$b)
# sanity checks: should be True
if (max(d$b) == max(my_df$average_PredAffR)){
print("Sanity check passed: B-factors replaced correctly")
} else {
print ("Error: Debug code please")
}
if (min(d$b) == min(my_df$average_PredAffR)){
print("Sanity check passed: B-factors replaced correctly")
} else {
print ("Error: Debug code please")
}
#=========
# step 3_P2
#=========
# sanity check: dim should be same before reassignment
# should be TRUE
dim(d) == dim(d2)
#=========
# step 4_P2
#=========
# assign it back to the pdb file
my_pdb[[1]] = d
max(d$b); min(d$b)
#=========
# step 5_P2
#=========
write.pdb(my_pdb, "Plotting/structure/complex1_BwithNormLIG.pdb")
# output dir
getwd()
# output dir
outDir = "~/git/Data/pyrazinamide/input/structure/"
outFile = paste0(outDir, "complex1_BwithNormLIG.pdb")
outFile = paste0(outDir, "complex1_BwithNormLIG.pdb"); outFile
write.pdb(my_pdb, outFile)

View file

@ -1,129 +0,0 @@
#########################################################
### A) Installing and loading required packages
#########################################################
#if (!require("gplots")) {
# install.packages("gplots", dependencies = TRUE)
# library(gplots)
#}
if (!require("tidyverse")) {
install.packages("tidyverse", dependencies = TRUE)
library(tidyverse)
}
if (!require("ggplot2")) {
install.packages("ggplot2", dependencies = TRUE)
library(ggplot2)
}
if (!require("cowplot")) {
install.packages("copwplot", dependencies = TRUE)
library(ggplot2)
}
if (!require("ggcorrplot")) {
install.packages("ggcorrplot", dependencies = TRUE)
library(ggcorrplot)
}
if (!require("ggpubr")) {
install.packages("ggpubr", dependencies = TRUE)
library(ggpubr)
}
if (!require("RColorBrewer")) {
install.packages("RColorBrewer", dependencies = TRUE)
library(RColorBrewer)
}
if (!require ("GOplot")) {
install.packages("GOplot")
library(GOplot)
}
if(!require("VennDiagram")) {
install.packages("VennDiagram", dependencies = T)
library(VennDiagram)
}
if(!require("scales")) {
install.packages("scales", dependencies = T)
library(scales)
}
if(!require("plotrix")) {
install.packages("plotrix", dependencies = T)
library(plotrix)
}
if(!require("stats")) {
install.packages("stats", dependencies = T)
library(stats)
}
if(!require("stats4")) {
install.packages("stats4", dependencies = T)
library(stats4)
}
if(!require("data.table")) {
library(stats4)
}
if (!require("PerformanceAnalytics")){
install.packages("PerformanceAnalytics", dependencies = T)
library(PerformaceAnalytics)
}
if (!require ("GGally")){
install.packages("GGally")
library(GGally)
}
if (!require ("corrr")){
install.packages("corrr")
library(corrr)
}
if (!require ("psych")){
install.packages("psych")
library(psych)
}
if (!require ("dplyr")){
install.packages("dplyr")
library(psych)
}
if (!require ("compare")){
install.packages("compare")
library(psych)
}
if (!require ("arsenal")){
install.packages("arsenal")
library(psych)
}
####TIDYVERSE
# Install
#if(!require(devtools)) install.packages("devtools")
#devtools::install_github("kassambara/ggcorrplot")
library(ggcorrplot)
###for PDB files
#install.packages("bio3d")
if(!require(bio3d)){
install.packages("bio3d")
library(bio3d)
}

View file

@ -1,27 +0,0 @@
#########################################################
# 1b: Define function: coloured barplot by subgroup
# LINK: https://stackoverflow.com/questions/49818271/stacked-barplot-with-colour-gradients-for-each-bar
#########################################################
ColourPalleteMulti <- function(df, group, subgroup){
# Find how many colour categories to create and the number of colours in each
categories <- aggregate(as.formula(paste(subgroup, group, sep="~" ))
, df
, function(x) length(unique(x)))
# return(categories) }
category.start <- (scales::hue_pal(l = 100)(nrow(categories))) # Set the top of the colour pallete
category.end <- (scales::hue_pal(l = 40)(nrow(categories))) # set the bottom
#return(category.start); return(category.end)}
# Build Colour pallette
colours <- unlist(lapply(1:nrow(categories),
function(i){
colorRampPalette(colors = c(category.start[i]
, category.end[i]))(categories[i,2])}))
return(colours)
}
#########################################################

View file

@ -1,299 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/")
getwd()
#########################################################
# TASK: To combine mcsm and meta data with af and or
#########################################################
########################################################################
# Installing and loading required packages #
########################################################################
source("Header_TT.R")
#require(data.table)
#require(arsenal)
#require(compare)
#library(tidyverse)
#################################
# Read file: normalised file
# output of step 4 mcsm_pipeline
#################################
inDir = "~/git/Data/pyrazinamide/input/processed/"
inFile = paste0(inDir, "mcsm_complex1_normalised.csv"); inFile
mcsm_data = read.csv(inFile
, row.names = 1
, stringsAsFactors = F
, header = T)
rm(inDir, inFile)
str(mcsm_data)
table(mcsm_data$DUET_outcome); sum(table(mcsm_data$DUET_outcome) )
# spelling Correction 1: DUET
mcsm_data$DUET_outcome[mcsm_data$DUET_outcome=='Stabilizing'] <- 'Stabilising'
mcsm_data$DUET_outcome[mcsm_data$DUET_outcome=='Destabilizing'] <- 'Destabilising'
# checks: should be the same as above
table(mcsm_data$DUET_outcome); sum(table(mcsm_data$DUET_outcome) )
head(mcsm_data$DUET_outcome); tail(mcsm_data$DUET_outcome)
# spelling Correction 2: Ligand
table(mcsm_data$Lig_outcome); sum(table(mcsm_data$Lig_outcome) )
mcsm_data$Lig_outcome[mcsm_data$Lig_outcome=='Stabilizing'] <- 'Stabilising'
mcsm_data$Lig_outcome[mcsm_data$Lig_outcome=='Destabilizing'] <- 'Destabilising'
# checks: should be the same as above
table(mcsm_data$Lig_outcome); sum(table(mcsm_data$Lig_outcome) )
head(mcsm_data$Lig_outcome); tail(mcsm_data$Lig_outcome)
# count na in each column
na_count = sapply(mcsm_data, function(y) sum(length(which(is.na(y))))); na_count
# sort by Mutationinformation
mcsm_data = mcsm_data[order(mcsm_data$Mutationinformation),]
head(mcsm_data$Mutationinformation)
# get freq count of positions and add to the df
setDT(mcsm_data)[, occurrence := .N, by = .(Position)]
pos_count_check = data.frame(mcsm_data$Position, mcsm_data$occurrence)
###########################
# 2: Read file: meta data with AFandOR
###########################
inDir = "~/git/Data/pyrazinamide/input/processed/"
inFile2 = paste0(inDir, "meta_data_with_AFandOR.csv"); inFile2
meta_with_afor <- read.csv(inFile2
, stringsAsFactors = F
, header = T)
rm(inDir, inFile2)
str(meta_with_afor)
# sort by Mutationinformation
head(meta_with_afor$Mutationinformation)
meta_with_afor = meta_with_afor[order(meta_with_afor$Mutationinformation),]
head(meta_with_afor$Mutationinformation)
# sanity check: should be True for all the mentioned columns
#is.numeric(meta_with_afor$OR)
na_var = c("AF", "OR", "pvalue", "logor", "neglog10pvalue")
c1 = NULL
for (i in na_var){
print(i)
c0 = is.numeric(meta_with_afor[,i])
c1 = c(c0, c1)
if ( all(c1) ){
print("Sanity check passed: These are all numeric cols")
} else{
print("Error: Please check your respective data types")
}
}
# If OR, and P value are not numeric, then convert to numeric and then count
# else they will say 0
na_count = sapply(meta_with_afor, function(y) sum(length(which(is.na(y))))); na_count
str(na_count)
# compare if the No of "NA" are the same for all these cols
na_len = NULL
for (i in na_var){
temp = na_count[[i]]
na_len = c(na_len, temp)
}
# extract how many NAs:
# should be all TRUE
# should be a single number since
# all the cols should have "equal" and "same" no. of NAs
my_nrows = NULL
for ( i in 1: (length(na_len)-1) ){
#print(compare(na_len[i]), na_len[i+1])
c = compare(na_len[i], na_len[i+1])
if ( c$result ) {
my_nrows = na_len[i] }
else {
print("Error: Please check your numbers")
}
}
my_nrows
#=#=#=#=#=#=#=#=#
# COMMENT: AF, OR, pvalue, logor and neglog10pvalue
# these are the same 7 ones
#=#=#=#=#=#=#=#=#
# sanity check
#which(is.na(meta_with_afor$OR))
# initialise an empty df with nrows as extracted above
na_count_df = data.frame(matrix(vector(mode = 'numeric'
# , length = length(na_var)
)
, nrow = my_nrows
# , ncol = length(na_var)
))
# populate the df with the indices of the cols that are NA
for (i in na_var){
print(i)
na_i = which(is.na(meta_with_afor[i]))
na_count_df = cbind(na_count_df, na_i)
colnames(na_count_df)[which(na_var == i)] <- i
}
# Now compare these indices to ensure these are the same
c2 = NULL
for ( i in 1: ( length(na_count_df)-1 ) ) {
# print(na_count_df[i] == na_count_df[i+1])
c1 = identical(na_count_df[[i]], na_count_df[[i+1]])
c2 = c(c1, c2)
if ( all(c2) ) {
print("Sanity check passed: The indices for AF, OR, etc are all the same")
} else {
print ("Error: Please check indices which are NA")
}
}
rm( c, c0, c1, c2, i, my_nrows
, na_count, na_i, na_len
, na_var, temp
, na_count_df
, pos_count_check )
###########################
# 3:merging two dfs: with NA
###########################
# link col name = Mutationinforamtion
head(mcsm_data$Mutationinformation)
head(meta_with_afor$Mutationinformation)
#########
# merge 1a: meta data with mcsm
#########
merged_df2 = merge(x = meta_with_afor
,y = mcsm_data
, by = "Mutationinformation"
, all.y = T)
head(merged_df2$Position)
# sort by Position
head(merged_df2$Position)
merged_df2 = merged_df2[order(merged_df2$Position),]
head(merged_df2$Position)
merged_df2v2 = merge(x = meta_with_afor
,y = mcsm_data
, by = "Mutationinformation"
, all.x = T)
#!=!=!=!=!=!=!=!
# COMMENT: used all.y since position 186 is not part of the struc,
# hence doesn't have a mcsm value
# but 186 is associated with with mutation
#!=!=!=!=!=!=!=!
# should be False
identical(merged_df2, merged_df2v2)
table(merged_df2$Position%in%merged_df2v2$Position)
rm(merged_df2v2)
#########
# merge 1b:remove duplicate mutation information
#########
#==#=#=#=#=#=#
# Cannot trust lineage, country from this df as the same mutation
# can have many different lineages
# but this should be good for the numerical corr plots
#=#=#=#=#=#=#=
merged_df3 = merged_df2[!duplicated(merged_df2$Mutationinformation),]
head(merged_df3$Position); tail(merged_df3$Position) # should be sorted
# sanity checks
# nrows of merged_df3 should be the same as the nrows of mcsm_data
if(nrow(mcsm_data) == nrow(merged_df3)){
print("sanity check: Passed")
} else {
print("Error!: check data, nrows is not as expected")
}
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# uncomment as necessary
# only need to run this if merged_df2v2 i.e non structural pos included
#mcsm = mcsm_data$Mutationinformation
#my_merged = merged_df3$Mutationinformation
# find the index where it differs
#diff_n = which(!my_merged%in%mcsm)
#check if it is indeed pos 186
#merged_df3[diff_n,]
# remove this entry
#merged_df3 = merged_df3[-diff_n,]]
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
###########################
# 3b :merging two dfs: without NA
###########################
#########
# merge 2a:same as merge 1 but excluding NA
#########
merged_df2_comp = merged_df2[!is.na(merged_df2$AF),]
#########
# merge 2b: remove duplicate mutation information
#########
merged_df3_comp = merged_df2_comp[!duplicated(merged_df2_comp$Mutationinformation),]
# alternate way of deriving merged_df3_comp
foo = merged_df3[!is.na(merged_df3$AF),]
# compare dfs: foo and merged_df3_com
all.equal(foo, merged_df3)
summary(comparedf(foo, merged_df3))
#=============== end of combining df
#clear variables
rm(mcsm_data
, meta_with_afor
, foo)
#rm(diff_n, my_merged, mcsm)
#=====================
# write_output files
#=====================
# output dir
outDir = "~/git/Data/pyrazinamide/output/"
getwd()
outFile1 = paste0(outDir, "merged_df3.csv"); outFile1
write.csv(merged_df3, outFile1)
#outFile2 = paste0(outDir, "merged_df3_comp.csv"); outFile2
#write.csv(merged_df3_comp, outFile2)
rm(outDir
, outFile1
# , outFile2
)
#============================= end of script

View file

@ -1,348 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/")
getwd()
#########################################################
# TASK: To combine mcsm and meta data with af and or
# by filtering for distance to ligand (<10Ang)
#########################################################
#########################################################
# Installing and loading required packages
#########################################################
#source("Header_TT.R")
#require(data.table)
#require(arsenal)
#require(compare)
#library(tidyverse)
#################################
# Read file: normalised file
# output of step 4 mcsm_pipeline
#################################
inDir = "~/git/Data/pyrazinamide/input/processed/"
inFile = paste0(inDir, "mcsm_complex1_normalised.csv"); inFile
mcsm_data = read.csv(inFile
, row.names = 1
, stringsAsFactors = F
, header = T)
rm(inDir, inFile)
str(mcsm_data)
table(mcsm_data$DUET_outcome); sum(table(mcsm_data$DUET_outcome) )
# spelling Correction 1: DUET
mcsm_data$DUET_outcome[mcsm_data$DUET_outcome=='Stabilizing'] <- 'Stabilising'
mcsm_data$DUET_outcome[mcsm_data$DUET_outcome=='Destabilizing'] <- 'Destabilising'
# checks
table(mcsm_data$DUET_outcome); sum(table(mcsm_data$DUET_outcome) )
head(mcsm_data$DUET_outcome); tail(mcsm_data$DUET_outcome)
# spelling Correction 2: Ligand
table(mcsm_data$Lig_outcome); sum(table(mcsm_data$Lig_outcome) )
mcsm_data$Lig_outcome[mcsm_data$Lig_outcome=='Stabilizing'] <- 'Stabilising'
mcsm_data$Lig_outcome[mcsm_data$Lig_outcome=='Destabilizing'] <- 'Destabilising'
# checks: should be the same as above
table(mcsm_data$Lig_outcome); sum(table(mcsm_data$Lig_outcome) )
head(mcsm_data$Lig_outcome); tail(mcsm_data$Lig_outcome)
########################### !!! only for mcsm_lig
# 4: Filter/subset data
# Lig plots < 10Ang
# Filter the lig plots for Dis_to_lig < 10Ang
###########################
# check range of distances
max(mcsm_data$Dis_lig_Ang)
min(mcsm_data$Dis_lig_Ang)
# count
table(mcsm_data$Dis_lig_Ang<10)
# subset data to have only values less than 10 Ang
mcsm_data2 = subset(mcsm_data, mcsm_data$Dis_lig_Ang < 10)
# sanity checks
max(mcsm_data2$Dis_lig_Ang)
min(mcsm_data2$Dis_lig_Ang)
# count no of unique positions
length(unique(mcsm_data2$Position))
# count no of unique mutations
length(unique(mcsm_data2$Mutationinformation))
# count Destabilisinga and stabilising
table(mcsm_data2$Lig_outcome) #{RESULT: no of mutations within 10Ang}
#<<<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT: so as not to alter the script
mcsm_data = mcsm_data2
#<<<<<<<<<<<<<<<<<<<<<<<<<<<
#############################
# Extra sanity check:
# for mcsm_lig ONLY
# Dis_lig_Ang should be <10
#############################
if (max(mcsm_data$Dis_lig_Ang) < 10){
print ("Sanity check passed: lig data is <10Ang")
}else{
print ("Error: data should be filtered to be within 10Ang")
}
# clear variables
rm(mcsm_data2)
# count na in each column
na_count = sapply(mcsm_data, function(y) sum(length(which(is.na(y))))); na_count
head(mcsm_data$Mutationinformation)
mcsm_data[mcsm_data$Mutationinformation=="Q10P",]
mcsm_data[mcsm_data$Mutationinformation=="L4S",]
# sort by Mutationinformation
mcsm_data = mcsm_data[order(mcsm_data$Mutationinformation),]
head(mcsm_data$Mutationinformation)
# check
mcsm_data[grep("Q10P", mcsm_data$Mutationinformation),]
mcsm_data[grep("A102T", mcsm_data$Mutationinformation),]
# get freq count of positions and add to the df
setDT(mcsm_data)[, occurrence := .N, by = .(Position)]
pos_count_check = data.frame(mcsm_data$Position, mcsm_data$occurrence)
###########################
# 2: Read file: meta data with AFandOR
###########################
inDir = "~/git/Data/pyrazinamide/input/processed/"
inFile2 = paste0(inDir, "meta_data_with_AFandOR.csv"); inFile2
meta_with_afor <- read.csv(inFile2
, stringsAsFactors = F
, header = T)
str(meta_with_afor)
# sort by Mutationinformation
head(meta_with_afor$Mutationinformation)
meta_with_afor = meta_with_afor[order(meta_with_afor$Mutationinformation),]
head(meta_with_afor$Mutationinformation)
# sanity check: should be True for all the mentioned columns
#is.numeric(meta_with_afor$OR)
na_var = c("AF", "OR", "pvalue", "logor", "neglog10pvalue")
c1 = NULL
for (i in na_var){
print(i)
c0 = is.numeric(meta_with_afor[,i])
c1 = c(c0, c1)
if ( all(c1) ){
print("Sanity check passed: These are all numeric cols")
} else{
print("Error: Please check your respective data types")
}
}
# If OR, and P value are not numeric, then convert to numeric and then count
# else they will say 0
# NOW count na in each column: if you did it before, then
# OR and Pvalue column would say 0 na since these were not numeric
na_count = sapply(meta_with_afor, function(y) sum(length(which(is.na(y))))); na_count
str(na_count)
# compare if the No of "NA" are the same for all these cols
na_len = NULL
na_var = c("AF", "OR", "pvalue", "logor", "neglog10pvalue")
for (i in na_var){
temp = na_count[[i]]
na_len = c(na_len, temp)
}
my_nrows = NULL
for ( i in 1: (length(na_len)-1) ){
#print(compare(na_len[i]), na_len[i+1])
c = compare(na_len[i], na_len[i+1])
if ( c$result ) {
my_nrows = na_len[i] }
else {
print("Error: Please check your numbers")
}
}
my_nrows
#=#=#=#=#=#=#=#=#
# COMMENT: AF, OR, pvalue, logor and neglog10pvalue
# all have 81 NA, with pyrazinamide with 960
# and these are the same 7 ones
#=#=#=#=#=#=#=#=#
# sanity check
#which(is.na(meta_with_afor$OR))
# initialise an empty df with nrows as extracted above
na_count_df = data.frame(matrix(vector(mode = 'numeric'
# , length = length(na_var)
)
, nrow = my_nrows
# , ncol = length(na_var)
))
# populate the df with the indices of the cols that are NA
for (i in na_var){
print(i)
na_i = which(is.na(meta_with_afor[i]))
na_count_df = cbind(na_count_df, na_i)
colnames(na_count_df)[which(na_var == i)] <- i
}
# Now compare these indices to ensure these are the same
c2 = NULL
for ( i in 1: ( length(na_count_df)-1 ) ) {
# print(na_count_df[i] == na_count_df[i+1])
c1 = identical(na_count_df[[i]], na_count_df[[i+1]])
c2 = c(c1, c2)
if ( all(c2) ) {
print("Sanity check passed: The indices for AF, OR, etc are all the same")
} else {
print ("Error: Please check indices which are NA")
}
}
rm( c, c1, c2, i, my_nrows
, na_count, na_i, na_len
, na_var, temp
, na_count_df
, pos_count_check )
###########################
# 3:merging two dfs: with NA
###########################
# link col name = Mutationinforamtion
head(mcsm_data$Mutationinformation)
head(meta_with_afor$Mutationinformation)
#########
# merge 1a: meta data with mcsm
#########
merged_df2 = merge(x = meta_with_afor
, y = mcsm_data
, by = "Mutationinformation"
, all.y = T)
head(merged_df2$Position)
# sort by Position
head(merged_df2$Position)
merged_df2 = merged_df2[order(merged_df2$Position),]
head(merged_df2$Position)
merged_df2v2 = merge(x = meta_with_afor
,y = mcsm_data
, by = "Mutationinformation"
, all.x = T)
#!=!=!=!=!=!=!=!
# COMMENT: used all.y since position 186 is not part of the struc,
# hence doesn't have a mcsm value
# but 186 is associated with with mutation
#!=!=!=!=!=!=!=!
# should be False
identical(merged_df2, merged_df2v2)
table(merged_df2$Position%in%merged_df2v2$Position)
rm(merged_df2v2)
#########
# merge 1b:remove duplicate mutation information
#########
#==#=#=#=#=#=#
# Cannot trust lineage, country from this df as the same mutation
# can have many different lineages
# but this should be good for the numerical corr plots
#=#=#=#=#=#=#=
merged_df3 = merged_df2[!duplicated(merged_df2$Mutationinformation),]
head(merged_df3$Position) ; tail(merged_df3$Position) # should be sorted
# sanity checks
# nrows of merged_df3 should be the same as the nrows of mcsm_data
if(nrow(mcsm_data) == nrow(merged_df3)){
print("sanity check: Passed")
} else {
print("Error!: check data, nrows is not as expected")
}
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# uncomment as necessary
# only need to run this if merged_df2v2 i.e non structural pos included
#mcsm = mcsm_data$Mutationinformation
#my_merged = merged_df3$Mutationinformation
# find the index where it differs
#diff_n = which(!my_merged%in%mcsm)
#check if it is indeed pos 186
#merged_df3[diff_n,]
# remove this entry
#merged_df3 = merged_df3[-diff_n,]
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
###########################
# 3b :merging two dfs: without NA
###########################
#########
# merge 2a:same as merge 1 but excluding NA
#########
merged_df2_comp = merged_df2[!is.na(merged_df2$AF),]
#########
# merge 2b: remove duplicate mutation information
#########
merged_df3_comp = merged_df2_comp[!duplicated(merged_df2_comp$Mutationinformation),]
# FIXME: add this as a sanity check. I have manually checked!
# alternate way of deriving merged_df3_comp
foo = merged_df3[!is.na(merged_df3$AF),]
# compare dfs: foo and merged_df3_com
all.equal(foo, merged_df3)
summary(comparedf(foo, merged_df3))
#=============== end of combining df
#clear variables
rm(mcsm_data
, meta_with_afor
, foo)
#rm(diff_n, my_merged, mcsm)
#===============end of script
#=====================
# write_output files
#=====================
# Not required as this is a subset of the "combining_two_df.R" script

View file

@ -1,244 +0,0 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Jun 25 08:46:36 2019
@author: tanushree
"""
############################################
# load libraries
import os
import pandas as pd
from Bio import SeqIO
############################################
#********************************************************************
# TASK: Read in fasta files and create mutant sequences akin to a MSA,
# to allow generation of logo plots
# Requirements:
# input: Fasta file of protein/target for which mut seqs will be created
# path: "Data/<drug>/input/original/<filename>"
# output: MSA for mutant sequences
# path: "Data/<drug>/input/processed/<filename>"
#***********************************************************************
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
############# specify variables for input and output paths and filenames
homedir = os.path.expanduser('~') # spyder/python doesn't recognise tilde
basedir = "/git/Data/pyrazinamide/input"
# input
inpath = "/original"
in_filename_fasta = "/3pl1.fasta.txt"
infile_fasta = homedir + basedir + inpath + in_filename_fasta
print("Input file is:", infile_fasta)
inpath_p = "/processed"
in_filename_meta_data = "/meta_data_with_AFandOR.csv"
infile_meta_data = homedir + basedir + inpath_p + in_filename_meta_data
print("Input file is:", infile_meta_data)
# output: only path specified, filenames in respective sections
outpath = "/processed"
################## end of variable assignment for input and output files
#==========
#read files
#==========
#############
#fasta file
#############
#my_file = infile_fasta
my_fasta = str()
for seq_record in SeqIO.parse(infile_fasta, "fasta"):
my_seq = seq_record.seq
my_fasta = str(my_seq) #convert to a string
print(my_fasta)
# print( len(my_fasta) )
# print( type(my_fasta) )
len(my_fasta)
#############
# SNP info
#############
# read mutant_info file and extract cols with positions and mutant_info
# This should be all samples with pncA muts
#my_data = pd.read_csv('mcsm_complex1_normalised.csv') #335, 15
#my_data = pd.read_csv('meta_data_with_AFandOR.csv') #3093, 22
my_data = pd.read_csv(infile_meta_data) #3093, 22
list(my_data.columns)
#FIXME: You need a better way to identify this
# remove positions not in the structure
#pos_remove = 186
my_data = my_data[my_data.position != 186] #3092, 22
# if multiple positions, then try the example below;
# https://stackoverflow.com/questions/29017525/deleting-rows-based-on-multiple-conditions-python-pandas
#df = df[(df.one > 0) | (df.two > 0) | (df.three > 0) & (df.four < 1)]
#mut_info1 = my_data[['Position', 'Mutant_type']] #335, 2
mut_info1 = my_data[['position', 'mutant_type']] #3092, 2
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
###############
# data cleaning
################
# extract only those positions that have a frequency count of pos>1
###mut_info['freq_pos'] = mut_info.groupby('Position').count()#### dodgy
# add a column of frequency for each position
#mut_info1['freq_pos'] = mut_info1.groupby('Position')['Position'].transform('count') #335,3
mut_info1['freq_pos'] = mut_info1.groupby('position')['position'].transform('count') #3092,3
# sort by position
mut_info2 = mut_info1.sort_values(by=['position'])
#FIXME
#__main__:1: SettingWithCopyWarning:
#A value is trying to be set on a copy of a slice from a DataFrame.
#Try using .loc[row_indexer,col_indexer] = value instead
#See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#sort dataframe by freq values so the row indices are in order!
#mut_info2 = mut_info1.sort_values(by = 'freq_pos'
# , axis = 0
# , ascending = False
# , inplace = False
# , na_position = 'last')
#mut_info2 = mut_info2.reset_index( drop = True)
# count how many pos have freq 1 as you will need to exclude those
mut_info2[mut_info2.freq_pos == 1].sum() #20
# extract entries with freq_pos>1
# should be 3093-211 = 3072
mut_info3 = mut_info2.loc[mut_info2['freq_pos'] >1] #3072
# reset index to allow iteration <<<<<<<< IMPORTANT
mut_info = mut_info3.reset_index(drop = True)
del(mut_info1, mut_info2, mut_info3, my_data)
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
###################
# generate mut seqs
###################
mut_seqsL = [] * len(mut_info)
# iterate
for i, pos in enumerate(mut_info['position']):
print('index:', i, 'position:', pos)
mut = mut_info['mutant_type'][i]
# print(mut)
# print( type(mut) )
print('index:', i, 'position:', pos, 'mutant', mut)
my_fastaL = list(my_fasta)
offset_pos = pos-1 #due to counting starting from 0
my_fastaL[offset_pos] = mut
# print(my_fastaL)
mut_seq = "".join(my_fastaL)
# print(mut_seq + '\n')
mut_seqsL.append(mut_seq)
# print('original:', my_fasta, ',', 'replaced at', pos, 'with', mut, mut_seq)
###############
# sanity check
################
len_orig = len(my_fasta)
# checking if all the mutant sequences have the same length as the original fasta file sequence
for seqs in mut_seqsL:
# print(seqs)
# print(len(seqs))
if len(seqs) != len_orig:
print('sequence lengths mismatch' +'\n', 'mutant seq length:', len(seqs), 'vs original seq length:', len_orig)
else:
print('**Hooray** Length of mutant and original sequences match')
del(i, len_orig, mut, mut_seq, my_fastaL, offset_pos, pos, seqs)
############
# write file
############
#filepath = homedir +'/git/LSHTM_Y1_PNCA/combined_v3/logo_plot/snp_seqsfile'
#filepath = homedir + '/git/LSHTM_Y1_PNCA/mcsm_analysis/pyrazinamide/Data/gene_msa.txt'
print(outpath)
out_filename_gene = "/gene_msa.txt"
outfile_gene = homedir + basedir + outpath + out_filename_gene
print("Output file is:", outfile_gene)
with open(outfile_gene, 'w') as file_handler:
for item in mut_seqsL:
file_handler.write("{}\n".format(item))
R="\n".join(mut_seqsL)
f = open('Columns.csv','w')
f.write(R)
f.close()
#################################################################################
# extracting only positions with SNPs so that when you plot only those positions
################################################################################
#mut_seqsL = mut_seqsL[:3] #just trying with 3 seqs
# create a list of unique positions
pos = mut_info['position'] #3072
posL = list(set(list(pos))) #110
del(pos)
snp_seqsL = [] * len(mut_seqsL)
for j, mut_seq in enumerate(mut_seqsL):
print (j, mut_seq)
# print(mut_seq[101]) #testing, this should be P, T V (in order of the mut_info file)
mut_seqsE = list(mut_seq)
# extract specific posistions (corres to SNPs) from list of mutant sequences
snp_seqL1 = [mut_seqsE[i-1] for i in posL] #should be 110
# print(snp_seqL1)
# print(len(snp_seqL1))
snp_seq_clean = "".join(snp_seqL1)
snp_seqsL.append(snp_seq_clean)
###############
# sanity check
################
no_unique_snps = len(posL)
# checking if all the mutant sequences have the same length as the original fasta file sequence
for seqs in snp_seqsL:
# print(seqs)
# print(len(seqs))
if len(seqs) != no_unique_snps:
print('sequence lengths mismatch' +'\n', 'mutant seq length:', len(seqs), 'vs original seq length:', no_unique_snps)
else:
print('**Hooray** Length of mutant and original sequences match')
del(mut_seq, mut_seqsE, mut_seqsL, seqs, snp_seqL1, snp_seq_clean)
############
# write file
############
#filepath = homedir +'/git/LSHTM_Y1_PNCA/combined_v3/logo_plot/snp_seqsfile'
#filepath = homedir + '/git/LSHTM_Y1_PNCA/mcsm_analysis/pyrazinamide/Data/snps_msa.txt'
print(outpath)
out_filename_snps = "/snps_msa.txt"
outfile_snps = homedir + basedir + outpath + out_filename_snps
print("Output file is:", outfile_snps)
with open(outfile_snps, 'w') as file_handler:
for item in snp_seqsL:
file_handler.write("{}\n".format(item))
R="\n".join(snp_seqsL)
f = open('Columns.csv','w')
f.write(R)
f.close()

View file

@ -1,9 +0,0 @@
#!/bin/bash
# run all bash scripts for mcsm
#./step0_check_duplicate_SNPs.sh
#./step1_lig_output_urls.sh
./step2_lig_results.sh
./step3a_results_format_interim.sh

View file

@ -1,25 +0,0 @@
#!/bin/bash
#*************************************
# need to be in the correct directory
#*************************************
##: comments for code
#: commented out code
#**********************************************************************
# TASK: Text file containing a list of SNPs; SNP in the format(C2E)
# per line. Sort by unique, which automatically removes duplicates.
# sace file in current directory
#**********************************************************************
infile="${HOME}/git/Data/pyrazinamide/input/processed/pnca_mis_SNPs_v2.csv"
outfile="${HOME}/git/Data/pyrazinamide/input/processed/pnca_mis_SNPs_v2_unique.csv"
# sort unique entries and output to current directory
sort -u ${infile} > ${outfile}
# count no. of unique snps mCSM will run on
count=$(wc -l < ${outfile})
# print to console no. of unique snps mCSM will run on
echo "${count} unique mutations for mCSM to run on"

View file

@ -1,104 +0,0 @@
#!/bin/bash
#**********************************************************************
# TASK: submit requests using curl: HANDLE redirects and refresh url.
# Iterate over mutation file and write/append result urls to a file
# Mutation file must have one mutation (format A1B) per line
# Requirements
# input: mutation list (format: A1B), complex struc: (pdb format)
# mutation: outFile from step0, one unique mutation/line, no chain ID
# path: "Data/<drug>/input/processed/<filename>"
# structure: pdb file of drug-target complex
# path: "Data/<drug>/input/structure/<filename>"
# output: should be n urls (n=no. of unique mutations in file)
# path: "Data/<drug>/input/processed/<filename>"
# NOTE: these are just result urls, not actual values for results
#**********************************************************************
############# specify variables for input and output paths and filenames
homedir="${HOME}"
#echo Home directory is ${homedir}
basedir="/git/Data/pyrazinamide/input"
# input
inpath_mut="/processed"
in_filename_mut="/pnca_mis_SNPs_v2_unique.csv"
infile_mut="${homedir}${basedir}${inpath_mut}${in_filename_mut}"
echo Input Mut filename: ${infile_mut}
inpath_struc="/structure"
in_filename_struc="/complex1_no_water.pdb"
infile_struc="${homedir}${basedir}${inpath_struc}${in_filename_struc}"
echo Input Struc filename: ${infile_struc}
# output
outpath="/processed"
out_filename="/complex1_result_url.txt"
outfile="${homedir}${basedir}${outpath}${out_filename}"
#echo Output filename: ${outfile}
################## end of variable assignment for input and output files
# iterate over mutation file (infile_mut); line by line and
# submit query using curl
# some useful messages
echo -n -e "Processing $(wc -l < ${infile_mut}) entries from ${infile_mut}\n"
COUNT=0
while read -r line; do
((COUNT++))
mutation="${line}"
# echo "${mutation}"
#pdb='../Data/complex1_no_water.pdb'
pdb="${infile_struc}"
mutation="${mutation}"
chain="A"
lig_id="PZA"
affin_wt="0.99"
host="http://biosig.unimelb.edu.au"
call_url="/mcsm_lig/prediction"
#=========================================
##html field_names names required for curl
##complex_field:wild=@
##mutation_field:mutation=@
##chain_field:chain=@
##ligand_field:lig_id@
##energy_field:affin_wt
#=========================================
refresh_url=$(curl -L \
-sS \
-F "wild=@${pdb}" \
-F "mutation=${mutation}" \
-F "chain=${chain}" \
-F "lig_id=${lig_id}" \
-F "affin_wt=${affin_wt}" \
${host}${call_url} | grep "http-equiv")
#echo Refresh URL: $refresh_url
#echo Host+Refresh: ${host}${refresh_url}
# use regex to extract the relevant bit from the refresh url
# regex:sed -r 's/.*(\/mcsm.*)".*$/\1/g'
# Now build: result url using host and refresh url and write the urls to a file
result_url=$(echo $refresh_url | sed -r 's/.*(\/mcsm.*)".*$/\1/g')
sleep 10
echo -e "${mutation} : processing entry ${COUNT}/$(wc -l < ${infile_mut})..."
# create output file with the added number of muts from file
# after much thought, bad idea as less generic!
#echo -e "${host}${result_url}" >> ../Results/$(wc -l < ${filename})_complex1_result_url.txt
echo -e "${host}${result_url}" >> ${outfile}
#echo -n '.'
done < "${infile_mut}"
#FIXME: stop executing if error else these echo statements are misleading!
echo
echo Output filename: ${outfile}
echo
echo Number of urls saved: $(wc -l < ${infile_mut})
echo
echo "Processing Complete"
# end of submitting query, receiving result url and storing results url in a file

View file

@ -1,76 +0,0 @@
#!/bin/bash
#********************************************************************
# TASK: submit result urls and fetch actual results using curl
# Iterate over each result url from the output of step1 stored in processed/
# Use curl to fetch results and extract relevant sections using hxtools
# and store these in another file in processed/
# Requirements:
# input: output of step1, file containing result urls
# path: "Data/<drug>/input/processed/<filename>"
# output: name of the file where extracted results will be stored
# path: "Data/<drug>/input/processed/<filename>"
# Optional: can make these command line args you pass when calling script
# by uncommenting code as indicated
#*********************************************************************
############################# uncomment: to make it command line args
#if [ "$#" -ne 2 ]; then
#if [ -Z $1 ]; then
# echo "
# Please provide both Input and Output files.
# Usage: batch_read_urls.sh INFILE OUTFILE
# "
# exit 1
#fi
# First argument: Input File
# Second argument: Output File
#infile=$1
#outfile=$2
############################ end of code block to make command line args
############# specify variables for input and output paths and filenames
homedir="${HOME}"
#echo Home directory is ${homedir}
basedir="/git/Data/pyrazinamide/input"
# input
inpath="/processed"
in_filename="/complex1_result_url.txt"
infile="${homedir}${basedir}${inpath}${in_filename}"
echo Input Mut filename: ${infile}
# output
outpath="/processed"
out_filename="/complex1_output_MASTER.txt"
outfile="${homedir}${basedir}${outpath}${out_filename}"
echo Output filename: ${outfile}
################## end of variable assignment for input and output files
# Iterate over each result url, and extract results using hxtools
# which nicely cleans and formats html
echo -n "Processing $(wc -l < ${infile}) entries from ${infile}"
echo
COUNT=0
while read -r line; do
#COUNT=$(($COUNT+1))
((COUNT++))
curl --silent ${line} \
| hxnormalize -x \
| hxselect -c div.span4 \
| hxselect -c div.well \
| sed -r -e 's/<[^>]*>//g' \
| sed -re 's/ +//g' \
>> ${outfile}
#| tee -a ${outfile}
# echo -n '.'
echo -e "Processing entry ${COUNT}/$(wc -l < ${infile})..."
done < "${infile}"
echo
echo "Processing Complete"

View file

@ -1,74 +0,0 @@
#!/bin/bash
#********************************************************************
# TASK: Intermediate results processing
# output file has a convenient delimiter of ":" that can be used to
# format the file into two columns (col1: field_desc and col2: values)
# However the section "PredictedAffinityChange:...." and
# "DUETstabilitychange:.." are split over multiple lines and
# prevent this from happening. Additionally there are other empty lines
# that need to be omiited. In order ensure these sections are not split
# over multiple lines, this script is written.
# Requirements:
# input: output of step2, file containing mcsm results as described above
# path: "Data/<drug>/input/processed/<filename>"
# output: replaces file in place.
# Therefore first create a copy of the input file
# but rename it to remove the word "MASTER" and add the word "processed"
# file format: .txt
# NOTE: This replaces the file in place!
# the output is a txt file with no newlines and formatting
# to have the following format "<colname><:><value>
#***********************************************************************
############# specify variables for input and output paths and filenames
homedir="${HOME}"
basedir="/git/Data/pyrazinamide/input"
inpath="/processed"
# Create input file: copy and rename output file of step2
oldfile="${homedir}${basedir}${inpath}/complex1_output_MASTER.txt"
newfile="${homedir}${basedir}${inpath}/complex1_output_processed.txt"
cp $oldfile $newfile
echo Input filename is ${oldfile}
echo
echo Output i.e copied filename is ${newfile}
# output: No output perse
# Replacement in place inside the copied file
################## end of variable assignment for input and output files
#sed -i '/PredictedAffinityChange:/ { N; N; N; N; s/\n//g;}' ${newfile} \
# | sed -i '/DUETstabilitychange:/ {x; N; N; s/\n//g; p;d;}' ${newfile}
# Outputs records separated by a newline, that look something like this:
# PredictedAffinityChange:-2.2log(affinityfoldchange)-Destabilizing
# Mutationinformation:
# Wild-type:L
# Position:4
# Mutant-type:W
# Chain:A
# LigandID:PZA
# Distancetoligand:15.911&Aring;
# DUETstabilitychange:-2.169Kcal/mol
#
# PredictedAffinityChange:-1.538log(affinityfoldchange)-Destabilizing
# (...etc)
# This script brings everything in a convenient format for further processing in python.
sed -i '/PredictedAffinityChange/ {
N
N
N
N
s/\n//g
}
/DUETstabilitychange:/ {
N
N
s/\n//g
}
/^$/d' ${newfile}

View file

@ -1,63 +0,0 @@
#!/usr/bin/python
###################
# load libraries
import os, sys
import pandas as pd
from collections import defaultdict
####################
#********************************************************************
# TASK: Formatting results with nice colnames
# step3a processed the mcsm results to remove all newlines and
# brought data in a format where the delimiter ":" splits
# data into a convenient format of "colname": "value".
# this script formats the data and outputs a df with each row
# as a mutation and its corresponding mcsm_values
# Requirements:
# input: output of step3a, file containing "..._output_processed.txt"
# path: "Data/<drug>/input/processed/<filename>"
# output: formatted .csv file
# path: "Data/<drug>/input/processed/<filename>"
#***********************************************************************
############# specify variables for input and output paths and filenames
homedir = os.path.expanduser('~') # spyder/python doesn't recognise tilde
basedir = "/git/Data/pyrazinamide/input"
# input
inpath = "/processed"
in_filename = "/complex1_output_processed.txt"
infile = homedir + basedir + inpath + in_filename
print("Input file is:", infile)
# output
outpath = "/processed"
out_filename = "/complex1_formatted_results.csv"
outfile = homedir + basedir + outpath + out_filename
print("Output file is:", outfile)
################## end of variable assignment for input and output files
outCols=[
'PredictedAffinityChange',
'Mutationinformation',
'Wild-type',
'Position',
'Mutant-type',
'Chain',
'LigandID',
'Distancetoligand',
'DUETstabilitychange'
]
lines = [line.rstrip('\n') for line in open(infile)]
outputs = defaultdict(list)
for item in lines:
col, val = item.split(':')
outputs[col].append(val)
dfOut=pd.DataFrame(outputs)
pd.DataFrame.to_csv(dfOut, outfile, columns=outCols)

View file

@ -1,230 +0,0 @@
getwd()
#setwd("~/git/LSHTM_analysis/mcsm_complex1/Results")
getwd()
#=======================================================
# TASK: read formatted_results_df.csv to complete
# missing info, adding DUET categories, assigning
# meaningful colnames, etc.
# Requirements:
# input: output of step3b, python processing,
# path: Data/<drug>/input/processed/<filename>"
# output: NO output as the next scripts refers to this
# for yet more processing
#=======================================================
# specify variables for input and output paths and filenames
homedir = "~"
basedir = "/git/Data/pyrazinamide/input"
inpath = "/processed"
in_filename = "/complex1_formatted_results.csv"
infile = paste0(homedir, basedir, inpath, in_filename)
print(paste0("Input file is:", infile))
#======================================================
#TASK: To tidy the columns so you can generate figures
#=======================================================
####################
#### read file #####: this will be the output from python script (csv file)
####################
data = read.csv(infile
, header = T
, stringsAsFactors = FALSE)
dim(data)
str(data)
# clear variables
rm(homedir, basedir, inpath, in_filename, infile)
###########################
##### Data processing #####
###########################
# populate mutation information columns as currently it is empty
head(data$Mutationinformation)
tail(data$Mutationinformation)
# should not be blank: create muation information
data$Mutationinformation = paste0(data$Wild.type, data$Position, data$Mutant.type)
head(data$Mutationinformation)
tail(data$Mutationinformation)
#write.csv(data, 'test.csv')
##########################################
# Remove duplicate SNPs as a sanity check
##########################################
# very important
table(duplicated(data$Mutationinformation))
# extract duplicated entries
dups = data[duplicated(data$Mutationinformation),] #0
# No of dups should match with the no. of TRUE in the above table
#u_dups = unique(dups$Mutationinformation) #10
sum( table(dups$Mutationinformation) )
#***************************************************************
# select non-duplicated SNPs and create a new df
df = data[!duplicated(data$Mutationinformation),]
#***************************************************************
# sanity check
u = unique(df$Mutationinformation)
u2 = unique(data$Mutationinformation)
table(u%in%u2)
# should all be 1
sum(table(df$Mutationinformation) == 1)
# sort df by Position
# MANUAL CHECKPOINT:
#foo <- df[order(df$Position),]
#df <- df[order(df$Position),]
# clear variables
rm(u, u2, dups)
####################
#### give meaningful colnames to reflect units to enable correct data type
####################
#=======
#STEP 1
#========
# make a copy of the PredictedAffinityColumn and call it Lig_outcome
df$Lig_outcome = df$PredictedAffinityChange
#make Predicted...column numeric and outcome column categorical
head(df$PredictedAffinityChange)
df$PredictedAffinityChange = gsub("log.*"
, ""
, df$PredictedAffinityChange)
# sanity checks
head(df$PredictedAffinityChange)
# should be numeric, check and if not make it numeric
is.numeric( df$PredictedAffinityChange )
# change to numeric
df$PredictedAffinityChange = as.numeric(df$PredictedAffinityChange)
# should be TRUE
is.numeric( df$PredictedAffinityChange )
# change the column name to indicate units
n = which(colnames(df) == "PredictedAffinityChange"); n
colnames(df)[n] = "PredAffLog"
colnames(df)[n]
#========
#STEP 2
#========
# make Lig_outcome column categorical showing effect of mutation
head(df$Lig_outcome)
df$Lig_outcome = gsub("^.*-"
, "",
df$Lig_outcome)
# sanity checks
head(df$Lig_outcome)
# should be factor, check and if not change it to factor
is.factor(df$Lig_outcome)
# change to factor
df$Lig_outcome = as.factor(df$Lig_outcome)
# should be TRUE
is.factor(df$Lig_outcome)
#========
#STEP 3
#========
# gsub
head(df$Distancetoligand)
df$Distancetoligand = gsub("&Aring;"
, ""
, df$Distancetoligand)
# sanity checks
head(df$Distancetoligand)
# should be numeric, check if not change it to numeric
is.numeric(df$Distancetoligand)
# change to numeric
df$Distancetoligand = as.numeric(df$Distancetoligand)
# should be TRUE
is.numeric(df$Distancetoligand)
# change the column name to indicate units
n = which(colnames(df) == "Distancetoligand")
colnames(df)[n] <- "Dis_lig_Ang"
colnames(df)[n]
#========
#STEP 4
#========
#gsub
head(df$DUETstabilitychange)
df$DUETstabilitychange = gsub("Kcal/mol"
, ""
, df$DUETstabilitychange)
# sanity checks
head(df$DUETstabilitychange)
# should be numeric, check if not change it to numeric
is.numeric(df$DUETstabilitychange)
# change to numeric
df$DUETstabilitychange = as.numeric(df$DUETstabilitychange)
# should be TRUE
is.numeric(df$DUETstabilitychange)
# change the column name to indicate units
n = which(colnames(df) == "DUETstabilitychange"); n
colnames(df)[n] = "DUETStability_Kcalpermol"
colnames(df)[n]
#========
#STEP 5
#========
# create yet another extra column: classification of DUET stability only
df$DUET_outcome = ifelse(df$DUETStability_Kcalpermol >=0
, "Stabilizing"
, "Destabilizing") # spelling to be consistent with mcsm
table(df$Lig_outcome)
table(df$DUET_outcome)
#==============================
#FIXME
#Insert a venn diagram
#================================
#========
#STEP 6
#========
# assign wild and mutant colnames correctly
wt = which(colnames(df) == "Wild.type"); wt
colnames(df)[wt] <- "Wild_type"
colnames(df[wt])
mut = which(colnames(df) == "Mutant.type"); mut
colnames(df)[mut] <- "Mutant_type"
colnames(df[mut])
#========
#STEP 7
#========
# create an extra column: maybe useful for some plots
df$WildPos = paste0(df$Wild_type, df$Position)
# clear variables
rm(n, wt, mut)
################ end of data cleaning

View file

@ -1,275 +0,0 @@
##################
# load libraries
library(compare)
##################
getwd()
#=======================================================
# TASK:read cleaned data and perform rescaling
# of DUET stability scores
# of Pred affinity
# compare scaling methods with plots
# Requirements:
# input: R script, step3c_results_cleaning.R
# path: Data/<drug>/input/processed/<filename>"
# output: NO output as the next scripts refers to this
# for yet more processing
# output normalised file
#=======================================================
# specify variables for input and output paths and filenames
homedir = "~"
currdir = "/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/mcsm"
in_filename = "/step3c_results_cleaning.R"
infile = paste0(homedir, currdir, in_filename)
print(paste0("Input file is:", infile))
# output file
basedir = "/git/Data/pyrazinamide/input"
outpath = "/processed"
out_filename = "/mcsm_complex1_normalised.csv"
outfile = paste0(homedir, basedir, outpath, out_filename)
print(paste0("Output file is:", outfile))
####################
#### read file #####: this will be the output of my R script that cleans the data columns
####################
source(infile)
#This will outut two dataframes:
# data: unclean data: 10 cols
# df : cleaned df: 13 cols
# you can remove data if you want as you will not need it
rm(data)
colnames(df)
#===================
#3a: PredAffLog
#===================
n = which(colnames(df) == "PredAffLog"); n
group = which(colnames(df) == "Lig_outcome"); group
#===================================================
# order according to PredAffLog values
#===================================================
# This is because this makes it easier to see the results of rescaling for debugging
head(df$PredAffLog)
# ORDER BY PredAff scrores: negative values at the top and positive at the bottoom
df = df[order(df$PredAffLog),]
head(df$PredAffLog)
# sanity checks
head(df[,n]) # all negatives
tail(df[,n]) # all positives
# sanity checks
mean(df[,n])
#-0.9526746
tapply(df[,n], df[,group], mean)
#===========================
# Same as above: in 2 steps
#===========================
# find range of your data
my_min = min(df[,n]); my_min #
my_max = max(df[,n]); my_max #
#===============================================
# WITHIN GROUP rescaling 2: method "ratio"
# create column to store the rescaled values
# Rescaling separately (Less dangerous)
# =====> chosen one: preserves sign
#===============================================
df$ratioPredAff = ifelse(df[,n] < 0
, df[,n]/abs(my_min)
, df[,n]/my_max
)# 14 cols
# sanity checks
head(df$ratioPredAff)
tail(df$ratioPredAff)
min(df$ratioPredAff); max(df$ratioPredAff)
tapply(df$ratioPredAff, df$Lig_outcome, min)
tapply(df$ratioPredAff, df$Lig_outcome, max)
# should be the same as below
sum(df$ratioPredAff < 0); sum(df$ratioPredAff > 0)
table(df$Lig_outcome)
#===============================================
# Hist and density plots to compare the rescaling
# methods: Base R
#===============================================
# uncomment as necessary
my_title = "Ligand_stability"
# my_title = colnames(df[n])
# Set the margin on all sides
par(oma = c(3,2,3,0)
, mar = c(1,3,5,2)
, mfrow = c(2,2))
hist(df[,n]
, xlab = ""
, main = "Raw values"
)
hist(df$ratioPredAff
, xlab = ""
, main = "ratio rescaling"
)
# Plot density plots underneath
plot(density( df[,n] )
, main = "Raw values"
)
plot(density( df$ratioPredAff )
, main = "ratio rescaling"
)
# titles
mtext(text = "Frequency"
, side = 2
, line = 0
, outer = TRUE)
mtext(text = my_title
, side = 3
, line = 0
, outer = TRUE)
#clear variables
rm(my_min, my_max, my_title, n, group)
#===================
# 3b: DUET stability
#===================
dim(df) # 14 cols
n = which(colnames(df) == "DUETStability_Kcalpermol"); n # 10
group = which(colnames(df) == "DUET_outcome"); group #12
#===================================================
# order according to DUET scores
#===================================================
# This is because this makes it easier to see the results of rescaling for debugging
head(df$DUETStability_Kcalpermol)
# ORDER BY DUET scores: negative values at the top and positive at the bottom
df = df[order(df$DUETStability_Kcalpermol),]
# sanity checks
head(df[,n]) # negatives
tail(df[,n]) # positives
# sanity checks
mean(df[,n])
tapply(df[,n], df[,group], mean)
#===============================================
# WITHIN GROUP rescaling 2: method "ratio"
# create column to store the rescaled values
# Rescaling separately (Less dangerous)
# =====> chosen one: preserves sign
#===============================================
# find range of your data
my_min = min(df[,n]); my_min
my_max = max(df[,n]); my_max
df$ratioDUET = ifelse(df[,n] < 0
, df[,n]/abs(my_min)
, df[,n]/my_max
) # 15 cols
# sanity check
head(df$ratioDUET)
tail(df$ratioDUET)
min(df$ratioDUET); max(df$ratioDUET)
# sanity checks
tapply(df$ratioDUET, df$DUET_outcome, min)
tapply(df$ratioDUET, df$DUET_outcome, max)
# should be the same as below (267 and 42)
sum(df$ratioDUET < 0); sum(df$ratioDUET > 0)
table(df$DUET_outcome)
#===============================================
# Hist and density plots to compare the rescaling
# methods: Base R
#===============================================
# uncomment as necessary
my_title = "DUET_stability"
#my_title = colnames(df[n])
# Set the margin on all sides
par(oma = c(3,2,3,0)
, mar = c(1,3,5,2)
, mfrow = c(2,2))
hist(df[,n]
, xlab = ""
, main = "Raw values"
)
hist(df$ratioDUET
, xlab = ""
, main = "ratio rescaling"
)
# Plot density plots underneath
plot(density( df[,n] )
, main = "Raw values"
)
plot(density( df$ratioDUET )
, main = "ratio rescaling"
)
# graph titles
mtext(text = "Frequency"
, side = 2
, line = 0
, outer = TRUE)
mtext(text = my_title
, side = 3
, line = 0
, outer = TRUE)
# reorder by column name
#data <- data[c("A", "B", "C")]
colnames(df)
df2 = df[c("X", "Mutationinformation", "WildPos", "Position"
, "Wild_type", "Mutant_type"
, "DUETStability_Kcalpermol", "DUET_outcome"
, "Dis_lig_Ang", "PredAffLog", "Lig_outcome"
, "ratioDUET", "ratioPredAff"
, "LigandID","Chain")]
# sanity check
# should be True
#compare(df, df2, allowAll = T)
compare(df, df2, ignoreColOrder = T)
#TRUE
#reordered columns
#===================
# write output as csv file
#===================
#write.csv(df, "../Data/mcsm_complex1_normalised.csv", row.names = FALSE)
write.csv(df2, outfile, row.names = FALSE)

View file

@ -1,131 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages #
########################################################################
source("../Header_TT.R")
#source("barplot_colour_function.R")
require(data.table)
require(dplyr)
########################################################################
# Read file: call script for combining df for PS #
########################################################################
source("../combining_two_df.R")
###########################
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
###########################
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
###########################
# you need merged_df3
# or
# merged_df3_comp
# since these have unique SNPs
# I prefer to use the merged_df3
# because using the _comp dataset means
# we lose some muts and at this level, we should use
# as much info as available
###########################
# uncomment as necessary
#%%%%%%%%%%%%%%%%%%%%%%%%
# REASSIGNMENT
my_df = merged_df3
#my_df = merged_df3_comp
#%%%%%%%%%%%%%%%%%%%%%%%%
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
###########################
# Data for bfactor figure
# PS average
# Lig average
###########################
head(my_df$Position)
head(my_df$ratioDUET)
# order data frame
df = my_df[order(my_df$Position),]
head(df$Position)
head(df$ratioDUET)
#***********
# PS: average by position
#***********
mean_DUET_by_position <- df %>%
group_by(Position) %>%
summarize(averaged.DUET = mean(ratioDUET))
#***********
# Lig: average by position
#***********
mean_Lig_by_position <- df %>%
group_by(Position) %>%
summarize(averaged.Lig = mean(ratioPredAff))
#***********
# cbind:mean_DUET_by_position and mean_Lig_by_position
#***********
combined = as.data.frame(cbind(mean_DUET_by_position, mean_Lig_by_position ))
# sanity check
# mean_PS_Lig_Bfactor
colnames(combined)
colnames(combined) = c("Position"
, "average_DUETR"
, "Position2"
, "average_PredAffR")
colnames(combined)
identical(combined$Position, combined$Position2)
n = which(colnames(combined) == "Position2"); n
combined_df = combined[,-n]
max(combined_df$average_DUETR) ; min(combined_df$average_DUETR)
max(combined_df$average_PredAffR) ; min(combined_df$average_PredAffR)
#=============
# output csv
#============
outDir = "~/git/Data/pyrazinamide/input/processed/"
outFile = paste0(outDir, "mean_PS_Lig_Bfactor.csv")
print(paste0("Output file with path will be:","", outFile))
head(combined_df$Position); tail(combined_df$Position)
write.csv(combined_df, outFile
, row.names = F)

View file

@ -1,250 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages #
########################################################################
source("../Header_TT.R")
#source("barplot_colour_function.R")
require(cowplot)
########################################################################
# Read file: call script for combining df for PS #
########################################################################
source("../combining_two_df.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#===========================
###########################
# Data for OR and stability plots
# you need merged_df3_comp
# since these are matched
# to allow pairwise corr
###########################
# uncomment as necessary
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df = merged_df3_comp
#my_df = merged_df3
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
# sanity check
# Ensure correct data type in columns to plot: need to be factor
is.numeric(my_df$OR)
#[1] TRUE
#<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
# FOR PS Plots
#<<<<<<<<<<<<<<<<<<<
PS_df = my_df
rm(my_df)
#>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end of section 1
########################################################################
# Read file: call script for combining df for lig #
########################################################################
getwd()
source("combining_two_df_lig.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#===========================
###########################
# Data for OR and stability plots
# you need merged_df3_comp
# since these are matched
# to allow pairwise corr
###########################
# uncomment as necessary
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df2 = merged_df3_comp
#my_df2 = merged_df3
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df2)
str(my_df2)
# sanity check
# Ensure correct data type in columns to plot: need to be factor
is.numeric(my_df2$OR)
#[1] TRUE
# sanity check: should be <10
if (max(my_df2$Dis_lig_Ang) < 10){
print ("Sanity check passed: lig data is <10Ang")
}else{
print ("Error: data should be filtered to be within 10Ang")
}
#<<<<<<<<<<<<<<<<
# REASSIGNMENT
# FOR Lig Plots
#<<<<<<<<<<<<<<<<
Lig_df = my_df2
rm(my_df2)
#>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> end of section 1
#############
# Plots: Bubble plot
# x = Position, Y = stability
# size of dots = OR
# col: stability
#############
#=================
# generate plot 1: DUET vs OR by position as geom_points
#=================
my_ats = 20 # axis text size
my_als = 22 # axis label size
# Spelling Correction: made redundant as already corrected at the source
#PS_df$DUET_outcome[PS_df$DUET_outcome=='Stabilizing'] <- 'Stabilising'
#PS_df$DUET_outcome[PS_df$DUET_outcome=='Destabilizing'] <- 'Destabilising'
table(PS_df$DUET_outcome) ; sum(table(PS_df$DUET_outcome))
g = ggplot(PS_df, aes(x = factor(Position)
, y = ratioDUET))
p1 = g +
geom_point(aes(col = DUET_outcome
, size = OR)) +
theme(axis.text.x = element_text(size = my_ats
, angle = 90
, hjust = 1
, vjust = 0.4)
, axis.text.y = element_text(size = my_ats
, angle = 0
, hjust = 1
, vjust = 0)
, axis.title.x = element_text(size = my_als)
, axis.title.y = element_text(size = my_als)
, legend.text = element_text(size = my_als)
, legend.title = element_text(size = my_als) ) +
#, legend.key.size = unit(1, "cm")) +
labs(title = ""
, x = "Position"
, y = "DUET(PS)"
, size = "Odds Ratio"
, colour = "DUET Outcome") +
guides(colour = guide_legend(override.aes = list(size=4)))
p1
#=================
# generate plot 2: Lig vs OR by position as geom_points
#=================
my_ats = 20 # axis text size
my_als = 22 # axis label size
# Spelling Correction: made redundant as already corrected at the source
#Lig_df$Lig_outcome[Lig_df$Lig_outcome=='Stabilizing'] <- 'Stabilising'
#Lig_df$Lig_outcome[Lig_df$Lig_outcome=='Destabilizing'] <- 'Destabilising'
table(Lig_df$Lig_outcome)
g = ggplot(Lig_df, aes(x = factor(Position)
, y = ratioPredAff))
p2 = g +
geom_point(aes(col = Lig_outcome
, size = OR))+
theme(axis.text.x = element_text(size = my_ats
, angle = 90
, hjust = 1
, vjust = 0.4)
, axis.text.y = element_text(size = my_ats
, angle = 0
, hjust = 1
, vjust = 0)
, axis.title.x = element_text(size = my_als)
, axis.title.y = element_text(size = my_als)
, legend.text = element_text(size = my_als)
, legend.title = element_text(size = my_als) ) +
#, legend.key.size = unit(1, "cm")) +
labs(title = ""
, x = "Position"
, y = "Ligand Affinity"
, size = "Odds Ratio"
, colour = "Ligand Outcome"
) +
guides(colour = guide_legend(override.aes = list(size=4)))
p2
#======================
#combine using cowplot
#======================
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots"
getwd()
svg('PS_Lig_OR_combined.svg', width = 32, height = 12) #inches
#png('PS_Lig_OR_combined.png', width = 2800, height = 1080) #300dpi
theme_set(theme_gray()) # to preserve default theme
printFile = cowplot::plot_grid(plot_grid(p1, p2
, ncol = 1
, align = 'v'
, labels = c("A", "B")
, label_size = my_als+5))
print(printFile)
dev.off()

View file

@ -1,154 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages #
########################################################################
source("../Header_TT.R")
########################################################################
# Read file: call script for combining df for lig #
########################################################################
source("../combining_two_df_lig.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#===========================
###########################
# Data for Lig plots
# you need merged_df3
# or
# merged_df3_comp
# since these have unique SNPs
# I prefer to use the merged_df3
# because using the _comp dataset means
# we lose some muts and at this level, we should use
# as much info as available
###########################
# uncomment as necessary
#%%%%%%%%%%%%%%%%%%%%%%%%
# REASSIGNMENT
my_df = merged_df3
#my_df = merged_df3_comp
#%%%%%%%%%%%%%%%%%%%%%%%%
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
#############################
# Extra sanity check:
# for mcsm_lig ONLY
# Dis_lig_Ang should be <10
#############################
if (max(my_df$Dis_lig_Ang) < 10){
print ("Sanity check passed: lig data is <10Ang")
}else{
print ("Error: data should be filtered to be within 10Ang")
}
########################################################################
# end of data extraction and cleaning for plots #
########################################################################
#==========================
# Plot: Barplot with scores (unordered)
# corresponds to Lig_outcome
# Stacked Barplot with colours: Lig_outcome @ position coloured by
# Lig_outcome. This is a barplot where each bar corresponds
# to a SNP and is coloured by its corresponding Lig_outcome.
#============================
#===================
# Data for plots
#===================
#%%%%%%%%%%%%%%%%%%%%%%%%
# REASSIGNMENT
df = my_df
#%%%%%%%%%%%%%%%%%%%%%%%%
rm(my_df)
# sanity checks
upos = unique(my_df$Position)
# should be a factor
is.factor(df$Lig_outcome)
#TRUE
table(df$Lig_outcome)
# should be -1 and 1: may not be in this case because you have filtered the data
# FIXME: normalisation before or after filtering?
min(df$ratioPredAff) #
max(df$ratioPredAff) #
# sanity checks
tapply(df$ratioPredAff, df$Lig_outcome, min)
tapply(df$ratioPredAff, df$Lig_outcome, max)
#******************
# generate plot
#******************
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots")
getwd()
my_title = "Ligand affinity"
# axis label size
my_xaxls = 13
my_yaxls = 15
# axes text size
my_xaxts = 15
my_yaxts = 15
# no ordering of x-axis
g = ggplot(df, aes(factor(Position, ordered = T)))
g +
geom_bar(aes(fill = Lig_outcome), colour = "grey") +
theme( axis.text.x = element_text(size = my_xaxls
, angle = 90
, hjust = 1
, vjust = 0.4)
, axis.text.y = element_text(size = my_yaxls
, angle = 0
, hjust = 1
, vjust = 0)
, axis.title.x = element_text(size = my_xaxts)
, axis.title.y = element_text(size = my_yaxts ) ) +
labs(title = my_title
, x = "Position"
, y = "Frequency")
# for sanity and good practice
rm(df)
#======================= end of plot
# axis colours labels
# https://stackoverflow.com/questions/38862303/customize-ggplot2-axis-labels-with-different-colors
# https://stackoverflow.com/questions/56543485/plot-coloured-boxes-around-axis-label

View file

@ -1,149 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages and functions #
########################################################################
source("../Header_TT.R")
########################################################################
# Read file: call script for combining df for PS #
########################################################################
source("../combining_two_df.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#==========================
###########################
# Data for DUET plots
# you need merged_df3
# or
# merged_df3_comp
# since these have unique SNPs
# I prefer to use the merged_df3
# because using the _comp dataset means
# we lose some muts and at this level, we should use
# as much info as available
###########################
# uncomment as necessary
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df = merged_df3
#my_df = merged_df3_comp
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
# Ensure correct data type in columns to plot: need to be factor
# sanity check
is.factor(my_df$DUET_outcome)
my_df$DUET_outcome = as.factor(my_df$DUET_outcome)
is.factor(my_df$DUET_outcome)
#[1] TRUE
########################################################################
# end of data extraction and cleaning for plots #
########################################################################
#==========================
# Plot 2: Barplot with scores (unordered)
# corresponds to DUET_outcome
# Stacked Barplot with colours: DUET_outcome @ position coloured by
# DUET outcome. This is a barplot where each bar corresponds
# to a SNP and is coloured by its corresponding DUET_outcome
#============================
#===================
# Data for plots
#===================
#<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
df = my_df
#<<<<<<<<<<<<<<<<<<<<<<<<<
rm(my_df)
# sanity checks
upos = unique(df$Position)
# should be a factor
is.factor(my_df$DUET_outcome)
#[1] TRUE
table(my_df$DUET_outcome)
# should be -1 and 1
min(df$ratioDUET)
max(df$ratioDUET)
tapply(df$ratioDUET, df$DUET_outcome, min)
tapply(df$ratioDUET, df$DUET_outcome, max)
#******************
# generate plot
#******************
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots")
getwd()
my_title = "Protein stability (DUET)"
# axis label size
my_xaxls = 13
my_yaxls = 15
# axes text size
my_xaxts = 15
my_yaxts = 15
# no ordering of x-axis
g = ggplot(df, aes(factor(Position, ordered = T)))
g +
geom_bar(aes(fill = DUET_outcome), colour = "grey") +
theme( axis.text.x = element_text(size = my_xaxls
, angle = 90
, hjust = 1
, vjust = 0.4)
, axis.text.y = element_text(size = my_yaxls
, angle = 0
, hjust = 1
, vjust = 0)
, axis.title.x = element_text(size = my_xaxts)
, axis.title.y = element_text(size = my_yaxts ) ) +
labs(title = my_title
, x = "Position"
, y = "Frequency")
# for sanity and good practice
rm(df)
#======================= end of plot
# axis colours labels
# https://stackoverflow.com/questions/38862303/customize-ggplot2-axis-labels-with-different-colors
# https://stackoverflow.com/questions/56543485/plot-coloured-boxes-around-axis-label

View file

@ -1,202 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages and functions #
########################################################################
source("../Header_TT.R")
source("../barplot_colour_function.R")
########################################################################
# Read file: call script for combining df for lig #
########################################################################
source("../combining_two_df_lig.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#===========================
###########################
# Data for Lig plots
# you need merged_df3
# or
# merged_df3_comp
# since these have unique SNPs
# I prefer to use the merged_df3
# because using the _comp dataset means
# we lose some muts and at this level, we should use
# as much info as available
###########################
# uncomment as necessary
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df = merged_df3
#my_df = merged_df3_comp
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
# Ensure correct data type in columns to plot: need to be factor
# sanity check
is.factor(my_df$Lig_outcome)
my_df$Lig_outcome = as.factor(my_df$Ligoutcome)
is.factor(my_df$Lig_outcome)
#[1] TRUE
#############################
# Extra sanity check:
# for mcsm_lig ONLY
# Dis_lig_Ang should be <10
#############################
if (max(my_df$Dis_lig_Ang) < 10){
print ("Sanity check passed: lig data is <10Ang")
}else{
print ("Error: data should be filtered to be within 10Ang")
}
########################################################################
# end of data extraction and cleaning for plots #
########################################################################
#==========================
# Plot: Barplot with scores (unordered)
# corresponds to Lig_outcome
# Stacked Barplot with colours: Lig_outcome @ position coloured by
# stability scores. This is a barplot where each bar corresponds
# to a SNP and is coloured by its corresponding Lig stability value.
# Normalised values (range between -1 and 1 ) to aid visualisation
# NOTE: since barplot plots discrete values, colour = score, so number of
# colours will be equal to the no. of unique normalised scores
# rather than a continuous scale
# will require generating the colour scale separately.
#============================
#===================
# Data for plots
#===================
#<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
df = my_df
#<<<<<<<<<<<<<<<<<<<<<<<<<
rm(my_df)
# sanity checks
table(df$Lig_outcome)
# should be -1 and 1: may not be in this case because you have filtered the data
# FIXME: normalisation before or after filtering?
min(df$ratioPredAff) #
max(df$ratioPredAff) #
# sanity checks
# very important!!!!
tapply(df$ratioPredAff, df$Lig_outcome, min)
tapply(df$ratioPredAff, df$Lig_outcome, max)
#******************
# generate plot
#******************
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots")
getwd()
# My colour FUNCTION: based on group and subgroup
# in my case;
# df = df
# group = Lig_outcome
# subgroup = normalised score i.e ratioPredAff
# Prepare data: round off ratioLig scores
# round off to 3 significant digits:
# 165 if no rounding is performed: used to generate the originalgraph
# 156 if rounded to 3 places
# FIXME: check if reducing precision creates any ML prob
# check unique values in normalised data
u = unique(df$ratioPredAff)
# <<<<< -------------------------------------------
# Run this section if rounding is to be used
# specify number for rounding
n = 3
df$ratioLigR = round(df$ratioPredAff, n)
u = unique(df$ratioLigR) # 156
# create an extra column called group which contains the "gp name and score"
# so colours can be generated for each unique values in this column
my_grp = df$ratioLigR
df$group <- paste0(df$Lig_outcome, "_", my_grp, sep = "")
# else
# uncomment the below if rounding is not required
#my_grp = df$ratioLig
#df$group <- paste0(df$Lig_outcome, "_", my_grp, sep = "")
# <<<<< -----------------------------------------------
# Call the function to create the palette based on the group defined above
colours <- ColourPalleteMulti(df, "Lig_outcome", "my_grp")
my_title = "Ligand affinity"
# axis label size
my_xaxls = 13
my_yaxls = 15
# axes text size
my_xaxts = 15
my_yaxts = 15
# no ordering of x-axis
g = ggplot(df, aes(factor(Position, ordered = T)))
g +
geom_bar(aes(fill = group), colour = "grey") +
scale_fill_manual( values = colours
, guide = 'none') +
theme( axis.text.x = element_text(size = my_xaxls
, angle = 90
, hjust = 1
, vjust = 0.4)
, axis.text.y = element_text(size = my_yaxls
, angle = 0
, hjust = 1
, vjust = 0)
, axis.title.x = element_text(size = my_xaxts)
, axis.title.y = element_text(size = my_yaxts ) ) +
labs(title = my_title
, x = "Position"
, y = "Frequency")
# for sanity and good practice
rm(df)
#======================= end of plot
# axis colours labels
# https://stackoverflow.com/questions/38862303/customize-ggplot2-axis-labels-with-different-colors
# https://stackoverflow.com/questions/56543485/plot-coloured-boxes-around-axis-label

View file

@ -1,192 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages and functions #
########################################################################
source("../Header_TT.R")
source("../barplot_colour_function.R")
########################################################################
# Read file: call script for combining df for PS #
########################################################################
source("../combining_two_df.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#===========================
###########################
# Data for DUET plots
# you need merged_df3
# or
# merged_df3_comp
# since these have unique SNPs
# I prefer to use the merged_df3
# because using the _comp dataset means
# we lose some muts and at this level, we should use
# as much info as available
###########################
# uncomment as necessary
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df = merged_df3
#my_df = merged_df3_comp
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
# Ensure correct data type in columns to plot: need to be factor
# sanity check
is.factor(my_df$DUET_outcome)
my_df$DUET_outcome = as.factor(my_df$DUET_outcome)
is.factor(my_df$DUET_outcome)
#[1] TRUE
########################################################################
# end of data extraction and cleaning for plots #
########################################################################
#==========================
# Barplot with scores (unordered)
# corresponds to DUET_outcome
# Stacked Barplot with colours: DUET_outcome @ position coloured by
# stability scores. This is a barplot where each bar corresponds
# to a SNP and is coloured by its corresponding DUET stability value.
# Normalised values (range between -1 and 1 ) to aid visualisation
# NOTE: since barplot plots discrete values, colour = score, so number of
# colours will be equal to the no. of unique normalised scores
# rather than a continuous scale
# will require generating the colour scale separately.
#============================
#===================
# Data for plots
#===================
#<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
df = my_df
#<<<<<<<<<<<<<<<<<<<<<<<<<
rm(my_df)
# sanity checks
upos = unique(df$Position)
# should be a factor
is.factor(my_df$DUET_outcome)
#[1] TRUE
table(df$DUET_outcome)
# should be -1 and 1
min(df$ratioDUET)
max(df$ratioDUET)
tapply(df$ratioDUET, df$DUET_outcome, min)
tapply(df$ratioDUET, df$DUET_outcome, max)
#******************
# generate plot
#******************
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots")
getwd()
# My colour FUNCTION: based on group and subgroup
# in my case;
# df = df
# group = DUET_outcome
# subgroup = normalised score i.e ratioDUET
# Prepare data: round off ratioDUET scores
# round off to 3 significant digits:
# 323 if no rounding is performed: used to generate the original graph
# 287 if rounded to 3 places
# FIXME: check if reducing precicion creates any ML prob
# check unique values in normalised data
u = unique(df$ratioDUET)
# <<<<< -------------------------------------------
# Run this section if rounding is to be used
# specify number for rounding
n = 3
df$ratioDUETR = round(df$ratioDUET, n)
u = unique(df$ratioDUETR)
# create an extra column called group which contains the "gp name and score"
# so colours can be generated for each unique values in this column
my_grp = df$ratioDUETR
df$group <- paste0(df$DUET_outcome, "_", my_grp, sep = "")
# else
# uncomment the below if rounding is not required
#my_grp = df$ratioDUET
#df$group <- paste0(df$DUET_outcome, "_", my_grp, sep = "")
# <<<<< -----------------------------------------------
# Call the function to create the palette based on the group defined above
colours <- ColourPalleteMulti(df, "DUET_outcome", "my_grp")
my_title = "Protein stability (DUET)"
# axis label size
my_xaxls = 13
my_yaxls = 15
# axes text size
my_xaxts = 15
my_yaxts = 15
# no ordering of x-axis
g = ggplot(df, aes(factor(Position, ordered = T)))
g +
geom_bar(aes(fill = group), colour = "grey") +
scale_fill_manual( values = colours
, guide = 'none') +
theme( axis.text.x = element_text(size = my_xaxls
, angle = 90
, hjust = 1
, vjust = 0.4)
, axis.text.y = element_text(size = my_yaxls
, angle = 0
, hjust = 1
, vjust = 0)
, axis.title.x = element_text(size = my_xaxts)
, axis.title.y = element_text(size = my_yaxts ) ) +
labs(title = my_title
, x = "Position"
, y = "Frequency")
# for sanity and good practice
rm(df)
#======================= end of plot
# axis colours labels
# https://stackoverflow.com/questions/38862303/customize-ggplot2-axis-labels-with-different-colors
# https://stackoverflow.com/questions/56543485/plot-coloured-boxes-around-axis-label

View file

@ -1,215 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages #
########################################################################
source("../Header_TT.R")
#require(data.table)
#require(dplyr)
########################################################################
# Read file: call script for combining df for lig #
########################################################################
source("../combining_two_df_lig.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#===========================
###########################
# Data for Lig plots
# you need merged_df3
# or
# merged_df3_comp
# since these have unique SNPs
# I prefer to use the merged_df3
# because using the _comp dataset means
# we lose some muts and at this level, we should use
# as much info as available
###########################
# uncomment as necessary
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df = merged_df3
#my_df = merged_df3_comp
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
# Ensure correct data type in columns to plot: need to be factor
# sanity check
is.factor(my_df$Lig_outcome)
my_df$Lig_outcome = as.factor(my_df$lig_outcome)
is.factor(my_df$Lig_outcome)
#[1] TRUE
#############################
# Extra sanity check:
# for mcsm_lig ONLY
# Dis_lig_Ang should be <10
#############################
if (max(my_df$Dis_lig_Ang) < 10){
print ("Sanity check passed: lig data is <10Ang")
}else{
print ("Error: data should be filtered to be within 10Ang")
}
########################################################################
# end of data extraction and cleaning for plots #
########################################################################
#===========================
# Plot: Basic barplots
#===========================
#===================
# Data for plots
#===================
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
df = my_df
#<<<<<<<<<<<<<<<<<<<<<<<<<
rm(my_df)
# sanity checks
str(df)
if (identical(df$Position, df$position)){
print("Sanity check passed: Columns 'Position' and 'position' are identical")
} else{
print("Error!: Check column names and info contained")
}
#****************
# generate plot: No of stabilising and destabilsing muts
#****************
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots")
getwd()
svg('basic_barplots_LIG.svg')
my_ats = 25 # axis text size
my_als = 22 # axis label size
# uncomment as necessary for either directly outputting results or
# printing on the screen
g = ggplot(df, aes(x = Lig_outcome))
#prinfFile = g + geom_bar(
g + geom_bar(
aes(fill = Lig_outcome)
, show.legend = TRUE
) + geom_label(
stat = "count"
, aes(label = ..count..)
, color = "black"
, show.legend = FALSE
, size = 10) + theme(
axis.text.x = element_blank()
, axis.title.x = element_blank()
, axis.title.y = element_text(size=my_als)
, axis.text.y = element_text(size = my_ats)
, legend.position = c(0.73,0.8)
, legend.text = element_text(size=my_als-2)
, legend.title = element_text(size=my_als)
, plot.title = element_blank()
) + labs(
title = ""
, y = "Number of SNPs"
#, fill='Ligand Outcome'
) + scale_fill_discrete(name = "Ligand Outcome"
, labels = c("Destabilising", "Stabilising"))
print(prinfFile)
dev.off()
#****************
# generate plot: No of positions
#****************
#get freq count of positions so you can subset freq<1
#require(data.table)
setDT(df)[, pos_count := .N, by = .(Position)] #169, 36
head(df$pos_count)
table(df$pos_count)
# this is cummulative
#1 2 3 4 5 6
#5 24 36 56 30 18
# use group by on this
snpsBYpos_df <- df %>%
group_by(Position) %>%
summarize(snpsBYpos = mean(pos_count))
table(snpsBYpos_df$snpsBYpos)
#1 2 3 4 5 6
#5 12 12 14 6 3
# this is what will get plotted
svg('position_count_LIG.svg')
my_ats = 25 # axis text size
my_als = 22 # axis label size
g = ggplot(snpsBYpos_df, aes(x = snpsBYpos))
prinfFile = g + geom_bar(
#g + geom_bar(
aes (alpha = 0.5)
, show.legend = FALSE
) +
geom_label(
stat = "count", aes(label = ..count..)
, color = "black"
, size = 10
) +
theme(
axis.text.x = element_text(
size = my_ats
, angle = 0
)
, axis.text.y = element_text(
size = my_ats
, angle = 0
, hjust = 1
)
, axis.title.x = element_text(size = my_als)
, axis.title.y = element_text(size = my_als)
, plot.title = element_blank()
) +
labs(
x = "Number of SNPs"
, y = "Number of Sites"
)
print(prinfFile)
dev.off()
########################################################################
# end of Lig barplots #
########################################################################

View file

@ -1,211 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages and functions #
########################################################################
source("../Header_TT.R")
########################################################################
# Read file: call script for combining df for PS #
########################################################################
source("../combining_two_df.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#==========================
###########################
# Data for DUET plots
# you need merged_df3
# or
# merged_df3_comp
# since these have unique SNPs
# I prefer to use the merged_df3
# because using the _comp dataset means
# we lose some muts and at this level, we should use
# as much info as available
###########################
# uncomment as necessary
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df = merged_df3
#my_df = merged_df3_comp
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
# Ensure correct data type in columns to plot: need to be factor
# sanity check
is.factor(my_df$DUET_outcome)
my_df$DUET_outcome = as.factor(my_df$DUET_outcome)
is.factor(my_df$DUET_outcome)
#[1] TRUE
########################################################################
# end of data extraction and cleaning for plots #
########################################################################
#===========================
# Plot: Basic barplots
#===========================
#===================
# Data for plots
#===================
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
df = my_df
#<<<<<<<<<<<<<<<<<<<<<<<<<
rm(my_df)
# sanity checks
str(df)
if (identical(df$Position, df$position)){
print("Sanity check passed: Columns 'Position' and 'position' are identical")
} else{
print("Error!: Check column names and info contained")
}
#****************
# generate plot: No of stabilising and destabilsing muts
#****************
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots")
getwd()
svg('basic_barplots_DUET.svg')
my_ats = 25 # axis text size
my_als = 22 # axis label size
theme_set(theme_grey())
# uncomment as necessary for either directly outputting results or
# printing on the screen
g = ggplot(df, aes(x = DUET_outcome))
prinfFile = g + geom_bar(
#g + geom_bar(
aes(fill = DUET_outcome)
, show.legend = TRUE
) + geom_label(
stat = "count"
, aes(label = ..count..)
, color = "black"
, show.legend = FALSE
, size = 10) + theme(
axis.text.x = element_blank()
, axis.title.x = element_blank()
, axis.title.y = element_text(size=my_als)
, axis.text.y = element_text(size = my_ats)
, legend.position = c(0.73,0.8)
, legend.text = element_text(size=my_als-2)
, legend.title = element_text(size=my_als)
, plot.title = element_blank()
) + labs(
title = ""
, y = "Number of SNPs"
#, fill='DUET Outcome'
) + scale_fill_discrete(name = "DUET Outcome"
, labels = c("Destabilising", "Stabilising"))
print(prinfFile)
dev.off()
#****************
# generate plot: No of positions
#****************
#get freq count of positions so you can subset freq<1
#setDT(df)[, .(Freq := .N), by = .(Position)] #189, 36
setDT(df)[, pos_count := .N, by = .(Position)] #335, 36
table(df$pos_count)
# this is cummulative
#1 2 3 4 5 6
#34 76 63 104 40 18
# use group by on this
snpsBYpos_df <- df %>%
group_by(Position) %>%
summarize(snpsBYpos = mean(pos_count))
table(snpsBYpos_df$snpsBYpos)
#1 2 3 4 5 6
#34 38 21 26 8 3
foo = select(df, Mutationinformation
, WildPos
, wild_type
, mutant_type
, mutation_info
, position
, pos_count) #335, 5
getwd()
write.csv(foo, "../Data/pos_count_freq.csv")
svg('position_count_DUET.svg')
my_ats = 25 # axis text size
my_als = 22 # axis label size
g = ggplot(snpsBYpos_df, aes(x = snpsBYpos))
prinfFile = g + geom_bar(
#g + geom_bar(
aes (alpha = 0.5)
, show.legend = FALSE
) +
geom_label(
stat = "count", aes(label = ..count..)
, color = "black"
, size = 10
) +
theme(
axis.text.x = element_text(
size = my_ats
, angle = 0
)
, axis.text.y = element_text(
size = my_ats
, angle = 0
, hjust = 1
)
, axis.title.x = element_text(size = my_als)
, axis.title.y = element_text(size = my_als)
, plot.title = element_blank()
) +
labs(
x = "Number of SNPs"
, y = "Number of Sites"
)
print(prinfFile)
dev.off()
########################################################################
# end of DUET barplots #
########################################################################

View file

@ -1,175 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages and functions #
########################################################################
source("../Header_TT.R")
#source("barplot_colour_function.R")
########################################################################
# Read file: call script for combining df for PS #
########################################################################
source("../combining_two_df.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#==========================
###########################
# Data for PS Corr plots
# you need merged_df3_comp
# since these are matched
# to allow pairwise corr
###########################
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df = merged_df3_comp
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
########################################################################
# end of data extraction and cleaning for plots #
########################################################################
#===========================
# Plot: Correlation plots
#===========================
#===================
# Data for plots
#===================
#<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
df = my_df
#<<<<<<<<<<<<<<<<<<<<<<<<<
rm(my_df)
# sanity checks
str(df)
table(df$DUET_outcome)
# unique positions
length(unique(df$Position)) #{RESULT: unique positions for comp data}
# subset data to generate pairwise correlations
corr_data = df[, c("ratioDUET"
# , "ratioPredAff"
# , "DUETStability_Kcalpermol"
# , "PredAffLog"
# , "OR"
, "logor"
# , "pvalue"
, "neglog10pvalue"
, "AF"
, "DUET_outcome"
# , "Lig_outcome"
, "pyrazinamide"
)]
dim(corr_data)
rm(df)
# assign nice colnames (for display)
my_corr_colnames = c("DUET"
# , "Ligand Affinity"
# , "DUET_raw"
# , "Lig_raw"
# , "OR"
, "Log(Odds Ratio)"
# , "P-value"
, "-LogP"
, "Allele Frequency"
, "DUET_outcome"
# , "Lig_outcome"
, "pyrazinamide")
# sanity check
if (length(my_corr_colnames) == length(corr_data)){
print("Sanity check passed: corr_data and corr_names match in length")
}else{
print("Error: length mismatch!")
}
colnames(corr_data)
colnames(corr_data) <- my_corr_colnames
colnames(corr_data)
###############
# PLOTS: corr
# http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
###############
#default pairs plot
start = 1
end = which(colnames(corr_data) == "pyrazinamide"); end # should be the last column
offset = 1
my_corr = corr_data[start:(end-offset)]
head(my_corr)
#my_cols = c("#f8766d", "#00bfc4")
# deep blue :#007d85
# deep red: #ae301e
#==========
# psych: ionformative since it draws the ellipsoid
# https://jamesmarquezportfolio.com/correlation_matrices_in_r.html
# http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
#==========
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots"
getwd()
svg('DUET_corr.svg', width = 15, height = 15)
printFile = pairs.panels(my_corr[1:4]
, method = "spearman" # correlation method
, hist.col = "grey" ##00AFBB
, density = TRUE # show density plots
, ellipses = F # show correlation ellipses
, stars = T
, rug = F
, breaks = "Sturges"
, show.points = T
, bg = c("#f8766d", "#00bfc4")[unclass(factor(my_corr$DUET_outcome))]
, pch = 21
, jitter = T
#, alpha = .05
#, points(pch = 19, col = c("#f8766d", "#00bfc4"))
, cex = 3
, cex.axis = 2.5
, cex.labels = 3
, cex.cor = 1
, smooth = F
)
print(printFile)
dev.off()

View file

@ -1,187 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages #
########################################################################
source("../Header_TT.R")
#source("barplot_colour_function.R")
########################################################################
# Read file: call script for combining df for lig #
########################################################################
source("../combining_two_df_lig.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#===========================
###########################
# Data for Lig Corr plots
# you need merged_df3_comp
# since these are matched
# to allow pairwise corr
###########################
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df = merged_df3_comp
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
#############################
# Extra sanity check:
# for mcsm_lig ONLY
# Dis_lig_Ang should be <10
#############################
if (max(my_df$Dis_lig_Ang) < 10){
print ("Sanity check passed: lig data is <10Ang")
}else{
print ("Error: data should be filtered to be within 10Ang")
}
########################################################################
# end of data extraction and cleaning for plots #
########################################################################
#===========================
# Plot: Correlation plots
#===========================
#===================
# Data for plots
#===================
#<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
df = my_df
#<<<<<<<<<<<<<<<<<<<<<<<<<
rm(my_df)
# sanity checks
str(df)
table(df$Lig_outcome)
# unique positions
length(unique(df$Position)) #{RESULT: unique positions for comp data}
# subset data to generate pairwise correlations
corr_data = df[, c(#"ratioDUET",
"ratioPredAff"
# , "DUETStability_Kcalpermol"
# , "PredAffLog"
# , "OR"
, "logor"
# , "pvalue"
, "neglog10pvalue"
, "AF"
# , "DUET_outcome"
, "Lig_outcome"
, "pyrazinamide"
)]
dim(corr_data)
rm(df)
# assign nice colnames (for display)
my_corr_colnames = c(#"DUET",
"Ligand Affinity"
# ,"DUET_raw"
# , "Lig_raw"
# , "OR"
, "Log(Odds Ratio)"
# , "P-value"
, "-LogP"
, "Allele Frequency"
# , "DUET_outcome"
, "Lig_outcome"
, "pyrazinamide")
# sanity check
if (length(my_corr_colnames) == length(corr_data)){
print("Sanity check passed: corr_data and corr_names match in length")
}else{
print("Error: length mismatch!")
}
colnames(corr_data)
colnames(corr_data) <- my_corr_colnames
colnames(corr_data)
###############
# PLOTS: corr
# http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
###############
# default pairs plot
start = 1
end = which(colnames(corr_data) == "pyrazinamide"); end # should be the last column
offset = 1
my_corr = corr_data[start:(end-offset)]
head(my_corr)
#my_cols = c("#f8766d", "#00bfc4")
# deep blue :#007d85
# deep red: #ae301e
#==========
# psych: ionformative since it draws the ellipsoid
# https://jamesmarquezportfolio.com/correlation_matrices_in_r.html
# http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
#==========
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots"
getwd()
svg('Lig_corr.svg', width = 15, height = 15)
printFile = pairs.panels(my_corr[1:4]
, method = "spearman" # correlation method
, hist.col = "grey" ##00AFBB
, density = TRUE # show density plots
, ellipses = F # show correlation ellipses
, stars = T
, rug = F
, breaks = "Sturges"
, show.points = T
, bg = c("#f8766d", "#00bfc4")[unclass(factor(my_corr$Lig_outcome))]
, pch = 21
, jitter = T
# , alpha = .05
# , points(pch = 19, col = c("#f8766d", "#00bfc4"))
, cex = 3
, cex.axis = 2.5
, cex.labels = 3
, cex.cor = 1
, smooth = F
)
print(printFile)
dev.off()

View file

@ -1,227 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages #
########################################################################
source("../Header_TT.R")
#source("barplot_colour_function.R")
require(data.table)
########################################################################
# Read file: call script for combining df #
########################################################################
source("../combining_two_df.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#==========================
###########################
# Data for plots
# you need merged_df2, comprehensive one
# since this has one-many relationship
# i.e the same SNP can belong to multiple lineages
###########################
# uncomment as necessary
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df = merged_df2
#my_df = merged_df2_comp
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
# Ensure correct data type in columns to plot: need to be factor
is.factor(my_df$lineage)
my_df$lineage = as.factor(my_df$lineage)
is.factor(my_df$lineage)
#==========================
# Plot: Lineage barplot
# x = lineage y = No. of samples
# col = Lineage
# fill = lineage
#============================
table(my_df$lineage)
# lineage1 lineage2 lineage3 lineage4 lineage5 lineage6 lineageBOV
#3 104 1293 264 1311 6 6 105
#===========================
# Plot: Lineage Barplots
#===========================
#===================
# Data for plots
#===================
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
df <- my_df
#<<<<<<<<<<<<<<<<<<<<<<<<<
rm(my_df)
# get freq count of positions so you can subset freq<1
#setDT(df)[, lineage_count := .N, by = .(lineage)]
#******************
# generate plot: barplot of mutation by lineage
#******************
sel_lineages = c("lineage1"
, "lineage2"
, "lineage3"
, "lineage4")
df_lin = subset(df, subset = lineage %in% sel_lineages )
#FIXME; add sanity check for numbers.
# Done this manually
############################################################
#########
# Data for barplot: Lineage barplot
# to show total samples and number of unique mutations
# within each linege
##########
# Create df with lineage inform & no. of unique mutations
# per lineage and total samples within lineage
# this is essentially barplot with two y axis
bar = bar = as.data.frame(sel_lineages) #4, 1
total_snps_u = NULL
total_samples = NULL
for (i in sel_lineages){
#print(i)
curr_total = length(unique(df$id)[df$lineage==i])
total_samples = c(total_samples, curr_total)
print(total_samples)
foo = df[df$lineage==i,]
print(paste0(i, "======="))
print(length(unique(foo$Mutationinformation)))
curr_count = length(unique(foo$Mutationinformation))
total_snps_u = c(total_snps_u, curr_count)
}
print(total_snps_u)
bar$num_snps_u = total_snps_u
bar$total_samples = total_samples
bar
#*****************
# generate plot: lineage barplot with two y-axis
#https://stackoverflow.com/questions/13035295/overlay-bar-graphs-in-ggplot2
#*****************
bar$num_snps_u = y1
bar$total_samples = y2
sel_lineages = x
to_plot = data.frame(x = x
, y1 = y1
, y2 = y2)
to_plot
melted = melt(to_plot, id = "x")
melted
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots")
getwd()
svg('lineage_basic_barplot.svg')
my_ats = 20 # axis text size
my_als = 22 # axis label size
g = ggplot(melted
, aes(x = x
, y = value
, fill = variable)
)
printFile = g + geom_bar(
#g + geom_bar(
stat = "identity"
, position = position_stack(reverse = TRUE)
, alpha=.75
, colour='grey75'
) + theme(
axis.text.x = element_text(
size = my_ats
# , angle= 30
)
, axis.text.y = element_text(size = my_ats
#, angle = 30
, hjust = 1
, vjust = 0)
, axis.title.x = element_text(
size = my_als
, colour = 'black'
)
, axis.title.y = element_text(
size = my_als
, colour = 'black'
)
, legend.position = "top"
, legend.text = element_text(size = my_als)
#) + geom_text(
) + geom_label(
aes(label = value)
, size = 5
, hjust = 0.5
, vjust = 0.5
, colour = 'black'
, show.legend = FALSE
#, check_overlap = TRUE
, position = position_stack(reverse = T)
#, position = ('
) + labs(
title = ''
, x = ''
, y = "Number"
, fill = 'Variable'
, colour = 'black'
) + scale_fill_manual(
values = c('grey50', 'gray75')
, name=''
, labels=c('Mutations', 'Total Samples')
) + scale_x_discrete(
breaks = c('lineage1', 'lineage2', 'lineage3', 'lineage4')
, labels = c('Lineage 1', 'Lineage 2', 'Lineage 3', 'Lineage 4')
)
print(printFile)
dev.off()

View file

@ -1,233 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting")
getwd()
########################################################################
# Installing and loading required packages #
########################################################################
source("../Header_TT.R")
#source("barplot_colour_function.R")
#require(data.table)
########################################################################
# Read file: call script for combining df for Lig #
########################################################################
source("../combining_two_df_lig.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#===========================
###########################
# Data for plots
# you need merged_df2 or merged_df2_comp
# since this is one-many relationship
# i.e the same SNP can belong to multiple lineages
###########################
# uncomment as necessary
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df = merged_df2
#my_df = merged_df2_comp
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
# Ensure correct data type in columns to plot: need to be factor
is.factor(my_df$lineage)
my_df$lineage = as.factor(my_df$lineage)
is.factor(my_df$lineage)
table(my_df$mutation_info)
#############################
# Extra sanity check:
# for mcsm_lig ONLY
# Dis_lig_Ang should be <10
#############################
if (max(my_df$Dis_lig_Ang) < 10){
print ("Sanity check passed: lig data is <10Ang")
}else{
print ("Error: data should be filtered to be within 10Ang")
}
########################################################################
# end of data extraction and cleaning for plots #
########################################################################
#==========================
# Plot: Lineage Distribution
# x = mcsm_values, y = dist
# fill = stability
#============================
#===================
# Data for plots
#===================
# subset only lineages1-4
sel_lineages = c("lineage1"
, "lineage2"
, "lineage3"
, "lineage4")
# uncomment as necessary
df_lin = subset(my_df, subset = lineage %in% sel_lineages ) #2037 35
# refactor
df_lin$lineage = factor(df_lin$lineage)
table(df_lin$lineage) #{RESULT: No of samples within lineage}
#lineage1 lineage2 lineage3 lineage4
#78 961 195 803
# when merged_df2_comp is used
#lineage1 lineage2 lineage3 lineage4
#77 955 194 770
length(unique(df_lin$Mutationinformation))
#{Result: No. of unique mutations the 4 lineages contribute to}
# sanity checks
r1 = 2:5 # when merged_df2 used: because there is missing lineages
if(sum(table(my_df$lineage)[r1]) == nrow(df_lin)) {
print ("sanity check passed: numbers match")
} else{
print("Error!: check your numbers")
}
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
df <- df_lin
#<<<<<<<<<<<<<<<<<<<<<<<<<
rm(df_lin)
#******************
# generate distribution plot of lineages
#******************
# basic: could improve this!
library(plotly)
library(ggridges)
fooNames = c('Lineage 1', 'Lineage 2', 'Lineage 3', 'Lineage 4')
names(fooNames) = c('lineage1', 'lineage2', 'lineage3', 'lineage4')
g <- ggplot(df, aes(x = ratioPredAff)) +
geom_density(aes(fill = Lig_outcome)
, alpha = 0.5) +
facet_wrap( ~ lineage
, scales = "free"
, labeller = labeller(lineage = fooNames) ) +
coord_cartesian(xlim = c(-1, 1)
# , ylim = c(0, 6)
# , clip = "off"
)
ggtitle("Kernel Density estimates of Ligand affinity by lineage")
ggplotly(g)
# 2 : ggridges (good!)
my_ats = 15 # axis text size
my_als = 20 # axis label size
fooNames = c('Lineage 1', 'Lineage 2', 'Lineage 3', 'Lineage 4')
names(fooNames) = c('lineage1', 'lineage2', 'lineage3', 'lineage4')
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots")
getwd()
svg('lineage_dist_LIG.svg')
printFile = ggplot( df, aes(x = ratioPredAff
, y = Lig_outcome) ) +
geom_density_ridges_gradient( aes(fill = ..x..)
, scale = 3
, size = 0.3 ) +
facet_wrap( ~lineage
, scales = "free"
# , switch = 'x'
, labeller = labeller(lineage = fooNames) ) +
coord_cartesian( xlim = c(-1, 1)
# , ylim = c(0, 6)
# , clip = "off"
) +
scale_fill_gradientn( colours = c("#f8766d", "white", "#00bfc4")
, name = "Ligand Affinity" ) +
theme( axis.text.x = element_text( size = my_ats
, angle = 90
, hjust = 1
, vjust = 0.4)
# , axis.text.y = element_text( size = my_ats
# , angle = 0
# , hjust = 1
# , vjust = 0)
, axis.text.y = element_blank()
, axis.title.x = element_blank()
, axis.title.y = element_blank()
, axis.ticks.y = element_blank()
, plot.title = element_blank()
, strip.text = element_text(size = my_als)
, legend.text = element_text(size = 10)
, legend.title = element_text(size = my_als)
# , legend.position = c(0.3, 0.8)
# , legend.key.height = unit(1, 'mm')
)
print(printFile)
dev.off()
#=!=!=!=!=!=!
# COMMENT: When you look at all mutations, the lineage differences disappear...
# The pattern we are interested in is possibly only for dr_mutations
#=!=!=!=!=!=!
#===================================================
# COMPARING DISTRIBUTIONS
head(df$lineage)
df$lineage = as.character(df$lineage)
lin1 = df[df$lineage == "lineage1",]$ratioPredAff
lin2 = df[df$lineage == "lineage2",]$ratioPredAff
lin3 = df[df$lineage == "lineage3",]$ratioPredAff
lin4 = df[df$lineage == "lineage4",]$ratioPredAff
# ks test
ks.test(lin1,lin2)
ks.test(lin1,lin3)
ks.test(lin1,lin4)
ks.test(lin2,lin3)
ks.test(lin2,lin4)
ks.test(lin3,lin4)

View file

@ -1,212 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts/plotting") # thinkpad
getwd()
########################################################################
# Installing and loading required packages #
########################################################################
source("../Header_TT.R")
#source("barplot_colour_function.R")
#require(data.table)
########################################################################
# Read file: call script for combining df for PS #
########################################################################
source("../combining_two_df.R")
#---------------------- PAY ATTENTION
# the above changes the working dir
#[1] "git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts"
#---------------------- PAY ATTENTION
#==========================
# This will return:
# df with NA:
# merged_df2
# merged_df3
# df without NA:
# merged_df2_comp
# merged_df3_comp
#===========================
###########################
# Data for plots
# you need merged_df2 or merged_df2_comp
# since this is one-many relationship
# i.e the same SNP can belong to multiple lineages
###########################
# uncomment as necessary
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
my_df = merged_df2
#my_df = merged_df2_comp
#<<<<<<<<<<<<<<<<<<<<<<<<<
# delete variables not required
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# quick checks
colnames(my_df)
str(my_df)
# Ensure correct data type in columns to plot: need to be factor
is.factor(my_df$lineage)
my_df$lineage = as.factor(my_df$lineage)
is.factor(my_df$lineage)
table(my_df$mutation_info)
########################################################################
# end of data extraction and cleaning for plots #
########################################################################
#==========================
# Plot: Lineage Distribution
# x = mcsm_values, y = dist
# fill = stability
#============================
#===================
# Data for plots
#===================
# subset only lineages1-4
sel_lineages = c("lineage1"
, "lineage2"
, "lineage3"
, "lineage4")
# uncomment as necessary
df_lin = subset(my_df, subset = lineage %in% sel_lineages )
# refactor
df_lin$lineage = factor(df_lin$lineage)
table(df_lin$lineage) #{RESULT: No of samples within lineage}
#lineage1 lineage2 lineage3 lineage4
#104 1293 264 1311
# when merged_df2_comp is used
#lineage1 lineage2 lineage3 lineage4
#99 1275 263 1255
length(unique(df_lin$Mutationinformation))
#{Result: No. of unique mutations the 4 lineages contribute to}
# sanity checks
r1 = 2:5 # when merged_df2 used: because there is missing lineages
if(sum(table(my_df$lineage)[r1]) == nrow(df_lin)) {
print ("sanity check passed: numbers match")
} else{
print("Error!: check your numbers")
}
#<<<<<<<<<<<<<<<<<<<<<<<<<
# REASSIGNMENT
df <- df_lin
#<<<<<<<<<<<<<<<<<<<<<<<<<
rm(df_lin)
#******************
# generate distribution plot of lineages
#******************
# basic: could improve this!
library(plotly)
library(ggridges)
g <- ggplot(df, aes(x = ratioDUET)) +
geom_density(aes(fill = DUET_outcome)
, alpha = 0.5) + facet_wrap(~ lineage,
scales = "free") +
ggtitle("Kernel Density estimates of Protein stability by lineage")
ggplotly(g)
# 2 : ggridges (good!)
my_ats = 15 # axis text size
my_als = 20 # axis label size
fooNames=c('Lineage 1', 'Lineage 2', 'Lineage 3', 'Lineage 4')
names(fooNames)=c('lineage1', 'lineage2', 'lineage3', 'lineage4')
# set output dir for plots
getwd()
setwd("~/git/Data/pyrazinamide/output/plots")
getwd()
svg('lineage_dist_PS.svg')
printFile = ggplot( df, aes(x = ratioDUET
, y = DUET_outcome) )+
#printFile=geom_density_ridges_gradient(
geom_density_ridges_gradient( aes(fill = ..x..)
, scale = 3
, size = 0.3 ) +
facet_wrap( ~lineage
, scales = "free"
# , switch = 'x'
, labeller = labeller(lineage = fooNames) ) +
coord_cartesian( xlim = c(-1, 1)
# , ylim = c(0, 6)
# , clip = "off"
) +
scale_fill_gradientn( colours = c("#f8766d", "white", "#00bfc4")
, name = "DUET" ) +
theme( axis.text.x = element_text( size = my_ats
, angle = 90
, hjust = 1
, vjust = 0.4)
# , axis.text.y = element_text( size = my_ats
# , angle = 0
# , hjust = 1
# , vjust = 0)
, axis.text.y = element_blank()
, axis.title.x = element_blank()
, axis.title.y = element_blank()
, axis.ticks.y = element_blank()
, plot.title = element_blank()
, strip.text = element_text(size=my_als)
, legend.text = element_text(size=10)
, legend.title = element_text(size=my_als)
# , legend.position = c(0.3, 0.8)
# , legend.key.height = unit(1, 'mm')
)
print(printFile)
dev.off()
#=!=!=!=!=!=!
# COMMENT: When you look at all mutations, the lineage differences disappear...
# The pattern we are interested in is possibly only for dr_mutations
#=!=!=!=!=!=!
#===================================================
# COMPARING DISTRIBUTIONS
head(df$lineage)
df$lineage = as.character(df$lineage)
lin1 = df[df$lineage == "lineage1",]$ratioDUET
lin2 = df[df$lineage == "lineage2",]$ratioDUET
lin3 = df[df$lineage == "lineage3",]$ratioDUET
lin4 = df[df$lineage == "lineage4",]$ratioDUET
# ks test
ks.test(lin1,lin2)
ks.test(lin1,lin3)
ks.test(lin1,lin4)
ks.test(lin2,lin3)
ks.test(lin2,lin4)
ks.test(lin3,lin4)

View file

@ -1,27 +0,0 @@
#########################
#3: Read complex pdb file
##########################
source("Header_TT.R")
# This script only reads the pdb file of your complex
# read in pdb file complex1
inDir = "~/git/Data/pyrazinamide/input/structure/"
inFile = paste0(inDir, "complex1_no_water.pdb")
complex1 = inFile
#inFile2 = paste0(inDir, "complex2_no_water.pdb")
#complex2 = inFile2
# list of 8
my_pdb = read.pdb(complex1
, maxlines = -1
, multi = FALSE
, rm.insert = FALSE
, rm.alt = TRUE
, ATOM.only = FALSE
, hex = FALSE
, verbose = TRUE)
rm(inDir, inFile, complex1)
#====== end of script

View file

@ -1,386 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts")
getwd()
########################################################################
# Installing and loading required packages #
########################################################################
source("Header_TT.R")
#########################################################
# TASK: replace B-factors in the pdb file with normalised values
# use the complex file with no water as mCSM lig was
# performed on this file. You can check it in the script: read_pdb file.
#########################################################
###########################
# 2: Read file: average stability values
# or mcsm_normalised file, output of step 4 mcsm pipeline
###########################
inDir = "~/git/Data/pyrazinamide/input/processed/"
inFile = paste0(inDir, "mean_PS_Lig_Bfactor.csv"); inFile
my_df <- read.csv(inFile
# , row.names = 1
# , stringsAsFactors = F
, header = T)
str(my_df)
#=========================================================
# Processing P1: Replacing B factor with mean ratioDUET scores
#=========================================================
#########################
# Read complex pdb file
# form the R script
##########################
source("read_pdb.R") # list of 8
# extract atom list into a variable
# since in the list this corresponds to data frame, variable will be a df
d = my_pdb[[1]]
# make a copy: required for downstream sanity checks
d2 = d
# sanity checks: B factor
max(d$b); min(d$b)
#*******************************************
# plot histograms for inspection
# 1: original B-factors
# 2: original DUET Scores
# 3: replaced B-factors with DUET Scores
#*********************************************
# Set the margin on all sides
par(oma = c(3,2,3,0)
, mar = c(1,3,5,2)
, mfrow = c(3,2))
#par(mfrow = c(3,2))
#1: Original B-factor
hist(d$b
, xlab = ""
, main = "B-factor")
plot(density(d$b)
, xlab = ""
, main = "B-factor")
# 2: DUET scores
hist(my_df$average_DUETR
, xlab = ""
, main = "Norm_DUET")
plot(density(my_df$average_DUETR)
, xlab = ""
, main = "Norm_DUET")
# 3: After the following replacement
#********************************
#=========
# step 0_P1: DONT RUN once you have double checked the matched output
#=========
# sanity check: match and assign to a separate column to double check
# colnames(my_df)
# d$ratioDUET = my_df$averge_DUETR[match(d$resno, my_df$Position)]
#=========
# step 1_P1
#=========
# Be brave and replace in place now (don't run sanity check)
# this makes all the B-factor values in the non-matched positions as NA
d$b = my_df$average_DUETR[match(d$resno, my_df$Position)]
#=========
# step 2_P1
#=========
# count NA in Bfactor
b_na = sum(is.na(d$b)) ; b_na
# count number of 0's in Bactor
sum(d$b == 0)
#table(d$b)
# replace all NA in b factor with 0
d$b[is.na(d$b)] = 0
# sanity check: should be 0
sum(is.na(d$b))
# sanity check: should be True
if (sum(d$b == 0) == b_na){
print ("Sanity check passed: NA's replaced with 0's successfully")
} else {
print("Error: NA replacement NOT successful, Debug code!")
}
max(d$b); min(d$b)
# sanity checks: should be True
if(max(d$b) == max(my_df$average_DUETR)){
print("Sanity check passed: B-factors replaced correctly")
} else {
print ("Error: Debug code please")
}
if (min(d$b) == min(my_df$average_DUETR)){
print("Sanity check passed: B-factors replaced correctly")
} else {
print ("Error: Debug code please")
}
#=========
# step 3_P1
#=========
# sanity check: dim should be same before reassignment
# should be TRUE
dim(d) == dim(d2)
#=========
# step 4_P1
#=========
# assign it back to the pdb file
my_pdb[[1]] = d
max(d$b); min(d$b)
#=========
# step 5_P1
#=========
# output dir
getwd()
outDir = "~/git/Data/pyrazinamide/input/structure/"
outFile = paste0(outDir, "complex1_BwithNormDUET.pdb"); outFile
write.pdb(my_pdb, outFile)
#********************************
# Add the 3rd histogram and density plots for comparisons
#********************************
# Plots continued...
# 3: hist and density of replaced B-factors with DUET Scores
hist(d$b
, xlab = ""
, main = "repalced-B")
plot(density(d$b)
, xlab = ""
, main = "replaced-B")
# graph titles
mtext(text = "Frequency"
, side = 2
, line = 0
, outer = TRUE)
mtext(text = "DUET_stability"
, side = 3
, line = 0
, outer = TRUE)
#********************************
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# NOTE: This replaced B-factor distribution has the same
# x-axis as the PredAff normalised values, but the distribution
# is affected since 0 is overinflated. This is because all the positions
# where there are no SNPs have been assigned 0.
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#######################################################################
#====================== end of section 1 ==============================
#######################################################################
#=========================================================
# Processing P2: Replacing B values with PredAff Scores
#=========================================================
# clear workspace
rm(list = ls())
###########################
# 2: Read file: average stability values
# or mcsm_normalised file, output of step 4 mcsm pipeline
###########################
inDir = "~/git/Data/pyrazinamide/input/processed/"
inFile = paste0(inDir, "mean_PS_Lig_Bfactor.csv"); inFile
my_df <- read.csv(inFile
# , row.names = 1
# , stringsAsFactors = F
, header = T)
str(my_df)
#rm(inDir, inFile)
#########################
# 3: Read complex pdb file
# form the R script
##########################
source("read_pdb.R") # list of 8
# extract atom list into a variable
# since in the list this corresponds to data frame, variable will be a df
d = my_pdb[[1]]
# make a copy: required for downstream sanity checks
d2 = d
# sanity checks: B factor
max(d$b); min(d$b)
#*******************************************
# plot histograms for inspection
# 1: original B-factors
# 2: original Pred Aff Scores
# 3: replaced B-factors with PredAff Scores
#********************************************
# Set the margin on all sides
par(oma = c(3,2,3,0)
, mar = c(1,3,5,2)
, mfrow = c(3,2))
#par(mfrow = c(3,2))
# 1: Original B-factor
hist(d$b
, xlab = ""
, main = "B-factor")
plot(density(d$b)
, xlab = ""
, main = "B-factor")
# 2: Pred Aff scores
hist(my_df$average_PredAffR
, xlab = ""
, main = "Norm_lig_average")
plot(density(my_df$average_PredAffR)
, xlab = ""
, main = "Norm_lig_average")
# 3: After the following replacement
#********************************
#=================================================
# Processing P2: Replacing B values with ratioPredAff scores
#=================================================
# use match to perform this replacement linking with "position no"
# in the pdb file, this corresponds to column "resno"
# in my_df, this corresponds to column "Position"
#=========
# step 0_P2: DONT RUN once you have double checked the matched output
#=========
# sanity check: match and assign to a separate column to double check
# colnames(my_df)
# d$ratioPredAff = my_df$average_PredAffR[match(d$resno, my_df$Position)] #1384, 17
#=========
# step 1_P2: BE BRAVE and replace in place now (don't run step 0)
#=========
# this makes all the B-factor values in the non-matched positions as NA
d$b = my_df$average_PredAffR[match(d$resno, my_df$Position)]
#=========
# step 2_P2
#=========
# count NA in Bfactor
b_na = sum(is.na(d$b)) ; b_na
# count number of 0's in Bactor
sum(d$b == 0)
#table(d$b)
# replace all NA in b factor with 0
d$b[is.na(d$b)] = 0
# sanity check: should be 0
sum(is.na(d$b))
if (sum(d$b == 0) == b_na){
print ("Sanity check passed: NA's replaced with 0's successfully")
} else {
print("Error: NA replacement NOT successful, Debug code!")
}
max(d$b); min(d$b)
# sanity checks: should be True
if (max(d$b) == max(my_df$average_PredAffR)){
print("Sanity check passed: B-factors replaced correctly")
} else {
print ("Error: Debug code please")
}
if (min(d$b) == min(my_df$average_PredAffR)){
print("Sanity check passed: B-factors replaced correctly")
} else {
print ("Error: Debug code please")
}
#=========
# step 3_P2
#=========
# sanity check: dim should be same before reassignment
# should be TRUE
dim(d) == dim(d2)
#=========
# step 4_P2
#=========
# assign it back to the pdb file
my_pdb[[1]] = d
max(d$b); min(d$b)
#=========
# step 5_P2
#=========
# output dir
outDir = "~/git/Data/pyrazinamide/input/structure/"
outFile = paste0(outDir, "complex1_BwithNormLIG.pdb"); outFile
write.pdb(my_pdb, outFile)
#********************************
# Add the 3rd histogram and density plots for comparisons
#********************************
# Plots continued...
# 3: hist and density of replaced B-factors with PredAff Scores
hist(d$b
, xlab = ""
, main = "repalced-B")
plot(density(d$b)
, xlab = ""
, main = "replaced-B")
# graph titles
mtext(text = "Frequency"
, side = 2
, line = 0
, outer = TRUE)
mtext(text = "Lig_stability"
, side = 3
, line = 0
, outer = TRUE)
#********************************
###########
# end of output files with Bfactors
##########

View file

@ -1,257 +0,0 @@
getwd()
setwd("~/git/LSHTM_analysis/mcsm_analysis/pyrazinamide/scripts")
getwd()
#########################################################
# 1: Installing and loading required packages #
#########################################################
source("Header_TT.R")
#source("barplot_colour_function.R")
##########################################################
# Checking: Entire data frame and for PS #
##########################################################
###########################
#2) Read file: combined one from the script
###########################
source("combining_two_df.R")
# df with NA:
# merged_df2
# merged_df3:
# df without NA:
# merged_df2_comp:
# merged_df3_comp:
######################
# You need to check it
# with the merged_df3
########################
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# REASSIGNMENT
my_df = merged_df3
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
#clear variables
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# should be true
identical(my_df$Position, my_df$position)
#################################
# Read file: normalised file
# output of step 4 mcsm_pipeline
#################################
inDir = "~/git/Data/pyrazinamide/input/processed/"
inFile = paste0(inDir, "mcsm_complex1_normalised.csv"); inFile
mcsm_data <- read.csv(inFile
, row.names = 1
, stringsAsFactors = F
, header = T)
str(mcsm_data)
my_colnames = colnames(mcsm_data)
#====================================
# subset my_df to include only the columns in mcsm data
my_df2 = my_df[my_colnames]
#====================================
# compare the two
head(mcsm_data$Mutationinformation)
head(mcsm_data$Position)
head(my_df2$Mutationinformation)
head(my_df2$Position)
# sort mcsm data by Mutationinformation
mcsm_data_s = mcsm_data[order(mcsm_data$Mutationinformation),]
head(mcsm_data_s$Mutationinformation)
head(mcsm_data_s$Position)
# now compare: should be True, but is false....
# possibly due to rownames!?!
identical(mcsm_data_s, my_df2)
# from library dplyr
setdiff(mcsm_data_s, my_df2)
#from lib compare
compare(mcsm_data_s, my_df2) # seems rownames are the problem
# FIXME: automate this
# write files: checked using meld and files are indeed identical
#write.csv(mcsm_data_s, "mcsm_data_s.csv", row.names = F)
#write.csv(my_df2, "my_df2.csv", row.names = F)
#====================================================== end of section 1
##########################################################
# Checking: LIG(Filtered dataframe) #
##########################################################
# clear workspace
rm(list = ls())
###########################
#3) Read file: combined_lig from the script
###########################
source("combining_two_df_lig.R")
# df with NA:
# merged_df2 :
# merged_df3:
# df without NA:
# merged_df2_comp:
# merged_df3_comp:
######################
# You need to check it
# with the merged_df3
########################
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# REASSIGNMENT
my_df = merged_df3
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
#clear variables
rm(merged_df2, merged_df2_comp, merged_df3, merged_df3_comp)
# should be true
identical(my_df$Position, my_df$position)
#################################
# Read file: normalised file
# output of step 4 mcsm_pipeline
#################################
inDir = "~/git/Data/pyrazinamide/input/processed/"
inFile = paste0(inDir, "mcsm_complex1_normalised.csv"); inFile
mcsm_data <- read.csv(inFile
, row.names = 1
, stringsAsFactors = F
, header = T)
str(mcsm_data)
###########################
# 4a: Filter/subset data: ONLY for LIGand analysis
# Lig plots < 10Ang
# Filter the lig plots for Dis_to_lig < 10Ang
###########################
# sanity checks
upos = unique(mcsm_data$Position)
# check range of distances
max(mcsm_data$Dis_lig_Ang)
min(mcsm_data$Dis_lig_Ang)
# Lig filtered: subset data to have only values less than 10 Ang
mcsm_data2 = subset(mcsm_data, mcsm_data$Dis_lig_Ang < 10)
rm(mcsm_data) #to avoid confusion
table(mcsm_data2$Dis_lig_Ang<10)
table(mcsm_data2$Dis_lig_Ang>10)
max(mcsm_data2$Dis_lig_Ang)
min(mcsm_data2$Dis_lig_Ang)
upos_f = unique(mcsm_data2$Position); upos_f
# colnames of df that you will need to subset the bigger df from
my_colnames = colnames(mcsm_data2)
#====================================
# subset bigger df i.e my_df to include only the columns in mcsm data2
my_df2 = my_df[my_colnames]
rm(my_df) #to avoid confusion
#====================================
# compare the two
head(mcsm_data2$Mutationinformation)
head(mcsm_data2$Position)
head(my_df2$Mutationinformation)
head(my_df2$Position)
# sort mcsm data by Mutationinformation
mcsm_data2_s = mcsm_data2[order(mcsm_data2$Mutationinformation),]
head(mcsm_data2_s$Mutationinformation)
head(mcsm_data2_s$Position)
# now compare: should be True, but is false....
# possibly due to rownames!?!
identical(mcsm_data2_s, my_df2)
# from library dplyr
setdiff(mcsm_data2_s, my_df2)
# from library compare
compare(mcsm_data2_s, my_df2) # seems rownames are the problem
#FIXME: automate this
# write files: checked using meld and files are indeed identical
#write.csv(mcsm_data2_s, "mcsm_data2_s.csv", row.names = F)
#write.csv(my_df2, "my_df2.csv", row.names = F)
##########################################################
# extract and write output file for SNP posn: all #
##########################################################
head(merged_df3$Position)
foo = merged_df3[order(merged_df3$Position),]
head(foo$Position)
snp_pos_unique = unique(foo$Position); snp_pos_unique
# sanity check:
table(snp_pos_unique == combined_df$Position)
#=====================
# write_output files
#=====================
outDir = "~/Data/pyrazinamide/input/processed/"
outFile1 = paste0(outDir, "snp_pos_unique.txt"); outFile1
print(paste0("Output file name and path will be:","", outFile1))
write.table(snp_pos_unique
, outFile1
, row.names = F
, col.names = F)
##############################################################
# extract and write output file for SNP posn: complete only #
##############################################################
head(merged_df3_comp$Position)
foo = merged_df3_comp[order(merged_df3_comp$Position),]
head(foo$Position)
snp_pos_unique = unique(foo$Position); snp_pos_unique
# outDir = "~/Data/pyrazinamide/input/processed/" # already set
outFile2 = paste0(outDir, "snp_pos_unique_comp.txt")
print(paste0("Output file name and path will be:", outFile2))
write.table(snp_pos_unique
, outFile2
, row.names = F
, col.names = F)
#============================== end of script

56
mcsm_na/examples.py Normal file
View file

@ -0,0 +1,56 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Feb 12 12:15:26 2021
@author: tanu
"""
import os
homedir = os.path.expanduser('~')
os.chdir (homedir + '/git/LSHTM_analysis/mcsm_na')
from submit_mcsm_na import *
from get_results_mcsm_na import *
#%%#####################################################################
#EXAMPLE RUN for different stages
#=====================
# STAGE: submit_mcsm_na.py
#=====================
my_host = 'http://biosig.unimelb.edu.au'
my_prediction_url = f"{my_host}/mcsm_na/run_prediction_list"
print(my_prediction_url)
my_outdir = homedir + '/git/LSHTM_analysis/mcsm_na'
my_nuc_type = 'RNA'
my_pdb_file = homedir + '/git/Data/streptomycin/input/gid_complex.pdb'
my_mutation_list = homedir + '/git/LSHTM_analysis/mcsm_na/test_snps_b1.csv'
my_suffix = 'TEST'
#----------------------------------------------
# example 1: 2 snps in a file
#----------------------------------------------
submit_mcsm_na(host_url = my_host
, pdb_file = my_pdb_file
, mutation_list = my_mutation_list
, nuc_type = my_nuc_type
, prediction_url = my_prediction_url
, output_dir = my_outdir
, outfile_suffix = my_suffix)
#%%###################################################################
#=====================
# STAGE: get_results.py
#=====================
my_host = 'http://biosig.unimelb.edu.au'
my_outdir = homedir + '/git/LSHTM_analysis/mcsm_na'
#----------------------------------------------
# example 1: single url in a single file
#----------------------------------------------
my_url_file_single = homedir + '/git/LSHTM_analysis/mcsm_na/mcsm_na_temp/mcsm_na_result_url_gid_test_b1.txt'
print(my_url_file_single)
my_suffix = 'single'
get_results(url_file = my_url_file_single
, host_url = my_host
, output_dir = my_outdir
, outfile_suffix = my_suffix)

View file

@ -0,0 +1,134 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 19 14:33:51 2020
@author: tanu
"""
#%% load packages
import os,sys
import subprocess
import argparse
import requests
import re
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
#%%#####################################################################
def format_mcsm_na_output(mcsm_na_output_tsv):
"""
@param mcsm_na_outputcsv: file containing mcsm_na_results for all muts
which is the result of combining all mcsm_na batch results, and using
bash scripts to combine all the batch results into one file.
This is post run_get_results_mcsm_na.py
Formatting df to a pandas df and output as csv.
@type string
@return (not true) formatted csv for mcsm_na output
@type pandas df
"""
#############
# Read file
#############
mcsm_na_data_raw = pd.read_csv(mcsm_na_output_tsv, sep = '\t')
# strip white space from both ends in all columns
mcsm_na_data = mcsm_na_data_raw.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
dforig_shape = mcsm_na_data.shape
print('dimensions of input file:', dforig_shape)
#############
# rename cols
#############
# format colnames: all lowercase and consistent colnames
mcsm_na_data.columns
print('Assigning meaningful colnames'
, '\n=======================================================')
my_colnames_dict = {'PDB_FILE': 'pdb_file' # relevant info from this col will be extracted and the column discarded
, 'CHAIN': 'chain'
, 'WILD_RES': 'wild_type' # one letter amino acid code
, 'RES_POS': 'position' # number
, 'MUT_RES': 'mutant_type' # one letter amino acid code
, 'RSA': 'rsa' # single letter (caps)
, 'PRED_DDG': 'mcsm_na_affinity'} # 3-letter code
mcsm_na_data.rename(columns = my_colnames_dict, inplace = True)
mcsm_na_data.columns
#%%============================================================================
#############
# create mutationinformation column
#############
#mcsm_na_data['mutationinformation'] = mcsm_na_data['wild_type'] + mcsm_na_data.position.map(str) + mcsm_na_data['mutant_type']
mcsm_na_data['mutationinformation'] = mcsm_na_data.loc[:,'wild_type'] + mcsm_na_data.loc[:,'position'].astype(int).apply(str) + mcsm_na_data.loc[:,'mutant_type']
#%%=====================================================================
#############
# Create col: mcsm_na_outcome
#############
# classification based on mcsm_na_affinity values
print('Assigning col: mcsm_na_outcome based on mcsm_na_affinity')
print('Sanity check:')
# count positive values in the mcsm_na_affinity column
c = mcsm_na_data[mcsm_na_data['mcsm_na_affinity']>=0].count()
mcsm_na_pos = c.get(key = 'mcsm_na_affinity')
# Assign category based on sign (+ve : I_affinity, -ve: R_affinity)
mcsm_na_data['mcsm_na_outcome'] = np.where(mcsm_na_data['mcsm_na_affinity']>=0, 'Increased_affinity', 'Reduced_affinity')
print('mcsm_na Outcome:', mcsm_na_data['mcsm_na_outcome'].value_counts())
#if mcsm_na_pos == mcsm_na_data['mcsm_na_outcome'].value_counts()['Increased_affinity']:
# print('PASS: mcsm_na_outcome assigned correctly')
#else:
# print('FAIL: mcsm_na_outcome assigned incorrectly'
# , '\nExpected no. of Increased_affinity mutations:', mcsm_na_pos
# , '\nGot no. of Increased affinity mutations', mcsm_na_data['mcsm_na_outcome'].value_counts()['Increased_affinity']
# , '\n======================================================')
#%%=====================================================================
#############
# scale mcsm_na values
#############
# Rescale values in mcsm_na_affinity col b/w -1 and 1 so negative numbers
# stay neg and pos numbers stay positive
mcsm_na_min = mcsm_na_data['mcsm_na_affinity'].min()
mcsm_na_max = mcsm_na_data['mcsm_na_affinity'].max()
mcsm_na_scale = lambda x : x/abs(mcsm_na_min) if x < 0 else (x/mcsm_na_max if x >= 0 else 'failed')
mcsm_na_data['mcsm_na_scaled'] = mcsm_na_data['mcsm_na_affinity'].apply(mcsm_na_scale)
print('Raw mcsm_na scores:\n', mcsm_na_data['mcsm_na_affinity']
, '\n---------------------------------------------------------------'
, '\nScaled mcsm_na scores:\n', mcsm_na_data['mcsm_na_scaled'])
c2 = mcsm_na_data[mcsm_na_data['mcsm_na_scaled']>=0].count()
mcsm_na_pos2 = c2.get(key = 'mcsm_na_affinity')
if mcsm_na_pos == mcsm_na_pos2:
print('\nPASS: Affinity values scaled correctly')
else:
print('\nFAIL: Affinity values scaled numbers MISmatch'
, '\nExpected number:', mcsm_na_pos
, '\nGot:', mcsm_na_pos2
, '\n======================================================')
#%%=====================================================================
#############
# reorder columns
#############
mcsm_na_data.columns
mcsm_na_dataf = mcsm_na_data[['mutationinformation'
, 'mcsm_na_affinity'
, 'mcsm_na_scaled'
, 'mcsm_na_outcome'
, 'rsa'
, 'wild_type'
, 'position'
, 'mutant_type'
, 'chain'
, 'pdb_file']]
return(mcsm_na_dataf)
#%%#####################################################################

View file

@ -0,0 +1,52 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 19 14:33:51 2020
@author: tanu
"""
#%% load packages
import os,sys
import subprocess
import argparse
import requests
import re
import time
from bs4 import BeautifulSoup
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
#%%#####################################################################
def get_results(url_file, host_url, output_dir, outfile_suffix):
# initilialise empty df
#mcsm_na_results_out_df = pd.DataFrame()
with open(url_file, 'r') as f:
for count, line in enumerate(f):
line = line.strip()
print('URL no.', count+1, '\n', line)
#============================
# Writing results file: csv
#============================
mcsm_na_results_dir = output_dir + '/mcsm_na_results'
if not os.path.exists(mcsm_na_results_dir):
print('\nCreating dir: mcsm_na_results within:', output_dir )
os.makedirs(mcsm_na_results_dir)
# Download the .txt
prediction_number = re.search(r'([0-9]+\.[0-9]+$)', line).group(0)
print('CHECK prediction no:', prediction_number)
txt_url = f"{host_url}/mcsm_na/static/results/" + prediction_number + '.txt'
print('CHECK txt url:', txt_url)
out_filename = mcsm_na_results_dir + '/' + outfile_suffix + '_output_' + prediction_number + '.txt.gz'
response_txt = requests.get(txt_url, stream = True)
if response_txt.status_code == 200:
print('\nDownloading .txt:', txt_url
, '\n\nSaving file as:', out_filename)
with open(out_filename, 'wb') as f:
f.write(response_txt.raw.read())
#%%#####################################################################

View file

@ -0,0 +1 @@
http://biosig.unimelb.edu.au/mcsm_na/results_prediction/1613147445.16

View file

@ -0,0 +1,78 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Feb 12 12:15:26 2021
@author: tanu
"""
#%% load packages
import os
homedir = os.path.expanduser('~')
os.chdir (homedir + '/git/LSHTM_analysis/mcsm_na')
from format_results_mcsm_na import *
########################################################################
#%% command line args
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('-d', '--drug' , help = 'drug name (case sensitive)', default = None)
arg_parser.add_argument('-g', '--gene' , help = 'gene name (case sensitive)', default = None)
arg_parser.add_argument('--datadir' , help = 'Data Directory. By default, it assmumes homedir + git/Data')
arg_parser.add_argument('-i', '--input_dir' , help = 'Input dir containing pdb files. By default, it assmumes homedir + <drug> + input')
arg_parser.add_argument('-o', '--output_dir', help = 'Output dir for results. By default, it assmes homedir + <drug> + output')
#arg_parser.add_argument('--mkdir_name' , help = 'Output dir for processed results. This will be created if it does not exist')
arg_parser.add_argument('-m', '--make_dirs' , help = 'Make dir for input and output', action='store_true')
arg_parser.add_argument('--debug' , action = 'store_true' , help = 'Debug Mode')
args = arg_parser.parse_args()
#%%============================================================================
# variable assignment: input and output paths & filenames
drug = args.drug
gene = args.gene
datadir = args.datadir
indir = args.input_dir
outdir = args.output_dir
#outdir_ppi2 = args.mkdir_name
make_dirs = args.make_dirs
#=======
# dirs
#=======
if not datadir:
datadir = homedir + '/git/Data/'
if not indir:
indir = datadir + drug + '/input/'
if not outdir:
outdir = datadir + drug + '/output/'
#if not mkdir_name:
# outdir_na = outdir + 'mcsm_na_results/'
outdir_na = outdir + 'mcsm_na_results/'
# Input file
infile_mcsm_na = outdir_na + gene.lower() + '_output_combined_clean.tsv'
# Formatted output file
outfile_mcsm_na_f = outdir_na + gene.lower() + '_complex_mcsm_na_norm.csv'
#===========================================
# CALL: format_results_mcsm_na()
# Data: gid+streptomycin
# Data: rpob+rifampicin, date: 18/11/2021
#===========================================
print('Formatting results for:', infile_mcsm_na)
mcsm_na_df_f = format_mcsm_na_output(mcsm_na_output_tsv = infile_mcsm_na)
# writing file
print('Writing formatted df to csv')
mcsm_na_df_f.to_csv(outfile_mcsm_na_f, index = False)
print('Finished writing file:'
, '\nFile:', outfile_mcsm_na_f
, '\nExpected no. of rows:', len(mcsm_na_df_f)
, '\nExpected no. of cols:', len(mcsm_na_df_f.columns)
, '\n=============================================================')
#%%#####################################################################

View file

@ -0,0 +1,42 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Feb 12 12:15:26 2021
@author: tanu
"""
#%% load packages
import os
homedir = os.path.expanduser('~')
os.chdir (homedir + '/git/LSHTM_analysis/mcsm_na')
from get_results_mcsm_na import *
########################################################################
# variables
my_host = 'http://biosig.unimelb.edu.au'
# TODO: add cmd line args
#gene = 'gid'
drug = 'streptomycin'
datadir = homedir + '/git/Data'
indir = datadir + '/' + drug + '/input'
outdir = datadir + '/' + drug + '/output'
#==============================================================================
# batch 26: 25.txt, RETRIEVED: 16 Feb:
# batch 27: 26.txt, RETRIEVED: 6 Aug:
my_url_file = outdir + '/mcsm_na_temp/mcsm_na_result_url_gid_b27.txt'
my_suffix = 'gid_b27'
#==============================================================================
#==========================
# CALL: get_results()
# Data: gid+streptomycin
#==========================
print('Downloading results for:', my_url_file, '\nsuffix:', my_suffix)
get_results(url_file = my_url_file
, host_url = my_host
, output_dir = outdir
, outfile_suffix = my_suffix)
#%%#####################################################################

49
mcsm_na/run_submit_mcsm_na.py Executable file
View file

@ -0,0 +1,49 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Feb 12 12:15:26 2021
@author: tanu
"""
#%% load packages
import os
homedir = os.path.expanduser('~')
os.chdir (homedir + '/git/LSHTM_analysis/mcsm_na')
from submit_mcsm_na import *
########################################################################
# variables
my_host = 'http://biosig.unimelb.edu.au'
my_prediction_url = f"{my_host}/mcsm_na/run_prediction_list"
print(my_prediction_url)
# TODO: add cmd line args
#gene = 'gid'
drug = ''
datadir = homedir + '/git/Data/'
indir = datadir + drug + 'input/'
outdir = datadir + drug + 'output/'
outdir_mcsm_na = outdir + 'mcsm_na_results/'
my_nuc_type = 'RNA'
my_pdb_file = indir + gene.lower() + '_complex.pdb'
#=============================================================================
# batch 26: 25.txt # RAN: 16 Feb:
# batch 27: 26.txt # RAN: 6 Aug:
# off by one
my_mutation_list = outdir + '/snp_batches/20/snp_batch_26.txt'
my_suffix = 'gid_b27'
#==============================================================================
#==========================
# CALL: submit_mcsm_na()
# Data: gid+streptomycin
#==========================
submit_mcsm_na(host_url = my_host
, pdb_file = my_pdb_file
, mutation_list = my_mutation_list
, nuc_type = my_nuc_type
, prediction_url = my_prediction_url
, output_dir = outdir_mcsm_na
, outfile_suffix = my_suffix)
#%%#####################################################################

27
mcsm_na/split_csv.sh Executable file
View file

@ -0,0 +1,27 @@
#!/bin/bash
# FIXME: This is written for expediency to kickstart running dynamut and mcsm-NA
# Usage: ~/git/LSHTM_analysis/dynamut/split_csv.sh <input file> <output dir> <chunk size in lines>
# copy your snp file to split into the mcsm_na dir
INFILE=$1
OUTDIR=$2
CHUNK=$3
mkdir -p ${OUTDIR}/${CHUNK}
cd ${OUTDIR}/${CHUNK}
split ../../${INFILE} -l ${CHUNK} -d snp_batch_
# use case
#~/git/LSHTM_analysis/mcsm_na/split_csv.sh gid_mcsm_formatted_snps.csv snp_batches 50
#~/git/LSHTM_analysis/mcsm_na/split_csv.sh embb_mcsm_formatted_snps.csv snp_batches 50
#~/git/LSHTM_analysis/mcsm_na/split_csv.sh rpob_mcsm_formatted_snps_chain.csv snp_batches 20 # date: 17/11/2021
#acccidently replaced file original rpob batches
#~/git/LSHTM_analysis/mcsm_na/split_csv.sh 5uhc_mcsm_formatted_snps_chain.csv snp_batches_5uhc 20 # date: 17/11/2021

27
mcsm_na/split_csv_chain.sh Executable file
View file

@ -0,0 +1,27 @@
#!/bin/bash
# FIXME: This is written for expediency to kickstart running dynamut, mcsm-PPI2 (batch pf 50) and mCSM-NA (batch of 20)
# Usage: ~/git/LSHTM_analysis/dynamut/split_csv.sh <input file> <output dir> <chunk size in lines>
# copy your snp file to split into the dynamut dir
# use sed to add chain ID to snp file and then split to avoid post processing
INFILE=$1
OUTDIR=$2
CHUNK=$3
mkdir -p ${OUTDIR}/${CHUNK}/chain_added
cd ${OUTDIR}/${CHUNK}/chain_added
# makes the 3 dirs, hence ../..
split ../../../${INFILE} -l ${CHUNK} -d snp_batch_
########################################################################
# use cases
# Date: 29/10/2021, 5UHC (for rifampicin)
~/git/LSHTM_analysis/mcsm_na/split_csv_chain.sh rpob_mcsm_formatted_snps_chain.csv snp_batches 20
# add .txt to the files
for i in {00..56}; do mv snp_batch_${i} snp_batch_${i}_chain.txt; done
########################################################################

19
mcsm_na/split_format_csv.sh Executable file
View file

@ -0,0 +1,19 @@
#!/bin/bash
# FIXME: This is written for expediency to kickstart running dynamut and mcsm-NA
# Usage: ~/git/LSHTM_analysis/dynamut/split_csv.sh <input file> <output dir> <chunk size in lines>
# copy your snp file to split into the mcsm_na dir
INFILE=$1
OUTDIR=$2
CHUNK=$3
mkdir -p ${OUTDIR}/${CHUNK}
cd ${OUTDIR}/${CHUNK}
split ../../${INFILE} -l ${CHUNK} -d snp_batch_
for i in *; do mv $i $i.txt; done
sed -i 's/^/A /g' *.txt

84
mcsm_na/submit_mcsm_na.py Normal file
View file

@ -0,0 +1,84 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 19 14:33:51 2020
@author: tanu
"""
#%% load packages
import os,sys
import subprocess
import argparse
import requests
import re
import time
from bs4 import BeautifulSoup
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
#%%#####################################################################
def submit_mcsm_na(host_url
, pdb_file
, mutation_list
, nuc_type
, prediction_url
, output_dir
, outfile_suffix
):
"""
Makes a POST request for mcsm_na predictions.
@param host_url: valid host url for submitting the job
@type string
@param pdb_file: valid path to pdb structure
@type string
@param mutation_list: list of mutations (1 per line) of the format:{chain} {WT}<POS>{Mut} [A X1Z}
@type string
@param nuc_type: Nucleic acid type
@type string
@param prediction_url: mcsm_na url for prediction
@type string
@param output_dir: output dir
@type string
@param outfile_suffix: outfile_suffix
@type string
@return writes a .txt file containing url for the snps processed with user provided suffix in filename
@type string
"""
with open(pdb_file, "rb") as pdb_file, open (mutation_list, "rb") as mutation_list:
files = {"wild": pdb_file
, "mutation_list": mutation_list}
body = {"na_type": nuc_type
,"pred_type": 'list',
"pdb_code": ''} # apparently needs it even though blank!
response = requests.post(prediction_url, files = files, data = body)
print(response.status_code)
if response.history:
print('\nPASS: valid submission. Fetching result url')
url_match = re.search('/mcsm_na/results_prediction/.+(?=")', response.text)
url = host_url + url_match.group()
print('\nURL for snp batch no ', str(outfile_suffix), ':', url)
#===============
# writing file: result urls
#===============
mcsm_na_temp_dir = output_dir + '/mcsm_na_temp' # creates a temp dir within output_dir
if not os.path.exists(mcsm_na_temp_dir):
print('\nCreating mcsm_na_temp in output_dir', output_dir )
os.makedirs(mcsm_na_temp_dir)
out_url_file = mcsm_na_temp_dir + '/mcsm_na_result_url_' + str(outfile_suffix) + '.txt'
print('\nWriting output url file:', out_url_file)
myfile = open(out_url_file, 'a')
myfile.write(url)
myfile.close()
#%%#####################################################################

2
mcsm_na/test_snps_b1.csv Normal file
View file

@ -0,0 +1,2 @@
A P3S
A I4N
1 A P3S
2 A I4N

View file

@ -0,0 +1,210 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 19 14:33:51 2020
@author: tanu
"""
#%% load packages
import os,sys
homedir = os.path.expanduser('~')
import subprocess
import argparse
import requests
import re
import time
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
sys.path.append(homedir + '/git/LSHTM_analysis/scripts')
from reference_dict import up_3letter_aa_dict
from reference_dict import oneletter_aa_dict
#%%============================================================================
def format_mcsm_ppi2_output(mcsm_ppi2_output_csv, gene_name):
"""
@param mcsm_ppi2_output_csv: file containing mcsm_ppi2_results for all mcsm snps
which is the result of combining all mcsm_ppi2 batch results, and using
bash scripts to combine all the batch results into one file.
Formatting df to a pandas df and output as csv.
@type string
@return (not true) formatted csv for mcsm_ppi2 output
@type pandas df
"""
#############
# Read file
#############
mcsm_ppi2_data_raw = pd.read_csv(mcsm_ppi2_output_csv, sep = ',')
# strip white space from both ends in all columns
mcsm_ppi2_data = mcsm_ppi2_data_raw.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
dforig_shape = mcsm_ppi2_data.shape
print('dimensions of input file:', dforig_shape)
#############
# Map 3 letter
# code to one
#############
# initialise a sub dict that is lookup dict for
# 3-LETTER aa code to 1-LETTER aa code
lookup_dict = dict()
for k, v in up_3letter_aa_dict.items():
lookup_dict[k] = v['one_letter_code']
wt = mcsm_ppi2_data['wild-type'].squeeze() # converts to a series that map works on
mcsm_ppi2_data['w_type'] = wt.map(lookup_dict)
mut = mcsm_ppi2_data['mutant'].squeeze()
mcsm_ppi2_data['m_type'] = mut.map(lookup_dict)
# #############
# # CHECK
# # Map 1 letter
# # code to 3Upper
# #############
# # initialise a sub dict that is lookup dict for
# # 3-LETTER aa code to 1-LETTER aa code
# lookup_dict = dict()
# for k, v in oneletter_aa_dict.items():
# lookup_dict[k] = v['three_letter_code_upper']
# wt = mcsm_ppi2_data['w_type'].squeeze() #converts to a series that map works on
# mcsm_ppi2_data['WILD'] = wt.map(lookup_dict)
# mut = mcsm_ppi2_data['m_type'].squeeze()
# mcsm_ppi2_data['MUT'] = mut.map(lookup_dict)
# # check
# mcsm_ppi2_data['wild-type'].equals(mcsm_ppi2_data['WILD'])
# mcsm_ppi2_data['mutant'].equals(mcsm_ppi2_data['MUT'])
#%%=====================================================================
# add offset specified position number for rpob since 5uhc with chain 'C' was
# used to run the analysis
geneL_sp = ['rpob']
if gene_name.lower() in geneL_sp:
offset = 6
chain_orig = 'A'
# Add offset corrected position number. matching with rpob nsSNPs used for mCSM-lig
# and also add corresponding chain id matching with rpob nsSNPs used for mCSM-lig
mcsm_ppi2_data['position'] = mcsm_ppi2_data['res-number'] - offset
mcsm_ppi2_data['chain'] = chain_orig
mcsm_ppi2_data['5uhc_offset'] = offset
#############
# rename cols
#############
# format colnames: all lowercase and consistent colnames
mcsm_ppi2_data.columns
print('Assigning meaningful colnames'
, '\n=======================================================')
my_colnames_dict = {'chain' : 'chain'
, 'position' : 'position'
, '5uhc_offset' : '5uhc_offset'
, 'wild-type' : 'wt_upper'
, 'res-number' : '5uhc_position'
, 'mutant' : 'mut_upper'
, 'distance-to-interface': 'interface_dist'
, 'mcsm-ppi2-prediction' : 'mcsm_ppi2_affinity'
, 'affinity' : 'mcsm_ppi2_outcome'
, 'w_type' : 'wild_type' # one letter amino acid code
, 'm_type' : 'mutant_type' # one letter amino acid code
}
else:
my_colnames_dict = {'chain' : 'chain'
, 'wild-type' : 'wt_upper'
, 'res-number' : 'position'
, 'mutant' : 'mut_upper'
, 'distance-to-interface': 'interface_dist'
, 'mcsm-ppi2-prediction' : 'mcsm_ppi2_affinity'
, 'affinity' : 'mcsm_ppi2_outcome'
, 'w_type' : 'wild_type' # one letter amino acid code
, 'm_type' : 'mutant_type' # one letter amino acid code
}
#%%==============================================================================
mcsm_ppi2_data.rename(columns = my_colnames_dict, inplace = True)
mcsm_ppi2_data.columns
#############
# create mutationinformation column
#############
#mcsm_ppi2_data['mutationinformation'] = mcsm_ppi2_data['wild_type'] + mcsm_ppi2_data.position.map(str) + mcsm_ppi2_data['mutant_type']
mcsm_ppi2_data['mutationinformation'] = mcsm_ppi2_data.loc[:,'wild_type'] + mcsm_ppi2_data.loc[:,'position'].astype(int).apply(str) + mcsm_ppi2_data.loc[:,'mutant_type']
#%%=====================================================================
#########################
# scale mcsm_ppi2 values
#########################
# Rescale values in mcsm_ppi2_affinity col b/w -1 and 1 so negative numbers
# stay neg and pos numbers stay positive
mcsm_ppi2_min = mcsm_ppi2_data['mcsm_ppi2_affinity'].min()
mcsm_ppi2_max = mcsm_ppi2_data['mcsm_ppi2_affinity'].max()
mcsm_ppi2_scale = lambda x : x/abs(mcsm_ppi2_min) if x < 0 else (x/mcsm_ppi2_max if x >= 0 else 'failed')
mcsm_ppi2_data['mcsm_ppi2_scaled'] = mcsm_ppi2_data['mcsm_ppi2_affinity'].apply(mcsm_ppi2_scale)
print('Raw mcsm_ppi2 scores:\n', mcsm_ppi2_data['mcsm_ppi2_affinity']
, '\n---------------------------------------------------------------'
, '\nScaled mcsm_ppi2 scores:\n', mcsm_ppi2_data['mcsm_ppi2_scaled'])
c = mcsm_ppi2_data[mcsm_ppi2_data['mcsm_ppi2_affinity']>=0].count()
mcsm_ppi2_pos = c.get(key = 'mcsm_ppi2_affinity')
c2 = mcsm_ppi2_data[mcsm_ppi2_data['mcsm_ppi2_scaled']>=0].count()
mcsm_ppi2_pos2 = c2.get(key = 'mcsm_ppi2_scaled')
if mcsm_ppi2_pos == mcsm_ppi2_pos2:
print('\nPASS: Affinity values scaled correctly')
else:
print('\nFAIL: Affinity values scaled numbers MISmatch'
, '\nExpected number:', mcsm_ppi2_pos
, '\nGot:', mcsm_ppi2_pos2
, '\n======================================================')
#%%=====================================================================
###################
# reorder columns
###################
mcsm_ppi2_data.columns
#---------------------
# Determine col order
#---------------------
core_cols = ['mutationinformation'
, 'mcsm_ppi2_affinity'
, 'mcsm_ppi2_scaled'
, 'mcsm_ppi2_outcome'
, 'interface_dist'
, 'wild_type'
, 'position'
, 'mutant_type'
, 'wt_upper'
, 'mut_upper'
, 'chain']
if gene_name.lower() in geneL_sp:
column_order = core_cols + ['5uhc_offset', '5uhc_position']
else:
column_order = core_cols.copy()
#--------------
# reorder now
#--------------
mcsm_ppi2_dataf = mcsm_ppi2_data[column_order]
#%%============================================================================
###################
# Sort df based on
# position columns
###################
mcsm_ppi2_dataf.sort_values(by = ['position', 'mutant_type'], inplace = True, ascending = True)
return(mcsm_ppi2_dataf)
#%%#####################################################################

View file

@ -0,0 +1,82 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Feb 12 12:15:26 2021
@author: tanu
"""
#%% load packages
import sys, os
homedir = os.path.expanduser('~')
#sys.path.append(homedir + '/git/LSHTM_analysis/mcsm_ppi2')
from format_results_mcsm_ppi2 import *
########################################################################
#%% command line args
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('-d', '--drug' , help = 'drug name (case sensitive)', default = None)
arg_parser.add_argument('-g', '--gene' , help = 'gene name (case sensitive)', default = None)
arg_parser.add_argument('--datadir' , help = 'Data Directory. By default, it assmumes homedir + git/Data')
arg_parser.add_argument('-i', '--input_dir' , help = 'Input dir containing pdb files. By default, it assmumes homedir + <drug> + input')
arg_parser.add_argument('-o', '--output_dir', help = 'Output dir for results. By default, it assmes homedir + <drug> + output')
arg_parser.add_argument('--input_file' , help = 'Output dir for results. By default, it assmes homedir + <drug> + output')
#arg_parser.add_argument('--mkdir_name' , help = 'Output dir for processed results. This will be created if it does not exist')
arg_parser.add_argument('-m', '--make_dirs' , help = 'Make dir for input and output', action='store_true')
arg_parser.add_argument('--debug' , action = 'store_true' , help = 'Debug Mode')
args = arg_parser.parse_args()
#%%============================================================================
# variable assignment: input and output paths & filenames
drug = args.drug
gene = args.gene
datadir = args.datadir
indir = args.input_dir
outdir = args.output_dir
infile_mcsm_ppi2 = args.input_file
#outdir_ppi2 = args.mkdir_name
make_dirs = args.make_dirs
#=======
# dirs
#=======
if not datadir:
datadir = homedir + '/git/Data/'
if not indir:
indir = datadir + drug + '/input/'
if not outdir:
outdir = datadir + drug + '/output/'
#if not mkdir_name:
# outdir_ppi2 = outdir + 'mcsm_ppi2/'
outdir_ppi2 = outdir + 'mcsm_ppi2/'
# Input file
if not infile_mcsm_ppi2:
infile_mcsm_ppi2 = outdir_ppi2 + gene.lower() + '_output_combined_clean.csv'
# Formatted output file
outfile_mcsm_ppi2_f = outdir_ppi2 + gene.lower() + '_complex_mcsm_ppi2_norm.csv'
#==========================
# CALL: format_results_mcsm_na()
# Data: gid+streptomycin
#==========================
print('Formatting results for:', infile_mcsm_ppi2)
mcsm_ppi2_df_f = format_mcsm_ppi2_output(mcsm_ppi2_output_csv = infile_mcsm_ppi2, gene_name = gene)
# writing file
print('Writing formatted df to csv')
mcsm_ppi2_df_f.to_csv(outfile_mcsm_ppi2_f, index = False)
print('Finished writing file:'
, '\nFile:', outfile_mcsm_ppi2_f
, '\nExpected no. of rows:', len(mcsm_ppi2_df_f)
, '\nExpected no. of cols:', len(mcsm_ppi2_df_f.columns)
, '\n=============================================================')
#%%#####################################################################

Some files were not shown because too many files have changed in this diff Show more