/home/tanu/git/LSHTM_analysis/scripts/ml/ml_data_cd_7030.py:548: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mask_check.sort_values(by = ['ligand_distance'], ascending = True, inplace = True)
1.22.4
1.4.1

aaindex_df contains non-numerical data

Total no. of non-numerial columns: 2

Selecting numerical data only

PASS: successfully selected numerical columns only for aaindex_df

Now checking for NA in the remaining aaindex_cols

Counting aaindex_df cols with NA 
ncols with NA: 4 columns 
Dropping these... 
Original ncols: 127

Revised df ncols: 123

Checking NA in revised df...

PASS: cols with NA successfully dropped from aaindex_df 
Proceeding with combining aa_df with other features_df

PASS: ncols match 
Expected ncols: 123 
Got: 123

Total no. of columns in clean aa_df: 123

Proceeding to merge, expected nrows in merged_df: 271

PASS: my_features_df and aa_df successfully combined 
nrows: 271 
ncols: 269
count of NULL values before imputation

or_mychisq          256
log10_or_mychisq    256
dtype: int64
count of NULL values AFTER imputation

mutationinformation    0
or_rawI                0
logorI                 0
dtype: int64

PASS: OR values imputed, data ready for ML

Total no. of features for aaindex: 123

No. of numerical features: 168 
No. of categorical features: 7

PASS: x_features has no target variable

No. of columns for x_features: 175

------------------------------------------------------------- 
Successfully split data with stratification [COMPLETE data]: 70/30 
Original data size: (271, 175) 
Train data size: (181, 175) 
Test data size: (90, 175) 
y_train numbers: Counter({0: 180, 1: 1}) 
y_train ratio: 180.0 
 
y_test_numbers: Counter({0: 89, 1: 1}) 
y_test ratio: 89.0 
-------------------------------------------------------------

index: 0 
ind: 1

Mask count check: True

index: 1 
ind: 2

Mask count check: True
Original Data
 Counter({0: 180, 1: 1}) Data dim: (181, 175)

Simple Random OverSampling
 Counter({0: 180, 1: 180})
(360, 175)

Simple Random UnderSampling
 Counter({0: 1, 1: 1})
(2, 175)

Simple Combined Over and UnderSampling
 Counter({0: 180, 1: 180})
(360, 175)
Traceback (most recent call last):
  File "/home/tanu/git/LSHTM_analysis/scripts/ml/./alr_cd_7030.py", line 19, in <module>
    setvars(gene,drug)
  File "/home/tanu/git/LSHTM_analysis/scripts/ml/ml_data_cd_7030.py", line 745, in setvars
    X_smnc, y_smnc = sm_nc.fit_resample(X, y)
  File "/home/tanu/anaconda3/envs/UQ/lib/python3.9/site-packages/imblearn/base.py", line 83, in fit_resample
    output = self._fit_resample(X, y)
  File "/home/tanu/anaconda3/envs/UQ/lib/python3.9/site-packages/imblearn/over_sampling/_smote/base.py", line 533, in _fit_resample
    X_resampled, y_resampled = super()._fit_resample(X_encoded, y)
  File "/home/tanu/anaconda3/envs/UQ/lib/python3.9/site-packages/imblearn/over_sampling/_smote/base.py", line 324, in _fit_resample
    nns = self.nn_k_.kneighbors(X_class, return_distance=False)[:, 1:]
  File "/home/tanu/anaconda3/envs/UQ/lib/python3.9/site-packages/sklearn/neighbors/_base.py", line 749, in kneighbors
    raise ValueError(
ValueError: Expected n_neighbors <= n_samples,  but n_samples = 1, n_neighbors = 6