ClassificationModels/ReadMe.txt at master · bilbisli/ClassificationModels · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
Author:
Israel Avihail

Required libraries:
######################################################################################################
import of file data_analysis
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 - argparse
the object will hold all the information necessary to parse the command line into Python data types.
 - on
os.path.exists Used to check if the path exists, if isn't exists we use os.mkdir to create.
 - pickle
it serializes objects so they can be saved to a file, and loaded in a program again later on.
 - Sequence
used for check if the object is type of Sequence.
 - operator.itemgetter
return to us item all the time we use in genrator.
 - numpy
numpy is a Python library used for working with arrays
- pandas
uesd to read csv files and do operations on it, Intervals used for us to check if object is the same type, pd.IntervalIndex use to convert to interval
 - stats
We import scipy.stats to use the entropy function which represents the effective size index of probability space.
 - sklearn
Kmeans, GaussianNB, CategoricalNB, KNeighborsClassifier, StandardScaler:  used for calculation and generation of Confusion Matrix PDFs
////////////////////////////////////////////////////////////////////////////////////////////////////////////////// end data_analysis

import of file Classifier_Algorithm
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 - abc
This module provides the infrastructure for defining abstract base classes (ABCs) in Python, we used to do interface

import of file dictionary_tree

 - copy
import copy to use for deep copy
////////////////////////////////////////////////////////////////////////////////////////////////////////////////// end Classifier_Algorithm
import of file entropy
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////
 - Fraction
used to convert two numbers to number rational and we used to check if object is the same type of fraction
 - log2
used to calculate entropy
////////////////////////////////////////////////////////////////////////////////////////////////////////////////// end entropy
import of file entropy_discretization

-combinations
It return r-length tuples in sorted order with no repeated elements we use in entropy with genretor his return one tuples from the list
##################################################################################################################### End Required libraries

How to add custom classification algorithm:
#####################################################################################################################
step 1: add the algorithem to package/folder of "classification_algorithms" within the project
step 2: in the package/folder "classification_algorithms" within the "__init__.py" file:
	 2.1: import the new algorithm (class). example: "from classification_algorithms.algorithm_file import AlgorithmName"
	 2.2: add the algorithm (class) to "__algorithm__" list. example: "[AlgorithmName, <existing algorithms...>]"
*It is recommended that the classification algorithm will implement the classification algorithm interface "ClassifierAlgorithm" which is located in "project_util" package/folder.
example:
	from project_util.classifier_algorithm import ClassifierAlgorithm
	class AlgorithmName(ClassifierAlgorithm):

##################################################################################################################### end add custom classification models

Preparing and running within a Virtual Enviroment:
#####################################################################################################################
-- option 1 --
	step 1: create a virtual enviroment (python must be installed before hand)
		1.1. open a shell within the project folder ("ClassifyingModels")
		1.1.1 on Windows run the command:  py -m venv env
		1.1.2 on Unix/MacOs run the command: python3 -m venv env
	step 2: activate the enviroment
		2.1 on Windows run the command: .\env\Scripts\activate
		2.2 on Unix/MacOs run the command: source env/bin/activate
	step 3: install requirements
		3.1 on Windows run the command: py -m pip install -r requirements.txt
		3.2 on Unix/MacOs run the command: python3 -m pip install -r requirements.txt

-- option 2 --
	1. on Windows run the bat file within the project folder: "run_windows.bat"
	2. on Unix/MacOs (or Windows with supporting shell such as git) run the sh file withing the project folder: "run_unix_mac.sh"

When the enviroment is all set, the program can now be run.

Important!!
After either options, when work with the program is finished, the enviroment needs to be deactivated:
	In the active enviroment's open shell run the command: deactivate

##################################################################################################################### end enviroment preparation

How to Run :
#####################################################################################################################
Command help section - To see this text (the help section) via the program  - run the command (in the open shell): classification_models.py -h
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

usage: classification_models.py [-h] WORKING_MODE ... DATA_PATH

Description: Analyse data with data mining tools.

positional arguments:
  DATA_PATH             Path directory of dataset files.
                        example: C:/Users/user/Desktop/data
                                        or
                                 ./resources

options:
  -h, --help            show this help message and exit

run modes:
  run modes define which mode the system should run in the current execution.
  example: classification_models.py all train.csv test.csv C:/Users/user/Desktop/data
                or
           classification_models.py preprocessing train.csv ./resources

  WORKING_MODE          run mode help
    preprocessing (p, pp)
                        in this mode only preprocessing is applied
    build_model (bm)    in this mode only model build is is done
    run_model (rm, r)   in this mode the only operation done is running a model on test data
    all (ALL, a, A)     in this mode the whole program will be executed

Made by Israel Avihail.
For bugs & issues: bilbisli@gmail.com

////////////////////////////////////////////////////////////////////////////////////////////////////////////////// end command help section

Run All mode help section - To see this text (the help section) via the program - run the command (in the open shell): classification_models.py a -h

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

usage: classification_models.py all [-h] [--fill FILL_BLANKS_TYPE] [--normalization] [--no-normalization]
                                    [--discretization DISCRETIZATION_TYPE] [--bins BIN_NUMBER [BIN_NUMBER ...]]
                                    [--algorithm ALGORITHM_TYPE] [--implementation IMPLEMENTATION_TYPE]
                                    [--result_name PREDICTION_RESULT_FILE_NAME]
                                    TRAINING_FILE_NAME TEST_FILE_NAME

positional arguments:
  TRAINING_FILE_NAME    Training dataset file name. example: train.csv
  TEST_FILE_NAME        Test dataset file name. example: test.csv

options:
  -h, --help            show this help message and exit
  --fill FILL_BLANKS_TYPE
                        Fill blank cells parameter. example: --fill all
  --normalization       Apply normalization. example: --normalization
  --no-normalization    Do not apply normalization. example: --no-normalization
  --discretization DISCRETIZATION_TYPE
                        Discretization type. example: --discretization equal_width
  --bins BIN_NUMBER [BIN_NUMBER ...]
                        Number of bins (intervals) the continues data will be divided to. example: --bins=5
  --algorithm ALGORITHM_TYPE
                        Model algorithm type. example: --algorithm algorithm_type
                        options: naive_bayes, decision_tree, k_neighbors, k_means
  --implementation IMPLEMENTATION_TYPE
                        Apply built in/own implementations of classifying/discretization algorithms(if exists).
                        example: --implementation own
  --result_name PREDICTION_RESULT_FILE_NAME
                        Prediction result file name to save. example: --result_name test_predicition_DecisionTree_1.csv

////////////////////////////////////////////////////////////////////////////////////////////////////////////////// end Run All Mode help section

Pre-Processing mode help section - To see this text (the help section) via the program - run the command (in the open shell): classification_models.py pp -h
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

usage: classification_models.py preprocessing [-h] [--fill FILL_BLANKS_TYPE] [--normalization] [--no-normalization] [--discretization DISCRETIZATION_TYPE]
                                              [--bins BIN_NUMBER [BIN_NUMBER ...]] [--implementation IMPLEMENTATION_TYPE] [--save_name FILE_NAME]
                                              TRAINING_FILE_NAME

positional arguments:
  TRAINING_FILE_NAME    Training dataset file name. example: train.csv

options:
  -h, --help            show this help message and exit
  --fill FILL_BLANKS_TYPE
                        Fill blank cells parameter. example: --fill all
  --normalization       Apply normalization. example: --normalization
  --no-normalization    Do not apply normalization. example: --no-normalization
  --discretization DISCRETIZATION_TYPE
                        Discretization type. example: --discretization equal_width
  --bins BIN_NUMBER [BIN_NUMBER ...]
                        Number of bins (intervals) the continues data will be divided to. example: --bins 5
  --implementation IMPLEMENTATION_TYPE
                        Apply built in/own implementations of classifying/discretization algorithms(if exists).
                        example: --implementation own
  --save_name FILE_NAME
                        The name of the file to be saved after processing. example: name.csv

////////////////////////////////////////////////////////////////////////////////////////////////////////////////// end preprocessing help section

Build Model mode help section - To see this text (the help section) via the program - run the command (in the open shell): classification_models.py bm -h
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

usage: classification_models.py build_model [-h] [--algorithm ALGORITHM_TYPE] [--implementation IMPLEMENTATION_TYPE]
                                            [--model_name MODEL_NAME]
                                            POST_PREPROCESSED_FILE_NAME

positional arguments:
  POST_PREPROCESSED_FILE_NAME
                        Training dataset file name (already undergone preprocessing). example: train_clean.csv

options:
  -h, --help            show this help message and exit
  --algorithm ALGORITHM_TYPE
                        Model algorithm type. example: --algorithm algorithm_type.
                        options: naive_bayes, decision_tree, k_neighbors, k_means
  --implementation IMPLEMENTATION_TYPE
                        Model algorithm type. example: --implementaion built_in
  --model_name MODEL_NAME
                        The name of the model to be saved (as pickle). example: --model_name decision_tree_model_1

////////////////////////////////////////////////////////////////////////////////////////////////////////////////// end Build Model help section

Run Model mode help section - To see this text (the help section) via the program - run the command (in the open shell): classification_models.py rm -h

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

positional arguments:
  TEST_FILE_NAME        Test dataset file name. example: test.csv

options:
  -h, --help            show this help message and exit
  --model_name TEST_FILE_NAME
                        Model file name that is already saved (as pickle). example: --model_name decision_tree_model_1
  --result_name PREDICTION_RESULT_FILE_NAME
                        Prediction result file name to save. example: --result_name test_predicition_DecisionTree_1.csv

////////////////////////////////////////////////////////////////////////////////////////////////////////////////// end Run Model mode help section


##################################################################################################################### end How to Run