Skip to content
83 changes: 22 additions & 61 deletions docs/user_guide/discretisation/ArbitraryDiscretiser.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,22 @@ ArbitraryDiscretiser
====================

The :class:`ArbitraryDiscretiser()` sorts the variable values into contiguous intervals
which limits are arbitrarily defined by the user. Thus, you must provide a dictionary
with the variable names as keys and a list with the limits of the intervals as values,
when setting up the discretiser.
whose limits are arbitrarily defined by the user.

The :class:`ArbitraryDiscretiser()` works only with numerical variables. The discretiser
will check that the variables entered by the user are present in the train set and cast
as numerical.
.. note::
You must provide a dictionary
with the variable names as keys and a list with the limits of the intervals as values,
when setting up the discretiser.

Example
-------

Let's take a look at how this transformer works. First, let's load a dataset and plot a
histogram of a continuous variable. We use the california housing dataset that comes
Python implementation
---------------------

Let's take a look at how this transformer works. We'll use the california housing dataset that comes
with Scikit-learn.

Let's load the dataset:

.. code:: python

import numpy as np
Expand All @@ -31,6 +32,10 @@ with Scikit-learn.

X, y = fetch_california_housing( return_X_y=True, as_frame=True)

Let's plot a histogram of a continuous variable.

.. code:: python

X['MedInc'].hist(bins=20)
plt.xlabel('MedInc')
plt.ylabel('Number of obs')
Expand Down Expand Up @@ -99,7 +104,7 @@ If we return the interval values as integers, the discretiser has the option to
the transformed variable as integer or as object. Why would we want the transformed
variables as object?

Categorical encoders in Feature-engine are designed to work with variables of type
Categorical encoders in feature-engine are designed to work with variables of type
object by default. Thus, if you wish to encode the returned bins further, say to try and
obtain monotonic relationships between the variable and the target, you can do so
seamlessly by setting `return_object` to True. You can find an example of how to use
Expand All @@ -108,56 +113,12 @@ this functionality `here <https://nbviewer.org/github/feature-engine/feature-eng
Additional resources
--------------------

Check also:

- `Jupyter notebook <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/ArbitraryDiscretiser.ipynb>`_
- `Jupyter notebook - Discretiser plus Mean Encoding <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/ArbitraryDiscretiser_plus_MeanEncoder.ipynb>`_

For more details about this and other feature engineering methods check out these resources:

- `Feature Engineering for Machine Learning <https://www.trainindata.com/p/feature-engineering-for-machine-learning>`_, online course.
- `Feature Engineering for Time Series Forecasting <https://www.trainindata.com/p/feature-engineering-for-forecasting>`_, online course.
- `Python Feature Engineering Cookbook <https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587>`_, book.

.. figure:: ../../images/feml.png
:width: 300
:figclass: align-center
:align: left
:target: https://www.trainindata.com/p/feature-engineering-for-machine-learning

Feature Engineering for Machine Learning

|
|
|
|
|
|
|
|
|
|

Or read our book:

.. figure:: ../../images/cookbook.png
:width: 200
:figclass: align-center
:align: left
:target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587

Python Feature Engineering Cookbook

|
|
|
|
|
|
|
|
|
|
|
|
|

Both our book and course are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.
Both our book and courses are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting `Sole <https://linkedin.com/in/soledad-galli>`_,
the main developer of feature-engine.
111 changes: 34 additions & 77 deletions docs/user_guide/discretisation/DecisionTreeDiscretiser.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,27 +5,27 @@
DecisionTreeDiscretiser
=======================

Discretization consists of transforming continuous variables into discrete features by creating
Discretisation consists of transforming continuous variables into discrete features by creating
a set of contiguous intervals, or bins, that span the range of the variable values.

Discretization is a common data preprocessing step in many data science projects, as it simplifies
Discretisation is a common data preprocessing step in many data science projects, as it simplifies
continuous attributes and has the potential to improve model performance or speed up model training.

Decision tree discretization
Decision tree discretisation
----------------------------

Decision trees make decisions based on discrete partitions over continuous features. During
training, a decision tree evaluates all possible feature values to find the best cut-point, that is,
the feature value at which the split maximizes the information gain, or in other words, reduces the
the feature value at which the split maximises the information gain, or in other words, reduces the
impurity. It repeats the procedure at each node until it allocates all samples to certain leaf
nodes or end nodes. Hence, classification and regression trees can naturally find the optimal limits
of the intervals to maximize class coherence.
of the intervals to maximise class coherence.

Discretization with decision trees consists of using a decision tree algorithm to identify the optimal
Discretisation with decision trees consists of using a decision tree algorithm to identify the optimal
partitions for each continuous variable. After finding the optimal partitions, we sort the variable's
values into those intervals.

Discretization with decision trees is a supervised discretization method, in that, the interval
Discretisation with decision trees is a supervised discretisation method, in that, the interval
limits are found based on class or target coherence. In simpler words, we need the target variable
to train the decision trees.

Expand All @@ -42,25 +42,25 @@ Limitations
- We need to tune some of the decision tree parameters to obtain the optimal number of intervals.


Decision tree discretizer
Decision tree discretiser
-------------------------

The :class:`DecisionTreeDiscretiser()` applies discretization based on the interval limits found
The :class:`DecisionTreeDiscretiser()` applies discretisation based on the interval limits found
by decision trees algorithms. It uses decision trees to find the optimal interval limits. Next,
it sorts the variable into those intervals.

The transformed variable can either have the limits of the intervals as values, an ordinal number
representing the interval into which the value was sorted, or alternatively, the prediction of the
decision tree. In any case, the number of values of the variable will be finite.

In theory, decision tree discretization creates discrete variables with a monotonic relationship
In theory, decision tree discretisation creates discrete variables with a monotonic relationship
with the target, and hence, the transformed features would be more suitable to train linear models,
like linear or logistic regression.

Original idea
-------------

The method of decision tree discretization is based on the winning solution of the KDD 2009 competition:
The method of decision tree discretisation is based on the winning solution of the KDD 2009 competition:

`Niculescu-Mizil, et al. "Winning the KDD Cup Orange Challenge with Ensemble
Selection". JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009
Expand All @@ -77,14 +77,14 @@ on the performance of linear models.
Code examples
-------------

In the following sections, we will do decision tree discretization to showcase the functionality of
the :class:`DecisionTreeDiscretiser()`. We will discretize 2 numerical variables of the Ames house
In the following sections, we will do decision tree discretisation to showcase the functionality of
the :class:`DecisionTreeDiscretiser()`. We will discretise 2 numerical variables of the Ames house
prices dataset using decision trees.

First, we will transform the variables using the predictions of the decision trees, next, we will
return the interval limits, and finally, we will return the bin order.

Discretization with the predictions of the decision tree
Discretisation with the predictions of the decision tree
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

First we load the data and separate it into a training set and a test set:
Expand Down Expand Up @@ -136,9 +136,9 @@ In the following output we see the predictor variables of the house prices datas

We set up the decision tree discretiser to find the optimal intervals using decision trees.

The :class:`DecisionTreeDiscretiser()` will optimize the depth of the decision tree classifier
The :class:`DecisionTreeDiscretiser()` will optimise the depth of the decision tree classifier
or regressor by default and using cross-validation. That's why we need to select the appropriate
metric for the optimization. In this example, we are using decision tree regression, so we select
metric for the optimisation. In this example, we are using decision tree regression, so we select
the mean squared error metric.

We specify in the `bin_output` that we want to replace the continuous attribute values with the
Expand Down Expand Up @@ -211,8 +211,8 @@ The `binner_dict_` stores the details of each decision tree.
scoring='neg_mean_squared_error')}


With decision tree discretization, each bin, that is, each prediction value in this case, does not
necessarily contain the same number of observations. Let's check that out with a visualization:
With decision tree discretisation, each bin, that is, each prediction value in this case, does not
necessarily contain the same number of observations. Let's check that out with a visualisation:

.. code:: python

Expand All @@ -239,7 +239,7 @@ Rounding the prediction value
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Sometimes, the output of the prediction can have multiple values after the comma, which makes the
visualization and interpretation a bit uncomfortable. Fortunately, we can round those values through
visualisation and interpretation a bit uncomfortable. Fortunately, we can round those values through
the `precision` parameter:

.. code:: python
Expand All @@ -266,7 +266,7 @@ the `precision` parameter:
In this example, we are predicting house prices, which is a continuous target. The procedure for
classification models is identical, we just need to set the parameter `regression` to False.

Discretization with interval limits
Discretisation with interval limits
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this section, instead of replacing the original variable values with the predictions of the
Expand Down Expand Up @@ -314,7 +314,7 @@ of the decision trees:
4576.0,
inf]}

The :class:`DecisionTreeDiscretiser()` will use these limits with `pandas.cut` to discretize the
The :class:`DecisionTreeDiscretiser()` will use these limits with `pandas.cut` to discretise the
continuous variable values during transform:

.. code:: python
Expand All @@ -337,7 +337,7 @@ In the following output we see the interval limits into which the values of the

To train machine learning algorithms we would follow that up with any categorical data encoding method.

Discretization with ordinal numbers
Discretisation with ordinal numbers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the last part of this guide, we will replace the variable values with the number of bin into
Expand Down Expand Up @@ -384,7 +384,7 @@ The `binner_dict_` will also contain the limits of the intervals:
inf]}

When we apply transform, :class:`DecisionTreeDiscretiser()` will use these limits with `pandas.cut` to
discretize the continuous variable:
discretise the continuous variable:

.. code:: python

Expand All @@ -408,62 +408,19 @@ were sorted:
Additional considerations
-------------------------

Decision tree discretization uses scikit-learn's DecisionTreeRegressor or DecisionTreeClassifier under
Decision tree discretisation uses scikit-learn's DecisionTreeRegressor or DecisionTreeClassifier under
the hood to find the optimal interval limits. These models do not support missing data. Hence, we need
to replace missing values with numbers before proceeding with the disrcretization.
to replace missing values with numbers before proceeding with the disrcretisation.

Tutorials, books and courses
----------------------------

Check also for more details on how to use this transformer:

- `Jupyter notebook <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/DecisionTreeDiscretiser.ipynb>`_
- `tree_pipe in cell 21 of this Kaggle kernel <https://www.kaggle.com/solegalli/feature-engineering-and-model-stacking>`_

For tutorials about this and other discretization methods and feature engineering techniques check out our online course:

.. figure:: ../../images/feml.png
:width: 300
:figclass: align-center
:align: left
:target: https://www.trainindata.com/p/feature-engineering-for-machine-learning

Feature Engineering for Machine Learning

|
|
|
|
|
|
|
|
|
|

Or read our book:

.. figure:: ../../images/cookbook.png
:width: 200
:figclass: align-center
:align: left
:target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587

Python Feature Engineering Cookbook

|
|
|
|
|
|
|
|
|
|
|
|
|

Both our book and course are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.
For tutorials about this and other discretisation methods and feature engineering techniques check out our online course:

- `Feature Engineering for Machine Learning <https://www.trainindata.com/p/feature-engineering-for-machine-learning>`_, online course.
- `Feature Engineering for Time Series Forecasting <https://www.trainindata.com/p/feature-engineering-for-forecasting>`_, online course.
- `Python Feature Engineering Cookbook <https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587>`_, book.

Both our book and courses are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting `Sole <https://linkedin.com/in/soledad-galli>`_,
the main developer of feature-engine.
Loading