Data wrangling¶
This module is intended for transforming data into a format that can be read and processed by the prediction models of the epftoolbox library. At the moment, the module is limited to scaling operations.
The module is composed of two components:
- The
DataScaler
class.- The
scaling
function.
The class DataScaler
is the main block for performing scaling operations. The class is based on the syntax of the scalers defined in the sklearn.preprocessing module of the scikit-learn library. The class performs some of the standard scaling algorithms in the context of electricity price forecasting:
Besides the class, the module also provides a function scaling
to scale
a list of datasets but estimating the scaler using only one of the datasets. This function is useful when
scaling the training, validation, and test dataset. In this scenario, to have a realistic evaluation, one would ideally estimate the scaler using the training dataset and simply transform the other two.
-
class
epftoolbox.data.
DataScaler
(normalize)[source]¶ Class to perform data scaling operations
The scaling technique is defined by the
normalize
parameter which takes one of the following values:'Norm'
for normalizing the data to the interval [0, 1].'Norm1'
for normalizing the data to the interval [-1, 1].'Std'
for standarizing the data to follow a normal distribution.'Median'
for normalizing the data based on the median as defined in as defined in here.'Invariant'
for scaling the data based on the asinh transformation (a variance stabilizing transformations) as defined in here.
This class follows the same syntax of the scalers defined in the sklearn.preprocessing module of the scikit-learn library
Parameters: normalize (str) – Type of scaling to be performed. Possible values are 'Norm'
,'Norm1'
,'Std'
,'Median'
, or'Invariant'
Example
>>> from epftoolbox.data import read_data >>> from epftoolbox.data import DataScaler >>> df_train, df_test = read_data(path='.', dataset='PJM', begin_test_date='01-01-2016', end_test_date='01-02-2016') Test datasets: 2016-01-01 00:00:00 - 2016-02-01 23:00:00 >>> df_train.tail() Price Exogenous 1 Exogenous 2 Date 2015-12-31 19:00:00 29.513832 100700.0 13015.0 2015-12-31 20:00:00 28.440134 99832.0 12858.0 2015-12-31 21:00:00 26.701700 97033.0 12626.0 2015-12-31 22:00:00 23.262253 92022.0 12176.0 2015-12-31 23:00:00 22.262431 86295.0 11434.0 >>> df_test.head() Price Exogenous 1 Exogenous 2 Date 2016-01-01 00:00:00 20.341321 76840.0 10406.0 2016-01-01 01:00:00 19.462741 74819.0 10075.0 2016-01-01 02:00:00 17.172706 73182.0 9795.0 2016-01-01 03:00:00 16.963876 72300.0 9632.0 2016-01-01 04:00:00 17.403722 72535.0 9566.0 >>> Xtrain = df_train.values >>> Xtest = df_train.values >>> scaler = DataScaler('Norm') >>> Xtrain_scaled = scaler.fit_transform(Xtrain) >>> Xtest_scaled = scaler.transform(Xtest) >>> Xtrain_inverse = scaler.inverse_transform(Xtrain_scaled) >>> Xtest_inverse = scaler.inverse_transform(Xtest_scaled) >>> Xtrain[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]]) >>> Xtrain_scaled[:3,:] array([[0.03833877, 0.2736787 , 0.28415155], [0.03608228, 0.24425597, 0.24633138], [0.03438982, 0.23016409, 0.2261206 ]]) >>> Xtrain_inverse[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]]) >>> Xtest[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]]) >>> Xtest_scaled[:3,:] array([[0.03833877, 0.2736787 , 0.28415155], [0.03608228, 0.24425597, 0.24633138], [0.03438982, 0.23016409, 0.2261206 ]]) >>> Xtest_inverse[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]])
Methods
fit_transform
(dataset)Method that estimates an scaler object using the data in dataset
and scales the data indataset
inverse_transform
(dataset)Method that inverse-scale the data in dataset
transform
(dataset)Method that scales the data in dataset
-
fit_transform
(dataset)[source]¶ Method that estimates an scaler object using the data in
dataset
and scales the data indataset
Parameters: dataset (numpy.array) – Dataset used to estimate the scaler Returns: Scaled data Return type: numpy.array
-
inverse_transform
(dataset)[source]¶ Method that inverse-scale the data in
dataset
It must be called after calling the
fit_transform
method for estimating the scalerParameters: dataset (numpy.array) – Dataset to be scaled Returns: Inverse-scaled data Return type: numpy.array
-
transform
(dataset)[source]¶ Method that scales the data in
dataset
It must be called after calling the
fit_transform
method for estimating the scaler :param dataset: Dataset to be scaled :type dataset: numpy.arrayReturns: Scaled data Return type: numpy.array
-
epftoolbox.data.
scaling
(datasets, normalize)[source]¶ Function that scales data and returns the scaled data and the
DataScaler
used for scaling.It rescales all the datasets contained in the list
datasets
using the first dataset as reference. For example, ifdatasets=[X_1, X_2, X_3]
, the function estimates aDataScaler
object using the arrayX_1
, and transformX_1
,X_2
, andX_3
using theDataScaler
object.Each dataset must be a numpy.array and it should have the same column-dimensions. For example, if
datasets=[X_1, X_2, X_3]
,X_1
must be a numpy.array of size[n_1, m]
,X_2
of size[n_2, m]
, andX_3
of size[n_3, m]
, wheren_1
,n_2
,n_3
can be different.The scaling technique is defined by the
normalize
parameter which takes one of the following values:'Norm'
for normalizing the data to the interval [0, 1].'Norm1'
for normalizing the data to the interval [-1, 1].'Std'
for standarizing the data to follow a normal distribution.'Median'
for normalizing the data based on the median as defined in as defined in here.'Invariant'
for scaling the data based on the asinh transformation (a variance stabilizing transformations) as defined in here.
The function returns the scaled data together with a
DataScaler
object representing the scaling. This object can be used to scale other dataset using the same rules or to inverse-transform the data.Parameters: - datasets (list) – List of numpy.array objects to be scaled.
- normalize (str) – Type of scaling to be performed. Possible values are
'Norm'
,'Norm1'
,'Std'
,'Median'
, or'Invariant'
Returns: List of scaled datasets and the
DataScaler
object used for scaling. Each dataset in the list is a numpy.array.Return type: List,
DataScaler
Example
>>> from epftoolbox.data import read_data >>> from epftoolbox.data import scaling >>> df_train, df_test = read_data(path='.', dataset='PJM', begin_test_date='01-01-2016', end_test_date='01-02-2016') Test datasets: 2016-01-01 00:00:00 - 2016-02-01 23:00:00 >>> df_train.tail() Price Exogenous 1 Exogenous 2 Date 2015-12-31 19:00:00 29.513832 100700.0 13015.0 2015-12-31 20:00:00 28.440134 99832.0 12858.0 2015-12-31 21:00:00 26.701700 97033.0 12626.0 2015-12-31 22:00:00 23.262253 92022.0 12176.0 2015-12-31 23:00:00 22.262431 86295.0 11434.0 >>> df_test.head() Price Exogenous 1 Exogenous 2 Date 2016-01-01 00:00:00 20.341321 76840.0 10406.0 2016-01-01 01:00:00 19.462741 74819.0 10075.0 2016-01-01 02:00:00 17.172706 73182.0 9795.0 2016-01-01 03:00:00 16.963876 72300.0 9632.0 2016-01-01 04:00:00 17.403722 72535.0 9566.0 >>> Xtrain = df_train.values >>> Xtest = df_train.values >>> [Xtrain_scaled, Xtest_scaled], scaler = scaling([Xtrain,Xtest],'Norm') >>> Xtrain[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]]) >>> Xtrain_scaled[:3,:] array([[0.03833877, 0.2736787 , 0.28415155], [0.03608228, 0.24425597, 0.24633138], [0.03438982, 0.23016409, 0.2261206 ]]) >>> Xtest[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]]) >>> Xtest_scaled[:3,:] array([[0.03833877, 0.2736787 , 0.28415155], [0.03608228, 0.24425597, 0.24633138], [0.03438982, 0.23016409, 0.2261206 ]]) >>> type(scaler) <class 'epftoolbox.data._wrangling.DataScaler'>