Data wrangling

This module is intended for transforming data into a format that can be read and processed by the prediction models of the epftoolbox library. At the moment, the module is limited to scaling operations.

The module is composed of two components:

The class DataScaler is the main block for performing scaling operations. The class is based on the syntax of the scalers defined in the sklearn.preprocessing module of the scikit-learn library. The class performs some of the standard scaling algorithms in the context of electricity price forecasting:

Besides the class, the module also provides a function scaling to scale a list of datasets but estimating the scaler using only one of the datasets. This function is useful when scaling the training, validation, and test dataset. In this scenario, to have a realistic evaluation, one would ideally estimate the scaler using the training dataset and simply transform the other two.


class epftoolbox.data.DataScaler(normalize)[source]

Class to perform data scaling operations

The scaling technique is defined by the normalize parameter which takes one of the following values:

  • 'Norm' for normalizing the data to the interval [0, 1].
  • 'Norm1' for normalizing the data to the interval [-1, 1].
  • 'Std' for standarizing the data to follow a normal distribution.
  • 'Median' for normalizing the data based on the median as defined in as defined in here.
  • 'Invariant' for scaling the data based on the asinh transformation (a variance stabilizing transformations) as defined in here.

This class follows the same syntax of the scalers defined in the sklearn.preprocessing module of the scikit-learn library

Parameters:normalize (str) – Type of scaling to be performed. Possible values are 'Norm', 'Norm1', 'Std', 'Median', or 'Invariant'

Example

>>> from epftoolbox.data import read_data
>>> from epftoolbox.data import DataScaler
>>> df_train, df_test = read_data(path='.', dataset='PJM', begin_test_date='01-01-2016', end_test_date='01-02-2016')
Test datasets: 2016-01-01 00:00:00 - 2016-02-01 23:00:00
>>> df_train.tail()
                         Price  Exogenous 1  Exogenous 2
Date
2015-12-31 19:00:00  29.513832     100700.0      13015.0
2015-12-31 20:00:00  28.440134      99832.0      12858.0
2015-12-31 21:00:00  26.701700      97033.0      12626.0
2015-12-31 22:00:00  23.262253      92022.0      12176.0
2015-12-31 23:00:00  22.262431      86295.0      11434.0
>>> df_test.head()
                         Price  Exogenous 1  Exogenous 2
Date
2016-01-01 00:00:00  20.341321      76840.0      10406.0
2016-01-01 01:00:00  19.462741      74819.0      10075.0
2016-01-01 02:00:00  17.172706      73182.0       9795.0
2016-01-01 03:00:00  16.963876      72300.0       9632.0
2016-01-01 04:00:00  17.403722      72535.0       9566.0
>>> Xtrain = df_train.values
>>> Xtest = df_train.values
>>> scaler = DataScaler('Norm')
>>> Xtrain_scaled = scaler.fit_transform(Xtrain)
>>> Xtest_scaled = scaler.transform(Xtest)
>>> Xtrain_inverse = scaler.inverse_transform(Xtrain_scaled)
>>> Xtest_inverse = scaler.inverse_transform(Xtest_scaled)
>>> Xtrain[:3,:]
array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04],
       [2.3554578e+01, 8.2128000e+04, 1.0942000e+04],
       [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]])
>>> Xtrain_scaled[:3,:]
array([[0.03833877, 0.2736787 , 0.28415155],
       [0.03608228, 0.24425597, 0.24633138],
       [0.03438982, 0.23016409, 0.2261206 ]])
>>> Xtrain_inverse[:3,:]
array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04],
       [2.3554578e+01, 8.2128000e+04, 1.0942000e+04],
       [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]])
>>> Xtest[:3,:]
array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04],
       [2.3554578e+01, 8.2128000e+04, 1.0942000e+04],
       [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]])
>>> Xtest_scaled[:3,:]
array([[0.03833877, 0.2736787 , 0.28415155],
       [0.03608228, 0.24425597, 0.24633138],
       [0.03438982, 0.23016409, 0.2261206 ]])
>>> Xtest_inverse[:3,:]
array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04],
       [2.3554578e+01, 8.2128000e+04, 1.0942000e+04],
       [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]])

Methods

fit_transform(dataset) Method that estimates an scaler object using the data in dataset and scales the data in dataset
inverse_transform(dataset) Method that inverse-scale the data in dataset
transform(dataset) Method that scales the data in dataset
fit_transform(dataset)[source]

Method that estimates an scaler object using the data in dataset and scales the data in dataset

Parameters:dataset (numpy.array) – Dataset used to estimate the scaler
Returns:Scaled data
Return type:numpy.array
inverse_transform(dataset)[source]

Method that inverse-scale the data in dataset

It must be called after calling the fit_transform method for estimating the scaler

Parameters:dataset (numpy.array) – Dataset to be scaled
Returns:Inverse-scaled data
Return type:numpy.array
transform(dataset)[source]

Method that scales the data in dataset

It must be called after calling the fit_transform method for estimating the scaler :param dataset: Dataset to be scaled :type dataset: numpy.array

Returns:Scaled data
Return type:numpy.array

epftoolbox.data.scaling(datasets, normalize)[source]

Function that scales data and returns the scaled data and the DataScaler used for scaling.

It rescales all the datasets contained in the list datasets using the first dataset as reference. For example, if datasets=[X_1, X_2, X_3], the function estimates a DataScaler object using the array X_1, and transform X_1, X_2, and X_3 using the DataScaler object.

Each dataset must be a numpy.array and it should have the same column-dimensions. For example, if datasets=[X_1, X_2, X_3], X_1 must be a numpy.array of size [n_1, m], X_2 of size [n_2, m], and X_3 of size [n_3, m], where n_1, n_2, n_3 can be different.

The scaling technique is defined by the normalize parameter which takes one of the following values:

  • 'Norm' for normalizing the data to the interval [0, 1].
  • 'Norm1' for normalizing the data to the interval [-1, 1].
  • 'Std' for standarizing the data to follow a normal distribution.
  • 'Median' for normalizing the data based on the median as defined in as defined in here.
  • 'Invariant' for scaling the data based on the asinh transformation (a variance stabilizing transformations) as defined in here.

The function returns the scaled data together with a DataScaler object representing the scaling. This object can be used to scale other dataset using the same rules or to inverse-transform the data.

Parameters:
  • datasets (list) – List of numpy.array objects to be scaled.
  • normalize (str) – Type of scaling to be performed. Possible values are 'Norm', 'Norm1', 'Std', 'Median', or 'Invariant'
Returns:

List of scaled datasets and the DataScaler object used for scaling. Each dataset in the list is a numpy.array.

Return type:

List, DataScaler

Example

>>> from epftoolbox.data import read_data
>>> from epftoolbox.data import scaling
>>> df_train, df_test = read_data(path='.', dataset='PJM', begin_test_date='01-01-2016', end_test_date='01-02-2016')
Test datasets: 2016-01-01 00:00:00 - 2016-02-01 23:00:00
>>> df_train.tail()
                         Price  Exogenous 1  Exogenous 2
Date
2015-12-31 19:00:00  29.513832     100700.0      13015.0
2015-12-31 20:00:00  28.440134      99832.0      12858.0
2015-12-31 21:00:00  26.701700      97033.0      12626.0
2015-12-31 22:00:00  23.262253      92022.0      12176.0
2015-12-31 23:00:00  22.262431      86295.0      11434.0
>>> df_test.head()
                         Price  Exogenous 1  Exogenous 2
Date
2016-01-01 00:00:00  20.341321      76840.0      10406.0
2016-01-01 01:00:00  19.462741      74819.0      10075.0
2016-01-01 02:00:00  17.172706      73182.0       9795.0
2016-01-01 03:00:00  16.963876      72300.0       9632.0
2016-01-01 04:00:00  17.403722      72535.0       9566.0
>>> Xtrain = df_train.values
>>> Xtest = df_train.values
>>> [Xtrain_scaled, Xtest_scaled], scaler = scaling([Xtrain,Xtest],'Norm')
>>> Xtrain[:3,:]
array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04],
       [2.3554578e+01, 8.2128000e+04, 1.0942000e+04],
       [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]])
>>> Xtrain_scaled[:3,:]
array([[0.03833877, 0.2736787 , 0.28415155],
       [0.03608228, 0.24425597, 0.24633138],
       [0.03438982, 0.23016409, 0.2261206 ]])
>>> Xtest[:3,:]
array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04],
       [2.3554578e+01, 8.2128000e+04, 1.0942000e+04],
       [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]])
>>> Xtest_scaled[:3,:]
array([[0.03833877, 0.2736787 , 0.28415155],
       [0.03608228, 0.24425597, 0.24633138],
       [0.03438982, 0.23016409, 0.2261206 ]])
>>> type(scaler)
<class 'epftoolbox.data._wrangling.DataScaler'>