Data wrangling¶
This module is intended for transforming data into a format that can be read and processed by the prediction models of the epftoolbox library. At the moment, the module is limited to scaling operations.
The module is composed of two components:
- The
DataScalerclass.- The
scalingfunction.
The class DataScaler is the main block for performing scaling operations. The class is based on the syntax of the scalers defined in the sklearn.preprocessing module of the scikit-learn library. The class performs some of the standard scaling algorithms in the context of electricity price forecasting:
Besides the class, the module also provides a function scaling to scale
a list of datasets but estimating the scaler using only one of the datasets. This function is useful when
scaling the training, validation, and test dataset. In this scenario, to have a realistic evaluation, one would ideally estimate the scaler using the training dataset and simply transform the other two.
-
class
epftoolbox.data.DataScaler(normalize)[source]¶ Class to perform data scaling operations
The scaling technique is defined by the
normalizeparameter which takes one of the following values:'Norm'for normalizing the data to the interval [0, 1].'Norm1'for normalizing the data to the interval [-1, 1].'Std'for standarizing the data to follow a normal distribution.'Median'for normalizing the data based on the median as defined in as defined in here.'Invariant'for scaling the data based on the asinh transformation (a variance stabilizing transformations) as defined in here.
This class follows the same syntax of the scalers defined in the sklearn.preprocessing module of the scikit-learn library
Parameters: normalize (str) – Type of scaling to be performed. Possible values are 'Norm','Norm1','Std','Median', or'Invariant'Example
>>> from epftoolbox.data import read_data >>> from epftoolbox.data import DataScaler >>> df_train, df_test = read_data(path='.', dataset='PJM', begin_test_date='01-01-2016', end_test_date='01-02-2016') Test datasets: 2016-01-01 00:00:00 - 2016-02-01 23:00:00 >>> df_train.tail() Price Exogenous 1 Exogenous 2 Date 2015-12-31 19:00:00 29.513832 100700.0 13015.0 2015-12-31 20:00:00 28.440134 99832.0 12858.0 2015-12-31 21:00:00 26.701700 97033.0 12626.0 2015-12-31 22:00:00 23.262253 92022.0 12176.0 2015-12-31 23:00:00 22.262431 86295.0 11434.0 >>> df_test.head() Price Exogenous 1 Exogenous 2 Date 2016-01-01 00:00:00 20.341321 76840.0 10406.0 2016-01-01 01:00:00 19.462741 74819.0 10075.0 2016-01-01 02:00:00 17.172706 73182.0 9795.0 2016-01-01 03:00:00 16.963876 72300.0 9632.0 2016-01-01 04:00:00 17.403722 72535.0 9566.0 >>> Xtrain = df_train.values >>> Xtest = df_train.values >>> scaler = DataScaler('Norm') >>> Xtrain_scaled = scaler.fit_transform(Xtrain) >>> Xtest_scaled = scaler.transform(Xtest) >>> Xtrain_inverse = scaler.inverse_transform(Xtrain_scaled) >>> Xtest_inverse = scaler.inverse_transform(Xtest_scaled) >>> Xtrain[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]]) >>> Xtrain_scaled[:3,:] array([[0.03833877, 0.2736787 , 0.28415155], [0.03608228, 0.24425597, 0.24633138], [0.03438982, 0.23016409, 0.2261206 ]]) >>> Xtrain_inverse[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]]) >>> Xtest[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]]) >>> Xtest_scaled[:3,:] array([[0.03833877, 0.2736787 , 0.28415155], [0.03608228, 0.24425597, 0.24633138], [0.03438982, 0.23016409, 0.2261206 ]]) >>> Xtest_inverse[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]])
Methods
fit_transform(dataset)Method that estimates an scaler object using the data in datasetand scales the data indatasetinverse_transform(dataset)Method that inverse-scale the data in datasettransform(dataset)Method that scales the data in dataset-
fit_transform(dataset)[source]¶ Method that estimates an scaler object using the data in
datasetand scales the data indatasetParameters: dataset (numpy.array) – Dataset used to estimate the scaler Returns: Scaled data Return type: numpy.array
-
inverse_transform(dataset)[source]¶ Method that inverse-scale the data in
datasetIt must be called after calling the
fit_transformmethod for estimating the scalerParameters: dataset (numpy.array) – Dataset to be scaled Returns: Inverse-scaled data Return type: numpy.array
-
transform(dataset)[source]¶ Method that scales the data in
datasetIt must be called after calling the
fit_transformmethod for estimating the scaler :param dataset: Dataset to be scaled :type dataset: numpy.arrayReturns: Scaled data Return type: numpy.array
-
epftoolbox.data.scaling(datasets, normalize)[source]¶ Function that scales data and returns the scaled data and the
DataScalerused for scaling.It rescales all the datasets contained in the list
datasetsusing the first dataset as reference. For example, ifdatasets=[X_1, X_2, X_3], the function estimates aDataScalerobject using the arrayX_1, and transformX_1,X_2, andX_3using theDataScalerobject.Each dataset must be a numpy.array and it should have the same column-dimensions. For example, if
datasets=[X_1, X_2, X_3],X_1must be a numpy.array of size[n_1, m],X_2of size[n_2, m], andX_3of size[n_3, m], wheren_1,n_2,n_3can be different.The scaling technique is defined by the
normalizeparameter which takes one of the following values:'Norm'for normalizing the data to the interval [0, 1].'Norm1'for normalizing the data to the interval [-1, 1].'Std'for standarizing the data to follow a normal distribution.'Median'for normalizing the data based on the median as defined in as defined in here.'Invariant'for scaling the data based on the asinh transformation (a variance stabilizing transformations) as defined in here.
The function returns the scaled data together with a
DataScalerobject representing the scaling. This object can be used to scale other dataset using the same rules or to inverse-transform the data.Parameters: - datasets (list) – List of numpy.array objects to be scaled.
- normalize (str) – Type of scaling to be performed. Possible values are
'Norm','Norm1','Std','Median', or'Invariant'
Returns: List of scaled datasets and the
DataScalerobject used for scaling. Each dataset in the list is a numpy.array.Return type: List,
DataScalerExample
>>> from epftoolbox.data import read_data >>> from epftoolbox.data import scaling >>> df_train, df_test = read_data(path='.', dataset='PJM', begin_test_date='01-01-2016', end_test_date='01-02-2016') Test datasets: 2016-01-01 00:00:00 - 2016-02-01 23:00:00 >>> df_train.tail() Price Exogenous 1 Exogenous 2 Date 2015-12-31 19:00:00 29.513832 100700.0 13015.0 2015-12-31 20:00:00 28.440134 99832.0 12858.0 2015-12-31 21:00:00 26.701700 97033.0 12626.0 2015-12-31 22:00:00 23.262253 92022.0 12176.0 2015-12-31 23:00:00 22.262431 86295.0 11434.0 >>> df_test.head() Price Exogenous 1 Exogenous 2 Date 2016-01-01 00:00:00 20.341321 76840.0 10406.0 2016-01-01 01:00:00 19.462741 74819.0 10075.0 2016-01-01 02:00:00 17.172706 73182.0 9795.0 2016-01-01 03:00:00 16.963876 72300.0 9632.0 2016-01-01 04:00:00 17.403722 72535.0 9566.0 >>> Xtrain = df_train.values >>> Xtest = df_train.values >>> [Xtrain_scaled, Xtest_scaled], scaler = scaling([Xtrain,Xtest],'Norm') >>> Xtrain[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]]) >>> Xtrain_scaled[:3,:] array([[0.03833877, 0.2736787 , 0.28415155], [0.03608228, 0.24425597, 0.24633138], [0.03438982, 0.23016409, 0.2261206 ]]) >>> Xtest[:3,:] array([[2.5464211e+01, 8.5049000e+04, 1.1509000e+04], [2.3554578e+01, 8.2128000e+04, 1.0942000e+04], [2.2122277e+01, 8.0729000e+04, 1.0639000e+04]]) >>> Xtest_scaled[:3,:] array([[0.03833877, 0.2736787 , 0.28415155], [0.03608228, 0.24425597, 0.24633138], [0.03438982, 0.23016409, 0.2261206 ]]) >>> type(scaler) <class 'epftoolbox.data._wrangling.DataScaler'>