Skip to content

Preprocessing API

Preprocessing Module

This module provides data preprocessing utilities for financial data analysis without dependencies on pandas, using only numpy and scipy.

clip_outliers(data, lower_percentile=1.0, upper_percentile=99.0)

Clip values outside specified percentiles.

Parameters:

Name Type Description Default
data array - like

Input data array

required
lower_percentile float

Lower percentile (between 0 and 100)

1.0
upper_percentile float

Upper percentile (between 0 and 100)

99.0

Returns:

Type Description
ndarray

Data with outliers clipped. For special cases: - Empty array: returns empty array - Single value: returns unchanged value - Constant values: returns unchanged array - All NaN: returns array of NaN

Raises:

Type Description
ValueError

If percentiles are not between 0 and 100 or if lower_percentile > upper_percentile

difference(data, order=1)

Calculate differences between consecutive elements of a time series.

Parameters:

Name Type Description Default
data array - like

Input data array

required
order int

The order of differencing. Must be non-negative.

1

Returns:

Type Description
ndarray

Array of differences with length n-order, where n is the length of the input array. For order=0, returns the original array. NaN values in input result in NaN differences only where NaN values are involved.

Raises:

Type Description
ValueError

If order is negative or larger than the length of the data.

discretize(data, n_bins=5, strategy='uniform')

Discretize continuous data into bins using efficient vectorized operations.

Parameters:

Name Type Description Default
data array - like

Input data array to be discretized. Will be flattened if multi-dimensional.

required
n_bins int

Number of bins to create. Must be positive.

5
strategy str

Strategy to use for creating bins: - 'uniform': Equal-width bins - 'quantile': Equal-frequency bins - 'kmeans': Bins based on k-means clustering

'uniform'

Returns:

Type Description
ndarray

Array of bin labels (1 to n_bins). NaN values in input remain NaN in output. For special cases: - Empty array: returns empty array - Single value or constant array: returns array filled with 1.0 (NaN preserved) - All NaN: returns array of NaN

Notes
  • Uses efficient vectorized operations for binning
  • Handles NaN values gracefully
  • Memory efficient implementation avoiding unnecessary copies
  • Bin labels are 1-based (1 to n_bins)

Raises:

Type Description
ValueError
  • If strategy is not one of 'uniform', 'quantile', or 'kmeans'
  • If n_bins is not positive

dynamic_tanh(data, alpha=1.0)

Apply Dynamic Tanh (DyT) transformation to data, which helps normalize data while preserving relative differences and handling outliers well.

Parameters:

Name Type Description Default
data array - like

Input data array, can contain None or np.nan as missing values

required
alpha float

Scaling factor that controls the transformation intensity. Higher values lead to more aggressive normalization (less extreme values).

1.0

Returns:

Type Description
ndarray

DyT-transformed data with values in range (-1, 1). For special cases: - Empty array: returns empty array - Single value: returns array of zeros - Constant values: returns array of zeros - All NaN: returns array of NaN

Notes

The Dynamic Tanh (DyT) transformation follows these steps: 1. Center data by subtracting the median 2. Scale data by dividing by (MAD * alpha), where MAD is Median Absolute Deviation Higher alpha means more scaling (division by larger value) before tanh 3. Apply tanh transformation to the scaled data

This transformation is particularly useful for financial data as it: - Is robust to outliers (uses median and MAD instead of mean and std) - Maps all values to the range (-1, 1) without clipping extreme values - Preserves the shape of the distribution better than min-max scaling - Handles multi-modal distributions better than standard normalization

fill_missing(data, method='mean', value=None)

Fill missing values in data using efficient NumPy operations.

Parameters:

Name Type Description Default
data array - like

Input data array, can contain None or np.nan as missing values

required
method str

Method to fill missing values: 'mean', 'median', 'mode', 'forward', 'backward', 'value'

'mean'
value float

Value to use when method='value'

None

Returns:

Type Description
ndarray

Data with missing values filled. For all-NaN input: - Statistical methods (mean, median, mode) return all-NaN array - Forward/backward fill return all-NaN array - Value fill returns array filled with specified value

Raises:

Type Description
ValueError
  • If method is not recognized
  • If method='value' but no value is provided

interpolate_missing(data, method='linear')

Interpolate missing values in data using efficient NumPy operations.

Parameters:

Name Type Description Default
data array - like

Input data array, can contain None or np.nan as missing values

required
method str

Interpolation method: 'linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic'

'linear'

Returns:

Type Description
ndarray

Data with missing values interpolated. For missing values at the start or end: - 'nearest' and 'zero' methods will use the nearest valid value - other methods will leave them as NaN

lag_features(data, lags)

Create lagged features from data using vectorized operations.

Parameters:

Name Type Description Default
data array - like

Input data array

required
lags list of int

List of lag values. Zero lag returns the original values, negative lags are ignored, and lags larger than the data length result in NaN columns.

required

Returns:

Type Description
ndarray

Array with original and lagged features as columns. First column is the original data, followed by columns for each lag. NaN values are used for undefined lag positions.

log_transform(data, base=None, offset=0.0)

Apply logarithmic transformation to data.

Parameters:

Name Type Description Default
data array - like

Input data array, can be list or numpy array

required
base float

Base of logarithm. If None, natural logarithm is used. Common bases: None (natural log), 2, 10

None
offset float

Offset added to data before taking logarithm (useful for non-positive data)

0.0

Returns:

Type Description
ndarray

Log-transformed data. For special cases: - Empty array: returns empty array - Single value: returns log of (value + offset) - All NaN: returns array of NaN - Non-positive values: raises ValueError if any (value + offset) <= 0

Raises:

Type Description
ValueError

If any value after offset is non-positive

min_max_scale(data, feature_range=(0, 1))

Scale features to a given range using min-max scaling.

Parameters:

Name Type Description Default
data array - like

Input data array, can contain None or np.nan as missing values

required
feature_range tuple

Desired range of transformed data

(0, 1)

Returns:

Type Description
ndarray

Scaled data. For special cases: - Empty array: returns empty array - Single value: returns array filled with feature_range[0] - Constant values: returns array filled with feature_range[0] - All NaN: returns array of NaN

Raises:

Type Description
ValueError

If feature_range[0] >= feature_range[1]

normalize(data, method='l2')

Normalize data using vector normalization methods.

Parameters:

Name Type Description Default
data array - like

Input data array, can contain None or np.nan as missing values

required
method str

Normalization method: 'l1' or 'l2' - 'l1': L1 normalization (Manhattan norm) - 'l2': L2 normalization (Euclidean norm)

'l2'

Returns:

Type Description
ndarray

Normalized data. For special cases: - Empty array: returns empty array - All zeros: returns array of zeros - Single value: returns array with 1.0 - All NaN: returns array of NaN

Notes

L1 normalization formula: X' = X / sum(|X|) L2 normalization formula: X' = X / sqrt(sum(X^2))

Raises:

Type Description
ValueError

If method is not 'l1' or 'l2'

polynomial_features(data, degree=2)

Generate polynomial features up to specified degree.

Parameters:

Name Type Description Default
data array - like

Input data array. Will be flattened if multi-dimensional.

required
degree int

Maximum degree of polynomial features. Must be a positive integer.

2

Returns:

Type Description
ndarray

Array with polynomial features as columns: - First column contains 1s (bias term) - Subsequent columns contain increasing powers (x, x², ..., x^degree) Shape will be (n_samples, degree + 1)

Notes
  • Uses efficient vectorized operations for polynomial computation
  • Handles NaN values gracefully (propagates through powers)
  • Memory efficient implementation avoiding unnecessary copies

Raises:

Type Description
ValueError

If degree is not a positive integer

power_transform(data, method='yeo-johnson', standardize=True)

Apply power transformation (Box-Cox or Yeo-Johnson) to make data more Gaussian-like.

Parameters:

Name Type Description Default
data array - like

Input data array, can be list or numpy array

required
method str

The power transform method: - 'box-cox': only works with positive values - 'yeo-johnson': works with both positive and negative values

'yeo-johnson'
standardize bool

Whether to standardize the data after transformation

True

Returns:

Type Description
ndarray

Power transformed data. NaN values in input remain NaN in output. For special cases: - Empty array: returns empty array - Single value: returns array of zeros if standardize=True, or the log1p of the constant if standardize=False - Constant values: returns array of zeros if standardize=True, or the log1p of the constant if standardize=False - All NaN: returns array of NaN

Raises:

Type Description
ValueError
  • If method is not one of 'box-cox', 'yeo-johnson'

quantile_transform(data, n_quantiles=1000, output_distribution='uniform')

Transform features using quantile information.

Parameters:

Name Type Description Default
data array - like

Input data array

required
n_quantiles int

Number of quantiles to use

1000
output_distribution str

'uniform' or 'normal'

'uniform'

Returns:

Type Description
ndarray

Quantile transformed data

remove_outliers(data, method='zscore', threshold=2.0)

Remove outliers from data by replacing them with NaN.

Parameters:

Name Type Description Default
data array - like

Input data array

required
method str

Method to detect outliers: 'zscore', 'iqr', or 'mad'

'zscore'
threshold float

Threshold for outlier detection: - For 'zscore': number of standard deviations - For 'iqr': multiplier of IQR - For 'mad': multiplier of MAD

2.0

Returns:

Type Description
ndarray

Data with outliers replaced by NaN. Original NaN values remain NaN.

resample(data, factor, method='mean')

Resample data by aggregating values using efficient NumPy operations.

Parameters:

Name Type Description Default
data array - like

Input data array, can contain None or np.nan as missing values

required
factor int

Resampling factor (e.g., 5 means aggregate every 5 points)

required
method str

Aggregation method: 'mean', 'median', 'sum', 'min', 'max' Note: For groups containing all NaN values: - sum will return NaN - mean, median, min, max will return NaN

'mean'

Returns:

Type Description
ndarray

Resampled data. Length will be floor(len(data)/factor). Remaining data points that don't fill a complete group are discarded.

robust_scale(data, method='iqr', quantile_range=(25.0, 75.0))

Scale features using statistics that are robust to outliers.

Parameters:

Name Type Description Default
data array - like

Input data array, can contain None or np.nan as missing values

required
method str

Method to use for scaling: - 'iqr': Use Interquartile Range - 'mad': Use Median Absolute Deviation

'iqr'
quantile_range tuple

Quantile range used to calculate scale when method='iqr'

(25.0, 75.0)

Returns:

Type Description
ndarray

Robustly scaled data. For special cases: - Empty array: returns empty array - Single value: returns array of zeros - Constant values: returns array of zeros - All NaN: returns array of NaN

Notes

For IQR method: (X - median) / IQR For MAD method: (X - median) / (MAD * 1.4826) The factor 1.4826 makes the MAD consistent with the standard deviation for normally distributed data.

Raises:

Type Description
ValueError

If method is not recognized or if quantile range is invalid

rolling_window(data, window_size, step=1)

Create rolling windows of data using efficient NumPy striding.

Parameters:

Name Type Description Default
data array - like

Input data array

required
window_size int

Size of the rolling window

required
step int

Step size between windows

1

Returns:

Type Description
ndarray

Array of rolling windows. Shape will be (n_windows, window_size) where n_windows = max(0, (len(data) - window_size) // step + 1)

Raises:

Type Description
ValueError

If window_size or step is not positive

scale_to_range(data, feature_range=(0.0, 1.0))

Scale data to a specific range while preserving relative distances.

Parameters:

Name Type Description Default
data array - like

Input data array, can contain None or np.nan as missing values

required
feature_range tuple

Desired range for transformed data (min, max)

(0.0, 1.0)

Returns:

Type Description
ndarray

Data scaled to target range. For special cases: - Empty array: returns empty array - Single value: returns array filled with feature_range[0] - Constant values: returns array filled with feature_range[0] - All NaN: returns array of NaN

Raises:

Type Description
ValueError

If feature_range[0] >= feature_range[1]

standardize(data)

Standardize data to have mean 0 and standard deviation 1 (Z-score normalization).

Parameters:

Name Type Description Default
data array - like

Input data array, can contain None or np.nan as missing values

required

Returns:

Type Description
ndarray

Standardized data with zero mean and unit variance. For special cases: - Empty array: returns empty array - Single value: returns array of zeros - Constant values: returns array of zeros - All NaN: returns array of NaN

winsorize(data, limits=0.05)

Limit extreme values in data.

Parameters:

Name Type Description Default
data array - like

Input data array

required
limits float or tuple

If a float, it is the proportion to cut on each side. If a tuple of two floats, they represent the proportions to cut from the lower and upper bounds.

0.05

Returns:

Type Description
ndarray

Winsorized data