Preprocessing API
Preprocessing Module
This module provides data preprocessing utilities for financial data analysis without dependencies on pandas, using only numpy and scipy.
clip_outliers(data, lower_percentile=1.0, upper_percentile=99.0)
Clip values outside specified percentiles.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array |
required |
lower_percentile
|
float
|
Lower percentile (between 0 and 100) |
1.0
|
upper_percentile
|
float
|
Upper percentile (between 0 and 100) |
99.0
|
Returns:
Type | Description |
---|---|
ndarray
|
Data with outliers clipped. For special cases: - Empty array: returns empty array - Single value: returns unchanged value - Constant values: returns unchanged array - All NaN: returns array of NaN |
Raises:
Type | Description |
---|---|
ValueError
|
If percentiles are not between 0 and 100 or if lower_percentile > upper_percentile |
difference(data, order=1)
Calculate differences between consecutive elements of a time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array |
required |
order
|
int
|
The order of differencing. Must be non-negative. |
1
|
Returns:
Type | Description |
---|---|
ndarray
|
Array of differences with length n-order, where n is the length of the input array. For order=0, returns the original array. NaN values in input result in NaN differences only where NaN values are involved. |
Raises:
Type | Description |
---|---|
ValueError
|
If order is negative or larger than the length of the data. |
discretize(data, n_bins=5, strategy='uniform')
Discretize continuous data into bins using efficient vectorized operations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array to be discretized. Will be flattened if multi-dimensional. |
required |
n_bins
|
int
|
Number of bins to create. Must be positive. |
5
|
strategy
|
str
|
Strategy to use for creating bins: - 'uniform': Equal-width bins - 'quantile': Equal-frequency bins - 'kmeans': Bins based on k-means clustering |
'uniform'
|
Returns:
Type | Description |
---|---|
ndarray
|
Array of bin labels (1 to n_bins). NaN values in input remain NaN in output. For special cases: - Empty array: returns empty array - Single value or constant array: returns array filled with 1.0 (NaN preserved) - All NaN: returns array of NaN |
Notes
- Uses efficient vectorized operations for binning
- Handles NaN values gracefully
- Memory efficient implementation avoiding unnecessary copies
- Bin labels are 1-based (1 to n_bins)
Raises:
Type | Description |
---|---|
ValueError
|
|
dynamic_tanh(data, alpha=1.0)
Apply Dynamic Tanh (DyT) transformation to data, which helps normalize data while preserving relative differences and handling outliers well.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array, can contain None or np.nan as missing values |
required |
alpha
|
float
|
Scaling factor that controls the transformation intensity. Higher values lead to more aggressive normalization (less extreme values). |
1.0
|
Returns:
Type | Description |
---|---|
ndarray
|
DyT-transformed data with values in range (-1, 1). For special cases: - Empty array: returns empty array - Single value: returns array of zeros - Constant values: returns array of zeros - All NaN: returns array of NaN |
Notes
The Dynamic Tanh (DyT) transformation follows these steps: 1. Center data by subtracting the median 2. Scale data by dividing by (MAD * alpha), where MAD is Median Absolute Deviation Higher alpha means more scaling (division by larger value) before tanh 3. Apply tanh transformation to the scaled data
This transformation is particularly useful for financial data as it: - Is robust to outliers (uses median and MAD instead of mean and std) - Maps all values to the range (-1, 1) without clipping extreme values - Preserves the shape of the distribution better than min-max scaling - Handles multi-modal distributions better than standard normalization
fill_missing(data, method='mean', value=None)
Fill missing values in data using efficient NumPy operations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array, can contain None or np.nan as missing values |
required |
method
|
str
|
Method to fill missing values: 'mean', 'median', 'mode', 'forward', 'backward', 'value' |
'mean'
|
value
|
float
|
Value to use when method='value' |
None
|
Returns:
Type | Description |
---|---|
ndarray
|
Data with missing values filled. For all-NaN input: - Statistical methods (mean, median, mode) return all-NaN array - Forward/backward fill return all-NaN array - Value fill returns array filled with specified value |
Raises:
Type | Description |
---|---|
ValueError
|
|
interpolate_missing(data, method='linear')
Interpolate missing values in data using efficient NumPy operations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array, can contain None or np.nan as missing values |
required |
method
|
str
|
Interpolation method: 'linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic' |
'linear'
|
Returns:
Type | Description |
---|---|
ndarray
|
Data with missing values interpolated. For missing values at the start or end: - 'nearest' and 'zero' methods will use the nearest valid value - other methods will leave them as NaN |
lag_features(data, lags)
Create lagged features from data using vectorized operations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array |
required |
lags
|
list of int
|
List of lag values. Zero lag returns the original values, negative lags are ignored, and lags larger than the data length result in NaN columns. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
Array with original and lagged features as columns. First column is the original data, followed by columns for each lag. NaN values are used for undefined lag positions. |
log_transform(data, base=None, offset=0.0)
Apply logarithmic transformation to data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array, can be list or numpy array |
required |
base
|
float
|
Base of logarithm. If None, natural logarithm is used. Common bases: None (natural log), 2, 10 |
None
|
offset
|
float
|
Offset added to data before taking logarithm (useful for non-positive data) |
0.0
|
Returns:
Type | Description |
---|---|
ndarray
|
Log-transformed data. For special cases: - Empty array: returns empty array - Single value: returns log of (value + offset) - All NaN: returns array of NaN - Non-positive values: raises ValueError if any (value + offset) <= 0 |
Raises:
Type | Description |
---|---|
ValueError
|
If any value after offset is non-positive |
min_max_scale(data, feature_range=(0, 1))
Scale features to a given range using min-max scaling.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array, can contain None or np.nan as missing values |
required |
feature_range
|
tuple
|
Desired range of transformed data |
(0, 1)
|
Returns:
Type | Description |
---|---|
ndarray
|
Scaled data. For special cases: - Empty array: returns empty array - Single value: returns array filled with feature_range[0] - Constant values: returns array filled with feature_range[0] - All NaN: returns array of NaN |
Raises:
Type | Description |
---|---|
ValueError
|
If feature_range[0] >= feature_range[1] |
normalize(data, method='l2')
Normalize data using vector normalization methods.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array, can contain None or np.nan as missing values |
required |
method
|
str
|
Normalization method: 'l1' or 'l2' - 'l1': L1 normalization (Manhattan norm) - 'l2': L2 normalization (Euclidean norm) |
'l2'
|
Returns:
Type | Description |
---|---|
ndarray
|
Normalized data. For special cases: - Empty array: returns empty array - All zeros: returns array of zeros - Single value: returns array with 1.0 - All NaN: returns array of NaN |
Notes
L1 normalization formula: X' = X / sum(|X|) L2 normalization formula: X' = X / sqrt(sum(X^2))
Raises:
Type | Description |
---|---|
ValueError
|
If method is not 'l1' or 'l2' |
polynomial_features(data, degree=2)
Generate polynomial features up to specified degree.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array. Will be flattened if multi-dimensional. |
required |
degree
|
int
|
Maximum degree of polynomial features. Must be a positive integer. |
2
|
Returns:
Type | Description |
---|---|
ndarray
|
Array with polynomial features as columns: - First column contains 1s (bias term) - Subsequent columns contain increasing powers (x, x², ..., x^degree) Shape will be (n_samples, degree + 1) |
Notes
- Uses efficient vectorized operations for polynomial computation
- Handles NaN values gracefully (propagates through powers)
- Memory efficient implementation avoiding unnecessary copies
Raises:
Type | Description |
---|---|
ValueError
|
If degree is not a positive integer |
power_transform(data, method='yeo-johnson', standardize=True)
Apply power transformation (Box-Cox or Yeo-Johnson) to make data more Gaussian-like.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array, can be list or numpy array |
required |
method
|
str
|
The power transform method: - 'box-cox': only works with positive values - 'yeo-johnson': works with both positive and negative values |
'yeo-johnson'
|
standardize
|
bool
|
Whether to standardize the data after transformation |
True
|
Returns:
Type | Description |
---|---|
ndarray
|
Power transformed data. NaN values in input remain NaN in output. For special cases: - Empty array: returns empty array - Single value: returns array of zeros if standardize=True, or the log1p of the constant if standardize=False - Constant values: returns array of zeros if standardize=True, or the log1p of the constant if standardize=False - All NaN: returns array of NaN |
Raises:
Type | Description |
---|---|
ValueError
|
|
quantile_transform(data, n_quantiles=1000, output_distribution='uniform')
Transform features using quantile information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array |
required |
n_quantiles
|
int
|
Number of quantiles to use |
1000
|
output_distribution
|
str
|
'uniform' or 'normal' |
'uniform'
|
Returns:
Type | Description |
---|---|
ndarray
|
Quantile transformed data |
remove_outliers(data, method='zscore', threshold=2.0)
Remove outliers from data by replacing them with NaN.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array |
required |
method
|
str
|
Method to detect outliers: 'zscore', 'iqr', or 'mad' |
'zscore'
|
threshold
|
float
|
Threshold for outlier detection: - For 'zscore': number of standard deviations - For 'iqr': multiplier of IQR - For 'mad': multiplier of MAD |
2.0
|
Returns:
Type | Description |
---|---|
ndarray
|
Data with outliers replaced by NaN. Original NaN values remain NaN. |
resample(data, factor, method='mean')
Resample data by aggregating values using efficient NumPy operations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array, can contain None or np.nan as missing values |
required |
factor
|
int
|
Resampling factor (e.g., 5 means aggregate every 5 points) |
required |
method
|
str
|
Aggregation method: 'mean', 'median', 'sum', 'min', 'max' Note: For groups containing all NaN values: - sum will return NaN - mean, median, min, max will return NaN |
'mean'
|
Returns:
Type | Description |
---|---|
ndarray
|
Resampled data. Length will be floor(len(data)/factor). Remaining data points that don't fill a complete group are discarded. |
robust_scale(data, method='iqr', quantile_range=(25.0, 75.0))
Scale features using statistics that are robust to outliers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array, can contain None or np.nan as missing values |
required |
method
|
str
|
Method to use for scaling: - 'iqr': Use Interquartile Range - 'mad': Use Median Absolute Deviation |
'iqr'
|
quantile_range
|
tuple
|
Quantile range used to calculate scale when method='iqr' |
(25.0, 75.0)
|
Returns:
Type | Description |
---|---|
ndarray
|
Robustly scaled data. For special cases: - Empty array: returns empty array - Single value: returns array of zeros - Constant values: returns array of zeros - All NaN: returns array of NaN |
Notes
For IQR method: (X - median) / IQR For MAD method: (X - median) / (MAD * 1.4826) The factor 1.4826 makes the MAD consistent with the standard deviation for normally distributed data.
Raises:
Type | Description |
---|---|
ValueError
|
If method is not recognized or if quantile range is invalid |
rolling_window(data, window_size, step=1)
Create rolling windows of data using efficient NumPy striding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array |
required |
window_size
|
int
|
Size of the rolling window |
required |
step
|
int
|
Step size between windows |
1
|
Returns:
Type | Description |
---|---|
ndarray
|
Array of rolling windows. Shape will be (n_windows, window_size) where n_windows = max(0, (len(data) - window_size) // step + 1) |
Raises:
Type | Description |
---|---|
ValueError
|
If window_size or step is not positive |
scale_to_range(data, feature_range=(0.0, 1.0))
Scale data to a specific range while preserving relative distances.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array, can contain None or np.nan as missing values |
required |
feature_range
|
tuple
|
Desired range for transformed data (min, max) |
(0.0, 1.0)
|
Returns:
Type | Description |
---|---|
ndarray
|
Data scaled to target range. For special cases: - Empty array: returns empty array - Single value: returns array filled with feature_range[0] - Constant values: returns array filled with feature_range[0] - All NaN: returns array of NaN |
Raises:
Type | Description |
---|---|
ValueError
|
If feature_range[0] >= feature_range[1] |
standardize(data)
Standardize data to have mean 0 and standard deviation 1 (Z-score normalization).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array, can contain None or np.nan as missing values |
required |
Returns:
Type | Description |
---|---|
ndarray
|
Standardized data with zero mean and unit variance. For special cases: - Empty array: returns empty array - Single value: returns array of zeros - Constant values: returns array of zeros - All NaN: returns array of NaN |
winsorize(data, limits=0.05)
Limit extreme values in data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array |
required |
limits
|
float or tuple
|
If a float, it is the proportion to cut on each side. If a tuple of two floats, they represent the proportions to cut from the lower and upper bounds. |
0.05
|
Returns:
Type | Description |
---|---|
ndarray
|
Winsorized data |