Statistics API
Statistics Module
This module provides statistical analysis utilities for financial data analysis.
arch_test(returns, lags=5)
Perform Engle's ARCH test for heteroskedasticity.
The ARCH test examines whether there is autoregressive conditional heteroskedasticity (ARCH) in the residuals. The null hypothesis is that there is no ARCH effect.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
returns
|
array - like
|
Input returns or residuals array. Can contain NaN values. |
required |
lags
|
int
|
Number of lags to test for ARCH effects. |
5
|
Returns:
Type | Description |
---|---|
tuple
|
A tuple containing: - test_statistic : float The LM test statistic. Larger values indicate stronger evidence against the null hypothesis of no ARCH effects. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating presence of ARCH effects. Returns (np.nan, np.nan) if: - Insufficient data - All values are NaN - Constant series - Zero variance series |
Raises:
Type | Description |
---|---|
ValueError: If lags is not positive
|
|
Warning: If sample size is less than 30
|
|
augmented_dickey_fuller_test(data, regression='c', max_lag=None)
Perform Augmented Dickey-Fuller test for stationarity.
The ADF test tests the null hypothesis that a unit root is present in a time series. The alternative hypothesis is stationarity or trend-stationarity, depending on the specified regression type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array. Can contain NaN values which will be removed before calculation. |
required |
regression
|
str
|
Regression type: - 'c': Include constant (test for level stationarity) - 'ct': Include constant and trend (test for trend stationarity) - 'n': No constant or trend (test for zero-mean stationarity) |
'c'
|
max_lag
|
int
|
Maximum lag order. If None, it is calculated using the rule: max_lag = int(ceil(12 * (n/100)^0.25)) where n is the sample size. |
None
|
Returns:
Type | Description |
---|---|
tuple
|
A tuple containing: - test_statistic : float The ADF test statistic. More negative values indicate stronger rejection of the null hypothesis (presence of a unit root). - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating stationarity. Returns np.nan if insufficient data. |
Raises:
Type | Description |
---|---|
ValueError: If regression is not one of 'n', 'c', 'ct'
|
|
Warning: If sample size is less than 50
|
|
Warning: If max_lag is negative
|
|
Warning: If max_lag is greater than or equal to the sample size
|
|
Warning: If max_lag is greater than or equal to the sample size minus 2
|
|
Warning: If max_lag is greater than or equal to the sample size minus 2
|
|
Warning: If max_lag is greater than or equal to the sample size minus 2
|
|
Warning: If max_lag is greater than or equal to the sample size minus 2
|
|
autocorrelation(data, max_lag=20)
Calculate autocorrelation function (ACF) for time series data.
The autocorrelation function measures the correlation between observations at different time lags. It helps identify patterns and seasonality in time series data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
ArrayLike
|
Input data array. Can contain NaN values which will be removed before calculation. |
required |
max_lag
|
int
|
Maximum lag to calculate autocorrelation. Must be positive and less than the length of the series after removing NaN values. |
20
|
Returns:
Type | Description |
---|---|
NDArray[float64]
|
Autocorrelation values for lags 0 to max_lag. The first value (lag 0) is always 1.0 for non-constant series. Values range from -1 to 1, where: - 1.0 indicates perfect positive correlation - -1.0 indicates perfect negative correlation - 0.0 indicates no correlation - np.nan indicates insufficient data or constant series |
Raises:
Type | Description |
---|---|
ValueError: If max_lag is negative
|
|
correlation_matrix(data, method='pearson', min_periods=1)
Calculate the correlation matrix for a 2D array.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
ArrayLike
|
2D array of shape (n_samples, n_features) |
required |
method
|
str
|
The correlation method to use ('pearson', 'spearman', or 'kendall') Default is 'pearson' |
'pearson'
|
min_periods
|
int
|
Minimum number of valid observations required for each pair of columns Default is 1 |
1
|
Returns:
Type | Description |
---|---|
NDArray[float64]
|
Correlation matrix of shape (n_features, n_features) |
Raises:
Type | Description |
---|---|
ValueError: If input data is not a 2D array or method is not one of 'pearson', 'spearman', 'kendall'
|
|
covariance_matrix(data, ddof=1)
Calculate covariance matrix for multivariate data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
ArrayLike
|
Input data array with shape (n_samples, n_features). Can contain None or np.nan as missing values. |
required |
ddof
|
int
|
Delta degrees of freedom for the covariance calculation. The divisor used in calculations is N - ddof, where N represents the number of non-missing elements. |
1
|
Returns:
Type | Description |
---|---|
NDArray[float64]
|
Covariance matrix with shape (n_features, n_features). For features i and j, the element [i,j] represents their covariance. The matrix is symmetric, with variances on the diagonal. |
Raises:
Type | Description |
---|---|
ValueError: If input data is not a 2D array
|
|
descriptive_stats(data)
Calculate descriptive statistics for data efficiently.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array. Must be 1-dimensional. |
required |
Returns:
Type | Description |
---|---|
dict
|
Dictionary containing the following statistics: - count: Number of non-NaN values - mean: Arithmetic mean - std: Standard deviation (N-1) - min: Minimum value - q1: First quartile (25th percentile) - median: Median (50th percentile) - q3: Third quartile (75th percentile) - max: Maximum value - skewness: Sample skewness - kurtosis: Sample excess kurtosis |
Raises:
Type | Description |
---|---|
ValueError: If input data is not a 1D array
|
|
durbin_watson_test(residuals)
Perform Durbin-Watson test for autocorrelation in regression residuals.
The Durbin-Watson test examines whether there is autocorrelation in the residuals from a regression analysis. The test statistic ranges from 0 to 4: - Values around 2 suggest no autocorrelation - Values < 2 suggest positive autocorrelation - Values > 2 suggest negative autocorrelation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
residuals
|
array - like
|
Input residuals array. Can contain NaN values which will be removed. |
required |
Returns:
Type | Description |
---|---|
float
|
The Durbin-Watson test statistic, ranging from 0 to 4. Returns np.nan if: - Insufficient data (less than 2 points) - All values are NaN - All values are constant (zero variance) - Sum of squared residuals is zero |
Raises:
Type | Description |
---|---|
ValueError: If residuals is not an array-like
|
|
Warning: If sample size is less than 30
|
|
granger_causality_test(x, y, max_lag=1)
Perform Granger causality test to determine if x Granger-causes y.
The Granger causality test examines whether past values of x help predict future values of y beyond what past values of y alone can predict. The null hypothesis is that x does not Granger-cause y.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
array - like
|
First time series (potential cause). Can contain NaN values. |
required |
y
|
array - like
|
Second time series (potential effect). Can contain NaN values. |
required |
max_lag
|
int
|
Maximum number of lags to include in the test. Must be positive and less than half the length of the shortest series after removing NaN values. |
1
|
Returns:
Type | Description |
---|---|
tuple
|
A tuple containing: - f_statistic : float The F-statistic of the test. Larger values indicate stronger evidence against the null hypothesis. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating x Granger-causes y. Returns np.nan if insufficient data or numerical issues. |
Raises:
Type | Description |
---|---|
ValueError: If max_lag is not positive
|
|
ValueError: If x and y have different shapes
|
|
Warning: If sample size is less than 30 + 2 * max_lag
|
|
Warning: If max_lag is negative
|
|
Warning: If max_lag is greater than or equal to the sample size
|
|
hurst_exponent(data, max_lag=None)
Calculate the Hurst exponent for time series data.
The Hurst exponent measures the long-term memory of a time series. It relates to the autocorrelations of the time series, and the rate at which these decrease as the lag between pairs of values increases.
Values: - H = 0.5: Random walk (Brownian motion) - 0 ≤ H < 0.5: Mean-reverting series (negative autocorrelation) - 0.5 < H ≤ 1: Trending series (positive autocorrelation)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array. Can contain NaN values which will be removed. |
required |
max_lag
|
int
|
Maximum lag to use in calculation. If None, uses n/4 where n is the sample size after removing NaN values. |
None
|
Returns:
Type | Description |
---|---|
float
|
The Hurst exponent, a value between 0 and 1. Returns np.nan if insufficient data or numerical issues. |
Raises:
Type | Description |
---|---|
ValueError: If max_lag is negative
|
|
Warning: If sample size is less than 10
|
|
Warning: If sample size is less than 100
|
|
jarque_bera_test(data)
Perform Jarque-Bera test for normality.
The Jarque-Bera test is a goodness-of-fit test that determines whether sample data have the skewness and kurtosis matching a normal distribution. The test statistic is always non-negative, with a larger value indicating a greater deviation from normality.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array. Can contain NaN values which will be removed before calculation. |
required |
Returns:
Type | Description |
---|---|
tuple
|
A tuple containing: - test_statistic : float The JB test statistic. A value close to 0 indicates normality. The statistic follows a chi-squared distribution with 2 degrees of freedom under the null hypothesis of normality. - p_value : float The p-value for the test. A small p-value (e.g., < 0.05) suggests rejection of normality. Values close to 1 suggest normality. Returns np.nan if insufficient data. |
Raises:
Type | Description |
---|---|
Warning: If sample size is less than 3
|
|
Warning: If max_lag is negative
|
|
Warning: If max_lag is greater than or equal to the sample size
|
|
kolmogorov_smirnov_test(data, dist='norm', params=None)
Perform Kolmogorov-Smirnov test for distribution fitting.
The KS test examines whether a sample comes from a specified continuous distribution. The null hypothesis is that the sample is drawn from the reference distribution.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array. Can contain NaN values which will be removed. |
required |
dist
|
str
|
The reference distribution to test against. Options: - 'norm': Normal distribution - 'uniform': Uniform distribution - 'expon': Exponential distribution |
'norm'
|
params
|
dict
|
Parameters for the reference distribution. If None, estimated from data. For 'norm': {'loc': mean, 'scale': std} For 'uniform': {'loc': min, 'scale': max-min} For 'expon': {'loc': min, 'scale': mean} |
None
|
Returns:
Type | Description |
---|---|
tuple
|
A tuple containing: - test_statistic : float The KS test statistic. Larger values indicate stronger evidence against the null hypothesis. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating the data does not follow the specified distribution. Returns np.nan if insufficient data. |
Raises:
Type | Description |
---|---|
ValueError: If dist is not one of 'norm', 'uniform', 'expon'
|
|
Warning: If sample size is less than 3
|
|
Warning: If sample size is less than 30
|
|
kpss_test(data, regression='c', lags=None)
Perform KPSS test for stationarity.
The KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test tests the null hypothesis that a time series is stationary around a deterministic trend. This test complements the ADF test, as the null hypothesis is stationarity (opposite to ADF).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array. Can contain NaN values which will be removed before calculation. |
required |
regression
|
str
|
The null hypothesis: - 'c': The series is stationary around a constant (level) - 'ct': The series is stationary around a trend |
'c'
|
lags
|
int
|
Number of lags to use for Newey-West estimator. If None, uses automatic selection based on Schwert's rule: [12 * (n/100)^(1/4)] |
None
|
Returns:
Type | Description |
---|---|
tuple
|
A tuple containing: - test_statistic : float The KPSS test statistic. Larger values indicate stronger evidence against the null hypothesis of stationarity. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating non-stationarity. Returns np.nan if insufficient data. |
Raises:
Type | Description |
---|---|
ValueError: If regression is not one of 'c', 'ct'
|
|
Warning: If sample size is less than 3
|
|
Warning: If sample size is less than 30
|
|
Warning: If max_lag is negative
|
|
Warning: If max_lag is greater than or equal to the sample size
|
|
ljung_box_test(data, lags=10, boxpierce=False)
Perform Ljung-Box test for autocorrelation in time series residuals.
The Ljung-Box test examines whether there is significant autocorrelation in the residuals of a time series. The null hypothesis is that the data is independently distributed (no autocorrelation). The alternative hypothesis is that the data exhibits serial correlation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array (typically residuals). Can contain NaN values. |
required |
lags
|
int
|
Number of lags to test. Must be positive and less than the sample size. |
10
|
boxpierce
|
bool
|
If True, compute the Box-Pierce statistic instead of the Ljung-Box statistic. The Box-Pierce statistic is a simpler version but is less powerful for small samples. |
False
|
Returns:
Type | Description |
---|---|
tuple
|
A tuple containing: - test_statistic : float The Q-statistic (Ljung-Box or Box-Pierce). Larger values indicate stronger evidence against the null hypothesis of no autocorrelation. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating presence of autocorrelation. Returns np.nan if insufficient data or numerical issues. |
Raises:
Type | Description |
---|---|
ValueError: If lags is not positive
|
|
Warning: If sample size is less than 3 times the number of lags
|
|
Warning: If max_lag is negative
|
|
Warning: If max_lag is greater than or equal to the sample size
|
|
partial_autocorrelation(data, max_lag=20)
Calculate partial autocorrelation function (PACF) for time series data.
The partial autocorrelation function measures the correlation between observations at different time lags after removing the effects of intermediate observations. It is particularly useful for identifying the order of an AR(p) process.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array. Can contain NaN values which will be removed before calculation. |
required |
max_lag
|
int
|
Maximum lag to calculate partial autocorrelation. Must be positive and less than the length of the series after removing NaN values. |
20
|
Returns:
Type | Description |
---|---|
ndarray
|
Partial autocorrelation values for lags 0 to max_lag. Values range from -1 to 1, where: - Values close to ±1 indicate strong partial correlation - Values close to 0 indicate weak partial correlation - np.nan indicates insufficient data or constant series |
Raises:
Type | Description |
---|---|
ValueError: If max_lag is negative
|
|
rolling_statistics(data, window, statistics=['mean', 'std'])
Calculate rolling statistics for time series data.
Computes various statistics over a rolling window of specified size. Missing values (NaN) at the start of the output array correspond to the first window-1 observations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array. Can contain NaN values. |
required |
window
|
int
|
Size of the rolling window. Must be positive. |
required |
statistics
|
list of str
|
List of statistics to compute. Options: - 'mean': Rolling mean - 'std': Rolling standard deviation - 'min': Rolling minimum - 'max': Rolling maximum - 'median': Rolling median - 'skew': Rolling skewness - 'kurt': Rolling kurtosis |
['mean', 'std']
|
Returns:
Type | Description |
---|---|
dict
|
Dictionary with statistic names as keys and numpy arrays as values. Each array has the same length as the input data, with the first window-1 elements being NaN. |
Raises:
Type | Description |
---|---|
ValueError: If window is not positive
|
|
ValueError: If statistics is not a list
|
|
ValueError: If any statistic in statistics is not in ['mean', 'std', 'min', 'max', 'median', 'skew', 'kurt']
|
|
Warning: If sample size is less than window
|
|
variance_ratio_test(data, periods=None, robust=True)
Perform Variance Ratio test for random walk hypothesis.
The Variance Ratio test examines whether a time series follows a random walk by comparing variances at different sampling intervals. The null hypothesis is that the series follows a random walk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
array - like
|
Input data array. Can contain NaN values which will be removed before calculation. Must be strictly positive for log returns calculation. |
required |
periods
|
list of int
|
List of periods for variance ratio calculations. |
[2, 4, 8, 16]
|
robust
|
bool
|
If True, use heteroskedasticity-robust standard errors. |
True
|
Returns:
Type | Description |
---|---|
dict
|
Dictionary with periods as keys and tuples of (test_statistic, p_value) as values. - test_statistic : float The VR test statistic. Values far from 1 indicate deviation from random walk. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis of random walk. Returns (0, 1) for constant series (perfect random walk). |
Raises:
Type | Description |
---|---|
ValueError: If periods is None
|
|
ValueError: If data is not strictly positive for log returns calculation
|
|
ValueError: If all periods are not positive integers
|
|
ValueError: If any period is larger than the sample size
|
|