Skip to content

Statistics API

Statistics Module

This module provides statistical analysis utilities for financial data analysis.

arch_test(returns, lags=5)

Perform Engle's ARCH test for heteroskedasticity.

The ARCH test examines whether there is autoregressive conditional heteroskedasticity (ARCH) in the residuals. The null hypothesis is that there is no ARCH effect.

Parameters:

Name Type Description Default
returns array - like

Input returns or residuals array. Can contain NaN values.

required
lags int

Number of lags to test for ARCH effects.

5

Returns:

Type Description
tuple

A tuple containing: - test_statistic : float The LM test statistic. Larger values indicate stronger evidence against the null hypothesis of no ARCH effects. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating presence of ARCH effects. Returns (np.nan, np.nan) if: - Insufficient data - All values are NaN - Constant series - Zero variance series

Raises:

Type Description
ValueError: If lags is not positive
Warning: If sample size is less than 30

augmented_dickey_fuller_test(data, regression='c', max_lag=None)

Perform Augmented Dickey-Fuller test for stationarity.

The ADF test tests the null hypothesis that a unit root is present in a time series. The alternative hypothesis is stationarity or trend-stationarity, depending on the specified regression type.

Parameters:

Name Type Description Default
data array - like

Input data array. Can contain NaN values which will be removed before calculation.

required
regression str

Regression type: - 'c': Include constant (test for level stationarity) - 'ct': Include constant and trend (test for trend stationarity) - 'n': No constant or trend (test for zero-mean stationarity)

'c'
max_lag int

Maximum lag order. If None, it is calculated using the rule: max_lag = int(ceil(12 * (n/100)^0.25)) where n is the sample size.

None

Returns:

Type Description
tuple

A tuple containing: - test_statistic : float The ADF test statistic. More negative values indicate stronger rejection of the null hypothesis (presence of a unit root). - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating stationarity. Returns np.nan if insufficient data.

Raises:

Type Description
ValueError: If regression is not one of 'n', 'c', 'ct'
Warning: If sample size is less than 50
Warning: If max_lag is negative
Warning: If max_lag is greater than or equal to the sample size
Warning: If max_lag is greater than or equal to the sample size minus 2
Warning: If max_lag is greater than or equal to the sample size minus 2
Warning: If max_lag is greater than or equal to the sample size minus 2
Warning: If max_lag is greater than or equal to the sample size minus 2

autocorrelation(data, max_lag=20)

Calculate autocorrelation function (ACF) for time series data.

The autocorrelation function measures the correlation between observations at different time lags. It helps identify patterns and seasonality in time series data.

Parameters:

Name Type Description Default
data ArrayLike

Input data array. Can contain NaN values which will be removed before calculation.

required
max_lag int

Maximum lag to calculate autocorrelation. Must be positive and less than the length of the series after removing NaN values.

20

Returns:

Type Description
NDArray[float64]

Autocorrelation values for lags 0 to max_lag. The first value (lag 0) is always 1.0 for non-constant series. Values range from -1 to 1, where: - 1.0 indicates perfect positive correlation - -1.0 indicates perfect negative correlation - 0.0 indicates no correlation - np.nan indicates insufficient data or constant series

Raises:

Type Description
ValueError: If max_lag is negative

correlation_matrix(data, method='pearson', min_periods=1)

Calculate the correlation matrix for a 2D array.

Parameters:

Name Type Description Default
data ArrayLike

2D array of shape (n_samples, n_features)

required
method str

The correlation method to use ('pearson', 'spearman', or 'kendall') Default is 'pearson'

'pearson'
min_periods int

Minimum number of valid observations required for each pair of columns Default is 1

1

Returns:

Type Description
NDArray[float64]

Correlation matrix of shape (n_features, n_features)

Raises:

Type Description
ValueError: If input data is not a 2D array or method is not one of 'pearson', 'spearman', 'kendall'

covariance_matrix(data, ddof=1)

Calculate covariance matrix for multivariate data.

Parameters:

Name Type Description Default
data ArrayLike

Input data array with shape (n_samples, n_features). Can contain None or np.nan as missing values.

required
ddof int

Delta degrees of freedom for the covariance calculation. The divisor used in calculations is N - ddof, where N represents the number of non-missing elements.

1

Returns:

Type Description
NDArray[float64]

Covariance matrix with shape (n_features, n_features). For features i and j, the element [i,j] represents their covariance. The matrix is symmetric, with variances on the diagonal.

Raises:

Type Description
ValueError: If input data is not a 2D array

descriptive_stats(data)

Calculate descriptive statistics for data efficiently.

Parameters:

Name Type Description Default
data array - like

Input data array. Must be 1-dimensional.

required

Returns:

Type Description
dict

Dictionary containing the following statistics: - count: Number of non-NaN values - mean: Arithmetic mean - std: Standard deviation (N-1) - min: Minimum value - q1: First quartile (25th percentile) - median: Median (50th percentile) - q3: Third quartile (75th percentile) - max: Maximum value - skewness: Sample skewness - kurtosis: Sample excess kurtosis

Raises:

Type Description
ValueError: If input data is not a 1D array

durbin_watson_test(residuals)

Perform Durbin-Watson test for autocorrelation in regression residuals.

The Durbin-Watson test examines whether there is autocorrelation in the residuals from a regression analysis. The test statistic ranges from 0 to 4: - Values around 2 suggest no autocorrelation - Values < 2 suggest positive autocorrelation - Values > 2 suggest negative autocorrelation

Parameters:

Name Type Description Default
residuals array - like

Input residuals array. Can contain NaN values which will be removed.

required

Returns:

Type Description
float

The Durbin-Watson test statistic, ranging from 0 to 4. Returns np.nan if: - Insufficient data (less than 2 points) - All values are NaN - All values are constant (zero variance) - Sum of squared residuals is zero

Raises:

Type Description
ValueError: If residuals is not an array-like
Warning: If sample size is less than 30

granger_causality_test(x, y, max_lag=1)

Perform Granger causality test to determine if x Granger-causes y.

The Granger causality test examines whether past values of x help predict future values of y beyond what past values of y alone can predict. The null hypothesis is that x does not Granger-cause y.

Parameters:

Name Type Description Default
x array - like

First time series (potential cause). Can contain NaN values.

required
y array - like

Second time series (potential effect). Can contain NaN values.

required
max_lag int

Maximum number of lags to include in the test. Must be positive and less than half the length of the shortest series after removing NaN values.

1

Returns:

Type Description
tuple

A tuple containing: - f_statistic : float The F-statistic of the test. Larger values indicate stronger evidence against the null hypothesis. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating x Granger-causes y. Returns np.nan if insufficient data or numerical issues.

Raises:

Type Description
ValueError: If max_lag is not positive
ValueError: If x and y have different shapes
Warning: If sample size is less than 30 + 2 * max_lag
Warning: If max_lag is negative
Warning: If max_lag is greater than or equal to the sample size

hurst_exponent(data, max_lag=None)

Calculate the Hurst exponent for time series data.

The Hurst exponent measures the long-term memory of a time series. It relates to the autocorrelations of the time series, and the rate at which these decrease as the lag between pairs of values increases.

Values: - H = 0.5: Random walk (Brownian motion) - 0 ≤ H < 0.5: Mean-reverting series (negative autocorrelation) - 0.5 < H ≤ 1: Trending series (positive autocorrelation)

Parameters:

Name Type Description Default
data array - like

Input data array. Can contain NaN values which will be removed.

required
max_lag int

Maximum lag to use in calculation. If None, uses n/4 where n is the sample size after removing NaN values.

None

Returns:

Type Description
float

The Hurst exponent, a value between 0 and 1. Returns np.nan if insufficient data or numerical issues.

Raises:

Type Description
ValueError: If max_lag is negative
Warning: If sample size is less than 10
Warning: If sample size is less than 100

jarque_bera_test(data)

Perform Jarque-Bera test for normality.

The Jarque-Bera test is a goodness-of-fit test that determines whether sample data have the skewness and kurtosis matching a normal distribution. The test statistic is always non-negative, with a larger value indicating a greater deviation from normality.

Parameters:

Name Type Description Default
data array - like

Input data array. Can contain NaN values which will be removed before calculation.

required

Returns:

Type Description
tuple

A tuple containing: - test_statistic : float The JB test statistic. A value close to 0 indicates normality. The statistic follows a chi-squared distribution with 2 degrees of freedom under the null hypothesis of normality. - p_value : float The p-value for the test. A small p-value (e.g., < 0.05) suggests rejection of normality. Values close to 1 suggest normality. Returns np.nan if insufficient data.

Raises:

Type Description
Warning: If sample size is less than 3
Warning: If max_lag is negative
Warning: If max_lag is greater than or equal to the sample size

kolmogorov_smirnov_test(data, dist='norm', params=None)

Perform Kolmogorov-Smirnov test for distribution fitting.

The KS test examines whether a sample comes from a specified continuous distribution. The null hypothesis is that the sample is drawn from the reference distribution.

Parameters:

Name Type Description Default
data array - like

Input data array. Can contain NaN values which will be removed.

required
dist str

The reference distribution to test against. Options: - 'norm': Normal distribution - 'uniform': Uniform distribution - 'expon': Exponential distribution

'norm'
params dict

Parameters for the reference distribution. If None, estimated from data. For 'norm': {'loc': mean, 'scale': std} For 'uniform': {'loc': min, 'scale': max-min} For 'expon': {'loc': min, 'scale': mean}

None

Returns:

Type Description
tuple

A tuple containing: - test_statistic : float The KS test statistic. Larger values indicate stronger evidence against the null hypothesis. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating the data does not follow the specified distribution. Returns np.nan if insufficient data.

Raises:

Type Description
ValueError: If dist is not one of 'norm', 'uniform', 'expon'
Warning: If sample size is less than 3
Warning: If sample size is less than 30

kpss_test(data, regression='c', lags=None)

Perform KPSS test for stationarity.

The KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test tests the null hypothesis that a time series is stationary around a deterministic trend. This test complements the ADF test, as the null hypothesis is stationarity (opposite to ADF).

Parameters:

Name Type Description Default
data array - like

Input data array. Can contain NaN values which will be removed before calculation.

required
regression str

The null hypothesis: - 'c': The series is stationary around a constant (level) - 'ct': The series is stationary around a trend

'c'
lags int

Number of lags to use for Newey-West estimator. If None, uses automatic selection based on Schwert's rule: [12 * (n/100)^(1/4)]

None

Returns:

Type Description
tuple

A tuple containing: - test_statistic : float The KPSS test statistic. Larger values indicate stronger evidence against the null hypothesis of stationarity. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating non-stationarity. Returns np.nan if insufficient data.

Raises:

Type Description
ValueError: If regression is not one of 'c', 'ct'
Warning: If sample size is less than 3
Warning: If sample size is less than 30
Warning: If max_lag is negative
Warning: If max_lag is greater than or equal to the sample size

ljung_box_test(data, lags=10, boxpierce=False)

Perform Ljung-Box test for autocorrelation in time series residuals.

The Ljung-Box test examines whether there is significant autocorrelation in the residuals of a time series. The null hypothesis is that the data is independently distributed (no autocorrelation). The alternative hypothesis is that the data exhibits serial correlation.

Parameters:

Name Type Description Default
data array - like

Input data array (typically residuals). Can contain NaN values.

required
lags int

Number of lags to test. Must be positive and less than the sample size.

10
boxpierce bool

If True, compute the Box-Pierce statistic instead of the Ljung-Box statistic. The Box-Pierce statistic is a simpler version but is less powerful for small samples.

False

Returns:

Type Description
tuple

A tuple containing: - test_statistic : float The Q-statistic (Ljung-Box or Box-Pierce). Larger values indicate stronger evidence against the null hypothesis of no autocorrelation. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis, indicating presence of autocorrelation. Returns np.nan if insufficient data or numerical issues.

Raises:

Type Description
ValueError: If lags is not positive
Warning: If sample size is less than 3 times the number of lags
Warning: If max_lag is negative
Warning: If max_lag is greater than or equal to the sample size

partial_autocorrelation(data, max_lag=20)

Calculate partial autocorrelation function (PACF) for time series data.

The partial autocorrelation function measures the correlation between observations at different time lags after removing the effects of intermediate observations. It is particularly useful for identifying the order of an AR(p) process.

Parameters:

Name Type Description Default
data array - like

Input data array. Can contain NaN values which will be removed before calculation.

required
max_lag int

Maximum lag to calculate partial autocorrelation. Must be positive and less than the length of the series after removing NaN values.

20

Returns:

Type Description
ndarray

Partial autocorrelation values for lags 0 to max_lag. Values range from -1 to 1, where: - Values close to ±1 indicate strong partial correlation - Values close to 0 indicate weak partial correlation - np.nan indicates insufficient data or constant series

Raises:

Type Description
ValueError: If max_lag is negative

rolling_statistics(data, window, statistics=['mean', 'std'])

Calculate rolling statistics for time series data.

Computes various statistics over a rolling window of specified size. Missing values (NaN) at the start of the output array correspond to the first window-1 observations.

Parameters:

Name Type Description Default
data array - like

Input data array. Can contain NaN values.

required
window int

Size of the rolling window. Must be positive.

required
statistics list of str

List of statistics to compute. Options: - 'mean': Rolling mean - 'std': Rolling standard deviation - 'min': Rolling minimum - 'max': Rolling maximum - 'median': Rolling median - 'skew': Rolling skewness - 'kurt': Rolling kurtosis

['mean', 'std']

Returns:

Type Description
dict

Dictionary with statistic names as keys and numpy arrays as values. Each array has the same length as the input data, with the first window-1 elements being NaN.

Raises:

Type Description
ValueError: If window is not positive
ValueError: If statistics is not a list
ValueError: If any statistic in statistics is not in ['mean', 'std', 'min', 'max', 'median', 'skew', 'kurt']
Warning: If sample size is less than window

variance_ratio_test(data, periods=None, robust=True)

Perform Variance Ratio test for random walk hypothesis.

The Variance Ratio test examines whether a time series follows a random walk by comparing variances at different sampling intervals. The null hypothesis is that the series follows a random walk.

Parameters:

Name Type Description Default
data array - like

Input data array. Can contain NaN values which will be removed before calculation. Must be strictly positive for log returns calculation.

required
periods list of int

List of periods for variance ratio calculations.

[2, 4, 8, 16]
robust bool

If True, use heteroskedasticity-robust standard errors.

True

Returns:

Type Description
dict

Dictionary with periods as keys and tuples of (test_statistic, p_value) as values. - test_statistic : float The VR test statistic. Values far from 1 indicate deviation from random walk. - p_value : float The p-value for the test. Small p-values (e.g., < 0.05) suggest rejection of the null hypothesis of random walk. Returns (0, 1) for constant series (perfect random walk).

Raises:

Type Description
ValueError: If periods is None
ValueError: If data is not strictly positive for log returns calculation
ValueError: If all periods are not positive integers
ValueError: If any period is larger than the sample size