nntm.model_selection package

Subpackages

Module contents

class nntm.model_selection.PurgedKFold(n_splits=5, target_days=20, embargo=None)

Bases: BaseCrossValidator

Purged K-Folds cross-validator

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds. Training observations overlapping in time with test observations are purged.

Optionally, the eras that immediately follow the test set can be eliminated using the embargo argument.

Data is assumed to be contiguous (shuffle=False).

Parameters:
  • n_splits (int, default=5) – Number of folds. Must be at least 2.

  • target_days (int, default=20) – Days between the observation of samples and the target.

  • embargo (float between 0.0 and 1.0, default=None) – Relative number of eras to be purged after every test set. (embargo * total_era_count) eras are embargoed.

References

get_n_splits(X=None, y=None, groups=None)

Returns the number of splitting iterations in the cross-validator :param X: Always ignored, exists for sklearn compatibility. :type X: object :param y: Always ignored, exists for sklearn compatibility. :type y: object :param groups: Always ignored, exists for sklearn compatibility. :type groups: object

Returns:

n_splits – Returns the number of splitting iterations in the cross-validator.

Return type:

int

split(X, y=None, groups=None)

Generate indices to split data into training and test set. :param X: Training data, where n_samples is the number of samples

and n_features is the number of features.

Parameters:
  • y (array-like of shape (n_samples,), default=None) – The target variable for supervised learning problems.

  • groups (array-like of shape (n_samples,), default=None) – Eras for the samples used while splitting the dataset into train/test set. This parameter is not required when X is a pandas DataFrame containing an era column.

Yields:
  • train (ndarray) – The training set indices for that split.

  • test (ndarray) – The testing set indices for that split.

nntm.model_selection.check_cv(cv=5, *, target_days=20, embargo=None)

Input checker utility for building a cross-validator :param cv: Determines the cross-validation splitting strategy.

Possible inputs for cv are: - None, to use the default 5-fold purged cross validation, - integer, to specify the number of folds for purged cross

validation,

  • An iterable yielding (train, test) splits as arrays of indices.

Parameters:
  • target_days (int, default=20) – Days between the observation of samples and the target.

  • embargo (float between 0.0 and 1.0, default=None) – Relative number of eras to be purged after every test set. (embargo * total_era_count) eras are embargoed.

Returns:

checked_cv – The return value is a cross-validator which generates the train/test splits via the split method.

Return type:

a cross-validator instance.

nntm.model_selection.validation_curve(estimator, X, y, *, param_name, param_range, groups=None, cv=None, target_days=20, embargo=None, scoring='corr', n_jobs=None, pre_dispatch='all', verbose=0, error_score=nan, fit_params=None)

Validation curve. Determine training and test scores for varying parameter values. Compute scores for an estimator with different values of a specified parameter. This is similar to grid search with one parameter. However, this will also compute training scores and is merely a utility for plotting the results. :param estimator: An object of that type which is cloned for each validation. :type estimator: object type that implements the “fit” and “predict” methods :param X: Training vector, where n_samples is the number of samples and

n_features is the number of features.

Parameters:
  • y (array-like of shape (n_samples,) or (n_samples, n_outputs) or None) – Target relative to X for classification or regression; None for unsupervised learning.

  • param_name (str) – Name of the parameter that will be varied.

  • param_range (array-like of shape (n_values,)) – The values of the parameter that will be evaluated.

  • groups (array-like of shape (n_samples,), default=None) – Eras for the samples used while splitting the dataset into train/test set. Also required for some scoring functions.

  • cv (int, cross-validation generator or an iterable, default=None) –

    Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold purged cross validation, - integer, to specify the number of folds for purged cross

    validation,

    • An iterable yielding (train, test) splits as arrays of indices.

  • target_days (int, default=20) – Days between the observation of samples and the target.

  • embargo (float between 0.0 and 1.0, default=None) – Relative number of eras to be purged after every test set. (embargo * total_era_count) eras are embargoed.

  • scoring (str or callable, default='corr') – A str or a scorer callable object / function with signature scorer(estimator, X, y).

  • n_jobs (int, default=None) – Number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the combinations of each parameter value and each cross-validation split. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

  • pre_dispatch (int or str, default='all') – Number of predispatched jobs for parallel execution (default is all). The option can reduce the allocated memory. The str can be an expression like ‘2*n_jobs’.

  • verbose (int, default=0) – Controls the verbosity: the higher, the more messages.

  • fit_params (dict, default=None) – Parameters to pass to the fit method of the estimator.

  • error_score ('raise' or numeric, default=np.nan) – Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised.

Returns:

  • train_scores (array of shape (n_ticks, n_cv_folds)) – Scores on training sets.

  • test_scores (array of shape (n_ticks, n_cv_folds)) – Scores on test set.