quasildr.structdr

This module implements nonparametric ridge estimation and relevant functions. The main class is Scms which provides ridge estimation and bootstrap functions.

Classes:

Scms(data, bw[, min_radius])

An implementation of nonparametric ridge estimation.

Functions:

bootstrap_resample(X[, n])

Bootstrap resample an array_like

multilevel_compression(data[, n, threshold, k])

Iteratively combine nearest neighbor pairs below distance threshold.

vdist(a, b)

This function is similar to cdist but returning vectors rather than norm of vectors implemented via virtual copies of a and b

class quasildr.structdr.Scms(data, bw, min_radius=10)[source]

Bases: object

An implementation of nonparametric ridge estimation. The main algorithm implemented is subspace constrained mean-shift (SCMS).

Parameters
  • data (array or list) –

    • if data is a 2D array of size (N, D). Regular KDE is used.

    • if data is a list. data is interpreted as compressed data as given by

    ridge_estimation.multilevel_compression function. Multi-level KDE is used. Different levels one data point corresponds to 1, 2, 4, 8, 16, … regular datapoints.

  • bw (float) – Gaussian kernel bandwidth for KDE.

  • min_radius (int) – Set data-point specific minimum bandwidth to be the distance to its min_radius-th nearest neighbor.

bw
Type

float

adaptive_bw

Kernel bandwidth specified for each data point. This is the bandwidth directly used in the algorithm.

Type

array

params

Save parameters used in running Scms.scms(…) for bootstrap.

Type

dict

Methods:

boostrap_trajectory([n_bootstrap, n_jobs, ...])

Construct confidence set of density ridge positions through bootstrap.

bootstrap_landmark(X[, n_bootstrap, method, ...])

Experimental function.

inverse_density_sampling(X[, n_samples, ...])

Extract a subset of data points to represent the whole data.

reset_bw([bw, min_radius])

Reset bandwidth.

scms(X[, n_iterations, threshold, tol, ...])

Performs subspace constrained mean shift on the entire data set using a Gaussian kernel with bandwidth of bw.

scms_update(X[, method, stepsize, ...])

Compute the constrained mean shift update of the point at x.

boostrap_trajectory(n_bootstrap=25, n_jobs=1, copy_bw=True)[source]

Construct confidence set of density ridge positions through bootstrap. For each bootstrap sample, distance to its nearest neighbors in the original ridge estimate is recorded. The bootstrap runs will copy the parameters of the previous scms call, therefore scms has to be executed before this function.

Parameters
  • n_bootstrap (int, optional) – Number of bootstrap runs. Default is 25.

  • n_jobs (int, optional) – Number of jobs to run in parallel. Default is 1. Only specify n_jobs>1 if n_jobs=1 is used when running scms.

  • copy_bw (bool, optional) – If true, the point-specific kernel bandwidths are unchanged after resampling. This option only has an effect is self.min_radius > 0.

Returns

  • boostrap_n (array, shape (n_samples, n_bootstrap)) – 2D array. Bootstrap nearest neighbor distances for each ridge position.

  • boostrap_s (array, shape (n_samples, n_bootstrap)) – 2D array. Bootstrap point-wise distances for each ridge position. Experimental.

  • boostrap_ifilters (array, shape (n_samples, n_bootstrap)) – 2D array. Bootstrap resample indices.

bootstrap_landmark(X, n_bootstrap=20, method='LocInv', smooth_bootstrap=False, copy_bw=True, n_jobs=4)[source]

Experimental function. Work in progress.

inverse_density_sampling(X, n_samples=2000, n_jobs=1, batch_size=16)[source]

Extract a subset of data points to represent the whole data. Data points from high density areas are most likely to be subsampled and those from low density areas are preserved. This is efficiently implemented with a trick of sorting the data by log density + Gumbel noise.

Parameters
  • X (2D array) – Input data points

  • n_samples (int, optional) – Number of sub-sampled data points. Default is 2000.

  • n_jobs (int, optional) – Number of jobs to run in parallel. Default is 1.

  • batch_size (int, optional) – Batch size. Set this to a number number to reduce memory usage. Default is 16

Returns

indices – Subsample indices corresponding to input X.

Return type

1D array

reset_bw(bw=None, min_radius=None)[source]

Reset bandwidth.

Parameters
  • bw (float) – Gaussian kernel bandwidth for KDE.

  • min_radius (int) – Set data-point specific minimum bandwidth to be the distance to its min_radius-th nearest neighbor.

Returns

Return type

None

scms(X, n_iterations=50, threshold=0, tol=1e-07, method='LocInv', stepsize=0.5, ridge_dimensionality=1, relaxation=0, n_jobs=1, batch_size=16)[source]

Performs subspace constrained mean shift on the entire data set using a Gaussian kernel with bandwidth of bw.

Parameters
  • X (array) – 2D array. Input data points.

  • n_iterations (int, optional) – The number of iterations for running the algorithm

  • threshold (float, optional) – Filter out data points below a probability density threshold. The threshold value is specified relative to the maximum density.

  • tol (float, optional) – Tolerance for assessing convergence, stop if max(abs(new_pos - pos)) < tol.

  • method (str {'GradientP'|'GradientLogP'|'MSP'|'MSLogP'|'InvCov'}, optional) –

    • GradientP

      Projected gradient update based on Hessian of probability density function p. Update is specified by projected gradient of p * step size.

    • GradientLogP

      Projected gradient update based on Hessian of log probability density function log(p). Update is specified by projected gradient of log(p) * step size.

    • MSP

      Projected mean-shift update based on Hessian of probability density function p. Update is specified by projected mean shift update * step size.

    • MSLogP or InvCov (equivalent)

      Projected mean-shift update based on Hessian of probability density function log p. Update is specified by projected mean shift update * step size.

    Default is ‘InvCov’.

  • stepsize (float, optional) – Step size. Default is 0.5

  • ridge_dimensionality (int, optional) – Ridge dimensionality. Default is 1.

  • relaxation (float [0,1], optional) – Relax the ridge dimensionality based on eigen gap corresponding to the ridge_dimensionality. Typically used with ridge_dimensionality=1. All point are projected to one-dimensional ridges (trajectories) when relaxation=0. All points are projected to two-dimensional ridges (surfaces) when relaxation=1. Default is 0.

  • n_jobs (int, optional) – Number of jobs to run in parallel. Default is 1.

  • batch_size (int, optional) – Number of data points to process in a batch. Reduce this number can decrease memory usage. Default is 500.

Returns

  • pos (array) – Updated data points postions.

  • ifilter (array) – Indices of returned positions (useful when threshold > 0 so part of the input are filtered).

scms_update(X, method='LocInv', stepsize=0.01, ridge_dimensionality=1, relaxation=0)[source]

Compute the constrained mean shift update of the point at x.

Parameters
  • X (array) – 2D array. Input data points.

  • method (str {'Gradient'|'GradientLogP'|'SuRF'|'LocInv'}, optional) –

    • Gradient

      Projected gradient update based on Hessian of probability density function p. Update is specified by projected gradient of p * step size.

    • GradientLogP

      Projected gradient update based on Hessian of log probability density function log(p). Update is specified by projected gradient of log(p) * step size.

    • SuRF

      Projected mean-shift update based on Hessian of probability density function p. Update is specified by projected mean shift update * step size.

    • LocInv

      Projected mean-shift update based on Hessian of probability density function log p. Update is specified by projected mean shift update * step size.

    Default is ‘LocInv’.

  • stepsize (float, optional) – Step size.

  • ridge_dimensionality (int, optional) – Ridge dimensionality.

  • relaxation (float [0,1], optional) – Relax the ridge dimensionality based on eigen gap corresponding to the ridge_dimensionality. Typically used with ridge_dimensionality=1. All point are projected to one-dimensional ridges (trajectories) when relaxation=0. All points are projected to two-dimensional ridges (surfaces) when relaxation=1.

Returns

  • update (2D array) – SCMS update.

  • relax_bool (1D array) – Indicate whether the ridge dimensionality of a particular point was increased by 1 relative to what ridge_dimensionality specified. The amount of relaxation is controlled by relaxation argument.

quasildr.structdr.bootstrap_resample(X, n=None)[source]

Bootstrap resample an array_like

Parameters
  • X (array_like) – data to resample

  • n (int, optional) – length of resampled array, equal to len(X) if n==None

Returns

  • Bootstrap resampled array

  • 1D array of indices corresponding to input.

Return type

array_like, numpy.array

quasildr.structdr.multilevel_compression(data, n=5000, threshold=None, k=20)[source]

Iteratively combine nearest neighbor pairs below distance threshold. This function can be used to reduce the number of data points to represent the whole dataset without greatly affect the density estimation quality. Most of the data points in low density areas will be unaffected and data points in high density areas will be combined and averaged.

Parameters
  • data (array) – 2D array of size (N, D)

  • n (int, optional) – Choose the distance threshold by the n-th largest k-NN distance; must be smaller than N x k.

  • threshold (float or None, optional) – The distance threshold for combining nearest neighbor pairs. Override N when given.

  • k (int, optional) – Number of nearesest neigbors for KNN graph. Choose larger k for better approximation, smaller for better speed. Use at least 2.

Returns

  • list of level 0, level 1, level 2, … data points. As each level a data point corresponds to

    2^0, 2^1, 2^2, … original data points.

  • list of input data indices corresponding to level-0,1,2,… data points.

Return type

tuple(list, list)

quasildr.structdr.vdist(a, b)[source]

This function is similar to cdist but returning vectors rather than norm of vectors implemented via virtual copies of a and b

Parameters
  • a (np.ndarray) – 2D arrays of size (M,D), (N,D) for which differences for all pairwise combinations of rows in the two arrays will be computed. The number of columns have to the same between a and b.

  • b (np.ndarray) – 2D arrays of size (M,D), (N,D) for which differences for all pairwise combinations of rows in the two arrays will be computed. The number of columns have to the same between a and b.

Returns

3D array of size (N, M, D), containing all pairwise differences of rows in a and b.

Return type

numpy.array