quasildr.structdr¶
This module implements nonparametric ridge estimation and relevant functions. The main class is Scms which provides ridge estimation and bootstrap functions.
Classes:
|
An implementation of nonparametric ridge estimation. |
Functions:
|
Bootstrap resample an array_like |
|
Iteratively combine nearest neighbor pairs below distance threshold. |
|
This function is similar to cdist but returning vectors rather than norm of vectors implemented via virtual copies of a and b |
- class quasildr.structdr.Scms(data, bw, min_radius=10)[source]¶
Bases:
objectAn implementation of nonparametric ridge estimation. The main algorithm implemented is subspace constrained mean-shift (SCMS).
- Parameters
data (array or list) –
if data is a 2D array of size (N, D). Regular KDE is used.
if data is a list. data is interpreted as compressed data as given by
ridge_estimation.multilevel_compression function. Multi-level KDE is used. Different levels one data point corresponds to 1, 2, 4, 8, 16, … regular datapoints.
bw (float) – Gaussian kernel bandwidth for KDE.
min_radius (int) – Set data-point specific minimum bandwidth to be the distance to its min_radius-th nearest neighbor.
- bw¶
- Type
float
- adaptive_bw¶
Kernel bandwidth specified for each data point. This is the bandwidth directly used in the algorithm.
- Type
array
- params¶
Save parameters used in running Scms.scms(…) for bootstrap.
- Type
dict
Methods:
boostrap_trajectory([n_bootstrap, n_jobs, ...])Construct confidence set of density ridge positions through bootstrap.
bootstrap_landmark(X[, n_bootstrap, method, ...])Experimental function.
inverse_density_sampling(X[, n_samples, ...])Extract a subset of data points to represent the whole data.
reset_bw([bw, min_radius])Reset bandwidth.
scms(X[, n_iterations, threshold, tol, ...])Performs subspace constrained mean shift on the entire data set using a Gaussian kernel with bandwidth of bw.
scms_update(X[, method, stepsize, ...])Compute the constrained mean shift update of the point at x.
- boostrap_trajectory(n_bootstrap=25, n_jobs=1, copy_bw=True)[source]¶
Construct confidence set of density ridge positions through bootstrap. For each bootstrap sample, distance to its nearest neighbors in the original ridge estimate is recorded. The bootstrap runs will copy the parameters of the previous scms call, therefore scms has to be executed before this function.
- Parameters
n_bootstrap (int, optional) – Number of bootstrap runs. Default is 25.
n_jobs (int, optional) – Number of jobs to run in parallel. Default is 1. Only specify n_jobs>1 if n_jobs=1 is used when running scms.
copy_bw (bool, optional) – If true, the point-specific kernel bandwidths are unchanged after resampling. This option only has an effect is self.min_radius > 0.
- Returns
boostrap_n (array, shape (n_samples, n_bootstrap)) – 2D array. Bootstrap nearest neighbor distances for each ridge position.
boostrap_s (array, shape (n_samples, n_bootstrap)) – 2D array. Bootstrap point-wise distances for each ridge position. Experimental.
boostrap_ifilters (array, shape (n_samples, n_bootstrap)) – 2D array. Bootstrap resample indices.
- bootstrap_landmark(X, n_bootstrap=20, method='LocInv', smooth_bootstrap=False, copy_bw=True, n_jobs=4)[source]¶
Experimental function. Work in progress.
- inverse_density_sampling(X, n_samples=2000, n_jobs=1, batch_size=16)[source]¶
Extract a subset of data points to represent the whole data. Data points from high density areas are most likely to be subsampled and those from low density areas are preserved. This is efficiently implemented with a trick of sorting the data by log density + Gumbel noise.
- Parameters
X (2D array) – Input data points
n_samples (int, optional) – Number of sub-sampled data points. Default is 2000.
n_jobs (int, optional) – Number of jobs to run in parallel. Default is 1.
batch_size (int, optional) – Batch size. Set this to a number number to reduce memory usage. Default is 16
- Returns
indices – Subsample indices corresponding to input X.
- Return type
1D array
- reset_bw(bw=None, min_radius=None)[source]¶
Reset bandwidth.
- Parameters
bw (float) – Gaussian kernel bandwidth for KDE.
min_radius (int) – Set data-point specific minimum bandwidth to be the distance to its min_radius-th nearest neighbor.
- Returns
- Return type
None
- scms(X, n_iterations=50, threshold=0, tol=1e-07, method='LocInv', stepsize=0.5, ridge_dimensionality=1, relaxation=0, n_jobs=1, batch_size=16)[source]¶
Performs subspace constrained mean shift on the entire data set using a Gaussian kernel with bandwidth of bw.
- Parameters
X (array) – 2D array. Input data points.
n_iterations (int, optional) – The number of iterations for running the algorithm
threshold (float, optional) – Filter out data points below a probability density threshold. The threshold value is specified relative to the maximum density.
tol (float, optional) – Tolerance for assessing convergence, stop if max(abs(new_pos - pos)) < tol.
method (str {'GradientP'|'GradientLogP'|'MSP'|'MSLogP'|'InvCov'}, optional) –
- GradientP
Projected gradient update based on Hessian of probability density function p. Update is specified by projected gradient of p * step size.
- GradientLogP
Projected gradient update based on Hessian of log probability density function log(p). Update is specified by projected gradient of log(p) * step size.
- MSP
Projected mean-shift update based on Hessian of probability density function p. Update is specified by projected mean shift update * step size.
- MSLogP or InvCov (equivalent)
Projected mean-shift update based on Hessian of probability density function log p. Update is specified by projected mean shift update * step size.
Default is ‘InvCov’.
stepsize (float, optional) – Step size. Default is 0.5
ridge_dimensionality (int, optional) – Ridge dimensionality. Default is 1.
relaxation (float [0,1], optional) – Relax the ridge dimensionality based on eigen gap corresponding to the ridge_dimensionality. Typically used with ridge_dimensionality=1. All point are projected to one-dimensional ridges (trajectories) when relaxation=0. All points are projected to two-dimensional ridges (surfaces) when relaxation=1. Default is 0.
n_jobs (int, optional) – Number of jobs to run in parallel. Default is 1.
batch_size (int, optional) – Number of data points to process in a batch. Reduce this number can decrease memory usage. Default is 500.
- Returns
pos (array) – Updated data points postions.
ifilter (array) – Indices of returned positions (useful when threshold > 0 so part of the input are filtered).
- scms_update(X, method='LocInv', stepsize=0.01, ridge_dimensionality=1, relaxation=0)[source]¶
Compute the constrained mean shift update of the point at x.
- Parameters
X (array) – 2D array. Input data points.
method (str {'Gradient'|'GradientLogP'|'SuRF'|'LocInv'}, optional) –
- Gradient
Projected gradient update based on Hessian of probability density function p. Update is specified by projected gradient of p * step size.
- GradientLogP
Projected gradient update based on Hessian of log probability density function log(p). Update is specified by projected gradient of log(p) * step size.
- SuRF
Projected mean-shift update based on Hessian of probability density function p. Update is specified by projected mean shift update * step size.
- LocInv
Projected mean-shift update based on Hessian of probability density function log p. Update is specified by projected mean shift update * step size.
Default is ‘LocInv’.
stepsize (float, optional) – Step size.
ridge_dimensionality (int, optional) – Ridge dimensionality.
relaxation (float [0,1], optional) – Relax the ridge dimensionality based on eigen gap corresponding to the ridge_dimensionality. Typically used with ridge_dimensionality=1. All point are projected to one-dimensional ridges (trajectories) when relaxation=0. All points are projected to two-dimensional ridges (surfaces) when relaxation=1.
- Returns
update (2D array) – SCMS update.
relax_bool (1D array) – Indicate whether the ridge dimensionality of a particular point was increased by 1 relative to what ridge_dimensionality specified. The amount of relaxation is controlled by relaxation argument.
- quasildr.structdr.bootstrap_resample(X, n=None)[source]¶
Bootstrap resample an array_like
- Parameters
X (array_like) – data to resample
n (int, optional) – length of resampled array, equal to len(X) if n==None
- Returns
Bootstrap resampled array
1D array of indices corresponding to input.
- Return type
array_like, numpy.array
- quasildr.structdr.multilevel_compression(data, n=5000, threshold=None, k=20)[source]¶
Iteratively combine nearest neighbor pairs below distance threshold. This function can be used to reduce the number of data points to represent the whole dataset without greatly affect the density estimation quality. Most of the data points in low density areas will be unaffected and data points in high density areas will be combined and averaged.
- Parameters
data (array) – 2D array of size (N, D)
n (int, optional) – Choose the distance threshold by the n-th largest k-NN distance; must be smaller than N x k.
threshold (float or None, optional) – The distance threshold for combining nearest neighbor pairs. Override N when given.
k (int, optional) – Number of nearesest neigbors for KNN graph. Choose larger k for better approximation, smaller for better speed. Use at least 2.
- Returns
- list of level 0, level 1, level 2, … data points. As each level a data point corresponds to
2^0, 2^1, 2^2, … original data points.
list of input data indices corresponding to level-0,1,2,… data points.
- Return type
tuple(list, list)
- quasildr.structdr.vdist(a, b)[source]¶
This function is similar to cdist but returning vectors rather than norm of vectors implemented via virtual copies of a and b
- Parameters
a (np.ndarray) – 2D arrays of size (M,D), (N,D) for which differences for all pairwise combinations of rows in the two arrays will be computed. The number of columns have to the same between a and b.
b (np.ndarray) – 2D arrays of size (M,D), (N,D) for which differences for all pairwise combinations of rows in the two arrays will be computed. The number of columns have to the same between a and b.
- Returns
3D array of size (N, M, D), containing all pairwise differences of rows in a and b.
- Return type
numpy.array