quasildr.graphdr¶
This module implements the quaslinear representation method GraphDR and relevant functions. The method is implemented in the graphdr function. A scikit-learn Estimator-like interface to graphdr is also provided through the GraphDR class.
Classes:
|
Quasilinear data representation that preserves interpretability of a linear space. |
Functions:
|
Solve X in linear equation AX = B when A is sparse with conjugate gradient method. |
|
Quasilinear data representation that preserves interpretability of a linear space. |
|
Construct an approximate K-nearest neighbors graph with nmslib. |
- class quasildr.graphdr.GraphDR(n_neighbors=10, regularization=1000, n_components=None, no_rotation=False, metric='l2', metric_params={}, method='auto', _lambda=None, symmetrize=True, rescale=False, refine_iter=0, refine_threshold=12, _refine_threshold=None, tol=1e-07, atol=1e-15, nmslib_params={'post': 2}, n_jobs=1, n_jobs_nmslib=8, use_cuda=False, verbose=False)[source]¶
Bases:
BaseEstimatorQuasilinear data representation that preserves interpretability of a linear space.
This class provides a scikit-learn type interface that wraps around the graphdr function.
- Parameters
n_neighbors (int, optional) – Number of nearest neigbors to compute per sample. Default is 10.
regularization (float, optional) – Regularization parameter determines the weight on the graph shrinkage term relative to the reconstruction error term. A smaller regularization parameter leads to output that are more similar to a linear representation. A larger regularization parameter provides more contraction based on the graph. Default is 100.
n_components (int or None, optional) – Number of dimensions in the output. Specify a n_components smaller than the input dimension when no_rotation=True can reduce the amount of computation. Default is None.
no_rotation (bool, optional) – \(W\) is fixed to identity matrix so that the output is not rotated relative to input. no_rotation can also speed up the computation for large input when n_components is also specified, because only the required top n_components are computed. If no_rotation=True, you may consider preprocessing the input with another linear dimensionality reduction method like PCA. Default is False.
metric (str, optional) –
distance metric for constructing nearest_neighbors graph. Note that here metric can be non-metric distance function. The allowed values depend on the method specified. As three nearest neighbors algorithms are supported, and these algorithms support different distance metrics. Default is euclidean. * ‘small’ :
Use sklearn.neighbors.kneighbors_graph Supported values are
euclidean
manhattan
chebyshev
minkowski
wminkowski
seuclidean
mahalanobis
hamming
canberra
braycurtis
See https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
- ’big_nndescent’ or ‘big_nnumap’ :
- umap.nndescent. Supported values are
euclidean
manhattan
chebyshev
minkowski
canberra
braycurtis
mahalanobis
wminkowski
seuclidean
cosine
correlation
haversine
hamming
jaccard
dice
russelrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule
’big_nmslib’ :
- nmslib. See https://github.com/nmslib/nmslib/blob/master/manual/manual.pdf for details:
bit_hamming
l1
l1_sparse
l2, euclidean
l2_sparse
linf
linf_sparse
lp:p=…
lp_sparse:p=…
angulardist, angulardist sparse, angulardist sparse fast
jsmetrslow, jsmetrfast, jsmetrfastapprox
leven
sqfd minus func, sqfd heuristic func:alpha=…, sqfd gaussian func:alpha=…
jsdivslow, jsdivfast, jsdivfastapprox
cosinesimil, cosine, cosinesimil sparse
normleven
kldivfast
kldivgenslow, kldivgenfast, kldivgenfastrq
itakurasaitoslow, itakurasaitofast, itakurasaitofastrq
negdotprod_sparse
querynorm_negdotprod_sparse
renyi_diverg
ab_diverg
metric_params (dict, optional) – Some distance metrics requires additional arguments which can be provided as a dictionary. Default is {}.
method (str, optional {'auto', 'small', 'big_nmslib', 'big_umap', 'big_nndescent'}) –
Method to use. Different algorithms strongly affect the computation speed. Optimal choices are dependent on the input size. Default is auto. * ‘auto’
Choose automatically based on input size. Recommended for most uses.
- ’small’
For input of small size, it is affordable to compute nearest-neigbors exactly and use matrix inversion in the algorithm.
- ’big_nmslib’
For input of large size. If NMSlib is installed, it is the fastest algorithm that is supported.
- ’big_nndescent’
For input of large size. The NN-Descent algorithm implemented in the the umap package. Do not require external libraries.
- ’big_nnumap’
For input of large size. Do not require external libraries.
_lambda (float or None, optional) – Regularization parameter \(\lambda\). If not specified \(\lambda\) is computed from regularization, specifying regularization instead of _lambda is generally recommended as it automatically adjusts based on graph density. Default is None.
symmetrize (bool, optional) – Symmetrize the nearest-neighbors graph by averaging with its transpose. Default is True.
rescale (bool, optional) – Rescale each dimension in the output to align with the input. It works with and without no_rotation. Default is False
refine_iter (int, optional) – Number of refinement iterations. Default is 3.
refine_threshold (float, optional) – Specify the cutoff for refinement. Unit is IQR (distance between the 25% quantile and the 75% quantile) of overall “stress” distribution across all edges. Default is 12.
_refine_threshold (float or None, optional) – Used internally. Default is None.
tol (float, optional) – Tolerances for iterative linear solver norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.
atol (float, optional) – Tolerances for iterative linear solver norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.
nmslib_params (dict, optional) – Parameters given to nmslib createIndex function. See https://github.com/nmslib/nmslib/blob/master/manual/methods.md for details. Default is {‘post’: 2}.
n_jobs (int, optional) – Number of jobs to run in parallel. Specifying n_jobs > 1 can speed up computation when method is one of big_nmslib, big_nndescent, or big_nnumap. Note that when use_cuda=True specifying n_jobs does not run the jobs in parallel but can reduce amount of GPU memory required by splitting the problem and run sequentially. Default is 1.
n_jobs_nmslib (int, optional) – Number of threads used by nmslib. Default is 8.
use_cuda (bool, optional) – Use CUDA GPU acceleration. CuPy has to be installed (https://github.com/cupy/cupy) and a CUDA-enabled GPU is required. Default is False.
verbose (bool, optional) – Print additional information. Default is False.
- embedding_¶
- Type
array, shape (n_samples, n_components)
- W_¶
Right linear operator in the transformation Z = KXW. W_ is None when no_rotation=True.
- Type
array, shape (n_features, n_components)
Methods:
fit(X[, y, custom_graph, refine_protected_graph])Fit the model from data in X.
fit_transform(X[, y, custom_graph, ...])Fit the model from data in X and transform X.
transform(X[, y, custom_graph, ...])Apply fitted model to transform X.
- fit(X, y=None, custom_graph=None, refine_protected_graph=None)[source]¶
Fit the model from data in X.
- Parameters
X (array, shape (n_samples, n_features).) – X may be a sparse matrix. If self.skip_pca is true, X must be orthogonal.
y (Ignored) –
custom_graph (sparse matrix, shape (n_samples, n_samples), optional) – If provided, use it instead of the graph constructed from X.
refine_protected_graph (sparse matrix) –
- Returns
self – Returns the instance itself.
- Return type
object
- fit_transform(X, y=None, custom_graph=None, refine_protected_graph=None)[source]¶
Fit the model from data in X and transform X. :param X: :type X: array, shape (n_samples, n_features) :param y: :type y: Ignored
- Returns
Z – Embedding of the input data.
- Return type
array, shape (n_samples, n_components)
- quasildr.graphdr.cg(A, B, x0=None, tol=1e-07, atol=1e-15, use_cuda=False)[source]¶
Solve X in linear equation AX = B when A is sparse with conjugate gradient method. Support GPU with use_cuda=True.
- Parameters
A (sparse matrix or array) – A N-by-N matrix of the linear system.
B (array) – Right hand side of the linear system.
x0 (array or None, optional) – If specified, initialize the solution with x0.
tol (float, optional) – Tolerances for convergence norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.
atol (float, optional) – Tolerances for convergence norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.
use_cuda (bool, optional) – Use GPU acceleration with CUDA. A GPU supporting CUDA is needed and CuPy package need to be installed. Default is False.
- Returns
The solution to X.
- Return type
array
- quasildr.graphdr.graphdr(X, n_neighbors=10, regularization=100, n_components=None, no_rotation=False, metric='euclidean', metric_params={}, method='auto', _lambda=None, init=None, symmetrize=True, custom_graph=None, rescale=False, refine_iter=0, refine_threshold=12, _refine_threshold=None, refine_protected_graph=None, tol=1e-07, atol=1e-15, nmslib_params={'post': 2}, n_jobs=1, n_jobs_nmslib=8, use_cuda=False, verbose=False, return_all=False)[source]¶
Quasilinear data representation that preserves interpretability of a linear space.
- Parameters
X (array (n_samples, n_features)) – The input array.
n_neighbors (int, optional) – Number of nearest neigbors to compute per sample. Default is 10.
regularization (float, optional) – Regularization parameter determines the weight on the graph shrinkage term relative to the reconstruction error term. A smaller regularization parameter leads to output that are more similar to a linear representation. A larger regularization parameter provides more contraction based on the graph. Default is 100.
n_components (int or None, optional) – Number of dimensions in the output. Specify a n_components smaller than the input dimension when no_rotation=True can reduce the amount of computation. Default is None.
no_rotation (bool, optional) –
- param refine_protected_graph
\(W\) is fixed to identity matrix so that the output is not rotated relative to input. no_rotation can also speed up the computation for large input when n_components is also specified, because only the required top n_components are computed. If no_rotation=True, you may consider preprocessing the input with another linear dimensionality reduction method like PCA. Default is False.
metric (str, optional) –
distance metric for constructing nearest_neighbors graph. Note that here metric can be non-metric distance function. The allowed values depend on the method specified. As three nearest neighbors algorithms are supported, and these algorithms support different distance metrics. Default is euclidean. * ‘small’:
Use sklearn.neighbors.kneighbors_graph Supported values are
euclidean
manhattan
chebyshev
minkowski
wminkowski
seuclidean
mahalanobis
hamming
canberra
braycurtis
See https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
- ’big_nndescent’ or ‘big_nnumap’:
- umap.nndescent. Supported values are
euclidean
manhattan
chebyshev
minkowski
canberra
braycurtis
mahalanobis
wminkowski
seuclidean
cosine
correlation
haversine
hamming
jaccard
dice
russelrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule
’big_nmslib’
- nmslib. See https://github.com/nmslib/nmslib/blob/master/manual/manual.pdf for details:
bit_hamming
l1
l1_sparse
l2, euclidean
l2_sparse
linf
linf_sparse
lp:p=…
lp_sparse:p=…
angulardist, angulardist sparse, angulardist sparse fast
jsmetrslow, jsmetrfast, jsmetrfastapprox
leven
sqfd minus func, sqfd heuristic func:alpha=…, sqfd gaussian func:alpha=…
jsdivslow, jsdivfast, jsdivfastapprox
cosinesimil, cosine, cosinesimil sparse
normleven
kldivfast
kldivgenslow, kldivgenfast, kldivgenfastrq
itakurasaitoslow, itakurasaitofast, itakurasaitofastrq
negdotprod_sparse
querynorm_negdotprod_sparse
renyi_diverg
ab_diverg
metric_params (dict, optional) – Some distance metrics requires additional arguments which can be provided as a dictionary. Default is {}.
method (str, optional {'auto', 'small', 'big_nmslib', 'big_umap', 'big_nndescent'}) –
Method to use. Different algorithms strongly affect the computation speed. Optimal choices are dependent on the input size. Default is auto. * ‘auto’
Choose automatically based on input size. Recommended for most uses.
- ’small’
For input of small size, it is affordable to compute nearest-neigbors exactly and use matrix inversion in the algorithm.
- ’big_nmslib’
For input of large size. If NMSlib is installed, it is the fastest algorithm that is supported.
- ’big_nndescent’
For input of large size. The NN-Descent algorithm implemented in the the umap package. Do not require external libraries.
- ’big_nnumap’
For input of large size. Do not require external libraries.
_lambda (float or None, optional) – Regularization parameter \(\lambda\). If not specified \(\lambda\) is computed from regularization, specifying regularization instead of _lambda is generally recommended as it automatically adjusts based on graph density. Default is None.
init (array or None, optional) – Initialize output representation Z with this array if solved through iterative solver. init is only used when no_rotation=True and method is not small. Default is None.
symmetrize (bool, optional) – Symmetrize the nearest-neighbors graph by averaging with its transpose. Default is True.
custom_graph (sparse matrix or None, optional) – If specified, use this user-provided graph instead of constructing nearest-neighbors graph from X. Default is None.
rescale (bool, optional) – Rescale each dimension in the output to align with the input. It works with and without no_rotation. Default is False
refine_iter (int, optional) – Number of refinement iterations. Default is 3.
refine_threshold (float, optional) – Specify the cutoff for refinement. Unit is IQR (distance between the 25% quantile and the 75% quantile) of overall “stress” distribution across all edges. Default is 12.
_refine_threshold (float or None, optional) – Used internally. Default is None.
refine_protected_graph (sparse matrix or None, optional) – Sparse matrix of size (n_samples, n_samples). Specify edges (any non-zero edges in refine_protected_graph) that will not be removed during refinement. Useful for graph construction for complex experimental designs like batch design so that edges connecting batches will not be removed. Default is None.
tol (float, optional) – Tolerances for iterative linear solver norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.
atol (float, optional) – Tolerances for iterative linear solver norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.
nmslib_params (dict, optional) – Parameters given to nmslib createIndex function. See https://github.com/nmslib/nmslib/blob/master/manual/methods.md for details. Default is {‘post’: 2}.
n_jobs (int, optional) – Number of jobs to run in parallel. Specifying n_jobs > 1 can speed up computation when method is one of big_nmslib, big_nndescent, or big_nnumap. Note that when use_cuda=True specifying n_jobs does not run the jobs in parallel but can reduce amount of GPU memory required by splitting the problem and run sequentially. Default is 1.
n_jobs_nmslib (int, optional) – Number of threads used by nmslib. Default is 8.
use_cuda (bool, optional) – Use CUDA GPU acceleration. CuPy has to be installed (https://github.com/cupy/cupy) and a CUDA-enabled GPU is required. Default is False.
verbose (bool, optional) – Print additional information. Default is False.
return_all (bool, optional) – Return a tuple of (Z, W, graph). Default is False.
- Returns
Z (array, (optional) W: array, (optional) Graph: sparse matrix)
* Z (Embedding of the input data.)
* W (Linear operator on the feature space, returned if return_all==True.)
* graph (The graph used for GraphDR embedding, returned if return_all==True.)
- quasildr.graphdr.knn_graph(X, n_neighbors=15, space='l2', space_params=None, method='hnsw', num_threads=8, params={'post': 2}, verbose=False)[source]¶
Construct an approximate K-nearest neighbors graph with nmslib.
- Parameters
X (array, shape (n_samples, n_features)) – A input 2D array to construct KNN graph on. Each row represent a sample.
n_neighbors (int, optional) – Default is 15. The number of nearest neighbors per sample.
space (string, optional) –
Default is l2. The metric/non-metric distance functions to use to compute distances. Note that l2 corresponds to Euclidean distance. see https://github.com/nmslib/nmslib/blob/master/manual/manual.pdf
bit_hamming
l1
l1_sparse
l2, euclidean
l2_sparse
linf
linf_sparse
lp:p=…
lp_sparse:p=…
angulardist, angulardist sparse, angulardist sparse fast
jsmetrslow, jsmetrfast, jsmetrfastapprox
leven
sqfd minus func, sqfd heuristic func:alpha=…, sqfd gaussian func:alpha=…
jsdivslow, jsdivfast, jsdivfastapprox
cosinesimil, cosine, cosinesimil sparse
normleven
kldivfast
kldivgenslow, kldivgenfast, kldivgenfastrq
itakurasaitoslow, itakurasaitofast, itakurasaitofastrq
negdotprod_sparse
querynorm_negdotprod_sparse
renyi_diverg
ab_diverg
space_params (dict or None, optional) – Parameters for configuring the space.
method (str) – The nmslib method to use. Default is hnsw.
num_threads (int, optional) – Default is 8. Number of threads to run nmslib with.
params (dict, optional) – Default is {‘post’: 2}. Input parameters to nmslib index construction. See https://github.com/nmslib/nmslib/blob/master/manual/methods.md for details.
verbose (bool, optional) – Default is False. Print extra information.
- Returns
Sparse matrix representing the constructed KNN graph.
- Return type
scipy.sparse.csr_matrix