quasildr.graphdr

This module implements the quaslinear representation method GraphDR and relevant functions. The method is implemented in the graphdr function. A scikit-learn Estimator-like interface to graphdr is also provided through the GraphDR class.

Classes:

GraphDR([n_neighbors, regularization, ...])

Quasilinear data representation that preserves interpretability of a linear space.

Functions:

cg(A, B[, x0, tol, atol, use_cuda])

Solve X in linear equation AX = B when A is sparse with conjugate gradient method.

graphdr(X[, n_neighbors, regularization, ...])

Quasilinear data representation that preserves interpretability of a linear space.

knn_graph(X[, n_neighbors, space, ...])

Construct an approximate K-nearest neighbors graph with nmslib.

class quasildr.graphdr.GraphDR(n_neighbors=10, regularization=1000, n_components=None, no_rotation=False, metric='l2', metric_params={}, method='auto', _lambda=None, symmetrize=True, rescale=False, refine_iter=0, refine_threshold=12, _refine_threshold=None, tol=1e-07, atol=1e-15, nmslib_params={'post': 2}, n_jobs=1, n_jobs_nmslib=8, use_cuda=False, verbose=False)[source]

Bases: BaseEstimator

Quasilinear data representation that preserves interpretability of a linear space.

This class provides a scikit-learn type interface that wraps around the graphdr function.

Parameters
  • n_neighbors (int, optional) – Number of nearest neigbors to compute per sample. Default is 10.

  • regularization (float, optional) – Regularization parameter determines the weight on the graph shrinkage term relative to the reconstruction error term. A smaller regularization parameter leads to output that are more similar to a linear representation. A larger regularization parameter provides more contraction based on the graph. Default is 100.

  • n_components (int or None, optional) – Number of dimensions in the output. Specify a n_components smaller than the input dimension when no_rotation=True can reduce the amount of computation. Default is None.

  • no_rotation (bool, optional) – \(W\) is fixed to identity matrix so that the output is not rotated relative to input. no_rotation can also speed up the computation for large input when n_components is also specified, because only the required top n_components are computed. If no_rotation=True, you may consider preprocessing the input with another linear dimensionality reduction method like PCA. Default is False.

  • metric (str, optional) –

    distance metric for constructing nearest_neighbors graph. Note that here metric can be non-metric distance function. The allowed values depend on the method specified. As three nearest neighbors algorithms are supported, and these algorithms support different distance metrics. Default is euclidean. * ‘small’ :

    Use sklearn.neighbors.kneighbors_graph Supported values are

    • euclidean

    • manhattan

    • chebyshev

    • minkowski

    • wminkowski

    • seuclidean

    • mahalanobis

    • hamming

    • canberra

    • braycurtis

    See https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

    • ’big_nndescent’ or ‘big_nnumap’ :
      umap.nndescent. Supported values are
      • euclidean

      • manhattan

      • chebyshev

      • minkowski

      • canberra

      • braycurtis

      • mahalanobis

      • wminkowski

      • seuclidean

      • cosine

      • correlation

      • haversine

      • hamming

      • jaccard

      • dice

      • russelrao

      • kulsinski

      • rogerstanimoto

      • sokalmichener

      • sokalsneath

      • yule

    • ’big_nmslib’ :

    nmslib. See https://github.com/nmslib/nmslib/blob/master/manual/manual.pdf for details:
    • bit_hamming

    • l1

    • l1_sparse

    • l2, euclidean

    • l2_sparse

    • linf

    • linf_sparse

    • lp:p=…

    • lp_sparse:p=…

    • angulardist, angulardist sparse, angulardist sparse fast

    • jsmetrslow, jsmetrfast, jsmetrfastapprox

    • leven

    • sqfd minus func, sqfd heuristic func:alpha=…, sqfd gaussian func:alpha=…

    • jsdivslow, jsdivfast, jsdivfastapprox

    • cosinesimil, cosine, cosinesimil sparse

    • normleven

    • kldivfast

    • kldivgenslow, kldivgenfast, kldivgenfastrq

    • itakurasaitoslow, itakurasaitofast, itakurasaitofastrq

    • negdotprod_sparse

    • querynorm_negdotprod_sparse

    • renyi_diverg

    • ab_diverg

  • metric_params (dict, optional) – Some distance metrics requires additional arguments which can be provided as a dictionary. Default is {}.

  • method (str, optional {'auto', 'small', 'big_nmslib', 'big_umap', 'big_nndescent'}) –

    Method to use. Different algorithms strongly affect the computation speed. Optimal choices are dependent on the input size. Default is auto. * ‘auto’

    Choose automatically based on input size. Recommended for most uses.

    • ’small’

      For input of small size, it is affordable to compute nearest-neigbors exactly and use matrix inversion in the algorithm.

    • ’big_nmslib’

      For input of large size. If NMSlib is installed, it is the fastest algorithm that is supported.

    • ’big_nndescent’

      For input of large size. The NN-Descent algorithm implemented in the the umap package. Do not require external libraries.

    • ’big_nnumap’

      For input of large size. Do not require external libraries.

  • _lambda (float or None, optional) – Regularization parameter \(\lambda\). If not specified \(\lambda\) is computed from regularization, specifying regularization instead of _lambda is generally recommended as it automatically adjusts based on graph density. Default is None.

  • symmetrize (bool, optional) – Symmetrize the nearest-neighbors graph by averaging with its transpose. Default is True.

  • rescale (bool, optional) – Rescale each dimension in the output to align with the input. It works with and without no_rotation. Default is False

  • refine_iter (int, optional) – Number of refinement iterations. Default is 3.

  • refine_threshold (float, optional) – Specify the cutoff for refinement. Unit is IQR (distance between the 25% quantile and the 75% quantile) of overall “stress” distribution across all edges. Default is 12.

  • _refine_threshold (float or None, optional) – Used internally. Default is None.

  • tol (float, optional) – Tolerances for iterative linear solver norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.

  • atol (float, optional) – Tolerances for iterative linear solver norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.

  • nmslib_params (dict, optional) – Parameters given to nmslib createIndex function. See https://github.com/nmslib/nmslib/blob/master/manual/methods.md for details. Default is {‘post’: 2}.

  • n_jobs (int, optional) – Number of jobs to run in parallel. Specifying n_jobs > 1 can speed up computation when method is one of big_nmslib, big_nndescent, or big_nnumap. Note that when use_cuda=True specifying n_jobs does not run the jobs in parallel but can reduce amount of GPU memory required by splitting the problem and run sequentially. Default is 1.

  • n_jobs_nmslib (int, optional) – Number of threads used by nmslib. Default is 8.

  • use_cuda (bool, optional) – Use CUDA GPU acceleration. CuPy has to be installed (https://github.com/cupy/cupy) and a CUDA-enabled GPU is required. Default is False.

  • verbose (bool, optional) – Print additional information. Default is False.

embedding_
Type

array, shape (n_samples, n_components)

W_

Right linear operator in the transformation Z = KXW. W_ is None when no_rotation=True.

Type

array, shape (n_features, n_components)

Methods:

fit(X[, y, custom_graph, refine_protected_graph])

Fit the model from data in X.

fit_transform(X[, y, custom_graph, ...])

Fit the model from data in X and transform X.

transform(X[, y, custom_graph, ...])

Apply fitted model to transform X.

fit(X, y=None, custom_graph=None, refine_protected_graph=None)[source]

Fit the model from data in X.

Parameters
  • X (array, shape (n_samples, n_features).) – X may be a sparse matrix. If self.skip_pca is true, X must be orthogonal.

  • y (Ignored) –

  • custom_graph (sparse matrix, shape (n_samples, n_samples), optional) – If provided, use it instead of the graph constructed from X.

  • refine_protected_graph (sparse matrix) –

Returns

self – Returns the instance itself.

Return type

object

fit_transform(X, y=None, custom_graph=None, refine_protected_graph=None)[source]

Fit the model from data in X and transform X. :param X: :type X: array, shape (n_samples, n_features) :param y: :type y: Ignored

Returns

Z – Embedding of the input data.

Return type

array, shape (n_samples, n_components)

transform(X, y=None, custom_graph=None, refine_protected_graph=None)[source]

Apply fitted model to transform X. :param X: :type X: array, shape (n_samples, n_features) :param y: :type y: Ignored

Returns

Z – Embedding of the input data.

Return type

array, shape (n_samples, n_components)

quasildr.graphdr.cg(A, B, x0=None, tol=1e-07, atol=1e-15, use_cuda=False)[source]

Solve X in linear equation AX = B when A is sparse with conjugate gradient method. Support GPU with use_cuda=True.

Parameters
  • A (sparse matrix or array) – A N-by-N matrix of the linear system.

  • B (array) – Right hand side of the linear system.

  • x0 (array or None, optional) – If specified, initialize the solution with x0.

  • tol (float, optional) – Tolerances for convergence norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.

  • atol (float, optional) – Tolerances for convergence norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.

  • use_cuda (bool, optional) – Use GPU acceleration with CUDA. A GPU supporting CUDA is needed and CuPy package need to be installed. Default is False.

Returns

The solution to X.

Return type

array

quasildr.graphdr.graphdr(X, n_neighbors=10, regularization=100, n_components=None, no_rotation=False, metric='euclidean', metric_params={}, method='auto', _lambda=None, init=None, symmetrize=True, custom_graph=None, rescale=False, refine_iter=0, refine_threshold=12, _refine_threshold=None, refine_protected_graph=None, tol=1e-07, atol=1e-15, nmslib_params={'post': 2}, n_jobs=1, n_jobs_nmslib=8, use_cuda=False, verbose=False, return_all=False)[source]

Quasilinear data representation that preserves interpretability of a linear space.

Parameters
  • X (array (n_samples, n_features)) – The input array.

  • n_neighbors (int, optional) – Number of nearest neigbors to compute per sample. Default is 10.

  • regularization (float, optional) – Regularization parameter determines the weight on the graph shrinkage term relative to the reconstruction error term. A smaller regularization parameter leads to output that are more similar to a linear representation. A larger regularization parameter provides more contraction based on the graph. Default is 100.

  • n_components (int or None, optional) – Number of dimensions in the output. Specify a n_components smaller than the input dimension when no_rotation=True can reduce the amount of computation. Default is None.

  • no_rotation (bool, optional) –

    param refine_protected_graph

    \(W\) is fixed to identity matrix so that the output is not rotated relative to input. no_rotation can also speed up the computation for large input when n_components is also specified, because only the required top n_components are computed. If no_rotation=True, you may consider preprocessing the input with another linear dimensionality reduction method like PCA. Default is False.

  • metric (str, optional) –

    distance metric for constructing nearest_neighbors graph. Note that here metric can be non-metric distance function. The allowed values depend on the method specified. As three nearest neighbors algorithms are supported, and these algorithms support different distance metrics. Default is euclidean. * ‘small’:

    Use sklearn.neighbors.kneighbors_graph Supported values are

    • euclidean

    • manhattan

    • chebyshev

    • minkowski

    • wminkowski

    • seuclidean

    • mahalanobis

    • hamming

    • canberra

    • braycurtis

    See https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html

    • ’big_nndescent’ or ‘big_nnumap’:
      umap.nndescent. Supported values are
      • euclidean

      • manhattan

      • chebyshev

      • minkowski

      • canberra

      • braycurtis

      • mahalanobis

      • wminkowski

      • seuclidean

      • cosine

      • correlation

      • haversine

      • hamming

      • jaccard

      • dice

      • russelrao

      • kulsinski

      • rogerstanimoto

      • sokalmichener

      • sokalsneath

      • yule

    • ’big_nmslib’

    nmslib. See https://github.com/nmslib/nmslib/blob/master/manual/manual.pdf for details:
    • bit_hamming

    • l1

    • l1_sparse

    • l2, euclidean

    • l2_sparse

    • linf

    • linf_sparse

    • lp:p=…

    • lp_sparse:p=…

    • angulardist, angulardist sparse, angulardist sparse fast

    • jsmetrslow, jsmetrfast, jsmetrfastapprox

    • leven

    • sqfd minus func, sqfd heuristic func:alpha=…, sqfd gaussian func:alpha=…

    • jsdivslow, jsdivfast, jsdivfastapprox

    • cosinesimil, cosine, cosinesimil sparse

    • normleven

    • kldivfast

    • kldivgenslow, kldivgenfast, kldivgenfastrq

    • itakurasaitoslow, itakurasaitofast, itakurasaitofastrq

    • negdotprod_sparse

    • querynorm_negdotprod_sparse

    • renyi_diverg

    • ab_diverg

  • metric_params (dict, optional) – Some distance metrics requires additional arguments which can be provided as a dictionary. Default is {}.

  • method (str, optional {'auto', 'small', 'big_nmslib', 'big_umap', 'big_nndescent'}) –

    Method to use. Different algorithms strongly affect the computation speed. Optimal choices are dependent on the input size. Default is auto. * ‘auto’

    Choose automatically based on input size. Recommended for most uses.

    • ’small’

      For input of small size, it is affordable to compute nearest-neigbors exactly and use matrix inversion in the algorithm.

    • ’big_nmslib’

      For input of large size. If NMSlib is installed, it is the fastest algorithm that is supported.

    • ’big_nndescent’

      For input of large size. The NN-Descent algorithm implemented in the the umap package. Do not require external libraries.

    • ’big_nnumap’

      For input of large size. Do not require external libraries.

  • _lambda (float or None, optional) – Regularization parameter \(\lambda\). If not specified \(\lambda\) is computed from regularization, specifying regularization instead of _lambda is generally recommended as it automatically adjusts based on graph density. Default is None.

  • init (array or None, optional) – Initialize output representation Z with this array if solved through iterative solver. init is only used when no_rotation=True and method is not small. Default is None.

  • symmetrize (bool, optional) – Symmetrize the nearest-neighbors graph by averaging with its transpose. Default is True.

  • custom_graph (sparse matrix or None, optional) – If specified, use this user-provided graph instead of constructing nearest-neighbors graph from X. Default is None.

  • rescale (bool, optional) – Rescale each dimension in the output to align with the input. It works with and without no_rotation. Default is False

  • refine_iter (int, optional) – Number of refinement iterations. Default is 3.

  • refine_threshold (float, optional) – Specify the cutoff for refinement. Unit is IQR (distance between the 25% quantile and the 75% quantile) of overall “stress” distribution across all edges. Default is 12.

  • _refine_threshold (float or None, optional) – Used internally. Default is None.

  • refine_protected_graph (sparse matrix or None, optional) – Sparse matrix of size (n_samples, n_samples). Specify edges (any non-zero edges in refine_protected_graph) that will not be removed during refinement. Useful for graph construction for complex experimental designs like batch design so that edges connecting batches will not be removed. Default is None.

  • tol (float, optional) – Tolerances for iterative linear solver norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.

  • atol (float, optional) – Tolerances for iterative linear solver norm(residual) <= max(tol*norm(b), atol). Default is tol=1e-7, atol=1e-15.

  • nmslib_params (dict, optional) – Parameters given to nmslib createIndex function. See https://github.com/nmslib/nmslib/blob/master/manual/methods.md for details. Default is {‘post’: 2}.

  • n_jobs (int, optional) – Number of jobs to run in parallel. Specifying n_jobs > 1 can speed up computation when method is one of big_nmslib, big_nndescent, or big_nnumap. Note that when use_cuda=True specifying n_jobs does not run the jobs in parallel but can reduce amount of GPU memory required by splitting the problem and run sequentially. Default is 1.

  • n_jobs_nmslib (int, optional) – Number of threads used by nmslib. Default is 8.

  • use_cuda (bool, optional) – Use CUDA GPU acceleration. CuPy has to be installed (https://github.com/cupy/cupy) and a CUDA-enabled GPU is required. Default is False.

  • verbose (bool, optional) – Print additional information. Default is False.

  • return_all (bool, optional) – Return a tuple of (Z, W, graph). Default is False.

Returns

  • Z (array, (optional) W: array, (optional) Graph: sparse matrix)

  • * Z (Embedding of the input data.)

  • * W (Linear operator on the feature space, returned if return_all==True.)

  • * graph (The graph used for GraphDR embedding, returned if return_all==True.)

quasildr.graphdr.knn_graph(X, n_neighbors=15, space='l2', space_params=None, method='hnsw', num_threads=8, params={'post': 2}, verbose=False)[source]

Construct an approximate K-nearest neighbors graph with nmslib.

Parameters
  • X (array, shape (n_samples, n_features)) – A input 2D array to construct KNN graph on. Each row represent a sample.

  • n_neighbors (int, optional) – Default is 15. The number of nearest neighbors per sample.

  • space (string, optional) –

    Default is l2. The metric/non-metric distance functions to use to compute distances. Note that l2 corresponds to Euclidean distance. see https://github.com/nmslib/nmslib/blob/master/manual/manual.pdf

    • bit_hamming

    • l1

    • l1_sparse

    • l2, euclidean

    • l2_sparse

    • linf

    • linf_sparse

    • lp:p=…

    • lp_sparse:p=…

    • angulardist, angulardist sparse, angulardist sparse fast

    • jsmetrslow, jsmetrfast, jsmetrfastapprox

    • leven

    • sqfd minus func, sqfd heuristic func:alpha=…, sqfd gaussian func:alpha=…

    • jsdivslow, jsdivfast, jsdivfastapprox

    • cosinesimil, cosine, cosinesimil sparse

    • normleven

    • kldivfast

    • kldivgenslow, kldivgenfast, kldivgenfastrq

    • itakurasaitoslow, itakurasaitofast, itakurasaitofastrq

    • negdotprod_sparse

    • querynorm_negdotprod_sparse

    • renyi_diverg

    • ab_diverg

  • space_params (dict or None, optional) – Parameters for configuring the space.

  • method (str) – The nmslib method to use. Default is hnsw.

  • num_threads (int, optional) – Default is 8. Number of threads to run nmslib with.

  • params (dict, optional) – Default is {‘post’: 2}. Input parameters to nmslib index construction. See https://github.com/nmslib/nmslib/blob/master/manual/methods.md for details.

  • verbose (bool, optional) – Default is False. Print extra information.

Returns

Sparse matrix representing the constructed KNN graph.

Return type

scipy.sparse.csr_matrix