| Title: | Bioinformatic Distances |
|---|---|
| Description: | A selection of distances measures for bioinformatics data. Other important distance measures for bioinformatics data are selected from the R package 'parallelDist'. A special distance measure for the Gene Ontology is available. |
| Authors: | Quirin Stier [aut, rev, ctb, cre] (ORCID: <https://orcid.org/0000-0002-7896-4737>), Michael Thrun [aut] (ORCID: <https://orcid.org/0000-0001-9542-5543>), Luca Brinkmann [ctb] |
| Maintainer: | Quirin Stier <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.3 |
| Built: | 2026-06-05 09:34:04 UTC |
| Source: | https://github.com/cran/BIDistances |
Calculates the cosine distance
CosinusDistance(Data)CosinusDistance(Data)
Data |
[1:n,1:d] matrix with n cases, d variables |
https://en.wikipedia.org/wiki/Cosine_similarity
Distance |
[1:n,1:n] symmetric matrix, containing the distanes of the cases (rows) for the given data |
The cosine distance is calculated by calculating the cosine similarity , where is the cosine similarity and the the cosine distance.
Michael Thrun
data(Hepta) distMatrix = CosinusDistance(Hepta$Data)data(Hepta) distMatrix = CosinusDistance(Hepta$Data)
Calculates all distances from a given vector to the rows of a matrix.
Dist2All(X, Data, SelectFeatures, method = "euclidean",p=2,knn=1)Dist2All(X, Data, SelectFeatures, method = "euclidean",p=2,knn=1)
X |
A vector containing the data point to be compared to data. |
Data |
A matrix containing the data points to be compared with x. |
SelectFeatures |
A vector of the same length as x and the rows of data, containing TRUE for all columns of the data to be compared and any other value for columns to be discarded. |
method |
(Optional) String marking, which distance measure is to be used. Euclidean by default. |
p |
(Optional) Scalar, The pp-th root of the sum of the pp-th powers of the differences of the components. Default is 2 |
knn |
(Optional) Scalar, gives the number of the indices of the k nearest neighbors returned. Default is 1 |
List with
distToAll |
A vector containing the distances from x to all rows of data. |
KNN |
Numeric vector, containing the indices of the k nearest neighbors (rows) to the given points |
This function is very inefficient for large Data.
Michael Thrun
data(Hepta) Dist2All(Hepta$Data[1,],Hepta$Data)data(Hepta) Dist2All(Hepta$Data[1,],Hepta$Data)
Calculates the distribution of the distances between the data points
DistanceDistributions(Data, DistanceMethods=c('bhjattacharyya', 'bray', 'canberra', 'chord', 'divergence', 'euclidean', 'minkowski', 'geodesic', 'hellinger', 'kullback', 'manhattan', 'maximum', 'soergel', 'wave', 'whittaker'), CosineNonParallel = TRUE, CorrelationDist = TRUE, Mahalanobis = FALSE, Podani = FALSE, PlotIt = FALSE, PlotSampleSize = 5e3)DistanceDistributions(Data, DistanceMethods=c('bhjattacharyya', 'bray', 'canberra', 'chord', 'divergence', 'euclidean', 'minkowski', 'geodesic', 'hellinger', 'kullback', 'manhattan', 'maximum', 'soergel', 'wave', 'whittaker'), CosineNonParallel = TRUE, CorrelationDist = TRUE, Mahalanobis = FALSE, Podani = FALSE, PlotIt = FALSE, PlotSampleSize = 5e3)
Data |
[1:n, 1:m] A matrix, containing data as rows. |
DistanceMethods |
Character vector stating all distance methods such as 'euclidean'. |
CosineNonParallel |
Boolean stating if cosine should be computed in parallel. |
CorrelationDist |
Boolean stating if CorrelationDist should be computed. |
Mahalanobis |
Boolean stating if Mahalanobis should be computed. |
Podani |
Boolean stating if Podani should be computed. |
PlotIt |
Boolean: TRUE => create plot. FALSE => no plot. |
PlotSampleSize |
Integer stating the number of samples for plotting. |
List with elements
DistanceMatrix |
[1:n, 1:n] numeric matrix containing the distance matrix |
DistanceChoice |
[1:n, 1:n] numeric matrix containing the distance matrix |
OrderedDistances |
[1:n, 1:n] numeric matrix containing the distance matrix |
ggobject |
ggplot object |
Michael Thrun
iris=datasets::iris if(requireNamespace("DataVisualizations",quietly=TRUE)){ library(DataVisualizations) DistanceDistributions(as.matrix(iris[,1:4]), c("euclidean"), PlotIt = FALSE) }iris=datasets::iris if(requireNamespace("DataVisualizations",quietly=TRUE)){ library(DataVisualizations) DistanceDistributions(as.matrix(iris[,1:4]), c("euclidean"), PlotIt = FALSE) }
computes the distance between objects in the data matrix, X, using the method specified by method
DistanceMatrix(X,method='euclidean',dim=2,outputisvector=FALSE)DistanceMatrix(X,method='euclidean',dim=2,outputisvector=FALSE)
X |
data matrix [1:n,1:d], n cases d variables |
method |
Optional, method specified by distance string: 'binary','canberra','cityblock','euclidean, 'sqEuclidean', 'maximum','cosine','chebychev','jaccard,'kendallM','kendallD' 'mahalanobis','minkowski','manhattan','braycur','cosine','wasserstein','pearsonD','spearmanD','pearsonM','spearmanM' |
dim |
Optional: if method="minkowski", or wasserstein, choose scalar. For minkowski the ppth root of the sum of the ppth powers of the differences of the components. For wasserstein the order, default should be then 1 |
outputisvector |
Optional: should the output be converted to a vector |
If possible uses implementation parallelized by the parallelDist package. Otherwise R implementations besides Euclidean for which a GPU implementation is provided.
'binary' (aka asymmetric binary): The vectors are regarded as binary bits, so non-zero elements are 'on' and zero elements are 'off'. The distance is the proportion of bits in which only one is on amongst those in which at least one is on.
'cityblock'==manhattan
'maximum': Maximum distance between two components of x and y (supremum norm)
'cosine' calculates a similarity matrix sim between all column vectors of a matrix x. This matrix might be a document-term matrix, so columns would be expected to be documents and rows to be terms. the distances is than defined with D=max(sim)-sim
'jaccard' Jaccard index is computed as 2B/(1+B), where B is Bray-Curtis dissimilarity: the number of items which occur in both elements divided by the total number of items in the elements (Sneath, 1957). This measure is often also called: binary, asymmetric binary, etc.
'mahalanobis'
the squared generalized Mahalanobis distance between all pairs of
rows in a data frame with respect to a covariance matrix.
The element of the i-th row and j-th column of the distance matrix is defined as
'minkowski':The p norm, the pth root of the sum of the pth powers of the differences of the components.
'manhattan': Absolute distance between the two vectors (1 norm aka L_1).
'chebychev'=max(abs(x-y)),
'canberra'=sum abs(x-y)/sum(abs(x)-abs(y)), Terms with zero numerator and denominator are omitted from the sum and treated as if the values were missing. This is intended for non-negative values (e.g., counts): taking the absolute value of the denominator is a 1998 R modification to avoid negative distances.
'braycur'=sum abs(x -y)/abs(x+y)
'pearsonM' metric, see [Legendre, 1986] or [Bock,1974, pp.77-79] sqrt((1 - r)+1)/2) with r beeing the Pearson's correlation coefficient.
'spearmanM' metric, see [Legendre, 1986] or [Bock,1974, pp.77-79] sqrt((1 - r)+1)/2) with r beeing Spearman's correlation coefficient.
'kendallM' metric, see [Legendre, 1986] or [Bock,1974, pp.77-79] sqrt((1 - r)+1)/2) with tau beeing Kendalls's correlation coefficient.
'pearsonD' dissimilarity 1 - r with r beeing the Pearson's correlation coefficient.
'spearmanD' dissimilarity 1 - r with r beeing Spearman's correlation coefficient.
'kendallD' dissimilarity 1 - r with tau beeing Kendalls's correlation coefficient.
'cosine' s. wiki for similarity conversion: max(S)-S(i,j)
Dmatrix |
[1:n,1:n] Distance Marix: Pairwise distance between pairs of objects |
Michael Thrun
Sneath, P. H. A. (1957) Some thoughts on bacterial classification. Journal of General Microbiology 17, pages 184-200.
Leydesdorff, L. (2005) Similarity Measures, Author Cocitation Analysis,and Information Theory. In: JASIST 56(7), pp.769-772.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979) Multivariate Analysis. Academic Press.
Borg, I. and Groenen, P. (1997) Modern Multidimensional Scaling. Theory and Applications. Springer.
Mahalanobis, P. C. (1936) On the generalized distance in statistics. Proceedings of The National Institute of Sciences of India, 12:49-55.
data(Hepta) Dmatrix = DistanceMatrix(Hepta$Data,method='euclidean')data(Hepta) Dmatrix = DistanceMatrix(Hepta$Data,method='euclidean')
calculates pairwise euclidean distances
fastPdist(X)fastPdist(X)
X |
[1:n,1:m] data to calculate distances to |
dist[1:n,1:n] distances
Michael Thrun
fastPdist(as.matrix(iris[,1:4]))fastPdist(as.matrix(iris[,1:4]))
calculates pairwise euclidean distances
fastPdistC(Ar,Br)fastPdistC(Ar,Br)
Ar |
[1:n,1:m] data to calculate distances to |
Br |
[1:n,1:m] data to calculate distances to |
dist[1:n,1:n] distances
Felix Riede
https://blog.felixriedel.com/2013/05/pairwise-distances-in-r/
Calculates distance matrix, through
FractionalDistance(Data, p)FractionalDistance(Data, p)
Data |
[1:n,1:d] Matrix, with n cases, d variables |
p |
Scalar, value for p |
Values of p < 1 can be used, which can be useful for high-dimensional data, see references.
DistanceMatrix |
[1:n,1:n] symmetric Matrix, containing the distances between the cases (rows) of the input matrix |
Michael Thrun
Aggrawal, C. C., Hinneburg, A., Keim, D. (2001), On the Suprising Behavior of Distance Metrics in High Dimensional Space.
data(Hepta) distMatrix = FractionalDistance(Hepta$Data, p = 1/2)data(Hepta) distMatrix = FractionalDistance(Hepta$Data, p = 1/2)
Calculates pairwise gini distances
GiniDist(Data)GiniDist(Data)
Data |
[1:n,1:d] data to calculate distances to |
dist[1:n,1:n] distances
Michael Thrun
GiniDist(as.matrix(iris[,1:4]))GiniDist(as.matrix(iris[,1:4]))
Hearingloss data, with Gene2GoTerm matrix.
data('Hearingloss_N109')data('Hearingloss_N109')
FeatureMarix_Gene2Term contains the dataset, NCBI are the row names for the genes and GoTerm_Header contains the column names for the GoTerms. Size of data matrix is 109 with dimension 829.
NCBI OtoGenome Test for Hearing Loss, accessed 24 June 2022.
GeneTestingRegistry (2018). OtoGenome Test for Hearing Loss Retrieved 2017
data(Hearingloss_N109) str(Hearingloss_N109)data(Hearingloss_N109) str(Hearingloss_N109)
Clearly defined clusters, different variances. Detailed description of dataset and its clustering challenge is provided in [Thrun/Ultsch, 2020].
data('Hepta')data('Hepta')
Size 212, Dimensions 3, stored in Hepta$Data
Classes 7, stored in Hepta$Cls
[Ultsch, 2003] Ultsch, A.: Maps for the visualization of high-dimensional data spaces, Proc. Workshop on Self organizing Maps (WSOM), pp. 225-230, Kyushu, Japan, 2003.
[Thrun/Ultsch, 2020] Thrun, M. C., & Ultsch, A.: Clustering Benchmark Datasets Exploiting the Fundamental Clustering Problems, Data in Brief, Vol. 30(C), pp. 105501, doi:10.1016/j.dib.2020.105501, 2020.
data(Hepta) str(Hepta)data(Hepta) str(Hepta)
The function computes dissimilarity indices Jaccard, which index is computed as 2B/(1+B), where B is Bray-Curtis dissimilarity
jaccard(X)jaccard(X)
X |
Distance Matrix |
Kosinusdistanz der beiden Vektoren x,y
MT
jaccard(as.matrix(iris[,1:4]))jaccard(as.matrix(iris[,1:4]))
Function to calculate the squared generalized Mahalanobis distance between all pairs of rows in a data frame with respect to a covariance matrix. The element of the i-th row and j-th column of the distance matrix is defined as
Mahalanobis(X, cov, inverted = FALSE)Mahalanobis(X, cov, inverted = FALSE)
X |
a matrix of data (n x d) n cases, d variables |
cov |
a variance-covariance matrix (p x p). |
inverted |
logical. If |
Distances[1:n,1:n]
copy of function in biotools package, because this packages doesnt work under mac os
Anderson Rodrigo da Silva <[email protected]>
Mahalanobis, P. C. (1936) On the generalized distance in statistics. Proceedings of The National Institute of Sciences of India, 12:49-55.
# Manly (2004, p.65-66) x1 <- c(131.37, 132.37, 134.47, 135.50, 136.17) x2 <- c(133.60, 132.70, 133.80, 132.30, 130.33) x3 <- c(99.17, 99.07, 96.03, 94.53, 93.50) x4 <- c(50.53, 50.23, 50.57, 51.97, 51.37) x <- cbind(x1, x2, x3, x4) Cov <- matrix(c(21.112,0.038,0.078,2.01, 0.038,23.486,5.2,2.844, 0.078,5.2,24.18,1.134, 2.01,2.844,1.134,10.154), 4, 4) Mahalanobis(x, Cov) # End (not run)# Manly (2004, p.65-66) x1 <- c(131.37, 132.37, 134.47, 135.50, 136.17) x2 <- c(133.60, 132.70, 133.80, 132.30, 130.33) x3 <- c(99.17, 99.07, 96.03, 94.53, 93.50) x4 <- c(50.53, 50.23, 50.57, 51.97, 51.37) x <- cbind(x1, x2, x3, x4) Cov <- matrix(c(21.112,0.038,0.078,2.01, 0.038,23.486,5.2,2.844, 0.078,5.2,24.18,1.134, 2.01,2.844,1.134,10.154), 4, 4) Mahalanobis(x, Cov) # End (not run)
msmd
msmd(Values1, Values2, ParameterC)msmd(Values1, Values2, ParameterC)
Values1 |
[1:N1] Numeric vector with values of the first time series. |
Values2 |
[1:N1] Numeric vector with values of the second time series. |
ParameterC |
Numeric vector with time stamps of the first time series. |
List with elements
Value |
Distance measure |
Quirin Stier
msmd(1:10, 1:10)msmd(1:10, 1:10)
returns the index of the nearest neighbour of a given data point.
nearest(Data, i, defined)nearest(Data, i, defined)
Data |
A matrix holding n data points as row vectors. |
i |
the index of the data point, who's nearest neighbour is sought. |
defined |
A row vector with 1 for all columns of data that are used for the computation. If missing, all columns are used. |
nNInd |
The index of the nearest neighbour of data[i, ] |
Michael Thrun, Raphael Paebst
nearest(Data = as.matrix(iris[,1:4]), i = 1)nearest(Data = as.matrix(iris[,1:4]), i = 1)
Computes the term frequency inverse document frequency (tfidf) distance for a FeatureMatrix_Gene2GoTerm. In case of genes with annotated GOterms from gene ontology genes can be interpreted as documents and GOterms as terms.
Tfidf_dist(FeatureMatrix_Gene2GoTerm, tf_fun = mean)Tfidf_dist(FeatureMatrix_Gene2GoTerm, tf_fun = mean)
FeatureMatrix_Gene2GoTerm |
[1:n,1:d] Matrix, with n genes and d GO-Terms. |
tf_fun |
Function, defining the numerator value in the normalized Term-frequency. The default is the mean of the not 0 values. |
For the FeatureMatrix_Gene2GoTerm it is:
FeatureMatrix_Gene2GoTerm[i,j] > 0 iff GOterm j is relevant for gene i. The FeatureMatrix_Gene2GoTerm[i,j] > 1 if the specific gene is annotated by in a specific GO-Term with more than one evidence code FeatureMatrix_Gene2GoTerm[i,j] is the frequency of term js occurance in document i.
List with
dist |
Numeric vector containing the tdfidf distances between the documents = absolute difference of TfidfWeights |
TfidfWeights |
[1:n] Numeric vector containing the term frequence inverse document frequency weights used for the distance, given as the Term frequency*Inverse document frequency |
Michael Thrun
Stier, Q. and Thrun, M., C.: Deriving homogeneous subsets from gene sets by exploiting the Gene Ontology, Informatica, in review, 2023
data(Hearingloss_N109) V = Tfidf_dist(Hearingloss_N109$FeatureMatrix_Gene2Term) dist = V$dist TfidfWeights = V$TfidfWeightsdata(Hearingloss_N109) V = Tfidf_dist(Hearingloss_N109$FeatureMatrix_Gene2Term) dist = V$dist TfidfWeights = V$TfidfWeights
Calculate toroid Euclidean Distances
positionxy |
One datapoint |
AllPositions(1:AnzData:2) |
All Other dataPoints |
Lines, Columns
|
Size of planar grid |
Dist2All(1:AnzData,1:AnzData); distance(s) between XY and AllPositions
MT
positionxy = c(1,1) AllPositions = rbind(c(2,3), c(5,2)) Lines = 40 Columns = 80 ToroidDist2All(positionxy, AllPositions, Lines, Columns)positionxy = c(1,1) AllPositions = rbind(c(2,3), c(5,2)) Lines = 40 Columns = 80 ToroidDist2All(positionxy, AllPositions, Lines, Columns)
TransformSimilarity2MetricDistance
TransformSimilarity2MetricDistance(Similarity)TransformSimilarity2MetricDistance(Similarity)
Similarity |
Similarity |
Similarity
Michael Thrun
Data_S = fastPdist(as.matrix(iris[,1:4])) Data_S = Data_S-min(Data_S) Data_S = Data_S/max(Data_S) diag(Data_S) = 1 TransformSimilarity2MetricDistance(Data_S)Data_S = fastPdist(as.matrix(iris[,1:4])) Data_S = Data_S-min(Data_S) Data_S = Data_S/max(Data_S) diag(Data_S) = 1 TransformSimilarity2MetricDistance(Data_S)
twed
twed(Values1, Values2, Time1, Time2, Nu = 1, Lambda = 1, Degree = 2)twed(Values1, Values2, Time1, Time2, Nu = 1, Lambda = 1, Degree = 2)
Values1 |
[1:N1] Numeric vector with values of the first time series. |
Values2 |
[1:N1] Numeric vector with values of the second time series. |
Time1 |
[1:N1] Numeric vector with time stamps of the first time series. |
Time2 |
[1:N1] Numeric vector with time stamps of the second time series. |
Nu |
Optional, Numeric: Elasticity parameter - nu >=0 needed for distance measure. |
Lambda |
Optional, Numeric: Penalty for deletion operation. |
Degree |
Optional, Integer: Degree of the p norm for local cost. |
List with elements
TWED |
TWED distance between time series Values1 (Time1) and Values2 (Time2) |
DPMatrix |
[1:n, 1:m] Numeric matrix |
Quirin Stier
twed(1:10, 1:10, 1:10, 1:10)twed(1:10, 1:10, 1:10, 1:10)
Computes the variable precision
VariablePrecision(Variable)VariablePrecision(Variable)
Variable |
Numeric vector [1:n] or matrix [1:n, 1:d] |
MinAbsDiff, MinAbsNZDiff, MinExpo
Michael Thrun
data(Hepta) distMat = VariablePrecision(as.matrix(iris[, 1])) distMat = VariablePrecision(as.matrix(iris[, 1:4]))data(Hepta) distMat = VariablePrecision(as.matrix(iris[, 1])) distMat = VariablePrecision(as.matrix(iris[, 1:4]))
Computes the Wasserstein distance for a data matrix
WassersteinDist(Data, p = 1, InverseWeighting = FALSE)WassersteinDist(Data, p = 1, InverseWeighting = FALSE)
Data |
data matrix of n cases and d feautures |
p |
scalar higher than one, the power to which the Euclidean distance between points is taken in order to compute transportation costs. |
InverseWeighting |
weighting per row can be either 1 (FALSE) or 1/n (TRUE) |
Wasserstein distance, also known as Earth Mover’s Distance (EMD) is the distance between two probability distributions over a region D. The Wasserstein distance of order p is defined as the p-th root of the total cost incurred when transporting measure a to measure b in an optimal way, where the cost of transporting a unit of mass from x to y is given as the p-th power of the Euclidean distance.
It is claimed to be useful for distributions that do not align well with traditional measures like Euclidean distance.
matrix of distances, symmetric
Michae Thrun
...
data(Hepta) distMat=WassersteinDist(Hepta$Data)data(Hepta) distMat=WassersteinDist(Hepta$Data)