Title: | Time Series Analysis Toolkit Based on Symbolic Aggregate Discretization, i.e. SAX |
---|---|
Description: | Implements time series z-normalization, SAX, HOT-SAX, VSM, SAX-VSM, RePair, and RRA algorithms facilitating time series motif (i.e., recurrent pattern), discord (i.e., anomaly), and characteristic pattern discovery along with interpretable time series classification. |
Authors: | Pavel Senin [aut, cre] |
Maintainer: | Pavel Senin <[email protected]> |
License: | GPL-2 |
Version: | 1.1.1 |
Built: | 2025-02-10 04:43:53 UTC |
Source: | https://github.com/jmotif/jmotif-r |
Translates an alphabet size into the array of corresponding SAX cut-lines built using the Normal distribution.
alphabet_to_cuts(a_size)
alphabet_to_cuts(a_size)
a_size |
the alphabet size, a value between 2 and 20 (inclusive). |
Lonardi, S., Lin, J., Keogh, E., Patel, P., Finding motifs in time series. In Proc. of the 2nd Workshop on Temporal Data Mining (pp. 53-68). (2002)
alphabet_to_cuts(5)
alphabet_to_cuts(5)
Computes a TF-IDF weight vectors for a set of word bags.
bags_to_tfidf(data)
bags_to_tfidf(data)
data |
the list containing the input word bags. |
Senin Pavel and Malinchik Sergey, SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model. Data Mining (ICDM), 2013 IEEE 13th International Conference on, pp.1175,1180, 7-10 Dec. 2013.
Salton, G., Wong, A., Yang., C., A vector space model for automatic indexing. Commun. ACM 18, 11, 613-620, 1975.
bag1 = data.frame( "words" = c("this", "is", "a", "sample"), "counts" = c(1, 1, 2, 1), stringsAsFactors = FALSE ) bag2 = data.frame( "words" = c("this", "is", "another", "example"), "counts" = c(1, 1, 2, 3), stringsAsFactors = FALSE ) ll = list("bag1" = bag1, "bag2" = bag2) tfidf = bags_to_tfidf(ll)
bag1 = data.frame( "words" = c("this", "is", "a", "sample"), "counts" = c(1, 1, 2, 1), stringsAsFactors = FALSE ) bag2 = data.frame( "words" = c("this", "is", "another", "example"), "counts" = c(1, 1, 2, 3), stringsAsFactors = FALSE ) ll = list("bag1" = bag1, "bag2" = bag2) tfidf = bags_to_tfidf(ll)
A standard UCR Cylinder-Bell-Funnel dataset from http://www.cs.ucr.edu/~eamonn/time_series_data
CBF
CBF
A four-elements list containing train and test data along with their labels
labels_train: the training data labels, correspond to data matrix rows
data_train: the training data matrix, each row is a time series instance
labels_test: the test data labels, correspond to data matrix rows
data_test: the test data matrix, each row is a time series instance
Computes the cosine similarity between numeric vectors
cosine_dist(m)
cosine_dist(m)
m |
the data matrix |
Returns the cosine similarity
a <- c(2, 1, 0, 2, 0, 1, 1, 1) b <- c(2, 1, 1, 1, 1, 0, 1, 1) sim <- cosine_dist(rbind(a,b))
a <- c(2, 1, 0, 2, 0, 1, 1, 1) b <- c(2, 1, 1, 1, 1, 0, 1, 1) sim <- cosine_dist(rbind(a,b))
Computes the cosine distance value between a bag of words and a set of TF-IDF weight vectors.
cosine_sim(data)
cosine_sim(data)
data |
the list containing a word-bag and the TF-IDF object. |
Senin Pavel and Malinchik Sergey, SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model. Data Mining (ICDM), 2013 IEEE 13th International Conference on, pp.1175,1180, 7-10 Dec. 2013.
Salton, G., Wong, A., Yang., C., A vector space model for automatic indexing. Commun. ACM 18, 11, 613-620, 1975.
Finds the Euclidean distance between points, if distance is above the threshold, abandons the computation and returns NAN.
early_abandoned_dist(seq1, seq2, upper_limit)
early_abandoned_dist(seq1, seq2, upper_limit)
seq1 |
the array 1. |
seq2 |
the array 2. |
upper_limit |
the max value after reaching which the distance computation stops and the NAN is returned. |
A PHYSIONET dataset
ecg0606
ecg0606
A vector of numeric values
Finds the Euclidean distance between points.
euclidean_dist(seq1, seq2)
euclidean_dist(seq1, seq2)
seq1 |
the array 1. |
seq2 |
the array 2. stops and the NAN is returned. |
Finds a discord using brute force algorithm.
find_discords_brute_force(ts, w_size, discords_num)
find_discords_brute_force(ts, w_size, discords_num)
ts |
the input timeseries. |
w_size |
the sliding window size. |
discords_num |
the number of discords to report. |
Keogh, E., Lin, J., Fu, A., HOT SAX: Efficiently finding the most unusual time series subsequence. Proceeding ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
discords = find_discords_brute_force(ecg0606[1:600], 100, 1) plot(ecg0606[1:600], type = "l", col = "cornflowerblue", main = "ECG 0606") lines(x=c(discords[1,2]:(discords[1,2]+100)), y=ecg0606[discords[1,2]:(discords[1,2]+100)], col="red")
discords = find_discords_brute_force(ecg0606[1:600], 100, 1) plot(ecg0606[1:600], type = "l", col = "cornflowerblue", main = "ECG 0606") lines(x=c(discords[1,2]:(discords[1,2]+100)), y=ecg0606[discords[1,2]:(discords[1,2]+100)], col="red")
Finds a discord (i.e. time series anomaly) with HOT-SAX. Usually works the best with lower sizes of discretization parameters: PAA and Alphabet.
find_discords_hotsax(ts, w_size, paa_size, a_size, n_threshold, discords_num)
find_discords_hotsax(ts, w_size, paa_size, a_size, n_threshold, discords_num)
ts |
the input timeseries. |
w_size |
the sliding window size. |
paa_size |
the PAA size. |
a_size |
the alphabet size. |
n_threshold |
the normalization threshold. |
discords_num |
the number of discords to report. |
Keogh, E., Lin, J., Fu, A., HOT SAX: Efficiently finding the most unusual time series subsequence. Proceeding ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
discords = find_discords_hotsax(ecg0606, 100, 3, 3, 0.01, 1) plot(ecg0606, type = "l", col = "cornflowerblue", main = "ECG 0606") lines(x=c(discords[1,2]:(discords[1,2]+100)), y=ecg0606[discords[1,2]:(discords[1,2]+100)], col="red")
discords = find_discords_hotsax(ecg0606, 100, 3, 3, 0.01, 1) plot(ecg0606, type = "l", col = "cornflowerblue", main = "ECG 0606") lines(x=c(discords[1,2]:(discords[1,2]+100)), y=ecg0606[discords[1,2]:(discords[1,2]+100)], col="red")
Finds a discord with RRA (Rare Rule Anomaly) algorithm. Usually works the best with higher than that for HOT-SAX sizes of discretization parameters (i.e., PAA and Alphabet sizes).
find_discords_rra( series, w_size, paa_size, a_size, nr_strategy, n_threshold, discords_num )
find_discords_rra( series, w_size, paa_size, a_size, nr_strategy, n_threshold, discords_num )
series |
the input timeseries. |
w_size |
the sliding window size. |
paa_size |
the PAA size. |
a_size |
the alphabet size. |
nr_strategy |
the numerosity reduction strategy ("none", "exact", "mindist"). |
n_threshold |
the normalization threshold. |
discords_num |
the number of discords to report. |
Senin Pavel and Malinchik Sergey, SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model., Data Mining (ICDM), 2013 IEEE 13th International Conference on.
discords = find_discords_rra(ecg0606, 100, 4, 4, "none", 0.01, 1) plot(ecg0606, type = "l", col = "cornflowerblue", main = "ECG 0606") lines(x=c(discords[1,2]:(discords[1,2]+100)), y=ecg0606[discords[1,2]:(discords[1,2]+100)], col="red")
discords = find_discords_rra(ecg0606, 100, 4, 4, "none", 0.01, 1) plot(ecg0606, type = "l", col = "cornflowerblue", main = "ECG 0606") lines(x=c(discords[1,2]:(discords[1,2]+100)), y=ecg0606[discords[1,2]:(discords[1,2]+100)], col="red")
A standard UCR Gun Point dataset from http://www.cs.ucr.edu/~eamonn/time_series_data
Gun_Point
Gun_Point
A four-elements list containing train and test data along with their labels
labels_train: the training data labels, correspond to data matrix rows
data_train: the training data matrix, each row is a time series instance
labels_test: the test data labels, correspond to data matrix rows
data_test: the test data matrix, each row is a time series instance
Get the ASCII letter by an index.
idx_to_letter(idx)
idx_to_letter(idx)
idx |
the index. |
# letter 'b' idx_to_letter(2)
# letter 'b' idx_to_letter(2)
Compares two strings using mindist.
is_equal_mindist(a, b)
is_equal_mindist(a, b)
a |
the string a. |
b |
the string b. |
is_equal_str("aaa", "bbb") # true is_equal_str("aaa", "ccc") # false
is_equal_str("aaa", "bbb") # true is_equal_str("aaa", "ccc") # false
Compares two strings using natural letter ordering.
is_equal_str(a, b)
is_equal_str(a, b)
a |
the string a. |
b |
the string b. |
is_equal_str("aaa", "bbb") is_equal_str("ccc", "ccc")
is_equal_str("aaa", "bbb") is_equal_str("ccc", "ccc")
Get the index for an ASCII letter.
letter_to_idx(letter)
letter_to_idx(letter)
letter |
the letter. |
# letter 'b' translates to 2 letter_to_idx('b')
# letter 'b' translates to 2 letter_to_idx('b')
Get an ASCII indexes sequence for a given character array.
letters_to_idx(str)
letters_to_idx(str)
str |
the character array. |
letters_to_idx(c('a','b','c','a'))
letters_to_idx(c('a','b','c','a'))
Converts a set of time-series into a single bag of words.
manyseries_to_wordbag(data, w_size, paa_size, a_size, nr_strategy, n_threshold)
manyseries_to_wordbag(data, w_size, paa_size, a_size, nr_strategy, n_threshold)
data |
the timeseries data, row-wise. |
w_size |
the sliding window size. |
paa_size |
the PAA size. |
a_size |
the alphabet size. |
nr_strategy |
the NR strategy. |
n_threshold |
the normalization threshold. |
Senin Pavel and Malinchik Sergey, SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model. Data Mining (ICDM), 2013 IEEE 13th International Conference on, pp.1175,1180, 7-10 Dec. 2013.
Salton, G., Wong, A., Yang., C., A vector space model for automatic indexing. Commun. ACM 18, 11, 613-620, 1975.
Computes the mindist value for two strings
min_dist(str1, str2, alphabet_size, compression_ratio = 1)
min_dist(str1, str2, alphabet_size, compression_ratio = 1)
str1 |
the first string |
str2 |
the second string |
alphabet_size |
the used alphabet size |
compression_ratio |
the distance compression ratio |
Returns the distance between strings
Lonardi, S., Lin, J., Keogh, E., Patel, P., Finding motifs in time series. In Proc. of the 2nd Workshop on Temporal Data Mining (pp. 53-68).
str1 <- c('a', 'b', 'c') str2 <- c('c', 'b', 'a') min_dist(str1, str2, 3)
str1 <- c('a', 'b', 'c') str2 <- c('c', 'b', 'a') min_dist(str1, str2, 3)
Computes a Piecewise Aggregate Approximation (PAA) for a time series.
paa(ts, paa_num)
paa(ts, paa_num)
ts |
a timeseries to compute the PAA for. |
paa_num |
the desired PAA size. |
Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S., Dimensionality reduction for fast similarity search in large time series databases. Knowledge and information Systems, 3(3), 263-286. (2001)
x = c(-1, -2, -1, 0, 2, 1, 1, 0) x_paa3 = paa(x, 3) # plot(x, type = "l", main = c("8-points time series and its PAA transform into three points", "PAA shown schematically in blue")) points(x, pch = 16, lwd = 5) # paa_bounds = c(1, 1+7/3, 1+7/3*2, 8) abline(v = paa_bounds, lty = 3, lwd = 2, col = "cornflowerblue") segments(paa_bounds[1:3], x_paa3, paa_bounds[2:4], x_paa3, col = "cornflowerblue", lwd = 2) points(x = c(1, 1+7/3, 1+7/3*2) + (7/3)/2, y = x_paa3, pch = 15, lwd = 5, col = "cornflowerblue")
x = c(-1, -2, -1, 0, 2, 1, 1, 0) x_paa3 = paa(x, 3) # plot(x, type = "l", main = c("8-points time series and its PAA transform into three points", "PAA shown schematically in blue")) points(x, pch = 16, lwd = 5) # paa_bounds = c(1, 1+7/3, 1+7/3*2, 8) abline(v = paa_bounds, lty = 3, lwd = 2, col = "cornflowerblue") segments(paa_bounds[1:3], x_paa3, paa_bounds[2:4], x_paa3, col = "cornflowerblue", lwd = 2) points(x = c(1, 1+7/3, 1+7/3*2) + (7/3)/2, y = x_paa3, pch = 15, lwd = 5, col = "cornflowerblue")
Discretize a time series with SAX using chunking (no sliding window).
sax_by_chunking(ts, paa_size, a_size, n_threshold)
sax_by_chunking(ts, paa_size, a_size, n_threshold)
ts |
the input time series. |
paa_size |
the PAA size. |
a_size |
the alphabet size. |
n_threshold |
the normalization threshold. |
Lonardi, S., Lin, J., Keogh, E., Patel, P., Finding motifs in time series. In Proc. of the 2nd Workshop on Temporal Data Mining (pp. 53-68). (2002)
Generates a SAX MinDist distance matrix (i.e. the "lookup table") for a given alphabet size.
sax_distance_matrix(a_size)
sax_distance_matrix(a_size)
a_size |
the desired alphabet size (a value between 2 and 20, inclusive) |
Returns a distance matrix (for SAX minDist) for a specified alphabet size
Lonardi, S., Lin, J., Keogh, E., Patel, P., Finding motifs in time series. In Proc. of the 2nd Workshop on Temporal Data Mining (pp. 53-68).
sax_distance_matrix(5)
sax_distance_matrix(5)
Discretizes a time series with SAX via sliding window.
sax_via_window(ts, w_size, paa_size, a_size, nr_strategy, n_threshold)
sax_via_window(ts, w_size, paa_size, a_size, nr_strategy, n_threshold)
ts |
the input timeseries. |
w_size |
the sliding window size. |
paa_size |
the PAA size. |
a_size |
the alphabet size. |
nr_strategy |
the Numerosity Reduction strategy, acceptable values are "exact" and "mindist" – any other value triggers no numerosity reduction. |
n_threshold |
the normalization threshold. |
Lonardi, S., Lin, J., Keogh, E., Patel, P., Finding motifs in time series. In Proc. of the 2nd Workshop on Temporal Data Mining (pp. 53-68). (2002)
Transforms a time series into the char array using SAX and the normal alphabet.
series_to_chars(ts, a_size)
series_to_chars(ts, a_size)
ts |
the timeseries. |
a_size |
the alphabet size. |
Lonardi, S., Lin, J., Keogh, E., Patel, P., Finding motifs in time series. In Proc. of the 2nd Workshop on Temporal Data Mining (pp. 53-68). (2002)
y = c(-1, -2, -1, 0, 2, 1, 1, 0) y_paa3 = paa(y, 3) series_to_chars(y_paa3, 3)
y = c(-1, -2, -1, 0, 2, 1, 1, 0) y_paa3 = paa(y, 3) series_to_chars(y_paa3, 3)
Transforms a time series into the string.
series_to_string(ts, a_size)
series_to_string(ts, a_size)
ts |
the timeseries. |
a_size |
the alphabet size. |
Lonardi, S., Lin, J., Keogh, E., Patel, P., Finding motifs in time series. In Proc. of the 2nd Workshop on Temporal Data Mining (pp. 53-68). (2002)
y = c(-1, -2, -1, 0, 2, 1, 1, 0) y_paa3 = paa(y, 3) series_to_string(y_paa3, 3)
y = c(-1, -2, -1, 0, 2, 1, 1, 0) y_paa3 = paa(y, 3) series_to_string(y_paa3, 3)
Converts a single time series into a bag of words.
series_to_wordbag(ts, w_size, paa_size, a_size, nr_strategy, n_threshold)
series_to_wordbag(ts, w_size, paa_size, a_size, nr_strategy, n_threshold)
ts |
the timeseries. |
w_size |
the sliding window size. |
paa_size |
the PAA size. |
a_size |
the alphabet size. |
nr_strategy |
the NR strategy. |
n_threshold |
the normalization threshold. |
Senin Pavel and Malinchik Sergey, SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model. Data Mining (ICDM), 2013 IEEE 13th International Conference on, pp.1175,1180, 7-10 Dec. 2013.
Salton, G., Wong, A., Yang., C., A vector space model for automatic indexing. Commun. ACM 18, 11, 613-620, 1975.
Runs the repair on a string.
str_to_repair_grammar(str)
str_to_repair_grammar(str)
str |
the input string. |
N.J. Larsson and A. Moffat. Offline dictionary-based compression. In Data Compression Conference, 1999.
str_to_repair_grammar("abc abc cba cba bac xxx abc abc cba cba bac")
str_to_repair_grammar("abc abc cba cba bac xxx abc abc cba cba bac")
Extracts a subseries.
subseries(ts, start, end)
subseries(ts, start, end)
ts |
the input timeseries (0-based, left inclusive). |
start |
the interval start. |
end |
the interval end. |
y = c(-1, -2, -1, 0, 2, 1, 1, 0) subseries(y, 0, 3)
y = c(-1, -2, -1, 0, 2, 1, 1, 0) subseries(y, 0, 3)
Z-normalizes a time series by subtracting its mean and dividing by the standard deviation.
znorm(ts, threshold = 0.01)
znorm(ts, threshold = 0.01)
ts |
the input time series. |
threshold |
the z-normalization threshold value, if the input time series' standard deviation will be found less than this value, the procedure will not be applied, so the "under-threshold-noise" would not get amplified. |
Dina Goldin and Paris Kanellakis, On similarity queries for time-series data: Constraint specification and implementation. In Principles and Practice of Constraint Programming (CP 1995), pages 137-153. (1995)
x = seq(0, pi*4, 0.02) y = sin(x) * 5 + rnorm(length(x)) plot(x, y, type="l", col="blue") lines(x, znorm(y, 0.01), type="l", col="red")
x = seq(0, pi*4, 0.02) y = sin(x) * 5 + rnorm(length(x)) plot(x, y, type="l", col="blue") lines(x, znorm(y, 0.01), type="l", col="red")