Graph-Constrained Functional Pruning Optimal Partitioning.
gfpop Vignette
Vincent Runge
LaMME, Evry University
March 2, 2022
Quick Start
we present a basic use of the main functions of the gfpop
package. More details about the theory of graph-cosntrained multiple change-point detection can be found here
We install the package from Github:
#devtools::install_github("vrunge/gfpop")
library(gfpop)
We simulate some univariate gaussian data (n = 1000
points) with relative change-point positions 0.1, 0.3, 0.5, 0.8, 1
and means 1, 2, 1, 3, 1
with a variance equal to 1
.
n <- 1000
myData <- dataGenerator(n, c(0.1,0.3,0.5,0.8,1), c(1,2,1,3,1), sigma = 1)
We define the graph of constraints to use for the dynamic programming algorithm. A simple case is the up-down constraint with a penalty here equal to a classic 2 log(n)
.
myGraph <- graph(penalty = 2*log(n), type = "updown")
The gfpop function gives the result of the segmentation using myData
and myGraph
as parameters. We choose a gaussian cost.
gfpop(data = myData, mygraph = myGraph, type = "mean")
## $changepoints
## [1] 100 298 500 800 1000
##
## $states
## [1] "Dw" "Up" "Dw" "Up" "Dw"
##
## $forced
## [1] FALSE FALSE FALSE FALSE
##
## $parameters
## [1] 1.0317856 1.9865414 0.9697470 3.0359938 0.8527903
##
## $globalCost
## [1] 1010.133
##
## attr(,"class")
## [1] "gfpop" "mean"
The vector changepoints
gives the last index of each segment. It always ends with the length of the vector vectData
.
The vector states
contains the states in which lies each mean. The length of this vector is the same as the length of changepoint
.
The vector forced
is a boolean vector. A forced element means that two consecutive means have been forced to satisfy the constraint. For example, the "up" edge with parameter c is forced if m(i+1) - m(i) = c.
The vector parameters
contains the inferred means/parameters of the successive segments.
The number globalCost
is equal to the non-penalized cost, that is the value of the fit to the data ignoring the penalties for adding changes.
Some examples
Isotonic regression
The isotonic regression infers a sequence of nondecreasing means.
n <- 1000
mydata <- dataGenerator(n, c(0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1), c(0, 0.5, 1, 1.5, 2, 2.5, 3), sigma = 1)
myGraphIso <- graph(penalty = 2*log(n), type = "isotonic")
gfpop(data = mydata, mygraph = myGraphIso, type = "mean")
## $changepoints
## [1] 186 361 713 1000
##
## $states
## [1] "Iso" "Iso" "Iso" "Iso"
##
## $forced
## [1] FALSE FALSE FALSE
##
## $parameters
## [1] 0.1250638 1.2483475 2.1161519 2.8636266
##
## $globalCost
## [1] 992.7441
##
## attr(,"class")
## [1] "gfpop" "mean"
In this example, we use in gfpop
function a robust biweight gaussian cost with K = 1
and the min
parameter in order to infer means greater than 0.5
.
Fixed number of change-points
This algorithm is called segment neighborhood in the change-point litterature. In this example, we fixed the number of segments at 3 with an isotonic constraint. The graph contains two "up" edges with no cycling.
n <- 1000
mydata <- dataGenerator(n, c(0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 1), c(0, 0.5, 1, 1.5, 2, 2.5, 3), sigma = 1)
beta <- 0
myGraph <- graph(
Edge(0, 1,"up", beta),
Edge(1, 2, "up", beta),
Edge(0, 0, "null"),
Edge(1, 1, "null"),
Edge(2, 2, "null"),
StartEnd(start = 0, end = 2))
gfpop(data = mydata, mygraph = myGraph, type = "mean")
## $changepoints
## [1] 323 705 1000
##
## $states
## [1] "0" "1" "2"
##
## $forced
## [1] FALSE FALSE
##
## $parameters
## [1] 0.5031993 2.1549691 2.9441510
##
## $globalCost
## [1] 1102.695
##
## attr(,"class")
## [1] "gfpop" "mean"
Robust up-down with constrained starting and ending states
In presence of outliers we need a robust loss (biweight). We can also force the starting and ending state and a minimal gap between the means (here equal to 1
)
n <- 1000
chgtpt <- c(0.1, 0.3, 0.5, 0.8, 1)
myData <- dataGenerator(n, chgtpt, c(0, 1, 0, 1, 0), sigma = 1)
myData <- myData + 5 * rbinom(n, 1, 0.05) - 5 * rbinom(n, 1, 0.05)
beta <- 2 * log(n)
myGraph <- graph(
Edge("Dw", "Up", type = "up", penalty = beta, gap = 1, K = 3),
Edge("Up", "Dw", type = "down", penalty = beta, gap = 1, K = 3),
Edge("Dw", "Dw", type = "null", K = 3),
Edge("Up", "Up", type = "null", K = 3),
StartEnd(start = "Dw", end = "Dw"))
gfpop(data = myData, mygraph = myGraph, type = "mean")
## $changepoints
## [1] 100 312 500 800 1000
##
## $states
## [1] "Dw" "Up" "Dw" "Up" "Dw"
##
## $forced
## [1] TRUE FALSE FALSE FALSE
##
## $parameters
## [1] 0.0456763883 1.0456763883 -0.0603308658 1.0383495117 -0.0003792976
##
## $globalCost
## [1] 1110.176
##
## attr(,"class")
## [1] "gfpop" "mean"
Robust up-down with constrained starting and ending states
In presence of outliers we need a robust loss (biweight). We can also force the starting and ending state and a minimal gap between the means (here equal to 1
)
n <- 1000
chgtpt <- c(0.1, 0.3, 0.5, 0.8, 1)
myData <- dataGenerator(n, chgtpt, c(0, 1, 0, 1, 0), sigma = 1)
myData <- myData + 5 * rbinom(n, 1, 0.05) - 5 * rbinom(n, 1, 0.05)
beta <- 2 * log(n)
myGraph <- graph(
Edge("Dw", "Up", type = "up", penalty = beta, gap = 1, K = 3),
Edge("Up", "Dw", type = "down", penalty = beta, gap = 1, K = 3),
Edge("Dw", "Dw", type = "null", K = 3),
Edge("Up", "Up", type = "null", K = 3),
StartEnd(start = "Dw", end = "Dw"))
gfpop(data = myData, mygraph = myGraph, type = "mean")
## $changepoints
## [1] 113 300 500 796 1000
##
## $states
## [1] "Dw" "Up" "Dw" "Up" "Dw"
##
## $forced
## [1] FALSE FALSE TRUE FALSE
##
## $parameters
## [1] -0.17825898 1.09464865 0.00177342 1.00177342 -0.20725183
##
## $globalCost
## [1] 1065.075
##
## attr(,"class")
## [1] "gfpop" "mean"
If we skip all these constraints and use a standard fpop algorithm, the result is the following
myGraphStd <- graph(penalty = 2*log(n), type = "std")
gfpop(data = myData, mygraph = myGraphStd, type = "mean")
## $changepoints
## [1] 29 30 51 56 58 65 66 99 101 143 144 145 197 199 207
## [16] 208 242 245 246 282 283 306 307 350 351 356 357 378 379 392
## [31] 393 426 427 429 430 446 447 494 496 509 510 529 530 556 557
## [46] 570 571 575 577 605 606 616 617 620 621 676 677 718 722 723
## [61] 746 747 769 770 780 781 821 822 829 830 892 893 898 899 908
## [76] 909 912 913 914 921 926 930 931 975 977 998 1000
##
## $states
## [1] "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std"
## [14] "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std"
## [27] "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std"
## [40] "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std"
## [53] "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std"
## [66] "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std"
## [79] "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std" "Std"
##
## $forced
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [40] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [53] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [66] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [79] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##
## $parameters
## [1] 0.013157272 5.852235983 0.397219746 -2.487036978 4.306750060 -0.459086261
## [7] -6.020547264 -0.580351710 5.041691711 0.767738682 5.961869168 -4.888515188
## [13] 1.104751718 6.573977997 0.319794566 -4.738167431 0.833793130 -2.899282191
## [19] 6.688410838 1.228639284 -4.269924074 0.404707460 -6.793861272 0.003291056
## [25] 6.506591287 0.629007228 -5.756058094 -0.259927168 -5.736196028 0.362430172
## [31] 5.983851635 -0.083938587 -6.728110389 -0.201933380 -5.596212715 0.515240779
## [37] -5.724921497 -0.233094949 -3.954707046 1.376006649 6.850232664 0.807237359
## [43] -4.772661389 1.572871812 7.044477776 0.419697210 6.576539303 0.668004319
## [49] -4.089779401 1.165036227 6.569345168 0.950959210 6.681891964 1.106775879
## [55] 7.174884098 0.932498620 7.593847560 0.593401906 3.024204730 -4.575301756
## [61] 1.347643390 -4.167946713 0.779419604 6.188344490 1.295509948 6.468646144
## [67] 0.008155765 5.288100272 -0.317542994 5.547977558 0.073240269 5.784884068
## [73] 0.018864107 -5.515696794 0.218333250 -6.987266782 0.513760015 -5.093780701
## [79] 5.643352084 -0.154778961 -4.128204915 -0.544777860 5.729519458 -0.404145572
## [85] 3.440665204 0.088765491 2.860110512
##
## $globalCost
## [1] 1467.21
##
## attr(,"class")
## [1] "gfpop" "mean"
abs edge
With a unique "abs"
edge, we impose a difference between the means of size at least 1.
n <- 10000
myData <- dataGenerator(n, c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1), c(0, 1, 0, 2, 1, 2, 0, 1, 0, 1), sigma = 0.5)
beta <- 2*log(n)
myGraph <- graph(
Edge(0, 0,"abs", penalty = beta, gap = 1),
Edge(0, 0,"null"))
gfpop(data = myData, mygraph = myGraph, type = "mean")
## $changepoints
## [1] 999 2000 3000 4000 4999 6000 7000 8000 8999 10000
##
## $states
## [1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
##
## $forced
## [1] TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
##
## $parameters
## [1] -0.001621797 0.998378203 -0.001621797 2.018687659 1.000371240 2.000371240
## [7] 0.007557667 1.039364965 -0.007114006 0.992885994
##
## $globalCost
## [1] 2417.046
##
## attr(,"class")
## [1] "gfpop" "mean"
Notice that some of the edges are forced, the vector forced
contains non-zero values.
Exponential decay
The null edge corresponds to an exponential decay state if its parameter is not equal to 1.
n <- 1000
mydata <- dataGenerator(n, c(0.2, 0.5, 0.8, 1), c(5, 10, 15, 20), sigma = 1, gamma = 0.966)
beta <- 2*log(n)
myGraphDecay <- graph(
Edge(0, 0, "up", penalty = beta),
Edge(0, 0, "null", 0, decay = 0.966)
)
g <- gfpop(data = mydata, mygraph = myGraphDecay, type = "mean")
g
## $changepoints
## [1] 200 322 500 800 1000
##
## $states
## [1] "0" "0" "0" "0" "0"
##
## $forced
## [1] FALSE FALSE FALSE FALSE
##
## $parameters
## [1] 0.0049454559 0.1510457319 0.0028802127 0.0004721947 0.0194129929
##
## $globalCost
## [1] 976.4308
##
## attr(,"class")
## [1] "gfpop" "mean"
and we plot the result
gamma <- 0.966
len <- diff(c(0, g$changepoints))
signal <- NULL
for(i in length(len):1)
{signal <- c(signal, g$parameters[i]*c(1, cumprod(rep(1/gamma,len[i]-1))))}
signal <- rev(signal)
ylimits <- c(min(mydata), max(mydata))
plot(mydata, type ='p', pch ='+', ylim = ylimits)
par(new = TRUE)
plot(signal, type ='l', col = 4, ylim = ylimits, lwd = 3)
Graph construction
In the gfpop
package, graphs are represented by a dataframe with 9 features and build with the R functions Edge
, Node
, StartEnd
and graph
.
emptyGraph <- graph()
emptyGraph
## [1] state1 state2 type parameter penalty K a
## [8] min max
## <0 lignes> (ou 'row.names' de longueur nulle)
state1
is the starting node of an edge, state2
its ending node. type
is one of the available edge type ("null"
, "std"
, "up"
, "down"
, "abs"
). penalty
is a nonnegative parameter: the additional cost $\beta_i$ to consider when we move within the graph using a edge (or stay on the same node). parameter
is annother nonnegative parameter, a characteristics of the edge, depending of its type (it is a decay if type is "null" and a gap otherwise). K
and a
are robust parameters. min
and max
are used to constrain the rang of value for the node parameter.
We add edges into a graph as follows
myGraph <- graph(
Edge("E1", "E1", "null"),
Edge("E1", "E2", "down", 3.1415, gap = 1.5)
)
myGraph
## state1 state2 type parameter penalty K a min max
## 1 E1 E1 null 1.0 0 Inf 0 NA NA
## 2 E1 E2 down 1.5 0 Inf 0 NA NA
we can only add edges to this dataframe using the object Edge
.
The graph can contain information on the starting and/or ending edge to use with the StartEnd
function.
beta <- 2 * log(1000)
myGraph <- graph(
Edge("Dw", "Dw", "null"),
Edge("Up", "Up", "null"),
Edge("Dw", "Up", "up", penalty = beta, gap = 1),
Edge("Dw", "Dw", "down", penalty = beta),
Edge("Up", "Dw", "down", penalty = beta),
StartEnd(start = "Dw", end = "Dw"))
myGraph
## state1 state2 type parameter penalty K a min max
## 1 Dw Dw null 1 0.00000 Inf 0 NA NA
## 2 Up Up null 1 0.00000 Inf 0 NA NA
## 3 Dw Up up 1 13.81551 Inf 0 NA NA
## 4 Dw Dw down 0 13.81551 Inf 0 NA NA
## 5 Up Dw down 0 13.81551 Inf 0 NA NA
## 6 Dw <NA> start NA NA NA NA NA NA
## 7 Dw <NA> end NA NA NA NA NA NA
Some graphs are often used: they are defined by default in the graph
function. To use these graphs, we specify a string type
equal to "std"
, "isotonic"
, "updown"
or "relevant"
. For example,
myGraphIso <- graph(penalty = 12, type = "isotonic")
myGraphIso
## state1 state2 type parameter penalty K a min max
## 1 Iso Iso null 1 0 Inf 0 NA NA
## 2 Iso Iso up 0 12 Inf 0 NA NA
The function Node
can be used to restrict the range of value for parameter associated to a node (called also a vertex). For example the following graph is an isotonic graph with inferred parameters between 0 et 1 only.
myGraph <- graph(
Edge("Up", "Up", "up", penalty = 3.1415),
Edge("Up", "Up"),
Node("Up", min = 0, max = 1)
)
myGraph
## state1 state2 type parameter penalty K a min max
## 1 Up Up up 0 3.1415 Inf 0 NA NA
## 2 Up Up null 1 0.0000 Inf 0 NA NA
## 3 Up Up node NA NA NA NA 0 1
Supplementary R functions
Data generator function
the dataGenerator
function is used to simulate n
data-points from a distribution of type
equal to "mean"
, "poisson"
, "exp"
, "variance"
or "negbin"
. Standard deviation parameter sigma
and decay gamma
are specific to the Gaussian mean model. size
is linked to the R rnbinom
function from R stats package.
Standard deviation estimation
We often need to estimate the standard deviation from the observed data to normalize the data or choose the edge penalties. The sdDiff
returns such an estimation with the default HALL method [Hall et al., 1990] well suited for time series with change-points.