-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtest_flexclust.Rmd
204 lines (144 loc) · 4.68 KB
/
test_flexclust.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
title: "Flexclust functionality"
author: "Hans Van Calster"
date: "5 maart 2018"
output: html_document
---
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(echo = TRUE)
library(tidyverse)
library(flexclust)
library(vegan)
options(stringsAsFactors = FALSE)
```
```{r fc-reorder}
# function to reorder clusters in order of descending size
fc_reorder <- function(x, orderby = "decending size") {
ko <- x
cl_map <- order(ko@clusinfo$size, decreasing = TRUE)
ko@second <- cl_map[ko@second]
ko@clsim <- ko@clsim[cl_map, cl_map]
ko@centers <- ko@centers[cl_map, ]
ko@cluster <- cl_map[ko@cluster]
ko@clusinfo <- ko@clusinfo[cl_map, ]
# ko@reorder <- cl_map add slot with reorder mapping
return(ko)
}
```
```{r fc-reclist}
fc_reclist <- function(len){
if (length(len) == 1) {
vector("list", len)
} else {
lapply(1:len[1], function(...) rec.list(len[-1]))
}
}
```
# The problem
Find a clustering method that is able to:
- cluster items where some items are forced to end up together in one cluster (group constraints)
- a method that is able to predict in which cluster a new item would end up
- use any distance measure
# The flexclust::kcca() function
## k-means with group constraints
```{r}
set.seed(123456)
data(iris)
iris <- iris %>%
tbl_df() %>%
mutate(Species = as.character(Species),
species_subset = ifelse(sample(c(TRUE, FALSE),
size = n(),
prob = c(0.5, 0.5),
replace = TRUE) == TRUE,
Species,
1:n()))
```
```{r}
iris %>%
mutate(in_subset = ifelse(species_subset %in% c("setosa", "virginica", "versicolor"),
species_subset, NA)) %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(colour = Species, fill = in_subset),
shape = 21, size = 4)
```
```{r}
means_5 <- kcca(x = iris[,1:4],
k = 5, family = kccaFamily("kmeans"),
group = iris$species_subset)
```
```{r}
table(clusters(means_5), iris$species_subset)
```
```{r}
table(clusters(means_5), iris$Species)
```
## Visualisation
The plot-method for the S4 class kcca does not work properly when group constraints present. Here is a solution using ggplot.
Solid line encloses 50% of relevés in cluster; dotted 95%. Encircled number gives centroid of each cluster. A thin line to other centroid indicates better separation (in real problem space). Each relevé is plotted against the first two principal components of data. Color is cluster assignment.
```{r}
iris_pca <- prcomp(iris[,1:4])
```
```{r}
iris %>%
mutate(cluster = as.factor(clusters(means_5)),
pc1 = scores(iris_pca, display = "sites")[,1],
pc2 = scores(iris_pca, display = "sites")[,2]) %>%
ggplot(aes(x = pc1, y = pc2)) +
geom_point(aes(colour = cluster, shape = Species)) +
stat_ellipse(aes(colour = cluster)) +
coord_equal()
```
Header: segment #,Count, & % total
Bar: proportion of response in cluster. Red line/dot: overall proportion
Greyed out when response not important to differentiate from other clusters. BUT, can still be an important characteristic of cluster
```{r}
barchart(means_5, strip.prefix = "#", shade = TRUE, layout = c(means_5@k, 1))
```
## Find best k and stable solution
```{r}
result <- stepFlexclust(x = iris[,1:4],
k = 2:5,
nrep = 20,
FUN = kcca,
group = iris$species_subset)
```
```{r}
ks <- 2:5
nreps <- 1:20
models <- fc_reclist(c(length(ks), length(nreps)))
for (k in ks) {
for (nrep in nreps) {
models[[k]][[nrep]] <- kcca(x = iris[,1:4],
k = k, family = kccaFamily("kmeans"),
group = iris$species_subset)
}
}
```
```{r}
#If a cluster gets empty during the iterations it is removed, so you
#can end up with less clusters than you asked for. For grouped
#clustering this happens more often than for regular kmeans because of
#the re-assignement of group members.
#A working example:
set.seed(12)
## same as above
nums <- sample(1:300,70)
x <- matrix(nums,10,7)
## Rows 1, 3 and 4 are in group 1, all other groups contain
## only one observation
mygroups <- c(1,2,1,1,3,4,5,6,7,8)
myfam <- kccaFamily("kmeans", groupFun = "minSumClusters")
clres <- kcca(x, k = 3, myfam, group = mygroups)
clres
table(clusters(clres),mygroups)
```
```{r}
result <- stepFlexclust(x = x,
k = 2:5,
nrep = 2,
FUN = kcca
#, group = mygroups # adding group constraints results in error subscript out of bounds
)
```