test_flexclust.Rmd

---
title: "Flexclust functionality"
author: "Hans Van Calster"
date: "5 maart 2018"
output: html_document
---

```{r setup, include=FALSE}
library(knitr)

opts_chunk$set(echo = TRUE)

library(tidyverse)
library(flexclust)
library(vegan)
options(stringsAsFactors = FALSE)
```

```{r fc-reorder}
# function to reorder clusters in order of descending size
fc_reorder <- function(x, orderby = "decending size") {
  ko <- x
  cl_map <- order(ko@clusinfo$size, decreasing = TRUE)
  ko@second <- cl_map[ko@second]
  ko@clsim <- ko@clsim[cl_map, cl_map]
  ko@centers <- ko@centers[cl_map, ]
  ko@cluster <- cl_map[ko@cluster]
  ko@clusinfo <- ko@clusinfo[cl_map, ]
  # ko@reorder <- cl_map                   add slot with reorder mapping
  return(ko)
}

```


```{r fc-reclist}
fc_reclist <- function(len){
    if (length(len) == 1) {
        vector("list", len)
    } else {
        lapply(1:len[1], function(...) rec.list(len[-1]))
    }
}
```


# The problem

Find a clustering method that is able to:

- cluster items where some items are forced to end up together in one cluster (group constraints)
- a method that is able to predict in which cluster a new item would end up
- use any distance measure


# The flexclust::kcca() function

## k-means with group constraints 

```{r}
set.seed(123456)
data(iris)
iris <- iris %>% 
  tbl_df() %>%
  mutate(Species = as.character(Species), 
         species_subset = ifelse(sample(c(TRUE, FALSE), 
                                        size = n(), 
                                        prob = c(0.5, 0.5), 
                                        replace = TRUE) == TRUE, 
                                 Species, 
                                 1:n()))
```


```{r}
iris %>%
  mutate(in_subset = ifelse(species_subset %in% c("setosa", "virginica", "versicolor"),
                            species_subset, NA)) %>%
  ggplot(aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point(aes(colour = Species, fill = in_subset), 
             shape = 21, size = 4)
```


```{r}
means_5 <- kcca(x = iris[,1:4], 
              k = 5, family = kccaFamily("kmeans"),
              group = iris$species_subset)

```

```{r}
table(clusters(means_5), iris$species_subset)
```


```{r}
table(clusters(means_5), iris$Species)
```


## Visualisation

The plot-method for the S4 class kcca does not work properly when group constraints present. Here is a solution using ggplot.

Solid line encloses 50% of relevés in cluster; dotted 95%. Encircled number gives centroid of each cluster. A thin line to other centroid indicates better separation (in real problem space). Each relevé is plotted against the first two principal components of data. Color is cluster assignment.

```{r}
iris_pca <- prcomp(iris[,1:4])
```


```{r}
iris %>% 
  mutate(cluster = as.factor(clusters(means_5)),
         pc1 = scores(iris_pca, display = "sites")[,1],
         pc2 = scores(iris_pca, display = "sites")[,2]) %>%
  ggplot(aes(x = pc1, y = pc2)) +
  geom_point(aes(colour = cluster, shape = Species)) + 
  stat_ellipse(aes(colour = cluster)) + 
  coord_equal()

```

Header: segment #,Count, & % total
Bar: proportion of response in cluster. Red line/dot: overall proportion
Greyed out when response not important to differentiate from other clusters. BUT, can still be an important characteristic of cluster

```{r}
barchart(means_5, strip.prefix = "#", shade = TRUE, layout = c(means_5@k, 1))
```


## Find best k and stable solution

```{r}
result <- stepFlexclust(x = iris[,1:4], 
              k = 2:5, 
              nrep = 20, 
              FUN = kcca,
              group = iris$species_subset)

```


```{r}


ks <- 2:5
nreps <- 1:20
models <- fc_reclist(c(length(ks), length(nreps)))

for (k in ks) {
  for (nrep in nreps) {
    models[[k]][[nrep]] <- kcca(x = iris[,1:4], 
              k = k, family = kccaFamily("kmeans"),
              group = iris$species_subset)
    
  }
}
```


```{r}
#If a cluster gets empty during the iterations it is removed, so you
#can end up with less clusters than you asked for. For grouped
#clustering this happens more often than for regular kmeans because of
#the re-assignement of group members.

#A working example:

set.seed(12)
## same as above
nums <- sample(1:300,70)
x <- matrix(nums,10,7)

## Rows 1, 3 and 4 are in group 1, all other groups contain
## only one observation
mygroups <- c(1,2,1,1,3,4,5,6,7,8)

myfam <- kccaFamily("kmeans", groupFun = "minSumClusters")
clres <- kcca(x, k = 3, myfam, group = mygroups)

clres

table(clusters(clres),mygroups)

```

```{r}
result <- stepFlexclust(x = x, 
              k = 2:5, 
              nrep = 2, 
              FUN = kcca
              #,              group = mygroups # adding group constraints results in error subscript out of bounds
              )

```