diff --git a/404.html b/404.html new file mode 100644 index 00000000..03d99437 --- /dev/null +++ b/404.html @@ -0,0 +1,149 @@ + + +
+ + + + +As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
+We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
+Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
+Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
+Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
+This Code of Conduct is adapted from the Contributor Covenant (https://www.contributor-covenant.org), version 1.0.0, available at https://contributor-covenant.org/version/1/0/0/.
+Please DO NOT use any confidential data when submitting an issue. For Roche internal data use code.roche.com.
+Please briefly describe your problem and what output you expect. If you have a question, please don’t use this form. Instead, ask in the Discussions.
+When possible, please include a minimal reproducible example (AKA a reprex). If you’ve never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.
+Brief description of the problem
+
+# insert reprex here
YEAR: 2022 +COPYRIGHT HOLDER: Iakov I. Davydov, Juliane Siebourg-Polster, Guido Steiner, Balazs Banfai, F. Hoffman-La Roche ++ +
Copyright (c) 2022 Iakov I. Davydov, Juliane Siebourg-Polster, Guido Steiner, Balazs Banfai, F. Hoffman-La Roche
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+vignettes/NCS22_talk.Rmd
+ NCS22_talk.Rmd
Examples in this vignette are used were used in our presentation.
+It uses a subset of the longitudinal_subject_samples
+dataset.
+data("longitudinal_subject_samples")
+
+dat <- longitudinal_subject_samples |>
+ filter(Group %in% 1:5, Week %in% c(1, 4)) |>
+ select(SampleID, SubjectID, Group, Sex, Week)
+
+# for simplicity: remove two subjects that don't have both visits
+dat <- dat |>
+ filter(SubjectID %in%
+ (dat |> count(SubjectID) |> filter(n == 2) |> pull(SubjectID)))
+
+
+subject_data <- dat |>
+ select(SubjectID, Group, Sex) |>
+ unique()
Here’s an example of plate effect. Here both top and bottom rows of +the plate are used as controls.
+This is the experiment design:
+ +These are the readouts:
+ +Due to the plate effect, the control rows are affected differently. +It is virtually impossible to normalize readouts in a meaningful +way.
+Gone wrong: Random distribution of 31 grouped subjects into 3 batches +turns out unbalanced:
+ +“Block what you can and randomize +what you cannot.” (G. Box, 1978)
+BatchContainer
class
+optimize_design()
+
+bc <- BatchContainer$new(
+ dimensions = list("batch" = 3, "location" = 11)
+) |>
+ assign_random(subject_data)
Batch composition before optimization
+ +
+bc$get_samples()
batch | +location | +SubjectID | +Group | +Sex | +
---|---|---|---|---|
1 | +1 | +NA | +NA | +NA | +
1 | +2 | +P32 | +5 | +M | +
1 | +3 | +P10 | +3 | +F | +
... | +... | +... | +... | +... | +
3 | +9 | +P31 | +3 | +F | +
3 | +10 | +P33 | +5 | +M | +
3 | +11 | +P24 | +5 | +F | +
+bc <- optimize_design(
+ bc,
+ scoring = list(
+ group = osat_score_generator(
+ batch_vars = "batch",
+ feature_vars = "Group"
+ ),
+ sex = osat_score_generator(
+ batch_vars = "batch",
+ feature_vars = "Sex"
+ )
+ ),
+ n_shuffle = 1,
+ acceptance_func =
+ ~ accept_leftmost_improvement(..., tolerance = 0.01),
+ max_iter = 150,
+ quiet = TRUE
+)
Batch composition after optimization
+ +batch | +location | +SubjectID | +Group | +Sex | +
---|---|---|---|---|
1 | +1 | +NA | +NA | +NA | +
1 | +2 | +P01 | +1 | +F | +
1 | +3 | +P10 | +3 | +F | +
... | +... | +... | +... | +... | +
3 | +9 | +P29 | +5 | +F | +
3 | +10 | +P33 | +5 | +M | +
3 | +11 | +P12 | +3 | +F | +
Assays are often performed in well plates (24, 96, 384)
+Observed effects
+Since plate effects often cannot be avoided, we aim to distribute +sample groups of interest evenly across the plate and adjust for the +effect computationally.
+
+set.seed(4)
+
+bc <- BatchContainer$new(
+ dimensions = list("plate" = 3, "row" = 4, "col" = 6)
+) |>
+ assign_in_order(dat)
+plot_plate(bc,
+ plate = plate, row = row, column = col,
+ .color = Group, title = "Initial layout by Group"
+)
+plot_plate(bc,
+ plate = plate, row = row, column = col,
+ .color = Sex, title = "Initial layout by Sex"
+)
+bc1 <- optimize_design(
+ bc,
+ scoring = list(
+ group = osat_score_generator(
+ batch_vars = "plate",
+ feature_vars = "Group"
+ ),
+ sex = osat_score_generator(
+ batch_vars = "plate",
+ feature_vars = "Sex"
+ )
+ ),
+ n_shuffle = 1,
+ acceptance_func =
+ ~ accept_leftmost_improvement(..., tolerance = 0.01),
+ max_iter = 150,
+ quiet = TRUE
+)
+bc2 <- optimize_design(
+ bc1,
+ scoring = mk_plate_scoring_functions(
+ bc1,
+ plate = "plate", row = "row", column = "col",
+ group = "Group"
+ ),
+ shuffle_proposal_func = shuffle_with_constraints(dst = plate == .src$plate),
+ max_iter = 150,
+ quiet = TRUE
+)
multi_plate_layout()
+We are performing the same optimization as before, but using the
+multi_plate_layout()
function to combine the two steps.
+bc <- optimize_multi_plate_design(
+ bc,
+ across_plates_variables = c("Group", "Sex"),
+ within_plate_variables = c("Group"),
+ plate = "plate", row = "row", column = "col",
+ n_shuffle = 2,
+ max_iter = 500 # 2000
+)
+#> 1 ... 2 ... 3 ...
#> Warning: Removed 4509 rows containing missing values or values outside the scale range
+#> (`geom_line()`).
+#> Warning: Removed 4509 rows containing missing values or values outside the scale range
+#> (`geom_point()`).
+
+Goal:
+Constraints:
+see vignette invivo_study_design
for the full story.
Acknowledgements
+
+library(designit)
+library(ggplot2)
+library(dplyr)
+#>
+#> Attaching package: 'dplyr'
+#> The following objects are masked from 'package:stats':
+#>
+#> filter, lag
+#> The following objects are masked from 'package:base':
+#>
+#> intersect, setdiff, setequal, union
+library(tidyr)
Samples of a 2-condition in-vivo experiment are to be placed on 48 +well plates.
+These are the conditions
+
+# conditions to use
+conditions <- data.frame(
+ group = c(1, 2, 3, 4, 5),
+ treatment = c(
+ "vehicle", "TRT1", "TRT2",
+ "TRT1", "TRT2"
+ ),
+ dose = c(0, 25, 25, 50, 50)
+)
+
+gt::gt(conditions)
group | +treatment | +dose | +
---|---|---|
1 | +vehicle | +0 | +
2 | +TRT1 | +25 | +
3 | +TRT2 | +25 | +
4 | +TRT1 | +50 | +
5 | +TRT2 | +50 | +
We will have 3 animals per groups with 4 replicates each
+
+# sample table (2 animals per group with 3 replicates)
+n_reps <- 4
+n_animals <- 3
+animals <- bind_rows(replicate(n_animals, conditions, simplify = FALSE),
+ .id = "animal"
+)
+samples <- bind_rows(replicate(n_reps, animals, simplify = FALSE),
+ .id = "replicate"
+) %>%
+ mutate(
+ SampleID = paste0(treatment, "_", animal, "_", replicate),
+ AnimalID = paste0(treatment, "_", animal)
+ ) %>%
+ mutate(dose = factor(dose))
+
+samples %>%
+ head(10) %>%
+ gt::gt()
replicate | +animal | +group | +treatment | +dose | +SampleID | +AnimalID | +
---|---|---|---|---|---|---|
1 | +1 | +1 | +vehicle | +0 | +vehicle_1_1 | +vehicle_1 | +
1 | +1 | +2 | +TRT1 | +25 | +TRT1_1_1 | +TRT1_1 | +
1 | +1 | +3 | +TRT2 | +25 | +TRT2_1_1 | +TRT2_1 | +
1 | +1 | +4 | +TRT1 | +50 | +TRT1_1_1 | +TRT1_1 | +
1 | +1 | +5 | +TRT2 | +50 | +TRT2_1_1 | +TRT2_1 | +
1 | +2 | +1 | +vehicle | +0 | +vehicle_2_1 | +vehicle_2 | +
1 | +2 | +2 | +TRT1 | +25 | +TRT1_2_1 | +TRT1_2 | +
1 | +2 | +3 | +TRT2 | +25 | +TRT2_2_1 | +TRT2_2 | +
1 | +2 | +4 | +TRT1 | +50 | +TRT1_2_1 | +TRT1_2 | +
1 | +2 | +5 | +TRT2 | +50 | +TRT2_2_1 | +TRT2_2 | +
Corner wells of the plates should be left empty. This means on a 48 +well plate we can place 44 samples. Since we have 60 samples, they will +fit on 2 plates
+ +Create a BatchContainer object that provides all possible +locations
+
+bc <- BatchContainer$new(
+ dimensions = c("plate" = n_plates, "column" = 8, "row" = 6),
+ exclude = exclude_wells
+)
+bc
+#> Batch container with 88 locations.
+#> Dimensions: plate, column, row
+
+bc$n_locations
+#> [1] 88
+bc$exclude
+#> NULL
+bc$get_locations() %>% head()
+#> # A tibble: 6 × 3
+#> plate column row
+#> <int> <int> <int>
+#> 1 1 1 2
+#> 2 1 1 3
+#> 3 1 1 4
+#> 4 1 1 5
+#> 5 1 2 1
+#> 6 1 2 2
Use random assignment function to place samples to plate +locations
+
+bc <- assign_random(bc, samples)
+
+bc$get_samples()
+#> # A tibble: 88 × 10
+#> plate column row replicate animal group treatment dose SampleID AnimalID
+#> <int> <int> <int> <chr> <chr> <dbl> <chr> <fct> <chr> <chr>
+#> 1 1 1 2 2 3 1 vehicle 0 vehicle_3… vehicle…
+#> 2 1 1 3 NA NA NA NA NA NA NA
+#> 3 1 1 4 3 1 3 TRT2 25 TRT2_1_3 TRT2_1
+#> 4 1 1 5 3 1 1 vehicle 0 vehicle_1… vehicle…
+#> 5 1 2 1 2 3 5 TRT2 50 TRT2_3_2 TRT2_3
+#> 6 1 2 2 1 3 3 TRT2 25 TRT2_3_1 TRT2_3
+#> 7 1 2 3 3 3 3 TRT2 25 TRT2_3_3 TRT2_3
+#> 8 1 2 4 4 2 4 TRT1 50 TRT1_2_4 TRT1_2
+#> 9 1 2 5 NA NA NA NA NA NA NA
+#> 10 1 2 6 2 1 3 TRT2 25 TRT2_1_2 TRT2_1
+#> # ℹ 78 more rows
+bc$get_samples(remove_empty_locations = TRUE)
+#> # A tibble: 60 × 10
+#> plate column row replicate animal group treatment dose SampleID AnimalID
+#> <int> <int> <int> <chr> <chr> <dbl> <chr> <fct> <chr> <chr>
+#> 1 1 1 2 2 3 1 vehicle 0 vehicle_3… vehicle…
+#> 2 1 1 4 3 1 3 TRT2 25 TRT2_1_3 TRT2_1
+#> 3 1 1 5 3 1 1 vehicle 0 vehicle_1… vehicle…
+#> 4 1 2 1 2 3 5 TRT2 50 TRT2_3_2 TRT2_3
+#> 5 1 2 2 1 3 3 TRT2 25 TRT2_3_1 TRT2_3
+#> 6 1 2 3 3 3 3 TRT2 25 TRT2_3_3 TRT2_3
+#> 7 1 2 4 4 2 4 TRT1 50 TRT1_2_4 TRT1_2
+#> 8 1 2 6 2 1 3 TRT2 25 TRT2_1_2 TRT2_1
+#> 9 1 3 1 4 1 4 TRT1 50 TRT1_1_4 TRT1_1
+#> 10 1 3 2 1 1 3 TRT2 25 TRT2_1_1 TRT2_1
+#> # ℹ 50 more rows
Plot of the result using the plot_plate
function
+plot_plate(bc,
+ plate = plate, column = column, row = row,
+ .color = treatment, .alpha = dose
+)
+To not show empty wells, we can directly plot the sample table as +well
+
+plot_plate(bc$get_samples(remove_empty_locations = TRUE),
+ plate = plate, column = column, row = row,
+ .color = treatment, .alpha = dose
+)
To move individual samples or manually assigning all locations we can
+use the batchContainer$move_samples()
method
To swap two or more samples use:
+Warning: This will change your BatchContainer +in-place.
+
+bc$move_samples(src = c(1L, 2L), dst = c(2L, 1L))
+
+plot_plate(bc$get_samples(remove_empty_locations = TRUE),
+ plate = plate, column = column, row = row,
+ .color = treatment, .alpha = dose
+)
To assign all samples in one go, use the option
+location_assignment
.
Warning: This will change your BatchContainer +in-place.
+The example below orders samples by ID and adds the empty locations +afterwards
+
+bc$move_samples(
+ location_assignment = c(
+ 1:nrow(samples),
+ rep(NA, (bc$n_locations - nrow(samples)))
+ )
+)
+
+plot_plate(bc$get_samples(remove_empty_locations = TRUE, include_id = TRUE),
+ plate = plate, column = column, row = row,
+ .color = .sample_id
+)
The optimization procedure is invoked with
+e.g. optimize_design
. Here we use a simple shuffling
+schedule: swap 10 samples for 100 times, then swap 2 samples for 400
+times.
To evaluate how good a layout is, we need a scoring function.
+This function will assess how well treatment and dose are balanced +across the two plates.
+
+bc <- optimize_design(bc,
+ scoring = osat_score_generator(
+ batch_vars = "plate",
+ feature_vars = c("treatment", "dose")
+ ),
+ # shuffling schedule
+ n_shuffle = c(rep(10, 200), rep(2, 400))
+)
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+#> Checking variances of 1-dim. score vector.
+#> ... (266.138) - OK
+#> Initial score: 80
+#> Achieved score: 54 at iteration 1
+#> Achieved score: 44 at iteration 3
+#> Achieved score: 42 at iteration 4
+#> Achieved score: 30 at iteration 5
+#> Achieved score: 6 at iteration 6
+#> Achieved score: 2 at iteration 7
+#> Achieved score: 0 at iteration 240
Development of the score can be viewed with
+
+bc$plot_trace()
The layout after plate batching looks the following
+
+plot_plate(bc$get_samples(remove_empty_locations = TRUE),
+ plate = plate, column = column, row = row,
+ .color = treatment, .alpha = dose
+)
Looking at treatment, we see it’s evenly distributed across the +plates
+
+ggplot(
+ bc$get_samples(remove_empty_locations = TRUE),
+ aes(x = treatment, fill = treatment)
+) +
+ geom_bar() +
+ facet_wrap(~plate)
To properly distinguish between empty and excluded locations one can +do the following.
+add_excluded = TRUE
, set
+rename_empty = TRUE
+na.value
+
+color_palette <- c(
+ TRT1 = "blue", TRT2 = "purple",
+ vehicle = "orange", empty = "white"
+)
+
+plot_plate(bc,
+ plate = plate, column = column, row = row,
+ .color = treatment, .alpha = dose,
+ add_excluded = TRUE, rename_empty = TRUE
+) +
+ scale_fill_manual(values = color_palette, na.value = "darkgray")
To remove all empty wells from the plot, hand the pruned sample list. +to plot_plate rather than the whole BatchContainer. You can still assign +your own colors.
+
+plot_plate(bc$get_samples(remove_empty_locations = TRUE),
+ plate = plate, column = column, row = row,
+ .color = treatment, .alpha = dose
+) +
+ scale_fill_viridis_d()
Note: removing all empty and excluded wells will lead to omitting +completely empty rows or columns!
+
+plot_plate(bc$get_samples(remove_empty_locations = TRUE) %>%
+ filter(column != 2),
+plate = plate, column = column, row = row,
+.color = treatment, .alpha = dose
+) +
+ scale_fill_viridis_d()
+library(designit)
+library(tidyverse)
+#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
+#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
+#> ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
+#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
+#> ✔ purrr 1.0.2
+#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+#> ✖ dplyr::filter() masks stats::filter()
+#> ✖ dplyr::lag() masks stats::lag()
+#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
In this example we would like to distribute animals among cages with +constraints:
+
+set.seed(43)
+samples <- tibble(
+ id = 1:params$n_samples,
+ sex = sample(c("F", "M"), params$n_samples, replace = TRUE, prob = c(0.8, 0.2)),
+ group = sample(c("treatment", "control"), params$n_samples, replace = TRUE),
+ weight = runif(params$n_samples, 20, 30)
+)
+samples %>%
+ head()
+#> # A tibble: 6 × 4
+#> id sex group weight
+#> <int> <chr> <chr> <dbl>
+#> 1 1 F treatment 26.2
+#> 2 2 M treatment 26.1
+#> 3 3 F control 26.6
+#> 4 4 F control 25.4
+#> 5 5 F treatment 24.3
+#> 6 6 F treatment 21.5
We create a BatchContainer
with 11 cages and 5 positions
+per cage. Note that positions do not actually matter; this is just to
+limit the number of animals per cage.
We start by assigning samples randomly.
+
+set.seed(42)
+bc <- BatchContainer$new(
+ dimensions = c("cage" = 11, "position" = 5)
+) %>%
+ assign_random(samples)
+bc
+#> Batch container with 55 locations and 50 samples (assigned).
+#> Dimensions: cage, position
Functions to plot number of males per cage, weights per cage and +treatment/control ratios.
+
+males_per_cage <- function(bc) {
+ bc$get_samples() %>%
+ filter(sex == "M") %>%
+ count(cage) %>%
+ ggplot(aes(cage, n)) +
+ geom_col()
+}
+
+weight_d <- function(bc) {
+ bc$get_samples() %>%
+ ggplot(aes(factor(cage), weight)) +
+ geom_violin() +
+ geom_point() +
+ stat_summary(fun = mean, geom = "point", size = 2, shape = 23, color = "red")
+}
+
+group_d <- function(bc) {
+ bc$get_samples(remove_empty_locations = TRUE) %>%
+ ggplot(aes(factor(cage), fill = group)) +
+ geom_bar(position = "fill")
+}
+males_per_cage(bc)
+weight_d(bc)
+#> Warning: Removed 5 rows containing non-finite outside the scale range
+#> (`stat_ydensity()`).
+#> Warning: Removed 5 rows containing non-finite outside the scale range
+#> (`stat_summary()`).
+#> Warning: Removed 5 rows containing missing values or values outside the scale range
+#> (`geom_point()`).
+group_d(bc)
First, we use OSAT scoring function to ensure even distribution of
+males among cages. Only cage
and sex
+interactions are considered in the scoring function. We only use 10
+iterations, since shuffling is limited to locations with males and
+enforces change of cage on every iteration.
+set.seed(10)
+
+bc <- optimize_design(
+ bc,
+ scoring = osat_score_generator(
+ "cage",
+ "sex"
+ ),
+ shuffle_proposal_func = shuffle_with_constraints(
+ sex == "M",
+ cage != .src$cage
+ ),
+ max_iter = 10
+)
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+#> Checking variances of 1-dim. score vector.
+#> ... (21.451) - OK
+#> Initial score: 15.818
+#> Achieved score: 11.818 at iteration 1
+#> Achieved score: 9.818 at iteration 10
+
+bc$plot_trace()
+We expect the distribution of males become even, while other variables +are not significantly affected.
+
+males_per_cage(bc)
+weight_d(bc)
+#> Warning: Removed 5 rows containing non-finite outside the scale range
+#> (`stat_ydensity()`).
+#> Warning: Removed 5 rows containing non-finite outside the scale range
+#> (`stat_summary()`).
+#> Warning: Removed 5 rows containing missing values or values outside the scale range
+#> (`geom_point()`).
+group_d(bc)
Here we only define our custom scoring function which ensures even
+distribution of weights and treatment/control groups. Only female
+samples are shuffled and male samples are kept in their locations. We
+also ensure that on every iteration the cage number is changed; we do
+this because position
dimension does affect actual animal
+allocation.
+scoring_f <- function(bc) {
+ samples <- bc$get_samples(include_id = TRUE, as_tibble = FALSE)
+ avg_w <- samples[, mean(weight, na.rm = TRUE)]
+ avg_w_per_cage <- samples[!is.na(weight), mean(weight), by = cage]$V1
+ trt_per_cage <- samples[!is.na(group), sum(group == "treatment") / .N, by = cage]$V1
+
+ w_score <- mean((avg_w - avg_w_per_cage)**2)
+ trt_score <- mean((trt_per_cage - 0.5)**2)
+ w_score + 10 * trt_score
+}
+
+set.seed(12)
+bc <- optimize_design(bc,
+ scoring = scoring_f,
+ shuffle_proposal = shuffle_with_constraints(
+ sex == "F",
+ cage != .src$cage & (is.na(sex) | sex != "M")
+ ),
+ n_shuffle = c(rep(10, 20), rep(5, 20), rep(3, 20), rep(1, 140)),
+ max_iter = 200
+)
+#> Checking variances of 1-dim. score vector.
+#> ... (0.457) - OK
+#> Initial score: 2.657
+#> Achieved score: 1.916 at iteration 2
+#> Achieved score: 1.835 at iteration 5
+#> Achieved score: 1.289 at iteration 9
+#> Achieved score: 1.225 at iteration 10
+#> Achieved score: 1.089 at iteration 11
+#> Achieved score: 1.074 at iteration 16
+#> Achieved score: 1.033 at iteration 18
+#> Achieved score: 0.875 at iteration 19
+#> Achieved score: 0.574 at iteration 22
+#> Achieved score: 0.523 at iteration 23
+#> Achieved score: 0.522 at iteration 32
+#> Achieved score: 0.483 at iteration 41
+#> Achieved score: 0.394 at iteration 49
+#> Achieved score: 0.379 at iteration 67
+#> Achieved score: 0.372 at iteration 73
+#> Achieved score: 0.365 at iteration 84
+#> Achieved score: 0.361 at iteration 90
+#> Achieved score: 0.348 at iteration 96
+#> Achieved score: 0.322 at iteration 102
+#> Achieved score: 0.302 at iteration 131
+#> Achieved score: 0.302 at iteration 141
+#> Achieved score: 0.246 at iteration 165
+#> Achieved score: 0.245 at iteration 166
+#> Achieved score: 0.241 at iteration 184
+#> Achieved score: 0.241 at iteration 191
+#> Achieved score: 0.237 at iteration 196
+bc$plot_trace()
+scoring_f(bc)
+#> [1] 0.2370109
Now we have a much more even distribution of weights and +treatment/control balance.
+
+males_per_cage(bc)
+weight_d(bc)
+#> Warning: Removed 5 rows containing non-finite outside the scale range
+#> (`stat_ydensity()`).
+#> Warning: Removed 5 rows containing non-finite outside the scale range
+#> (`stat_summary()`).
+#> Warning: Removed 5 rows containing missing values or values outside the scale range
+#> (`geom_point()`).
+group_d(bc)
vignettes/invivo_study_design.Rmd
+ invivo_study_design.Rmd
This example demonstrates how recurring complex design problems of +similar structure may be handled by writing dedicated wrappers that use +designIt functionality in the background while presenting a simplified +interface to the user. These wrapper functions may completely hide the +construction of batch containers, scoring functions and other +fundamental package concepts from the user, allowing to focus on the +correct specification of the concrete design task at hand.
+We are using the very specific design constraints of certain in +vivo studies as an example. The implementation of the respective +wrapper functions won’t be discussed here, but code may be inspected in +the .Rmd file of the vignette if desired.
+We would like to assign 3 treatment conditions to a cohort of 59 +animals, representing 2 relevant strains. There are a few concrete user +specified constraints for the study, on top we have to avoid confounding +by common variables such as animal sex, body weight and age.
+The animal information is provided in a sample sheet, the treatment +list has to be stored separately. The example data we are looking at is +included in the package.
+ +
+str(invivo_study_samples)
+#> 'data.frame': 59 obs. of 8 variables:
+#> $ AnimalID : chr "F1" "F2" "F3" "F4" ...
+#> $ Strain : chr "Strain B" "Strain B" "Strain B" "Strain B" ...
+#> $ Sex : chr "F" "F" "F" "F" ...
+#> $ BirthDate : Date, format: "2021-05-24" "2021-03-01" ...
+#> $ Earmark : chr "R" "2L" "2L1R" "L" ...
+#> $ ArrivalWeight : num 19.4 26.5 20.8 22.1 22.9 ...
+#> $ Arrival.weight.Unit: chr "g" "g" "g" "g" ...
+#> $ Litter : chr "Litter 1" "Litter 2" "Litter 2" "Litter 2" ...
+
+invivo_study_samples %>%
+ dplyr::count(Strain, Sex, BirthDate) %>%
+ gt::gt()
Strain | +Sex | +BirthDate | +n | +
---|---|---|---|
Strain A | +F | +NA | +7 | +
Strain A | +M | +NA | +22 | +
Strain B | +F | +2021-03-01 | +4 | +
Strain B | +F | +2021-04-12 | +2 | +
Strain B | +F | +2021-05-24 | +1 | +
Strain B | +M | +2021-02-22 | +4 | +
Strain B | +M | +2021-03-15 | +8 | +
Strain B | +M | +2021-04-12 | +5 | +
Strain B | +M | +2021-05-17 | +3 | +
Strain B | +M | +2021-05-24 | +3 | +
A simple data summary reveals that the cohort is almost equally +composed of Strains A and B. There are male and female animals in quite +different proportions, with a noticeable excess of the males. Birth +dates are available for Strain A, but missing completely for Strain +B.
+Initial body weights (arrival weights), identifying ear marks and +litter information are available for all animals. The litter is nested +within the strain and all individuals within one litter naturally share +one birth date.
+ +Strain | +Litter | +BirthDate | +n | +
---|---|---|---|
Strain A | +Litter 10 | +NA | +5 | +
Strain A | +Litter 11 | +NA | +7 | +
Strain A | +Litter 12 | +NA | +7 | +
Strain A | +Litter 13 | +NA | +4 | +
Strain A | +Litter 14 | +NA | +6 | +
Strain B | +Litter 1 | +2021-05-24 | +4 | +
Strain B | +Litter 2 | +2021-03-01 | +4 | +
Strain B | +Litter 3 | +2021-04-12 | +7 | +
Strain B | +Litter 4 | +2021-03-15 | +4 | +
Strain B | +Litter 5 | +2021-02-22 | +4 | +
Strain B | +Litter 6 | +2021-03-15 | +4 | +
Strain B | +Litter 7 | +2021-05-17 | +3 | +
+str(invivo_study_treatments)
+#> 'data.frame': 59 obs. of 3 variables:
+#> $ Treatment: chr "Treatment 1" "Treatment 1" "Treatment 1" "Treatment 1" ...
+#> $ Strain : chr "Strain A" "Strain A" "Strain A" "Strain A" ...
+#> $ Sex : chr "M" "M" "M" "M" ...
+
+invivo_study_treatments %>%
+ dplyr::count(Treatment, Strain, Sex) %>%
+ gt::gt()
Treatment | +Strain | +Sex | +n | +
---|---|---|---|
Treatment 1 | +Strain A | +M | +10 | +
Treatment 1 | +Strain B | +M | +10 | +
Treatment 2 | +Strain A | +F | +5 | +
Treatment 2 | +Strain A | +M | +5 | +
Treatment 2 | +Strain B | +F | +5 | +
Treatment 2 | +Strain B | +M | +5 | +
Treatment 3 | +Strain A | +M | +6 | +
Treatment 3 | +Strain B | +M | +6 | +
untreated | +Strain A | +F | +2 | +
untreated | +Strain A | +M | +1 | +
untreated | +Strain B | +F | +2 | +
untreated | +Strain B | +M | +2 | +
We have 3 treatments that should each be administered to a defined +number of animals. In addition, some satellite animals of either strain +will not receive any treatment at all, which is specified by a fourth +(‘untreated’) condition.
+In most cases the treatment list could be reduced to the first +column, i.e. repeating each label for the right number of times so that +the total length matches the sample sheet.
+However, additional study specific constraints may be specified by +adding columns that also appear in the animal list and indicate how the +treatments should be assigned to subgroups of the cohort. In this +example, a different number of animals is used for each of the +treatments, balanced across strains. However, female animals are only to +be used for treatment 2.
+The specific constraints for our type of in vivo study may +be summarized as follows:
+The very special and intricate nature of these requirements motivate +th creation of dedicated functionality on top of this package, as +demonstrated by this vignette.
+Before using these functions, we add two auxiliary columns to the +sample sheet:
+
+invivo_study_samples <- dplyr::mutate(invivo_study_samples,
+ AgeGroup = as.integer(factor(BirthDate, exclude = NULL)),
+ Litter_combine_females = ifelse(Sex == "F", "female_all", Litter)
+)
+
+invivo_study_samples %>%
+ dplyr::count(Strain, Litter_combine_females, BirthDate, AgeGroup) %>%
+ gt::gt()
Strain | +Litter_combine_females | +BirthDate | +AgeGroup | +n | +
---|---|---|---|---|
Strain A | +Litter 10 | +NA | +7 | +3 | +
Strain A | +Litter 11 | +NA | +7 | +4 | +
Strain A | +Litter 12 | +NA | +7 | +5 | +
Strain A | +Litter 13 | +NA | +7 | +4 | +
Strain A | +Litter 14 | +NA | +7 | +6 | +
Strain A | +female_all | +NA | +7 | +7 | +
Strain B | +Litter 1 | +2021-05-24 | +6 | +3 | +
Strain B | +Litter 3 | +2021-04-12 | +4 | +5 | +
Strain B | +Litter 4 | +2021-03-15 | +3 | +4 | +
Strain B | +Litter 5 | +2021-02-22 | +1 | +4 | +
Strain B | +Litter 6 | +2021-03-15 | +3 | +4 | +
Strain B | +Litter 7 | +2021-05-17 | +5 | +3 | +
Strain B | +female_all | +2021-03-01 | +2 | +4 | +
Strain B | +female_all | +2021-04-12 | +4 | +2 | +
Strain B | +female_all | +2021-05-24 | +6 | +1 | +
The process of solving the design problem can be divided into 3 +successive steps, each of which is addressed by a specific in +vivo-specific wrapper function.
+Assign treatments to individuals animals (function +InVivo_assignTreatments())
Allocate animals to cages (function +Invivo_assignCages())
Arrange cages in one or more racks of given dimension (function +Invivo_arrangeCages())
Dedicated constraints have to be handled at each step, as is +reflected in the interface of those wrappers.
+As stated above, implementation details are beyond the scope of this +example. We will instead just show the interfaces of the three wrappers, +run the example case and visualize the resulting design.
+
+InVivo_assignTreatments <- function(animal_list, treatments,
+ balance_treatment_vars = c(),
+ form_cages_by = c(),
+ n_shuffle = c(rep(5, 100), rep(3, 200), rep(2, 500), rep(1, 20000)),
+ quiet_process = FALSE, quiet_optimize = TRUE) {
+ (...)
+}
The function works with the initial animal and treatment lists.
+Most importantly, balance_treatment_vars lists the +variables that should be balanced across treatments (e.g. strain, sex, +body weight, age, litter). Different scoring functions will be created +for categorical and numerical covariates.
+form_cages_by is not mandatory, but gives important +clues regarding the variables that will later be homogeneous within each +cage (e.g. strain, sex, litter). Providing this may be crucial for +finding good solutions with a low number of single-housed animals that +don’t fit into any other cage.
+It is also possible to modify the shuffling protocol and toggle +messaging on the level of processing steps as well as optimization +iterations.
+
+Invivo_assignCages <- function(design_trt,
+ cagegroup_vars,
+ unique_vars = c(),
+ balance_cage_vars = c(),
+ n_min = 2, n_max = 5, n_ideal = 2, prefer_big_groups = TRUE, strict = TRUE,
+ maxiter = 5e3,
+ quiet_process = FALSE, quiet_optimize = TRUE) {
+ (...)
+}
This wrapper takes the output of the previous step (‘design_trt’) as +input.
+
+Invivo_arrangeCages <- function(design_cage,
+ distribute_cagerack_vars = "Treatment",
+ rack_size_x = 4,
+ rack_size_y = 4,
+ n_shuffle = c(rep(5, 100), rep(3, 400), rep(2, 500), rep(1, 4000)),
+ quiet_process = FALSE, quiet_optimize = TRUE) {
+ (...)
+}
This wrapper takes the output of the previous step (‘design_cage’) as +input.
+distribute_cagerack_vars is a list of variables that +should be evenly spaced out across the rows and columns of a rack (or +several racks, if needed). Typical cases may include treatment, strain +and sex of the animals.
+rack_size_x and rack_size_y specify +the number of cages that fit into the rows and columns of a grid like +rack, respectively. Depending on the actual number of cages, one or more +racks are automatically assigned. Only rectangular sub-grids may be used +of any rack to accommodate the cages.
++A full run of the three wrapper functions is executed below, printing +messages on the level of processing steps, but not the iterations within +every optimization. +
+set.seed(44)
+
+# Assign treatments to animals, respecting user provided as well as passed constraints
+design_trt <- InVivo_assignTreatments(invivo_study_samples, invivo_study_treatments,
+ form_cages_by = c("Strain", "Sex", "Litter_combine_females"),
+ balance_treatment_vars = c("Strain", "Sex", "ArrivalWeight", "AgeGroup"),
+ n_shuffle = c(rep(5, 200), rep(3, 300), rep(2, 500), rep(1, 3000)),
+ quiet_process = FALSE,
+ quiet_optimize = TRUE
+)
+#> Performing treatment assignment with constrained animal selection.
+#> Using constraints in variables: Strain, Sex
+#> Checking if solution is possible:
+#> ... Yes!
+#> Setting up batch container.
+#> Constructing scoring functions:
+#> ... user specified treatment allocation constraint (Treatment-Strain-Sex)
+#> ... facilitating homogeneity of treatment in cages (CageGroup)
+#> ... ANOVA -logP for numerical variables balanced across treatment (ArrivalWeight, AgeGroup)
+#> Success. User provided constraints could be fully met.
+
+# Form cages with reasonable animal numbers and compliant with all constraints
+design_cage <- Invivo_assignCages(design_trt,
+ cagegroup_vars = c("Treatment", "Strain", "Sex", "Litter_combine_females"),
+ unique_vars = c("Earmark"),
+ balance_cage_vars = c("ArrivalWeight", "AgeGroup"),
+ n_min = 2, n_max = 5, n_ideal = 2, prefer_big_groups = T, strict = F,
+ maxiter = 1000,
+ quiet_process = FALSE,
+ quiet_optimize = TRUE
+)
+#> Setting up batch container.
+#>
+#> Formed 22 homogeneous groups using 59 samples.
+#> 27 subgroups needed to satisfy size constraints.
+#>
+#> Finding possible ways to allocate variable of interest with 1 levels ...
+#>
+#> Finished with 27 recursive calls.
+#> 1 allocations found.
+#>
+#> Expecting 27 cages to be created and 4 single-housed animals.
+#> Constructing scoring functions:
+#> ... ANOVA -logP for numerical variables balanced across cages (ArrivalWeight, AgeGroup)
+#> Adding 4 attributes to samples.
+
+# Arrange cages in sub-grid of one rack (or several racks), avoiding spatial clusters
+design_rack <- Invivo_arrangeCages(design_cage,
+ distribute_cagerack_vars = c("Treatment", "Strain", "Sex"),
+ rack_size_x = 7,
+ rack_size_y = 10,
+ n_shuffle = c(rep(5, 100), rep(3, 200), rep(2, 300), rep(1, 500)),
+ quiet_process = FALSE,
+ quiet_optimize = TRUE
+)
+#> Needing 1 rack with a grid of 4 x 7 cages.
+#> There will be 1 empty position overall.
+#> Setting up batch container.
+#>
+#> Distributing target variables (Treatment, Strain, Sex) within rack
+#> ... Rack 1
+#> ... Performing simple mean/stddev adjustment.
+#> ... final scores: Plate_Treatment: 5.12, Plate_Strain: 5.48, Plate_Sex: 5.72
++There are 27 cages in total. +
++Strains and age groups should be evenly split (balanced) across the +treatments. Also,in each cage there should be only animals with the same +treatment, strain and sex. +
++Females are exclusively used for treatment 2, as was specified in the +treatment list. +
++
++Body weights should be balanced across treatments as well as possible. +
++The plot illustrates that this is true for the overall weight +distribution (box plots). Interestingly, as there are females +(associated with considerable less body weight) involved in treatment 2, +the optimization favored the selection of heavier males in this group to +compensate, achieving better cross-treatment balance of this factor. +
++Red diamonds mark the mean values for a specific sex within each +treatment group. +
++
+vignettes/nested_dimensions_examples.Rmd
+ nested_dimensions_examples.Rmd
+library(designit)
+library(tidyverse)
+#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
+#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
+#> ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
+#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
+#> ✔ purrr 1.0.2
+#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+#> ✖ dplyr::filter() masks stats::filter()
+#> ✖ dplyr::lag() masks stats::lag()
+#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
+data("multi_trt_day_samples")
Samples are grouped by Treatment and Collection time with the +following group sizes:
+ +Time | +Treatment | +n | +
---|---|---|
4 | +CTRL | +3 | +
4 | +SOC_TRT1 | +4 | +
4 | +SOC_TRT1_TRT2 | +3 | +
4 | +SOC_TRT1_TRT3 | +4 | +
4 | +SOC_TRT1_TRT3_TRT2 | +4 | +
8 | +SOC_TRT1 | +3 | +
8 | +SOC_TRT1_TRT2 | +3 | +
8 | +SOC_TRT1_TRT3 | +4 | +
8 | +SOC_TRT1_TRT3_TRT2 | +4 | +
Total number of samples is: 32
+Samples are to be blocked in batches for scRNA-seq.
+
+# Setting up the batch container
+bc <- BatchContainer$new(
+ dimensions = c(
+ batch = ceiling(nrow(multi_trt_day_samples) / 8),
+ run = 2, position = 4
+ )
+)
+
+# Initial random assignment
+bc <- assign_in_order(bc, multi_trt_day_samples)
+bc
+#> Batch container with 32 locations and 32 samples (assigned).
+#> Dimensions: batch, run, position
The samples are distributed to 4 batches (processing days). This is
+done using osat scoring on sample Treatment
and
+Time
, optimizing by shuffling.
+n_shuffle <- rep(c(32, 10, 2), c(100, 80, 20))
+n_iterations <- length(n_shuffle)
+
+set.seed(42) # should we have conventions for this?
+
+scoring_f <- osat_score_generator(c("batch"), c("Treatment", "Time"))
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ n_shuffle = n_shuffle,
+ max_iter = n_iterations
+) # default is 10000
+#> Re-defined number of swaps to 16 in swapping function.
+#> Checking variances of 1-dim. score vector.
+#> ... (29.235) - OK
+#> Initial score: 75
+#> Achieved score: 25 at iteration 1
+#> Achieved score: 21 at iteration 2
+#> Achieved score: 15 at iteration 4
+#> Achieved score: 13 at iteration 5
+#> Achieved score: 11 at iteration 95
+#> Achieved score: 9 at iteration 112
+#> Achieved score: 5 at iteration 188
NOTE: Here the shuffling procedure is short, as it was optimized for +this vignette. I practice you will have to run for a much higher number +of iterations.
+
+qplot(
+ x = bc$trace$scores[[1]]$step,
+ y = bc$trace$scores[[1]]$score_1,
+ color = factor(c(32, n_shuffle)),
+ main = str_glue("Final score={bc$score(scoring_f)}"), geom = "point"
+)
+#> Warning: `qplot()` was deprecated in ggplot2 3.4.0.
+#> This warning is displayed once every 8 hours.
+#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
+#> generated.
+# copy batch container for second optimization
+bc2 <- assign_in_order(bc)
+
+n_iterations <- 200
+
+set.seed(42) # should we have conventions for this?
+
+bc2 <- optimize_design(
+ bc2,
+ scoring = scoring_f,
+ shuffle_proposal = shuffle_with_constraints(
+ src = TRUE,
+ # batch needs to change for shuffle to be accepted
+ dst = .src$batch != batch
+ ),
+ max_iter = n_iterations
+)
+#> Checking variances of 1-dim. score vector.
+#> ... (26.382) - OK
+#> Initial score: 75
+#> Achieved score: 67 at iteration 1
+#> Achieved score: 63 at iteration 2
+#> Achieved score: 53 at iteration 3
+#> Achieved score: 49 at iteration 4
+#> Achieved score: 47 at iteration 5
+#> Achieved score: 37 at iteration 6
+#> Achieved score: 35 at iteration 9
+#> Achieved score: 33 at iteration 11
+#> Achieved score: 29 at iteration 13
+#> Achieved score: 27 at iteration 14
+#> Achieved score: 25 at iteration 19
+#> Achieved score: 19 at iteration 20
+#> Achieved score: 17 at iteration 25
+#> Achieved score: 15 at iteration 36
+#> Achieved score: 13 at iteration 40
+#> Achieved score: 11 at iteration 49
+#> Achieved score: 9 at iteration 57
+#> Achieved score: 7 at iteration 79
+#> Achieved score: 5 at iteration 126
+#> Achieved score: 3 at iteration 180
+
+qplot(
+ x = bc2$trace$scores[[1]]$step,
+ y = bc2$trace$scores[[1]]$score_1,
+ main = str_glue("Final score={bc2$score(scoring_f)}"), geom = "point"
+)
+
+bc2$get_samples(assignment = TRUE) %>%
+ mutate(batch = factor(batch)) %>%
+ ggplot(aes(x = batch, fill = Treatment, alpha = factor(Time))) +
+ geom_bar()
+#> Warning: Using alpha for a discrete variable is not advised.
+NOTE: It is not possible to calculate the theoretically minimal osat +score, right?
+Using shuffle with constraints
+Within each day there will be 2 runs (samples processed together)
+with 4 samples each. For this we keep the optimized batch
+and now only optimize run
with constraint.
+n_iterations <- 100
+
+# new optimization function
+scoring_f <- osat_score_generator(c("run"), c("Treatment", "Time"))
+# like this the optimization score is wrong because it tries to optimize across Batches.
+# Possible ways to go:
+# - we'd need something like c("batch", batch/run") for optimize by batch and run within batch.
+# - or we add "batch/run" to the constraints somehow.
+bc$score(scoring_f)
+#> score_1
+#> 16
+
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ shuffle_proposal = shuffle_with_constraints(
+ src = TRUE,
+ # batch remains the same and run needs to change
+ dst = batch == .src$batch & run != .src$run
+ ),
+ max_iter = n_iterations
+)
+#> Checking variances of 1-dim. score vector.
+#> ... (36.071) - OK
+#> Initial score: 16
+#> Achieved score: 14 at iteration 2
+#> Achieved score: 8 at iteration 4
+#> Achieved score: 6 at iteration 10
+#> Achieved score: 4 at iteration 11
+#> Achieved score: 2 at iteration 16
+qplot(
+ x = bc$trace$scores[[1]]$step,
+ y = bc$trace$scores[[1]]$score_1,
+ color = factor(n_iterations),
+ main = str_glue("Final score={bc$score(scoring_f)}"), geom = "point"
+)
+data("multi_trt_day_samples")
Samples are grouped by Treatment and Collection time with the +following group sizes:
+ +Time | +Treatment | +n | +
---|---|---|
4 | +CTRL | +3 | +
4 | +SOC_TRT1 | +4 | +
4 | +SOC_TRT1_TRT2 | +3 | +
4 | +SOC_TRT1_TRT3 | +4 | +
4 | +SOC_TRT1_TRT3_TRT2 | +4 | +
8 | +SOC_TRT1 | +3 | +
8 | +SOC_TRT1_TRT2 | +3 | +
8 | +SOC_TRT1_TRT3 | +4 | +
8 | +SOC_TRT1_TRT3_TRT2 | +4 | +
Total number of samples is: 32
+Samples are to be blocked in batches for scRNA-seq.
+This data set is also used in the nested dimensions example. Here, we +focus on using different methods for the optimization.
+We allocate surplus positions in the batch container and some +excluded positions to check that all optimization methods support empty +container positions.
+
+# Setting up the batch container
+bc <- BatchContainer$new(
+ dimensions = c(
+ batch = ceiling(nrow(multi_trt_day_samples) / 8),
+ run = 2, position = 5
+ ),
+ exclude = tibble::tibble(batch = 4, run = c(1, 2), position = c(5, 5))
+) %>%
+ # Add samples to container
+ assign_in_order(samples = multi_trt_day_samples)
+
+bc
+#> Batch container with 38 locations and 32 samples (assigned).
+#> Dimensions: batch, run, position
The samples are distributed to 4 batches (processing days). We use
+the osat scoring on sample Treatment
and Time
,
+using first a shuffling protocol with a fixed number of sample swaps on
+each iteration.
Note that doing 32 swaps on 38 free container positions does not make +sense, since each swapping operation affects two different positions +anyway. The upper limit is reduced to the max number of meaningful swaps +(19) on the fly.
+Optimization finishes after the list of permutations is +exhausted.
+
+n_shuffle <- rep(c(32, 10, 5, 2, 1), c(20, 40, 40, 50, 50))
+
+scoring_f <- osat_score_generator(c("batch"), c("Treatment", "Time"))
+
+bc1 <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ n_shuffle = n_shuffle # will implicitly generate a shuffling function according to the provided schedule
+)
+#> Re-defined number of swaps to 19 in swapping function.
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+#> Checking variances of 1-dim. score vector.
+#> ... (25.909) - OK
+#> Initial score: 73.03
+#> Achieved score: 23.452 at iteration 1
+#> Achieved score: 23.136 at iteration 2
+#> Achieved score: 19.136 at iteration 7
+#> Achieved score: 16.399 at iteration 10
+#> Achieved score: 13.662 at iteration 21
+#> Achieved score: 10.188 at iteration 44
+#> Achieved score: 10.083 at iteration 87
+#> Achieved score: 10.083 at iteration 117
+#> Achieved score: 8.083 at iteration 130
+#> Achieved score: 6.083 at iteration 136
+
+bc1$trace$elapsed
+#> Time difference of 0.891397 secs
Custom plot with some colours:
+
+bc1$scores_table() %>%
+ dplyr::mutate(
+ n_shuffle = c(NA, n_shuffle)
+ ) %>%
+ ggplot2::ggplot(
+ ggplot2::aes(step, value, color = factor(n_shuffle))
+ ) +
+ ggplot2::geom_point() +
+ ggplot2::labs(
+ title = "Score 1 tracing",
+ subtitle = stringr::str_glue("Final score = {bc1$score(scoring_f)}"),
+ x = "Iteration",
+ y = "Score",
+ color = "n_shuffle"
+ )
Using the internal method…
+
+bc1$plot_trace()
We may safely apply the batch container methods get_samples() and +score() also after using the new optimization code.
+
+bc1$score(scoring_f)
+#> score_1
+#> 6.083102
+
+bc1$get_samples(assignment = TRUE) %>%
+ dplyr::filter(!is.na(Treatment)) %>%
+ dplyr::mutate(anno = stringr::str_c(Time, " hr")) %>%
+ ggplot2::ggplot(ggplot2::aes(x = batch, y = interaction(position, run), fill = Treatment)) +
+ ggplot2::geom_tile(color = "white") +
+ ggplot2::geom_hline(yintercept = 5.5, size = 1) +
+ ggplot2::geom_text(ggplot2::aes(label = anno)) +
+ ggplot2::labs(x = "Batch", y = "Position . Run")
+#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+#> ℹ Please use `linewidth` instead.
+#> This warning is displayed once every 8 hours.
+#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
+#> generated.
Further optimization (using a different shuffling protocol maybe) can +be done immediately on the same batch container.
+
+n_shuffle <- rep(c(5, 2, 1), c(30, 30, 30))
+
+bc1 <- optimize_design(
+ bc1,
+ scoring = scoring_f,
+ n_shuffle = n_shuffle
+)
+#> Checking variances of 1-dim. score vector.
+#> ... (27.236) - OK
+#> Initial score: 6.083
Starting optimization from scratch, we are passing now some stopping +criteria that may terminate optimization before a shuffling protocol has +been exhausted.
+For demonstration, we use a shuffling function now that will do 3 +sample (position) swaps per iteration and can be called an arbitrary +number of times. Thus, iteration has to be stopped by either the +max_iter criterion or by reaching a specific minimum delta threshold +(score improvement from one selected solution to the next).
+
+bc2 <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ n_shuffle = 3, # will implicitly generate a shuffling function that will do 3 swaps at each iteration
+ max_iter = 2000,
+ min_delta = 0.1
+)
+#> Checking variances of 1-dim. score vector.
+#> ... (32.811) - OK
+#> Initial score: 73.03
+#> Achieved score: 57.767 at iteration 1
+#> Achieved score: 49.767 at iteration 2
+#> Achieved score: 35.767 at iteration 3
+#> Achieved score: 30.188 at iteration 4
+#> Achieved score: 20.188 at iteration 5
+#> Achieved score: 16.188 at iteration 8
+#> Achieved score: 9.978 at iteration 9
+#> Achieved score: 8.083 at iteration 38
+#> Achieved score: 6.399 at iteration 43
+#> Achieved score: 4.504 at iteration 104
+#> Achieved score: 4.188 at iteration 235
+#> Achieved score: 3.873 at iteration 738
+#> Achieved score: 3.873 at iteration 965
+#> Reached min delta in 965 iterations.
Instead of passing a single scoring function, a list of multiple +scoring functions can be passed to the optimizer, each of which to +return a scalar value on evaluation.
+By default, a strict improvement rule is applied for classifying a +potential solution as “better”: each of the individual scores has to be +smaller than or equal to its previous value, and one of the scores has +to be changed.
+However, the user could specify other methods for aggregating the +scores or defining the acceptance criterion. See later examples.
+The second scoring function used here is by the way rather redundant +and just serves for illustration.
+
+multi_scoring_f <- list(
+ osat_score_generator(c("batch"), c("Treatment", "Time")),
+ osat_score_generator(c("batch"), c("Treatment"))
+)
+
+
+bc3 <- optimize_design(
+ bc,
+ scoring = multi_scoring_f,
+ n_shuffle = 3,
+ max_iter = 200,
+ min_delta = 0.1
+)
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+#> Checking variances of 2-dim. score vector.
+#> ... (26.175, 64.376) - OK
+#> Initial score: c(73.03, 44.803)
+#> Achieved score: c(57.03, 32.803) at iteration 1
+#> Achieved score: c(46.925, 28.593) at iteration 2
+#> Achieved score: c(45.346, 27.33) at iteration 3
+#> Achieved score: c(33.346, 23.33) at iteration 4
+#> Achieved score: c(33.346, 21.33) at iteration 5
+#> Achieved score: c(29.346, 21.33) at iteration 6
+#> Achieved score: c(19.767, 14.066) at iteration 7
+#> Achieved score: c(14.083, 10.488) at iteration 13
+#> Achieved score: c(12.504, 9.224) at iteration 22
+#> Achieved score: c(8.399, 9.224) at iteration 50
+#> Achieved score: c(8.083, 8.909) at iteration 66
+#> Achieved score: c(6.083, 4.909) at iteration 99
+#> Achieved score: c(2.609, 1.856) at iteration 150
Note that the first score tends to yield higher values than the +second one. This could be a problem when trying to select a solution +based on an aggregated, overall score. We repeat the same optimization +now by using the autoscaling functionality of the optimizer.
+We’re just adding the autoscale_scores
option here to
+estimate the distribution of individual scores on a number of completely
+random sample assignments (200 in this case) and then apply a
+transformation to rescale each score to a standard normal.
Note that by ‘normalizing’ the distribution of the scores we obtain +values centered around zero, thus that the optimized scores are likely +to be negative. We may also want to decrease the delta_min parameter to +match the new numerical range.
+
+bc3_as <- optimize_design(
+ bc,
+ scoring = multi_scoring_f,
+ n_shuffle = 3,
+ max_iter = 200,
+ min_delta = 0.01,
+ autoscale_scores = T,
+ autoscaling_permutations = 200
+)
+#> Checking variances of 2-dim. score vector.
+#> ... (24.866, 47.009) - OK
+#> Creating autoscaling function for 2-dim. score vector. (200 random permutations)
+#> ... Performing boxcox lambda estimation.
+#> Initial score: c(6.232, 2.641)
+#> Achieved score: c(4.724, 1.231) at iteration 1
+#> Achieved score: c(2.903, -0.295) at iteration 2
+#> Achieved score: c(1.865, -1.311) at iteration 4
+#> Achieved score: c(0.967, -1.311) at iteration 5
+#> Achieved score: c(0.024, -1.548) at iteration 6
+#> Achieved score: c(-0.348, -1.548) at iteration 7
+#> Achieved score: c(-0.368, -1.593) at iteration 12
+#> Achieved score: c(-0.614, -1.78) at iteration 19
+#> Achieved score: c(-0.915, -2.475) at iteration 20
+#> Achieved score: c(-1.347, -2.475) at iteration 39
+#> Achieved score: c(-1.813, -2.475) at iteration 69
+#> Achieved score: c(-2.982, -3.353) at iteration 92
+#> Achieved score: c(-3.638, -3.353) at iteration 135
+#> Achieved score: c(-4.601, -3.664) at iteration 147
Having directly comparable scores, it may be reasonable now to use a +function that somehow aggregates the scores to decide on the best +iteration (instead of looking at the scores individually).
+An easy way to do this is to use the built-in worst_score function. +This will simply set the aggregated score to whichever of the individual +scores is larger (i.e. ‘worse’ in terms of the optimization).
+
+bc4 <- optimize_design(
+ bc,
+ scoring = multi_scoring_f,
+ n_shuffle = 3,
+ aggregate_scores_func = worst_score,
+ max_iter = 200,
+ autoscale_scores = TRUE,
+ autoscaling_permutations = 200
+)
+#> Checking variances of 2-dim. score vector.
+#> ... (24.094, 59.118) - OK
+#> Creating autoscaling function for 2-dim. score vector. (200 random permutations)
+#> ... Performing boxcox lambda estimation.
+#> Initial score: c(6.38, 2.259)
+#> Aggregated: 6.38
+#> Achieved score: c(5.005, 2.005) at iteration 1
+#> Aggregated: 5.005
+#> Achieved score: c(4.13, 1.727) at iteration 2
+#> Aggregated: 4.13
+#> Achieved score: c(3.699, 1.283) at iteration 3
+#> Aggregated: 3.699
+#> Achieved score: c(2.688, 1.75) at iteration 4
+#> Aggregated: 2.688
+#> Achieved score: c(1.529, 1.594) at iteration 5
+#> Aggregated: 1.594
+#> Achieved score: c(0.889, 1.1) at iteration 7
+#> Aggregated: 1.1
+#> Achieved score: c(-0.986, 0.517) at iteration 8
+#> Aggregated: 0.517
+#> Achieved score: c(-0.917, 0.082) at iteration 11
+#> Aggregated: 0.082
+#> Achieved score: c(-1.641, -0.548) at iteration 16
+#> Aggregated: -0.548
+#> Achieved score: c(-1.149, -1.192) at iteration 18
+#> Aggregated: -1.149
+#> Achieved score: c(-2.149, -1.635) at iteration 19
+#> Aggregated: -1.635
+#> Achieved score: c(-2.149, -2.077) at iteration 31
+#> Aggregated: -2.077
+#> Achieved score: c(-2.234, -2.298) at iteration 57
+#> Aggregated: -2.234
+#> Achieved score: c(-2.803, -2.298) at iteration 59
+#> Aggregated: -2.298
+#> Achieved score: c(-4.062, -2.891) at iteration 111
+#> Aggregated: -2.891
+#> Achieved score: c(-4.104, -2.933) at iteration 127
+#> Aggregated: -2.933
+#> Achieved score: c(-4.785, -3.457) at iteration 192
+#> Aggregated: -3.457
Another - more interesting - option would be to aggregate the two +scores by taking their sum. This way both scores will influence the +optimization at every step.
+For illustration, we omit the n_shuffle
parameter here,
+which will lead by default to pairwise sample swaps being done on each
+iteration.
+bc5 <- optimize_design(
+ bc,
+ scoring = multi_scoring_f,
+ aggregate_scores_func = sum_scores,
+ max_iter = 200,
+ autoscale_scores = TRUE,
+ autoscaling_permutations = 200
+)
As a final example, we calculate the (squared) L2 norm to actually +aggregate the two scores. Not that this choice is not really motivated +in this case, but it could be used if optimization was carried on +meaningful distance vectors or normalized n-tuples.
+Note that we don’t use the auto-scaling in this case as the L2-norm +based optimization would force both normalized scores towards zero, not +the minimal (negative) value that would be desired in that case.
+
+bc5_2 <- optimize_design(
+ bc,
+ scoring = multi_scoring_f,
+ aggregate_scores_func = L2s_norm,
+ max_iter = 200,
+)
It is recommended to use the n_shuffle
parameter to
+steer the optimization protocol. However, you may also provide a
+dedicated shuffling function that on each call has to return a shuffling
+order (as integer vector) or a list with the source and destination
+positions (src and dst) of the sample positions to be swapped.
The following example uses a template for creating complete random +shuffles across all available positions in the batch container. Note +that this is usually not a good strategy for converging to a +solution.
+
+bc6 <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ shuffle_proposal_func = complete_random_shuffling,
+ max_iter = 200
+)
+#> Checking variances of 1-dim. score vector.
+#> ... (29.765) - OK
+#> Initial score: 73.03
+#> Achieved score: 18.294 at iteration 1
+#> Achieved score: 16.188 at iteration 3
+#> Achieved score: 15.03 at iteration 7
+#> Achieved score: 14.925 at iteration 15
+#> Achieved score: 12.083 at iteration 21
+#> Achieved score: 8.925 at iteration 101
Esp. for very large search spaces, better solutions can be quite +successfully obtained by a SA protocol which allows the optimizer to +jump over ‘energy barriers’ to more likely converge at lower local +minima.
+The optimizer usually remembers the permutation with the best overall +score to start with, but this behavior can be changed by supplying a +simulated annealing protocol, most simply by generating a ready-made +function template.
+It is generally recommended for SA to make small changes at each +step, like allowing just 1 sample swap per iteration.
+Currently the simulated annealing protocol requires a single double +value score to be optimized. Choose an appropriate aggregation function +if you happen to have multiple scores initially.
+
+bc7 <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ n_shuffle = 1,
+ acceptance_func = mk_simanneal_acceptance_func(),
+ max_iter = 200
+)
+#> Checking variances of 1-dim. score vector.
+#> ... (30.893) - OK
+#> Initial score: 73.03
+#> Achieved score: 71.346 at iteration 1
+#> Achieved score: 71.346 at iteration 2
+#> Achieved score: 67.662 at iteration 3
+#> Achieved score: 61.767 at iteration 4
+#> Achieved score: 65.873 at iteration 5
+#> Achieved score: 55.873 at iteration 6
+#> Achieved score: 55.873 at iteration 7
+#> Achieved score: 51.873 at iteration 8
+#> Achieved score: 45.873 at iteration 9
+#> Achieved score: 41.452 at iteration 10
+#> Achieved score: 41.452 at iteration 11
+#> Achieved score: 41.452 at iteration 12
+#> Achieved score: 33.452 at iteration 13
+#> Achieved score: 29.452 at iteration 15
+#> Achieved score: 29.452 at iteration 16
+#> Achieved score: 31.452 at iteration 17
+#> Achieved score: 31.452 at iteration 18
+#> Achieved score: 29.452 at iteration 19
+#> Achieved score: 25.452 at iteration 20
+#> Achieved score: 25.452 at iteration 22
+#> Achieved score: 25.452 at iteration 23
+#> Achieved score: 21.452 at iteration 24
+#> Achieved score: 19.767 at iteration 26
+#> Achieved score: 19.767 at iteration 27
+#> Achieved score: 19.767 at iteration 28
+#> Achieved score: 19.767 at iteration 29
+#> Achieved score: 17.767 at iteration 30
+#> Achieved score: 18.188 at iteration 31
+#> Achieved score: 16.188 at iteration 32
+#> Achieved score: 14.188 at iteration 33
+#> Achieved score: 14.083 at iteration 34
+#> Achieved score: 12.083 at iteration 35
+#> Achieved score: 10.083 at iteration 36
+#> Achieved score: 10.083 at iteration 37
+#> Achieved score: 10.083 at iteration 38
+#> Achieved score: 10.083 at iteration 39
+#> Achieved score: 10.083 at iteration 40
+#> Achieved score: 10.188 at iteration 41
+#> Achieved score: 10.188 at iteration 43
+#> Achieved score: 10.188 at iteration 49
+#> Achieved score: 8.504 at iteration 51
+#> Achieved score: 8.399 at iteration 53
+#> Achieved score: 8.399 at iteration 57
+#> Achieved score: 8.399 at iteration 61
+#> Achieved score: 8.504 at iteration 67
+#> Achieved score: 8.504 at iteration 69
+#> Achieved score: 8.504 at iteration 72
+#> Achieved score: 8.504 at iteration 75
+#> Achieved score: 8.399 at iteration 77
+#> Achieved score: 8.399 at iteration 85
+#> Achieved score: 8.399 at iteration 86
+#> Achieved score: 8.399 at iteration 87
+#> Achieved score: 8.399 at iteration 88
+#> Achieved score: 8.399 at iteration 102
+#> Achieved score: 8.399 at iteration 103
+#> Achieved score: 8.399 at iteration 106
+#> Achieved score: 8.399 at iteration 107
+#> Achieved score: 7.978 at iteration 114
+#> Achieved score: 6.399 at iteration 120
+#> Achieved score: 6.399 at iteration 127
+#> Achieved score: 4.82 at iteration 174
+#> Achieved score: 3.241 at iteration 178
The trace may show a non strictly monotonic behavior now, reflecting +the SA protocol at work.
+
+bc7$plot_trace()
Better results and quicker convergence may be achieved by playing +with the starting temperature (T0) and cooling speed (alpha) in a +specific case.
+
+bc8 <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ n_shuffle = 1,
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 100, alpha = 2)),
+ max_iter = 150
+)
+#> Checking variances of 1-dim. score vector.
+#> ... (21.955) - OK
+#> Initial score: 73.03
+#> Achieved score: 69.03 at iteration 1
+#> Achieved score: 59.03 at iteration 2
+#> Achieved score: 53.452 at iteration 3
+#> Achieved score: 45.452 at iteration 4
+#> Achieved score: 45.452 at iteration 5
+#> Achieved score: 45.452 at iteration 7
+#> Achieved score: 45.452 at iteration 8
+#> Achieved score: 41.452 at iteration 9
+#> Achieved score: 35.873 at iteration 10
+#> Achieved score: 35.873 at iteration 11
+#> Achieved score: 33.767 at iteration 12
+#> Achieved score: 27.767 at iteration 13
+#> Achieved score: 25.767 at iteration 14
+#> Achieved score: 21.767 at iteration 15
+#> Achieved score: 20.083 at iteration 21
+#> Achieved score: 16.083 at iteration 25
+#> Achieved score: 14.083 at iteration 30
+#> Achieved score: 12.504 at iteration 39
+#> Achieved score: 8.504 at iteration 42
+#> Achieved score: 6.504 at iteration 62
+#> Achieved score: 4.504 at iteration 116
+
+bc8$plot_trace()
The following example puts together all possible options to +illustrate the flexibility of the optimization.
+
+n_shuffle <- rep(c(3, 2, 1), c(20, 20, 200))
+
+bc9 <- optimize_design(
+ bc,
+ scoring = list(
+ osat_score_generator(c("batch"), c("Treatment", "Time")),
+ osat_score_generator(c("batch"), c("Treatment")),
+ osat_score_generator(c("batch"), c("Time"))
+ ),
+ n_shuffle = n_shuffle,
+ aggregate_scores_func = sum_scores,
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 500, alpha = 1)),
+ max_iter = 200,
+ min_delta = 1e-8,
+ autoscale_scores = T
+)
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+#> Checking variances of 3-dim. score vector.
+#> ... (31.751, 74.52, 97.769) - OK
+#> Creating autoscaling function for 3-dim. score vector. (100 random permutations)
+#> ... Performing boxcox lambda estimation.
+#> Initial score: c(5.011, 2.287, 4.641)
+#> Aggregated: 11.939
+#> Achieved score: c(4.055, 1.288, 4.294) at iteration 1
+#> Aggregated: 9.637
+#> Achieved score: c(3.087, 0.785, 3.121) at iteration 2
+#> Aggregated: 6.993
+#> Achieved score: c(1.868, 0.346, 1.673) at iteration 3
+#> Aggregated: 3.887
+#> Achieved score: c(2.513, 0.785, 1.673) at iteration 4
+#> Aggregated: 4.971
+#> Achieved score: c(1.868, 0.785, 2.154) at iteration 5
+#> Aggregated: 4.807
+#> Achieved score: c(0.857, 1.196, 0.917) at iteration 6
+#> Aggregated: 2.969
+#> Achieved score: c(-0.348, -0.011, 0.248) at iteration 7
+#> Aggregated: -0.112
+#> Achieved score: c(-1.141, 0.245, -1.098) at iteration 8
+#> Aggregated: -1.994
+#> Achieved score: c(-0.368, 0.18, -1.015) at iteration 9
+#> Aggregated: -1.203
+#> Achieved score: c(-0.368, 0.14, -0.326) at iteration 10
+#> Aggregated: -0.553
+#> Achieved score: c(-0.665, -0.025, -1.036) at iteration 12
+#> Aggregated: -1.726
+#> Achieved score: c(-1.498, -0.878, -0.194) at iteration 14
+#> Aggregated: -2.57
+#> Achieved score: c(-1.05, -0.917, -0.144) at iteration 15
+#> Aggregated: -2.111
+#> Achieved score: c(0.071, -1.336, 0.122) at iteration 16
+#> Aggregated: -1.143
+#> Achieved score: c(-0.273, 0.003, -0.62) at iteration 18
+#> Aggregated: -0.89
+#> Achieved score: c(-0.273, 0.003, -1.486) at iteration 19
+#> Aggregated: -1.755
+#> Achieved score: c(0.071, -0.277, -2.153) at iteration 22
+#> Aggregated: -2.359
+#> Achieved score: c(-0.645, -1.336, -0.686) at iteration 24
+#> Aggregated: -2.666
+#> Achieved score: c(-0.665, -0.976, -0.686) at iteration 28
+#> Aggregated: -2.326
+#> Achieved score: c(-1.073, -0.976, -0.686) at iteration 29
+#> Aggregated: -2.734
+#> Achieved score: c(-1.073, -0.976, -0.686) at iteration 32
+#> Aggregated: -2.734
+#> Reached min delta in 32 iterations.
+
+bc9$plot_trace()
In this vignette, we demonstrate how to use the OSAT score (Yan et +al. (2012)).
+
+library(designit)
+library(tidyverse)
+#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
+#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
+#> ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
+#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
+#> ✔ purrr 1.0.2
+#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+#> ✖ dplyr::filter() masks stats::filter()
+#> ✖ dplyr::lag() masks stats::lag()
+#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
+if (!requireNamespace("OSAT")) {
+ print("This vignette can only be rendered if `OSAT` package is installed.")
+ knitr::knit_exit()
+}
+#> Loading required namespace: OSAT
Loading samples. We add two dummy columns to demonstrate how to +choose batch columns of interest.
+
+osat_data_path <- system.file("extdata", package = "OSAT")
+samples <- read_tsv(file.path(osat_data_path, "samples.txt"),
+ col_types = cols(SampleType = col_factor(), Race = col_factor(), AgeGrp = col_factor())
+) %>%
+ mutate(dummy_var1 = rnorm(n()), dummy_var2 = str_c(SampleType, Race, sep = " "))
Here we use OSAT to optimize setup.
+
+gs <- OSAT::setup.sample(samples, optimal = c("SampleType", "Race", "AgeGrp"))
+
+gc <- OSAT::setup.container(OSAT::IlluminaBeadChip96Plate, 7, batch = "plates")
+
+set.seed(1234)
+
+bench::system_time(
+ g_setup <- OSAT::create.optimized.setup(sample = gs, container = gc, nSim = params$iterations)
+)
+#> Warning in OSAT::create.optimized.setup(sample = gs, container = gc, nSim =
+#> params$iterations): Using default optimization method: optimal.shuffle
+#> process real
+#> 423ms 420ms
+
+OSAT::QC(g_setup)
+#>
+#> Test independence between "plates" and sample variables
+#>
+#> Pearson's Chi-squared test
+#> Var X-squared df p.value
+#> 1 SampleType 0.1126719 6 0.9999714
+#> 2 Race 0.3062402 6 0.9994663
+#> 3 AgeGrp 1.8220606 24 1.0000000
+#>
Saving starting point of optimization
+
+set.seed(1234)
+
+g_setup_start <- OSAT::create.optimized.setup(sample = gs, container = gc, nSim = 1) %>%
+ OSAT::get.experiment.setup()
+#> Warning in OSAT::create.optimized.setup(sample = gs, container = gc, nSim = 1):
+#> Using default optimization method: optimal.shuffle
Visualize various batch factors. OSAT score is optimized only for
+plates
in this case.
+OSAT::get.experiment.setup(g_setup) %>%
+ select(AgeGrp, plates, chipRows, chipColumns, chips, rows, columns, wells) %>%
+ pivot_longer(-AgeGrp) %>%
+ count(AgeGrp, value, name) %>%
+ ggplot(aes(AgeGrp, n, fill = factor(value))) +
+ geom_col(position = "dodge") +
+ facet_wrap(~name, scales = "free_y")
+plot_batch <- function(df) {
+ df %>%
+ select(plates, SampleType, Race, AgeGrp) %>%
+ pivot_longer(c(SampleType, Race, AgeGrp), names_to = "variable", values_to = "level") %>%
+ count(plates, variable, level) %>%
+ ggplot(aes(level, n, fill = factor(plates))) +
+ geom_col(position = "dodge") +
+ facet_wrap(~variable, scales = "free", ncol = 1)
+}
Before the optimization.
+
+g_setup_start %>% plot_batch()
After the optimization.
+
+OSAT::get.experiment.setup(g_setup) %>%
+ plot_batch()
Compare OSAT score generated using designit.
+
+OSAT::getLayout(gc) %>%
+ left_join(OSAT::get.experiment.setup(g_setup)) %>%
+ data.table::data.table() %>%
+ osat_score("plates", c("SampleType", "Race", "AgeGrp")) %>%
+ .$score
+#> Joining with `by = join_by(plates, chipRows, chipColumns, chips, rows, columns,
+#> wells)`
+#> Warning in osat_score(., "plates", c("SampleType", "Race", "AgeGrp")): NAs in
+#> features / batch columns; they will be excluded from scoring
+#> [1] 26.85714
+
+# score using OSAT
+g_setup@metadata$optValue %>% tail(1)
+#> [1] 28.85714
First let’s create a BatchContainer with same dimensions.
+
+bc <- BatchContainer$new(
+ dimensions = c(plates = 7, chips = 8, rows = 6, columns = 2)
+)
+bc
+#> Batch container with 672 locations.
+#> Dimensions: plates, chips, rows, columns
+
+bc$n_locations
+#> [1] 672
Assign samples and get initial setup.
+
+bc <- assign_in_order(bc, samples)
+
+starting_assignment <- bc$get_locations() %>%
+ left_join(g_setup_start) %>%
+ pull(ID) %>%
+ as.integer()
+#> Joining with `by = join_by(plates, chips, rows, columns)`
+
+bc$move_samples(location_assignment = starting_assignment)
+
+bc$get_samples(remove_empty_locations = TRUE) %>%
+ plot_batch()
+scoring_f <- osat_score_generator("plates", c("SampleType", "Race", "AgeGrp"))
+
+bc$score(scoring_f)
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+#> score_1
+#> 360.8571
+g_setup@metadata$optValue %>% head(1)
+#> [1] 360.8571
+# should be identical
+
+bench::system_time({
+ set.seed(123)
+ bc_reference <- optimize_design(bc, scoring = scoring_f, max_iter = params$iterations)
+})
+#> Checking variances of 1-dim. score vector.
+#> ... (6749.04) - OK
+#> Initial score: 360.857
+#> Achieved score: 348.857 at iteration 7
+#> Achieved score: 344.857 at iteration 8
+#> Achieved score: 338.857 at iteration 10
+#> Achieved score: 326.857 at iteration 19
+#> Achieved score: 322.857 at iteration 22
+#> Achieved score: 320.857 at iteration 24
+#> Achieved score: 318.857 at iteration 26
+#> Achieved score: 314.857 at iteration 27
+#> Achieved score: 308.857 at iteration 28
+#> Achieved score: 302.857 at iteration 32
+#> Achieved score: 300.857 at iteration 33
+#> Achieved score: 292.857 at iteration 36
+#> Achieved score: 290.857 at iteration 38
+#> Achieved score: 290.857 at iteration 39
+#> Achieved score: 284.857 at iteration 40
+#> Achieved score: 274.857 at iteration 45
+#> Achieved score: 270.857 at iteration 50
+#> Achieved score: 264.857 at iteration 54
+#> Achieved score: 256.857 at iteration 55
+#> Achieved score: 246.857 at iteration 60
+#> Achieved score: 240.857 at iteration 62
+#> Achieved score: 222.857 at iteration 71
+#> Achieved score: 220.857 at iteration 72
+#> Achieved score: 214.857 at iteration 73
+#> Achieved score: 202.857 at iteration 78
+#> Achieved score: 200.857 at iteration 79
+#> Achieved score: 192.857 at iteration 80
+#> Achieved score: 184.857 at iteration 84
+#> Achieved score: 178.857 at iteration 85
+#> Achieved score: 174.857 at iteration 86
+#> Achieved score: 172.857 at iteration 95
+#> Achieved score: 166.857 at iteration 96
+#> Achieved score: 156.857 at iteration 98
+#> Achieved score: 150.857 at iteration 99
+#> Achieved score: 146.857 at iteration 103
+#> Achieved score: 144.857 at iteration 108
+#> Achieved score: 142.857 at iteration 111
+#> Achieved score: 140.857 at iteration 113
+#> Achieved score: 136.857 at iteration 114
+#> Achieved score: 134.857 at iteration 128
+#> Achieved score: 130.857 at iteration 133
+#> Achieved score: 120.857 at iteration 138
+#> Achieved score: 116.857 at iteration 140
+#> Achieved score: 108.857 at iteration 149
+#> Achieved score: 106.857 at iteration 152
+#> Achieved score: 106.857 at iteration 164
+#> Achieved score: 104.857 at iteration 171
+#> Achieved score: 102.857 at iteration 175
+#> Achieved score: 100.857 at iteration 176
+#> Achieved score: 96.857 at iteration 179
+#> Achieved score: 92.857 at iteration 190
+#> Achieved score: 90.857 at iteration 191
+#> Achieved score: 88.857 at iteration 200
+#> Achieved score: 84.857 at iteration 205
+#> Achieved score: 78.857 at iteration 209
+#> Achieved score: 76.857 at iteration 223
+#> Achieved score: 74.857 at iteration 250
+#> Achieved score: 70.857 at iteration 265
+#> Achieved score: 68.857 at iteration 268
+#> Achieved score: 68.857 at iteration 283
+#> Achieved score: 64.857 at iteration 295
+#> Achieved score: 62.857 at iteration 336
+#> Achieved score: 60.857 at iteration 350
+#> Achieved score: 58.857 at iteration 368
+#> Achieved score: 56.857 at iteration 389
+#> Achieved score: 54.857 at iteration 406
+#> Achieved score: 52.857 at iteration 418
+#> Achieved score: 50.857 at iteration 428
+#> Achieved score: 50.857 at iteration 431
+#> Achieved score: 48.857 at iteration 433
+#> Achieved score: 46.857 at iteration 452
+#> Achieved score: 44.857 at iteration 496
+#> Achieved score: 42.857 at iteration 553
+#> Achieved score: 40.857 at iteration 562
+#> Achieved score: 38.857 at iteration 612
+#> Achieved score: 36.857 at iteration 717
+#> Achieved score: 34.857 at iteration 727
+#> Achieved score: 32.857 at iteration 828
+#> Achieved score: 30.857 at iteration 853
+#> Achieved score: 30.857 at iteration 859
+#> Achieved score: 28.857 at iteration 904
+#> Achieved score: 26.857 at iteration 970
+#> process real
+#> 3.47s 3.45s
+# final score
+bc_reference$score(scoring_f)
+#> score_1
+#> 26.85714
+bc_reference$plot_trace() +
+ ggtitle(str_glue("Final score={bc$score(scoring_f)}"))
+bc$get_samples(remove_empty_locations = TRUE) %>%
+ plot_batch()
data.table
+Instead of relying on BatchContainer
, here we have a
+manual optimization process using data.table
.
+fast_osat_optimize <- function(bc, batch_vars, feature_vars, iterations) {
+ bc <- bc$copy()
+ ldf <- data.table::data.table(bc$get_locations())[, c("plates")][, ".sample_id" := bc$assignment]
+ fcols <- c(".sample_id", feature_vars)
+ smp <- data.table::data.table(bc$samples)[, ..fcols]
+ df <- smp[ldf, on = ".sample_id"]
+
+ v <- osat_score(df, batch_vars, feature_vars)
+ edf <- v$expected_dt
+ current_score <- v$score
+ scores <- numeric(length = iterations)
+ n_avail <- nrow(df)
+
+ for (i in 1:iterations) {
+ repeat {
+ pos <- sample(n_avail, 2)
+
+ # does not make sense to shuffle NAs
+ if (any(!is.na(df[pos, feature_vars[1]]))) {
+ break
+ }
+ }
+
+ val <- df[c(pos[2], pos[1]), fcols, with = FALSE]
+ df[c(pos[1], pos[2]), (fcols) := val]
+
+ new_score <- osat_score(df, batch_vars, feature_vars, edf)$score
+ if (new_score <= current_score) {
+ current_score <- new_score
+ } else {
+ df[c(pos[2], pos[1]), (fcols) := val]
+ }
+
+ scores[i] <- current_score
+ }
+
+ bc$assignment <- df$.sample_id
+
+ list(bc=bc, scores=scores)
+}
+
+bench::system_time({
+ set.seed(123)
+ opt_res <- fast_osat_optimize(bc, "plates", c("SampleType", "Race", "AgeGrp"), iterations = params$iterations)
+})
+#> Warning in osat_score(df, batch_vars, feature_vars): NAs in features / batch
+#> columns; they will be excluded from scoring
+#> Warning in (function (assignment) : this field might become read-only in the
+#> future, please use $move_samples() instead
+#> process real
+#> 2.22s 2.21s
+scoring_f <- osat_score_generator("plates", c("SampleType", "Race", "AgeGrp"))
+
+burn_in_it <- floor(params$iterations * 0.1)
+burn_in_it
+#> [1] 100
+
+bench::system_time({
+ set.seed(123)
+ bc_burn_in <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ n_shuffle = c(
+ rep(20, burn_in_it),
+ rep(
+ 2,
+ params$iterations - burn_in_it
+ )
+ ),
+ max_iter = params$iterations
+ )
+})
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+#> Checking variances of 1-dim. score vector.
+#> ... (5018.697) - OK
+#> Initial score: 360.857
+#> Achieved score: 322.857 at iteration 1
+#> Achieved score: 290.857 at iteration 5
+#> Achieved score: 278.857 at iteration 7
+#> Achieved score: 254.857 at iteration 8
+#> Achieved score: 238.857 at iteration 15
+#> Achieved score: 224.857 at iteration 28
+#> Achieved score: 216.857 at iteration 32
+#> Achieved score: 198.857 at iteration 34
+#> Achieved score: 196.857 at iteration 50
+#> Achieved score: 184.857 at iteration 54
+#> Achieved score: 172.857 at iteration 59
+#> Achieved score: 162.857 at iteration 75
+#> Achieved score: 158.857 at iteration 80
+#> Achieved score: 158.857 at iteration 93
+#> Achieved score: 156.857 at iteration 101
+#> Achieved score: 148.857 at iteration 102
+#> Achieved score: 144.857 at iteration 119
+#> Achieved score: 136.857 at iteration 125
+#> Achieved score: 130.857 at iteration 136
+#> Achieved score: 126.857 at iteration 139
+#> Achieved score: 124.857 at iteration 140
+#> Achieved score: 124.857 at iteration 141
+#> Achieved score: 118.857 at iteration 145
+#> Achieved score: 114.857 at iteration 147
+#> Achieved score: 112.857 at iteration 150
+#> Achieved score: 108.857 at iteration 161
+#> Achieved score: 100.857 at iteration 166
+#> Achieved score: 94.857 at iteration 168
+#> Achieved score: 92.857 at iteration 169
+#> Achieved score: 90.857 at iteration 176
+#> Achieved score: 88.857 at iteration 186
+#> Achieved score: 86.857 at iteration 188
+#> Achieved score: 82.857 at iteration 191
+#> Achieved score: 80.857 at iteration 194
+#> Achieved score: 78.857 at iteration 195
+#> Achieved score: 78.857 at iteration 199
+#> Achieved score: 76.857 at iteration 206
+#> Achieved score: 76.857 at iteration 213
+#> Achieved score: 72.857 at iteration 226
+#> Achieved score: 66.857 at iteration 231
+#> Achieved score: 62.857 at iteration 233
+#> Achieved score: 60.857 at iteration 237
+#> Achieved score: 58.857 at iteration 245
+#> Achieved score: 58.857 at iteration 249
+#> Achieved score: 56.857 at iteration 252
+#> Achieved score: 56.857 at iteration 256
+#> Achieved score: 54.857 at iteration 267
+#> Achieved score: 52.857 at iteration 290
+#> Achieved score: 50.857 at iteration 296
+#> Achieved score: 48.857 at iteration 300
+#> Achieved score: 42.857 at iteration 318
+#> Achieved score: 40.857 at iteration 348
+#> Achieved score: 40.857 at iteration 357
+#> Achieved score: 38.857 at iteration 384
+#> Achieved score: 36.857 at iteration 420
+#> Achieved score: 34.857 at iteration 485
+#> Achieved score: 32.857 at iteration 636
+#> Achieved score: 30.857 at iteration 783
+#> process real
+#> 3.46s 3.45s
+bc$score(scoring_f)
+#> score_1
+#> 360.8571
+assign_random(bc)
+#> Batch container with 672 locations and 576 samples (assigned).
+#> Dimensions: plates, chips, rows, columns
+
+bc$get_samples()
+#> # A tibble: 672 × 10
+#> plates chips rows columns ID SampleType Race AgeGrp dummy_var1
+#> <int> <int> <int> <int> <dbl> <fct> <fct> <fct> <dbl>
+#> 1 1 1 1 1 563 Control Hispanic (30,40] -0.0000588
+#> 2 1 1 1 2 488 Control European (30,40] 1.37
+#> 3 1 1 2 1 215 Case European (50,60] 0.0980
+#> 4 1 1 2 2 268 Case Hispanic (40,50] -0.0492
+#> 5 1 1 3 1 165 Case European (60,100] -1.24
+#> 6 1 1 3 2 86 Case Hispanic (60,100] -0.813
+#> 7 1 1 4 1 264 Case European (60,100] 0.127
+#> 8 1 1 4 2 451 Control European (30,40] 0.877
+#> 9 1 1 5 1 250 Case Hispanic (60,100] -1.96
+#> 10 1 1 5 2 NA NA NA NA NA
+#> # ℹ 662 more rows
+#> # ℹ 1 more variable: dummy_var2 <chr>
+bc$get_samples(remove_empty_locations = TRUE)
+#> # A tibble: 576 × 10
+#> plates chips rows columns ID SampleType Race AgeGrp dummy_var1
+#> <int> <int> <int> <int> <dbl> <fct> <fct> <fct> <dbl>
+#> 1 1 1 1 1 563 Control Hispanic (30,40] -0.0000588
+#> 2 1 1 1 2 488 Control European (30,40] 1.37
+#> 3 1 1 2 1 215 Case European (50,60] 0.0980
+#> 4 1 1 2 2 268 Case Hispanic (40,50] -0.0492
+#> 5 1 1 3 1 165 Case European (60,100] -1.24
+#> 6 1 1 3 2 86 Case Hispanic (60,100] -0.813
+#> 7 1 1 4 1 264 Case European (60,100] 0.127
+#> 8 1 1 4 2 451 Control European (30,40] 0.877
+#> 9 1 1 5 1 250 Case Hispanic (60,100] -1.96
+#> 10 1 1 6 1 191 Case Hispanic (60,100] 0.282
+#> # ℹ 566 more rows
+#> # ℹ 1 more variable: dummy_var2 <chr>
+
+scoring_f <- list(
+ fc0 = function(samples) rnorm(1) + 2 * rexp(1),
+ fc1 = function(samples) rnorm(1, 100),
+ fc2 = function(samples) -7
+)
+
+bc$score(scoring_f)
+#> fc0 fc1 fc2
+#> 3.64428 100.78084 -7.00000
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
+#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
+#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
+#> ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
+#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
+#> ✔ purrr 1.0.2
+#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+#> ✖ dplyr::filter() masks stats::filter()
+#> ✖ dplyr::lag() masks stats::lag()
+#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
+Distributing samples to wells plates for experimental procedures is a
+very common task. In the following we use a data set of longitudinal
+subject samples that are to be spread across several n-well plates,
+balanced for treatment group and time point
+longitudinal_subject_samples
.
+data("longitudinal_subject_samples")
+head(longitudinal_subject_samples)
+#> # A tibble: 6 × 9
+#> SampleID SampleType SubjectID Group Week Sex Age BMI SamplesPerSubject
+#> <chr> <fct> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
+#> 1 P01W1 Sample P01 1 1 F 71 28.1 7
+#> 2 P01W4 Sample P01 1 4 F 71 28.1 7
+#> 3 P01W6 Sample P01 1 6 F 71 28.1 7
+#> 4 P01W10 Sample P01 1 10 F 71 28.1 7
+#> 5 P01W14 Sample P01 1 14 F 71 28.1 7
+#> 6 P01W18 Sample P01 1 18 F 71 28.1 7
In the fist example we’ll use a subset of samples to demonstrate the +process.
+For placing samples on plates and optimizing across plate
+distribution of factors as well as the within plate spacial
+distribution, the multi_plate_wrapper()
function can be
+used. It focuses first on assigning samples to plates and then optimizes
+the layout within plates.
To place all 66 samples on 24-well plates we create a
+batch_container
with 3 plates
We do the initial assignment of sample to plates and plot it
+
+set.seed(42)
+
+bc <- BatchContainer$new(
+ dimensions = list("plate" = 3, "row" = 4, "col" = 6),
+)
+
+bc <- assign_in_order(bc, dat)
+
+head(bc$get_samples()) %>% gt::gt()
plate | +row | +col | +SampleID | +SubjectID | +Group | +Sex | +Week | +
---|---|---|---|---|---|---|---|
1 | +1 | +1 | +P01W1 | +P01 | +1 | +F | +1 | +
1 | +1 | +2 | +P01W4 | +P01 | +1 | +F | +4 | +
1 | +1 | +3 | +P02W1 | +P02 | +1 | +M | +1 | +
1 | +1 | +4 | +P02W4 | +P02 | +1 | +M | +4 | +
1 | +1 | +5 | +P03W1 | +P03 | +1 | +M | +1 | +
1 | +1 | +6 | +P03W4 | +P03 | +1 | +M | +4 | +
We can view the initial assignment with plot_plate
+cowplot::plot_grid(
+ plotlist = list(
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Initial layout by Group"
+ ),
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Sex,
+ title = "Initial layout by Sex"
+ )
+ ),
+ nrow = 2
+)
For optimization optimize_multi_plate_design()
+iteratively calls optimize_design()
for different steps of
+the experiment. For across plate optimization osat scoring is used. For
+within plate optimization spatial scoreing is used. The order of the
+factors indicate their relative importance. In this case we prioritize
+Group over Sex.
+bc <- optimize_multi_plate_design(bc,
+ across_plates_variables = c("Group", "Sex"),
+ within_plate_variables = c("Group"),
+ plate = "plate",
+ row = "row",
+ column = "col",
+ n_shuffle = 2,
+ max_iter = 700,
+ quiet = TRUE
+)
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+cowplot::plot_grid(
+ plotlist = list(
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Initial layout by Group"
+ ),
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Sex,
+ title = "Initial layout by Sex"
+ )
+ ),
+ nrow = 2
+)
We can look at the trace objects for each internal
+optimize_design
run, returned from the wrapper
+function.
+bc$scores_table() |>
+ ggplot(aes(step, value, color = score)) +
+ geom_line() +
+ geom_point() +
+ facet_wrap(~ optimization_index, scales = "free_y")
+#> Warning: Removed 6309 rows containing missing values or values outside the scale range
+#> (`geom_line()`).
+#> Warning: Removed 6309 rows containing missing values or values outside the scale range
+#> (`geom_point()`).
Note that internally the wrapper function sets up plate specific +scoring functions that could manually be set up in the following +way.
+
+scoring_f <- c(
+ Group = mk_plate_scoring_functions(bc,
+ plate = "plate", row = "row", column = "col",
+ group = "Group", penalize_lines = "hard"
+ ),
+ Sex = mk_plate_scoring_functions(bc,
+ plate = "plate", row = "row", column = "col",
+ group = "Sex", penalize_lines = "hard"
+ )
+)
For more information on customized plate scoring see vignette
+Plate scoring examples
.
Sometimes layout requests can be more complicated. Assume we want to +keep the two samples of a subject on the same 24 well plate.
+Now we need to customize across plate optimization more so we need to +split the process into two steps.
+There are 31 subjects with each 2 time points, i.e. we need ~ 11 +subjects per plate and want to balance by treatment, sex.
+First we create a batch container with 3 batches that each fit 11
+subjects i.e. have 11 virtual locations
.
For layout scoring we use OSAT score on Group
and
+Sex
variables.
Then we assign the samples randomly to the batches and look at their +initial distribution.
+
+set.seed(17) # gives `bad` random assignment
+
+bc <- BatchContainer$new(
+ dimensions = list("batch" = 3, "location" = 11)
+)
+
+scoring_f <- list(
+ group = osat_score_generator(batch_vars = "batch", feature_vars = "Group"),
+ sex = osat_score_generator(batch_vars = "batch", feature_vars = "Sex")
+)
+
+bc <- assign_random(
+ bc,
+ dat %>% select(SubjectID, Group, Sex) %>% distinct()
+)
+
+bc$get_samples() %>%
+ head() %>%
+ gt::gt()
batch | +location | +SubjectID | +Group | +Sex | +
---|---|---|---|---|
1 | +1 | +NA | +NA | +NA | +
1 | +2 | +P32 | +5 | +M | +
1 | +3 | +P10 | +3 | +F | +
1 | +4 | +P26 | +3 | +M | +
1 | +5 | +P17 | +5 | +M | +
1 | +6 | +P07 | +2 | +F | +
+cowplot::plot_grid(
+ plotlist = list(
+ bc$get_samples() %>% ggplot(aes(x = batch, fill = Group)) +
+ geom_bar() +
+ labs(y = "subject count"),
+ bc$get_samples() %>% ggplot(aes(x = batch, fill = Sex)) +
+ geom_bar() +
+ labs(y = "subject count")
+ ),
+ nrow = 1
+)
Optimizing the layout with optimize_design()
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ n_shuffle = 1,
+ acceptance_func = ~ accept_leftmost_improvement(..., tolerance = 0.01),
+ max_iter = 150,
+ quiet = TRUE
+)
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
+
+#> Warning in osat_score(bc, batch_vars = batch_vars, feature_vars = feature_vars,
+#> : NAs in features / batch columns; they will be excluded from scoring
After optimization the group and sex of samples are equally +distributed across all plates. The lower right panel shows the +optimization trace of the scores.
+
+cowplot::plot_grid(
+ plotlist = list(
+ bc$get_samples() %>% ggplot(aes(x = batch, fill = Group)) +
+ geom_bar() +
+ labs(y = "subject count"),
+ bc$get_samples() %>% ggplot(aes(x = batch, fill = Sex)) +
+ geom_bar() +
+ labs(y = "subject count"),
+ bc$plot_trace(include_aggregated = TRUE)
+ ),
+ ncol = 3
+)
Using the result from step 1 we now optimize the layout within +plates. For this we still need to add empty wells to each batch and +assign the pre-allocated sample sheet in the right way to the new batch +container.
+
+dat <- dat %>%
+ left_join(bc$get_samples() %>%
+ select(SubjectID, batch))
+#> Joining with `by = join_by(SubjectID)`
+
+# add empty wells depending on how full the batch plate is
+dat <- dat %>%
+ bind_rows(data.frame(
+ SubjectID = "empty",
+ SampleID = paste("empty", 1:(3 * 24 - nrow(dat))),
+ batch = rep(1:3, 24 - (dat %>% count(batch) %>% .$n))
+ ))
+
+bc <- BatchContainer$new(
+ dimensions = list("plate" = 3, "row" = 4, "col" = 6)
+)
+
+# initial assignment such that the original plate assigned stays the same
+bc <- assign_in_order(
+ bc,
+ dat %>% arrange(batch)
+)
+
+
+cowplot::plot_grid(
+ plotlist = list(
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Initial layout by Group"
+ ),
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Sex,
+ title = "Initial layout by Sex"
+ )
+ ),
+ nrow = 2
+)
As we have already assigned samples to plates, the across plate +optimization can be skipped in the wrapper. For distributing samples +within each plate, we use variables Group and Sex again. The order of +the factors indicate their relative importance.
+
+bc <- optimize_multi_plate_design(bc,
+ within_plate_variables = c("Group", "Sex"),
+ plate = "plate",
+ row = "row",
+ column = "col",
+ n_shuffle = 2,
+ max_iter = 1000,
+ quiet = TRUE
+)
+cowplot::plot_grid(
+ plotlist = list(
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Final layout by Group"
+ ),
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Sex,
+ title = "Final layout by Sex"
+ )
+ ),
+ nrow = 2
+)
+bc$scores_table() |>
+ ggplot(aes(step, value, color = score)) +
+ geom_line() +
+ geom_point() +
+ facet_wrap(~ optimization_index)
In the following we use the full data set of longitudinal subject
+samples that are to be spread across several n-well plates, balanced for
+treatment group and time point
+longitudinal_subject_samples
.
In addition to the normal samples there are also controls to be +placed on each plate. These are added after the sample batching +step.
+For accommodation of samples to plates there are the following +control samples available
+
+longitudinal_subject_samples %>%
+ filter(SampleType != "Sample") %>%
+ count(SampleType, Group) %>%
+ gt::gt()
SampleType | +Group | +n | +
---|---|---|
Control | +Pool | +6 | +
Standard | +SpikeIn | +15 | +
Again we want to keep all samples of a subject on the same plate. A +first step could be grouping subjects into 3 batches blocking by +treatment, sex and age. There are 34 subjects with each 3 - 8 time +points, i.e. we need ~ 11 subjects per plate.
+We first create a ‘subjects’ dataset.
+
+# get subject data for batching
+subjects <- longitudinal_subject_samples %>%
+ filter(SampleType == "Sample") %>%
+ count(SubjectID, Group, Sex, Age, name = "nTimePoints") %>%
+ distinct()
+
+subjects %>%
+ select(-nTimePoints) %>%
+ slice(1:5) %>%
+ gt::gt() %>%
+ gt::tab_options()
SubjectID | +Group | +Sex | +Age | +
---|---|---|---|
P01 | +1 | +F | +71 | +
P02 | +1 | +M | +74 | +
P03 | +1 | +M | +76 | +
P04 | +1 | +F | +83 | +
P05 | +2 | +M | +79 | +
Then we create a batch container for the samples with 3 batches
+called plate
that each fit 11 subjects i.e. have 11 virtual
+locations
.
For layout scoring we use OSAT score on Group
and
+Sex
variables. We initially assign the samples randomly to
+the batches and check the layout.
+set.seed(42)
+
+bc <- BatchContainer$new(
+ dimensions = list("plate" = 3, "locations" = 11)
+)
+
+scoring_f <- list(
+ group = osat_score_generator(batch_vars = "plate", feature_vars = c("Group")),
+ sex = osat_score_generator(batch_vars = "plate", feature_vars = "Sex")
+)
+
+bc <- assign_random(bc, subjects)
+cowplot::plot_grid(
+ plotlist = list(
+ bc$get_samples() %>% ggplot(aes(x = plate, fill = Group)) +
+ geom_bar() +
+ labs(y = "subject count"),
+ bc$get_samples() %>% ggplot(aes(x = plate, fill = Sex)) +
+ geom_bar() +
+ labs(y = "subject count"),
+ bc$get_samples() %>% ggplot(aes(x = factor(plate), y = Age)) +
+ geom_boxplot() +
+ geom_point()
+ ),
+ nrow = 1
+)
Optimizing the layout with optimize_design()
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ n_shuffle = 1,
+ acceptance_func = ~ accept_leftmost_improvement(..., tolerance = 0.1),
+ max_iter = 150,
+ quiet = TRUE
+)
After optimization the group and sex of samples are equally +distributed across all plates. The lower right panel shows the +optimization trace of the scores.
+
+cowplot::plot_grid(
+ plotlist = list(
+ bc$get_samples() %>% ggplot(aes(x = plate, fill = Group)) +
+ geom_bar() +
+ labs(y = "subject count"),
+ bc$get_samples() %>% ggplot(aes(x = plate, fill = Sex)) +
+ geom_bar() +
+ labs(y = "subject count"),
+ bc$get_samples() %>% ggplot(aes(x = factor(plate), y = Age)) +
+ geom_boxplot() +
+ geom_point(),
+ bc$plot_trace(include_aggregated = TRUE)
+ ),
+ nrow = 2
+)
We start here by creating the batch container for all samples and +making an initial assignment. Note there will be empty positions on the +plates which we have to add before we assign the samples to the batch +container in order.
+
+samples_with_plate <- longitudinal_subject_samples %>%
+ left_join(bc$get_samples() %>%
+ select(-locations)) %>%
+ mutate(plate = ifelse(SampleType == "Sample", plate, str_extract(SampleID, ".$")))
+#> Joining with `by = join_by(SubjectID, Group, Sex, Age)`
+
+# not all plates have same amount of samples
+samples_with_plate %>% count(plate)
+#> # A tibble: 3 × 2
+#> plate n
+#> <chr> <int>
+#> 1 1 74
+#> 2 2 80
+#> 3 3 76
+
+# add empty wells depending on how full the batch plate is
+# column 11 and 12 are left empty: 96 - 16 = 80 samples per plate
+samples_with_plate <- samples_with_plate %>%
+ bind_rows(data.frame(
+ SubjectID = "empty",
+ SampleID = paste("empty", 1:(3 * 80 - nrow(samples_with_plate))),
+ plate = rep(1:3, 80 - (samples_with_plate %>% count(plate) %>% .$n)) %>%
+ as.character()
+ ))
+
+
+# new batch container for step 2
+bc <- BatchContainer$new(
+ dimensions = list(plate = 3, row = 8, col = 12),
+ exclude = crossing(plate = 1:3, row = 1:8, col = 11:12)
+)
+
+# assign samples in order of plate
+bc <- assign_in_order(
+ bc,
+ samples_with_plate %>%
+ arrange(plate) %>%
+ rename(orig_plate = plate)
+)
+
+# check if plate assignment is still correct
+bc$get_samples() %>%
+ summarize(all(plate == orig_plate)) %>%
+ unlist()
+#> all(plate == orig_plate)
+#> TRUE
+cowplot::plot_grid(
+ plotlist = list(
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Initial layout by Group"
+ ),
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = SubjectID,
+ title = "Initial layout by SubjectID"
+ ) +
+ theme(legend.key.size = unit(.25, "cm")),
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Sex,
+ title = "Initial layout by Sex"
+ )
+ ),
+ nrow = 3
+)
As we have already assigned samples to plates, the across plate +optimization can be skipped in the wrapper. For distributing samples +within each plate, we use variables Group and Sex again. The order of +the factors indicate their relative importance.
+
+bc <- optimize_multi_plate_design(bc,
+ within_plate_variables = c("Group", "SubjectID", "Sex"),
+ plate = "plate",
+ row = "row",
+ column = "col",
+ n_shuffle = 2,
+ max_iter = 1000,
+ quiet = TRUE
+)
+cowplot::plot_grid(
+ plotlist = list(
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Final layout by Group"
+ ),
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color =
+ SubjectID, title = "Final layout by SubjectID"
+ ) +
+ theme(legend.key.size = unit(.25, "cm")),
+ plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Sex,
+ title = "Final layout by Sex"
+ )
+ ),
+ nrow = 3
+)
+bc$scores_table() |>
+ ggplot(aes(step, value, color = score)) +
+ geom_line() +
+ geom_point() +
+ facet_wrap(~ optimization_index)
vignettes/plate_scoring_examples.Rmd
+ plate_scoring_examples.Rmd
library(designit)
+library(ggplot2)
+library(dplyr)
+#>
+#> Attaching package: 'dplyr'
+#> The following objects are masked from 'package:stats':
+#>
+#> filter, lag
+#> The following objects are masked from 'package:base':
+#>
+#> intersect, setdiff, setequal, union
+library(tidyr)
++(latin square should give the best score) +
++First using a combination of two OSAT scores (for row and column). +
++This usually produces a latin square when using the squared L2 norm +(L2s) for aggregation of the 2 scores. +
+# Setting up the batch container
+example1 <- BatchContainer$new(
+ dimensions = c(
+ plate = 1,
+ row = 4, col = 4
+ )
+)
+
+# Add samples to container
+# Need unique Sample ID. Can we drop this constraint?
+example1 <- assign_in_order(example1,
+ samples = tibble::tibble(
+ Group = rep(c("Grp 1", "Grp 2", "Grp 3", "Grp 4"), each = 4),
+ ID = 1:16
+ )
+)
+
+# The following does not work (an gives a constant score of 144!)
+# scoring_f <- osat_score_generator(batch_vars = c("row","col"), feature_vars = c("Group"))
+# First analysis of problem indicates that osat_score generates a full row*col vector of 'ideal scores'
+# which are in fact the same value, implying an identical overall result as each position can be either
+# allocated by 1 sample or 0 samples, the sum of 1's being the sample count.
+# --> don't use osat_score if there's a lack of samples as compared to possible positioning
+
+# # Set scoring function
+scoring_f <- list(
+ Row.Score = osat_score_generator(batch_vars = c("row"), feature_vars = c("Group")),
+ Column.Score = osat_score_generator(batch_vars = c("col"), feature_vars = c("Group"))
+)
+set.seed(41)
+
+bc <- optimize_design(
+ example1,
+ scoring = scoring_f,
+ max_iter = 300, # this is set to shorten vignette run-time based on known random seed, normally we don't know.
+ n_shuffle = 2,
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 1000, alpha = 0.5)),
+ aggregate_scores_func = L2s_norm
+)
+#> Checking variances of 2-dim. score vector.
+#> ... (11.385, 16.687) - OK
+#> Initial score: c(48, 0)
+#> Aggregated: 2304
+#> Achieved score: c(36, 4) at iteration 1
+#> Aggregated: 1312
+#> Achieved score: c(24, 4) at iteration 2
+#> Aggregated: 592
+#> Achieved score: c(24, 12) at iteration 3
+#> Aggregated: 720
+#> Achieved score: c(24, 8) at iteration 4
+#> Aggregated: 640
+#> Achieved score: c(16, 10) at iteration 5
+#> Aggregated: 356
+#> Achieved score: c(12, 10) at iteration 6
+#> Aggregated: 244
+#> Achieved score: c(10, 10) at iteration 7
+#> Aggregated: 200
+#> Achieved score: c(10, 6) at iteration 9
+#> Aggregated: 136
+#> Achieved score: c(10, 4) at iteration 10
+#> Aggregated: 116
+#> Achieved score: c(8, 10) at iteration 11
+#> Aggregated: 164
+#> Achieved score: c(8, 8) at iteration 12
+#> Aggregated: 128
+#> Achieved score: c(8, 8) at iteration 14
+#> Aggregated: 128
+#> Achieved score: c(4, 10) at iteration 19
+#> Aggregated: 116
+#> Achieved score: c(4, 6) at iteration 20
+#> Aggregated: 52
+#> Achieved score: c(4, 4) at iteration 22
+#> Aggregated: 32
+#> Achieved score: c(4, 4) at iteration 27
+#> Aggregated: 32
+#> Achieved score: c(4, 4) at iteration 36
+#> Aggregated: 32
+#> Achieved score: c(4, 4) at iteration 38
+#> Aggregated: 32
+#> Achieved score: c(0, 4) at iteration 39
+#> Aggregated: 16
+#> Achieved score: c(0, 4) at iteration 50
+#> Aggregated: 16
+#> Achieved score: c(0, 4) at iteration 54
+#> Aggregated: 16
+#> Achieved score: c(0, 4) at iteration 82
+#> Aggregated: 16
+#> Achieved score: c(0, 4) at iteration 87
+#> Aggregated: 16
+#> Achieved score: c(0, 4) at iteration 119
+#> Aggregated: 16
+#> Achieved score: c(0, 4) at iteration 125
+#> Aggregated: 16
+#> Achieved score: c(0, 4) at iteration 200
+#> Aggregated: 16
+#> Achieved score: c(4, 0) at iteration 215
+#> Aggregated: 16
+#> Achieved score: c(0, 0) at iteration 284
+#> Aggregated: 0
+bc$trace$elapsed
+#> Time difference of 5.483629 secs
+
+plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Ex1: Using OSAT scores for plate design\n(not the recommended way!)"
+)
++
++Now using a dedicated scoring for the group distances on a plate. +
++This should reliably lead to a nice symmetry-bearing latin square design +with only a one-dimensional score to look at. +
+scoring_f <- mk_plate_scoring_functions(bc, row = "row", column = "col", group = "Group")
+set.seed(42)
+bc <- optimize_design(
+ example1,
+ scoring = scoring_f,
+ max_iter = 1000, # this is set to shorten vignette run-time based on random seed, normally we don't know.
+ n_shuffle = 2,
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 1000, alpha = 0.5)),
+ quiet = TRUE
+)
+bc$trace$elapsed
+#> Time difference of 6.182886 secs
+
+plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Ex1: Using a dedicated plate scoring function:\nThis should show a latin square!"
+)
++
+library(designit)
+library(ggplot2)
+library(dplyr)
+#>
+#> Attaching package: 'dplyr'
+#> The following objects are masked from 'package:stats':
+#>
+#> filter, lag
+#> The following objects are masked from 'package:base':
+#>
+#> intersect, setdiff, setequal, union
+library(tidyr)
++(latin square for each plate should give the best score) +
++We set up in one go 2 plate scoring functions, each one acting locally +on a specific plate, plus one osat score to guarantee uniform +distribution of groups across plates. +
++The initial sample allocation (by assign_in_order) leads to a poor +starting point since each plate has only 2 of the 4 groups represented. +
++This is not a problem as long as we make sure that initial permutations +are likely to remedy the situation. That’s why we ensure 10 pairwise +sample swaps for the first iterations. +
+# Setting up the batch container
+example2 <- BatchContainer$new(
+ dimensions = c(
+ plate = 2,
+ row = 4, col = 4
+ )
+)
+
+# Add samples to container
+example2 <- assign_in_order(example2, samples = tibble::tibble(
+ Group = c(rep(c("Grp 1", "Grp 2", "Grp 3", "Grp 4"), each = 8)),
+ ID = 1:32
+))
+
+scoring_f <- c(mk_plate_scoring_functions(example2, plate = "plate", row = "row", column = "col", group = "Group"),
+ osat_plate = osat_score_generator(batch_vars = c("plate"), feature_vars = c("Group"))
+)
+
+plot_plate(example2,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Ex2: Initial sample arrangement"
+)
++
+
+example2$score(scoring_f)
+#> Plate 1 Plate 2 osat_plate
+#> 23.89265 23.89265 128.00000
+set.seed(41)
+bc <- optimize_design(
+ example2,
+ scoring = scoring_f,
+ n_shuffle = c(rep(10, 10), rep(3, 90), rep(2, 100), rep(1, 1400)),
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 10000, alpha = 0.5)),
+ aggregate_scores_func = worst_score,
+ quiet = TRUE
+)
+bc$trace$elapsed
+#> Time difference of 16.73602 secs
+
+bc$score(scoring_f)
+#> Plate 1 Plate 2 osat_plate
+#> 6.127258 6.094080 0.000000
+
+plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Ex2: Design created by swapping samples 'globally' across the plates"
+)
++
++While this ‘global’ optimization is possible, it does probably not +converge to an (almost) ideal solution in an acceptable time if there +are more samples involved. This is due to a lot of unproductive sample +swapping happening across the plates. +
++One way to address this: we may split the optimization into two cycles, +first assigning samples to plates (balancing groups), then improving the +positions of the samples within each plate. This motivates the use of a +dedicated sample permutation function which takes the plate structure +into account and only shuffles samples around within one plate. +
+scoring_f <- osat_score_generator(batch_vars = c("plate"), feature_vars = c("Group"))
+
+set.seed(42)
+bc <- optimize_design(
+ example2,
+ scoring = scoring_f,
+ quiet = TRUE,
+ max_iter = 200, # this is set to shorten vignette run-time, normally we don't know.
+ n_shuffle = 2,
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 10000, alpha = 0.5)),
+)
+bc$trace$elapsed
+#> Time difference of 2.140794 secs
+
+plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Ex2: 'Plate wise' design\nStep 1: after allocating samples to plates"
+)
++
+
+scoring_f <- mk_plate_scoring_functions(bc, plate = "plate", row = "row", column = "col", group = "Group")
+
+bc$score(scoring_f)
+#> Plate 1 Plate 2
+#> 12.77527 13.63704
+set.seed(42)
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ max_iter = 400,
+ shuffle_proposal_func = mk_subgroup_shuffling_function(subgroup_vars = c("plate")),
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 1000, alpha = 0.5)),
+ aggregate_scores_func = L2s_norm,
+ quiet = TRUE
+)
+bc$trace$elapsed
+#> Time differences in secs
+#> [1] 2.140794 3.270934
+
+bc$score(scoring_f)
+#> Plate 1 Plate 2
+#> 6.854748 6.309297
+
+plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Ex2: 'Plate wise' design\nStep 2: after arranging samples within plates"
+)
++
++In this case, the shuffling function exchanges 1 pair of sample +assignments every time (the default). However, any number of constant +swaps or a swapping protocol (formally a vector of integers) can be +supplied as well. +
++Now for the most efficient solution: we start again by first assigning +samples to plates (balancing groups), then making use of the +independence of the two within-plate optimizations and improving them +one after the other. +
++This is possible by passing the argument to the function that generates +the permutations. It enforces permutation only to happen first within +plate 1, then within plate 2, so that the two scores can be optimized in +succeeding runs. +
+scoring_f <- osat_score_generator(batch_vars = c("plate"), feature_vars = c("Group"))
+set.seed(42)
+bc <- optimize_design(
+ example2,
+ scoring = scoring_f,
+ quiet = TRUE,
+ max_iter = 150, # this is set to shorten vignette run-time, normally we don't know.
+ n_shuffle = 2,
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 10000, alpha = 0.5)),
+)
+bc$trace$elapsed
+#> Time difference of 1.791624 secs
+
+plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Ex2: 'Serial plate' design\nStep 1: after allocating samples to plates"
+)
++
+
+scoring_f <- mk_plate_scoring_functions(bc, plate = "plate", row = "row", column = "col", group = "Group")
+
+bc$score(scoring_f)
+#> Plate 1 Plate 2
+#> 10.57482 26.16613
+set.seed(42)
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ max_iter = 150,
+ quiet = TRUE,
+ shuffle_proposal_func = mk_subgroup_shuffling_function(subgroup_vars = c("plate"), restrain_on_subgroup_levels = c(1)),
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 1000, alpha = 0.5)),
+ aggregate_scores_func = L2s_norm
+)
+bc$trace$elapsed
+#> Time differences in secs
+#> [1] 1.791624 1.591002
+
+bc$score(scoring_f)
+#> Plate 1 Plate 2
+#> 6.416193 26.166134
+set.seed(42)
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ max_iter = 550,
+ quiet = TRUE,
+ shuffle_proposal_func = mk_subgroup_shuffling_function(subgroup_vars = c("plate"), restrain_on_subgroup_levels = c(2)),
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 1000, alpha = 0.5)),
+ aggregate_scores_func = L2s_norm
+)
+bc$trace$elapsed
+#> Time differences in secs
+#> [1] 1.791624 1.591002 4.226352
+
+bc$score(scoring_f)
+#> Plate 1 Plate 2
+#> 6.416193 6.581966
+
+plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Ex2: 'Serial plate' design\nStep 2: after optimizing each plate in turn"
+)
++
+library(designit)
+library(ggplot2)
+library(dplyr)
+#>
+#> Attaching package: 'dplyr'
+#> The following objects are masked from 'package:stats':
+#>
+#> filter, lag
+#> The following objects are masked from 'package:base':
+#>
+#> intersect, setdiff, setequal, union
+library(tidyr)
++We simulate one ordinary 96 well plate and two smaller ones with sizes +6x8 and 4x6, respectively. There are 3 experimental groups as well with +sample sizes of 69, 30 and 69, respectively. This example should +demonstrate that an empirically determined normalization of the scores +yield 3 comparable numerical values, independent of the different plate +and group sizes. +
++Again, a first optimization aims to achieve same group allocation on +each plate, while the second run takes care of the sample distribution +on each plate. +
++We use assign_random in this example to start from a more balanced +initial position as compared to example 2. +
++Aggregation of scores is done by the L2s method (square of L2 norm). +Because of the comparable numerical range of scores, also the +worst_score method could be used for aggregation. However, the L2s +method always takes into account all individual scores, providing more +stability in case the 3 plate scores are not exactly normalized. +
+# Setting up the batch container
+
+example3 <- BatchContainer$new(
+ dimensions = c(
+ plate = 3,
+ row = 8, col = 12
+ ),
+ exclude = dplyr::bind_rows(
+ tidyr::crossing(plate = 2, row = 1:8, col = 1:12) %>% dplyr::filter(row > 6 | col > 8),
+ tidyr::crossing(plate = 3, row = 1:8, col = 1:12) %>% dplyr::filter(row > 4 | col > 6)
+ )
+)
+
+
+# Assign samples randomly to start from a better initial state
+example3 <- assign_random(example3,
+ samples = tibble::tibble(
+ Group = rep.int(c("Grp 1", "Grp 2", "Grp3"),
+ times = c(69, 30, 69)
+ ),
+ ID = 1:168
+ )
+)
+
+scoring_f <- osat_score_generator(batch_vars = c("plate"), feature_vars = c("Group"))
+set.seed(42)
+bc <- optimize_design(
+ example3,
+ scoring = scoring_f,
+ quiet = TRUE,
+ max_iter = 150,
+ n_shuffle = 2,
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 1000, alpha = 0.5)),
+)
+bc$trace$elapsed
+#> Time difference of 2.03003 secs
+plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Ex3: Dealing with plates of different size\nStep 1: after distributing groups across plates"
+)
++
+scoring_f <- mk_plate_scoring_functions(bc,
+ plate = "plate", row = "row",
+ column = "col", group = "Group"
+)
+
+bc$score(scoring_f)
+#> Plate 1 Plate 2 Plate 3
+#> 9.387071 10.302690 9.826243
+set.seed(42)
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ max_iter = 300,
+ shuffle_proposal_func = mk_subgroup_shuffling_function(
+ subgroup_vars = c("plate"),
+ n_swaps = c(rep(5, 500), rep(3, 1500), rep(2, 3000), rep(1, 5000))
+ ),
+ # acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 10000, alpha = 1)),
+ aggregate_scores_func = L2s_norm,
+ quiet = TRUE
+)
+bc$trace$elapsed
+#> Time differences in secs
+#> [1] 2.030030 4.778121
+
+bc$score(scoring_f)
+#> Plate 1 Plate 2 Plate 3
+#> 8.809205 8.553802 8.185525
+plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group,
+ title = "Ex3: Dealing with plates of different size\nStep 2: after swapping samples within plates"
+)
++
+library(designit)
+library(ggplot2)
+library(dplyr)
+#>
+#> Attaching package: 'dplyr'
+#> The following objects are masked from 'package:stats':
+#>
+#> filter, lag
+#> The following objects are masked from 'package:base':
+#>
+#> intersect, setdiff, setequal, union
+library(tidyr)
++In this example, we have 2 factors to distribute across one plate: +Treatment and (animal) sex. +
++To indicate that the balancing of treatment is considered more important +than the animal sex we assign a custom aggregation function giving more +weight to the treatment variable. (A better aggregation mechanism has to +be implemented!!!) +
++There can be less samples than possible positions on the plate(s). In +this case, we simulate 20 animal derived samples distributed on a plate +with 24 locations. +
+# Setting up the batch container
+example4 <- BatchContainer$new(
+ dimensions = c(
+ plate = 1, row = 6, col = 4
+ )
+)
+
+
+# Assign samples randomly to start from lower score (avoid Inf values even since plate 3 will miss 2 groups initially :)
+example4 <- assign_in_order(example4, samples = tibble::tibble(
+ Group = rep.int(c("Treatment 1", "Treatment 2"), times = c(10, 10)),
+ Sex = c(rep(c("M", "F", "F", "M"), times = 4), "M", NA, NA, "F"), ID = 1:20
+))
+
+cowplot::plot_grid(
+ plot_plate(example4, plate = plate, row = row, column = col, .color = Group, title = "Initial layout by Group"),
+ plot_plate(example4, plate = plate, row = row, column = col, .color = Sex, title = "Initial layout by Sex"),
+ ncol = 2
+)
++
+
+
+scoring_f <- c(
+ Group = mk_plate_scoring_functions(example4, row = "row", column = "col", group = "Group"),
+ Sex = mk_plate_scoring_functions(example4, row = "row", column = "col", group = "Sex")
+)
+
+example4$score(scoring_f)
+#> Group.Plate Sex.Plate
+#> 83.63858 239.20748
+set.seed(42)
+bc <- optimize_design(
+ example4,
+ scoring = scoring_f,
+ max_iter = 750,
+ n_shuffle = 1,
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 10000, alpha = 1)),
+ aggregate_scores_func = function(scores, ...) {
+ 2 * scores["Group.Plate"] + scores["Sex.Plate"]
+ },
+ quiet = TRUE
+)
+bc$trace$elapsed
+#> Time difference of 6.251454 secs
+
+bc$score(scoring_f)
+#> Group.Plate Sex.Plate
+#> 8.019656 7.608810
+
+cowplot::plot_grid(
+ plot_plate(bc, plate = plate, row = row, column = col, .color = Group, title = "Final layout by Group"),
+ plot_plate(bc, plate = plate, row = row, column = col, .color = Sex, title = "Final layout by Sex"),
+ ncol = 2
+)
++
++We do the same example with auto-scaling, weighted scoring and SA to +have a reference! +
+set.seed(42)
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ max_iter = 500,
+ n_shuffle = 1,
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 10000, alpha = 1)),
+ aggregate_scores_func = function(scores, ...) {
+ purrr::set_names(2 * scores["Group.Plate"] + scores["Sex.Plate"], nm = "Weighted.Score")
+ },
+ autoscale_scores = T,
+ quiet = TRUE
+)
+#> ... Performing simple mean/stddev adjustment.
+bc$score(scoring_f)
+#> Group.Plate Sex.Plate
+#> 8.080860 7.458345
+
+cowplot::plot_grid(
+ plot_plate(bc, plate = plate, row = row, column = col, .color = Group, title = "Final layout by Group"),
+ plot_plate(bc, plate = plate, row = row, column = col, .color = Sex, title = "Final layout by Sex"),
+ ncol = 2
+)
++
++We do the same example with auto-scaling and position-dependent scoring +now, not aggregating the score vector! This is more effective even when +using the default acceptance function. We are strictly prioritizing the +leftmost score in addition to reflect relevance for the design. +
+scoring_f <- c(
+ Group = mk_plate_scoring_functions(example4, row = "row", column = "col", group = "Group"),
+ Sex = mk_plate_scoring_functions(example4, row = "row", column = "col", group = "Sex")
+)
+
+example4$score(scoring_f)
+#> Group.Plate Sex.Plate
+#> 83.63858 239.20748
+
+set.seed(42)
+bc <- optimize_design(
+ example4,
+ scoring = scoring_f,
+ max_iter = 5000,
+ n_shuffle = 1,
+ acceptance_func = accept_leftmost_improvement,
+ autoscale_scores = TRUE,
+ quiet = TRUE
+)
+#> ... Performing simple mean/stddev adjustment.
+bc$score(scoring_f)
+#> Group.Plate Sex.Plate
+#> 7.619846 7.473524
+
+cowplot::plot_grid(
+ plot_plate(bc, plate = plate, row = row, column = col, .color = Group, title = "Final layout by Group"),
+ plot_plate(bc, plate = plate, row = row, column = col, .color = Sex, title = "Final layout by Sex"),
+ ncol = 2
+)
++
++Using a tolerance value to accept slightly worse solutions in the +leftmost relevant score if overcompensated by other scores: +
+scoring_f <- c(
+ Group = mk_plate_scoring_functions(example4, row = "row", column = "col", group = "Group"),
+ Sex = mk_plate_scoring_functions(example4, row = "row", column = "col", group = "Sex")
+)
+
+
+set.seed(42)
+bc <- optimize_design(
+ example4,
+ scoring = scoring_f,
+ max_iter = 5000,
+ n_shuffle = 1,
+ acceptance_func = ~ accept_leftmost_improvement(..., tolerance = 0.1),
+ autoscale_scores = TRUE,
+ quiet = TRUE
+)
+#> ... Performing simple mean/stddev adjustment.
+
+bc$score(scoring_f)
+#> Group.Plate Sex.Plate
+#> 7.366667 7.323324
+cowplot::plot_grid(
+ plot_plate(bc, plate = plate, row = row, column = col, .color = Group, title = "Final layout by Group"),
+ plot_plate(bc, plate = plate, row = row, column = col, .color = Sex, title = "Final layout by Sex"),
+ ncol = 2
+)
++
++Testing an alternative left-to-right weighing of scores, based on +exponential down-weighing of the respective score differences at +position \(p\) with factor \(\kappa^p\), \(0 < \kappa +< 1\) We choose a \(\kappa\) of 0.5, i.e. the second +score’s improvement counts half of that of the first one. +
+scoring_f <- c(
+ Group = mk_plate_scoring_functions(example4, row = "row", column = "col", group = "Group"),
+ Sex = mk_plate_scoring_functions(example4, row = "row", column = "col", group = "Sex")
+)
+
+bc$score(scoring_f)
+#> Group.Plate Sex.Plate
+#> 7.366667 7.323324
+
+set.seed(42)
+bc <- optimize_design(
+ example4,
+ scoring = scoring_f,
+ max_iter = 1000,
+ n_shuffle = 1,
+ acceptance_func = mk_exponentially_weighted_acceptance_func(kappa = 0.5, simulated_annealing = T),
+ autoscale_scores = TRUE,
+ quiet = TRUE
+)
+#> ... Performing simple mean/stddev adjustment.
+bc$score(scoring_f)
+#> Group.Plate Sex.Plate
+#> 7.630367 7.616179
+
+cowplot::plot_grid(
+ plot_plate(bc, plate = plate, row = row, column = col, .color = Group, title = "Final layout by Group"),
+ plot_plate(bc, plate = plate, row = row, column = col, .color = Sex, title = "Final layout by Sex"),
+ ncol = 2
+)
++
+library(designit)
+library(ggplot2)
+library(dplyr)
+#>
+#> Attaching package: 'dplyr'
+#> The following objects are masked from 'package:stats':
+#>
+#> filter, lag
+#> The following objects are masked from 'package:base':
+#>
+#> intersect, setdiff, setequal, union
+library(tidyr)
++In some cases it may be essential to avoid samples of the same group +being put into the same row or column on a plate, i.e. these variables +are regarded as factor levels of their own, in addition to the spatial +relation of the samples. +
++Plate scoring functions can use an specific penalty for these ‘linearly +arranged’ samples on top of the distance metrics chosen. +
+# Setting up the batch container
+example5 <- BatchContainer$new(
+ dimensions = c(
+ plate = 1, row = 8, col = 12
+ )
+)
+
+# Assign samples randomly to start from lower score (avoid `Inf` values when doing the 'hard' penalization)
+example5 <- assign_random(example5, samples = tibble::tibble(
+ Group = rep.int(paste("Group", 1:5), times = c(8, 8, 8, 8, 64)),
+ ID = 1:96
+))
+
+penalize_lines <- "hard"
+
+scoring_f <- c(
+ Group = mk_plate_scoring_functions(example5, row = "row", column = "col", group = "Group", p = 2, penalize_lines = penalize_lines)
+)
+
+example5$score(scoring_f)
+#> Group.Plate
+#> 11.08608
+set.seed(42)
+bc <- optimize_design(
+ example5,
+ scoring = scoring_f,
+ max_iter = 5000,
+ n_shuffle = 1,
+ acceptance_func = mk_simanneal_acceptance_func(mk_simanneal_temp_func(T0 = 500, alpha = 0.1)),
+ quiet = T
+)
+bc$trace$elapsed
+#> Time difference of 29.47413 secs
+
+bc$score(scoring_f)
+#> Group.Plate
+#> 8.785693
+
+plot_plate(bc, plate = plate, row = row, column = col, .color = Group, title = stringr::str_c("Line penalization: ", penalize_lines))
++
+vignettes/shuffling_with_constraints.Rmd
+ shuffling_with_constraints.Rmd
This example demonstrates that by using customized shuffling +functions, it is possible to restrain the design optimization to only +score sample arrangements that fit the given constraints.
+Key idea is that every reshuffling produces a ‘valid’ sample +permutation that is not violating those constraints, even if the +suggested solution may be quite bad. During optimization, we pick the +best design solution from the possible ones by appropriate scoring.
+The user is free to implement custom shuffling functions and pass +those to the optimizer. However, some knowledge is required regarding +the internal workings of the optimization and batch container setup. +Therefore the package provides a little generic constructor for +customized shufflings shuffle_grouped_data in which +certain types of constraints that relate to grouped samples may be +specified by the passed parameters.
+We refer to a simplified version of the in vivo example +which is examined deeper in a dedicated vignette.
+
+data("invivo_study_samples")
+
+invivo_study_samples <- dplyr::mutate(invivo_study_samples,
+ Litter_combine_females = ifelse(Sex == "F", "female_all", Litter)
+)
+str(invivo_study_samples)
+#> 'data.frame': 59 obs. of 9 variables:
+#> $ AnimalID : chr "F1" "F2" "F3" "F4" ...
+#> $ Strain : chr "Strain B" "Strain B" "Strain B" "Strain B" ...
+#> $ Sex : chr "F" "F" "F" "F" ...
+#> $ BirthDate : Date, format: "2021-05-24" "2021-03-01" ...
+#> $ Earmark : chr "R" "2L" "2L1R" "L" ...
+#> $ ArrivalWeight : num 19.4 26.5 20.8 22.1 22.9 ...
+#> $ Arrival.weight.Unit : chr "g" "g" "g" "g" ...
+#> $ Litter : chr "Litter 1" "Litter 2" "Litter 2" "Litter 2" ...
+#> $ Litter_combine_females: chr "female_all" "female_all" "female_all" "female_all" ...
+
+invivo_study_samples %>%
+ dplyr::count(Strain, Sex, Litter_combine_females) %>%
+ gt::gt()
Strain | +Sex | +Litter_combine_females | +n | +
---|---|---|---|
Strain A | +F | +female_all | +7 | +
Strain A | +M | +Litter 10 | +3 | +
Strain A | +M | +Litter 11 | +4 | +
Strain A | +M | +Litter 12 | +5 | +
Strain A | +M | +Litter 13 | +4 | +
Strain A | +M | +Litter 14 | +6 | +
Strain B | +F | +female_all | +7 | +
Strain B | +M | +Litter 1 | +3 | +
Strain B | +M | +Litter 3 | +5 | +
Strain B | +M | +Litter 4 | +4 | +
Strain B | +M | +Litter 5 | +4 | +
Strain B | +M | +Litter 6 | +4 | +
Strain B | +M | +Litter 7 | +3 | +
We will use the litter as a factor to form cages in our design.
+However, in order to indicate the compatibility of female animals (see
+in vivo study vignette), a pseudo-litter
+female_all
is created here to group all the females
+together, marking them as interchangeable for the subgroup (i.e. cage)
+allocation.
In the simplified setup we want to assign two treatments to those +animals, balancing for strain and sex as the primary suspected +confounders. The batch container is prepared as follows:
+
+treatments <- factor(rep(c("Treatment A", "Treatment B"), c(30, 29)))
+table(treatments)
+#> treatments
+#> Treatment A Treatment B
+#> 30 29
+
+bc <- BatchContainer$new(locations_table = data.frame(Treatment = treatments, Position = seq_along(treatments)))
+
+bc <- assign_in_order(bc, invivo_study_samples)
+
+scoring_f <- osat_score_generator(batch_vars = "Treatment", feature_vars = c("Strain", "Sex"))
+
+bc
+#> Batch container with 59 locations and 59 samples (assigned).
+#> Dimensions: Treatment, Position
As noted, we have to assign animals to cages in this example. The +cage is thus acting as the grouping factor for the samples (animals) on +which we may want to put further constraints. Concretely:
+We will tackle the usual factor balancing (using the osat score) and +the additional constraints at the same time, combined in one +conceptional framework.
+As stated, the main idea is to provide a customized shuffling +function that ensures that only ‘suitable’ design proposals are +generated and passed to the scoring function which will then identify a +good one.
+Also keep in mind that what is the cage here could be any subgroup +into which samples have to be partitioned.
+The wrapper shuffle_grouped_data
allows to construct a
+shuffling function that satisfies all constraints defined above at the
+same time. It can be passed to the optimizer together with other user
+defined options such as the scoring or acceptance functions.
+bc2 <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ shuffle_proposal_func = shuffle_grouped_data(bc,
+ allocate_var = "Treatment",
+ keep_together_vars = c("Strain", "Sex"),
+ keep_separate_vars = c("Earmark"),
+ subgroup_var_name = "Cage",
+ n_min = 2, n_ideal = 3, n_max = 5,
+ strict = TRUE,
+ report_grouping_as_attribute = TRUE
+ ),
+ max_iter = 600
+)
+#>
+#> Formed 4 homogeneous groups using 59 samples.
+#> 22 subgroups needed to satisfy size constraints.
+#>
+#> Finding possible ways to allocate variable of interest with 2 levels ...
+#>
+#> Aborted after 1000010 recursive calls (maxCalls limit).
+#> 176364 allocations found.
+#> Usage of sample attributes --> executing first shuffling call.
+#> Checking variances of 1-dim. score vector.
+#> ... (495.302) - OK
+#> Initial score: 453.202
+#> Adding 4 attributes to samples.
+#> Achieved score: 355.575 at iteration 2
+#> Achieved score: 305.134 at iteration 19
+#> Achieved score: 272.592 at iteration 20
+#> Achieved score: 231.507 at iteration 34
+#> No permutations fulfilling the 'keep_separate' constraints in 1000 iters!
+#> Increasing number of tolerated violations to 1
+#> Achieved score: 181.473 at iteration 52
+#> Achieved score: 143.032 at iteration 116
+#> Achieved score: 122.49 at iteration 123
+#> Achieved score: 105.405 at iteration 165
+#> Achieved score: 104.999 at iteration 356
+#> Achieved score: 79.371 at iteration 384
+#> Achieved score: 52.931 at iteration 525
+#> Achieved score: 44.388 at iteration 546
+
+design <- bc2$get_samples()
allocate_var
is the batch container variable that should
+be primarily assigned to individual samples.
keep_together_vars
is a list of variables that must be
+homogeneous within a subgroup (here: cage).
keep_separate_vars
lists variables which should have
+different values within a subgroup (here: cage), if at all possible.
+This is a soft constraint and will be relaxed in a stepwise way until
+solutions can be found.
subgroup_var_name
allows to give the generated subgroup
+variable a useful name.
n_min
, n_max
and n_ideal
+specify the minimal, maximal and ideal group sizes, respectively. It is
+often necessary to release the strict
criterion to find any
+solution at all that satisfies those size criteria.
report_grouping_as_attribute
allows, if TRUE, to add the
+updated group variable into the batch container at each iteration, so
+that scoring functions could make use of this variable (here: cage)!
Following the output of the optimizer, we see that a solution was +identified that satisfies all constraints, with the exception of +tolerating one violation of earmark-uniqueness within a cage.
+The following cages (homogeneous in strain, sex and treatment) have +been generated in the process:
+ +Cage | +Strain | +Sex | +Treatment | +n | +
---|---|---|---|---|
1_1 | +Strain A | +F | +Treatment A | +3 | +
1_2 | +Strain A | +F | +Treatment A | +2 | +
1_3 | +Strain A | +F | +Treatment A | +2 | +
2_1 | +Strain A | +M | +Treatment A | +3 | +
2_2 | +Strain A | +M | +Treatment A | +3 | +
2_3 | +Strain A | +M | +Treatment A | +3 | +
2_4 | +Strain A | +M | +Treatment A | +3 | +
2_5 | +Strain A | +M | +Treatment B | +3 | +
2_6 | +Strain A | +M | +Treatment B | +3 | +
2_7 | +Strain A | +M | +Treatment B | +2 | +
2_8 | +Strain A | +M | +Treatment B | +2 | +
3_1 | +Strain B | +F | +Treatment B | +3 | +
3_2 | +Strain B | +F | +Treatment A | +2 | +
3_3 | +Strain B | +F | +Treatment B | +2 | +
4_1 | +Strain B | +M | +Treatment A | +3 | +
4_2 | +Strain B | +M | +Treatment A | +3 | +
4_3 | +Strain B | +M | +Treatment A | +3 | +
4_4 | +Strain B | +M | +Treatment B | +3 | +
4_5 | +Strain B | +M | +Treatment B | +3 | +
4_6 | +Strain B | +M | +Treatment B | +3 | +
4_7 | +Strain B | +M | +Treatment B | +3 | +
4_8 | +Strain B | +M | +Treatment B | +2 | +
shuffle_grouped_data
is a wrapper that consecutively
+calls other helper function. As an addendum, let us break the whole
+procedure down into parts that show what is happening internally at each
+step.
We have to divide our animal cohort into subgroups with same strain
+and sex, meeting size constraints as stated above. Since 2-5 animals
+should go into one cage, we specify n_min
and
+n_max
accordingly. n_ideal
would be selected by
+default as the mean of those two, but we specify it explicitly here,
+too.
The homogeneity of subgroups regarding strain and sex is achieved by
+listing those two parameters as keep_together_vars
.
Assignment of treatments should be performed as well at some point.
+We thus specify Treatment as the allocation variable
.
Note that the Treatment
variable is technically a batch
+container location and not a part of the sample list. This distinction
+does not matter at this point. However, all required variables must
+exist in the batch container object.
The following call to form_homogeneous_subgroups()
+produces an object that holds all relevant information about the
+samples, the allocation variable and the sizes of the subgroups that
+have to be formed. It is NOT decided, however, which animal will end up
+in which subgroup. This will be a matter of optimization later on.
+subg <- form_homogeneous_subgroups(
+ batch_container = bc, allocate_var = "Treatment",
+ keep_together_vars = c("Strain", "Sex", "Litter_combine_females"),
+ subgroup_var_name = "Cage",
+ n_min = 2, n_ideal = 3, n_max = 5
+)
+#>
+#> Formed 13 homogeneous groups using 59 samples.
+#> 18 subgroups needed to satisfy size constraints.
In this example, 18 subgroups have to be formed to meet all +constraints.
+It is possible to obtain more information from the returned list
+object. Inspection of element Subgroup_Sizes
tells us that
+13 ‘animal pools’ have to be formed which are homogeneous in the
+relevant parameters (here: strain and sex). Each of those groups happens
+to be split in subgroups with a size between 2 and 4 animals , which
+will later constitute the individual cages.
+subg$Subgroup_Sizes
+#> $`Strain A/F/female_all`
+#> [1] 3 4
+#>
+#> $`Strain A/M/Litter 10`
+#> [1] 3
+#>
+#> $`Strain A/M/Litter 11`
+#> [1] 4
+#>
+#> $`Strain A/M/Litter 12`
+#> [1] 3 2
+#>
+#> $`Strain A/M/Litter 13`
+#> [1] 4
+#>
+#> $`Strain A/M/Litter 14`
+#> [1] 3 3
+#>
+#> $`Strain B/F/female_all`
+#> [1] 3 4
+#>
+#> $`Strain B/M/Litter 1`
+#> [1] 3
+#>
+#> $`Strain B/M/Litter 3`
+#> [1] 3 2
+#>
+#> $`Strain B/M/Litter 4`
+#> [1] 4
+#>
+#> $`Strain B/M/Litter 5`
+#> [1] 4
+#>
+#> $`Strain B/M/Litter 6`
+#> [1] 4
+#>
+#> $`Strain B/M/Litter 7`
+#> [1] 3
Each subgroup of animals receives one specific treatment. Or more +generally: subgroups have to be homogeneous regarding the allocation +variable.
+This introduces another type of constraint, since numbers have to add
+up to 10 ‘Control’ and 10 ‘Compound’ cases, as given by the
+treatments
variable. As a next step, we have to find all
+possible combinations of subgroups which produce valid options for the
+treatment allocation. That’s done with the next call.
This will find a large number of different ways to assign treatments +to subgroups that lead to the correct overall number of treated +animals.
+
+possible <- compile_possible_subgroup_allocation(subg)
+#>
+#> Finding possible ways to allocate variable of interest with 2 levels ...
+#>
+#> Finished with 108296 recursive calls.
+#> 14660 allocations found.
So far we only know the sizes of subgroups (i.e. cages). Thus, in a +last step we have to assign specific animals to the various subgroups. +Ideally each group of ‘equivalent animals’ (in terms of strain and sex) +is split up into more than one subgroup, so there’s many potential ways +to assign animals to those.
+To allow optimization as usual, we want to generate a shuffling +function that produces only valid solutions in terms of our constraints, +so that optimization can iterate efficiently over this solution space. +The function can be generated by calling +shuffle_with_subgroup_formation() with the previously created +subgrouping object and the list of possible treatment allocations.
+Every call to this shuffling function will return a permutation index +(of the original samples) that constitutes a valid solution to be +scored.
+The permutation function actually also constructs a ‘Cage’ variable
+(see parameter subgroup_var_name
in the call to
+form_homogeneous_subgroups()). To make this parameter available
+and join it to the samples in the batch container, use flag
+report_grouping_as_attribute
in the construction of the
+permutation function.
+shuffle_proposal <- shuffle_with_subgroup_formation(subg, possible, report_grouping_as_attribute = TRUE)
+
+shuffle_proposal()
+#> $location_assignment
+#> [1] 8 9 12 10 11 13 14 38 39 40 41 42 43 44 45 47 48 46 49 50 51 52 53 54 56
+#> [26] 59 1 4 5 7 55 57 58 2 3 6 35 36 37 19 26 27 20 21 15 16 17 18 22 23
+#> [51] 24 25 28 29 30 31 32 33 34
+#>
+#> $samples_attr
+#> # A tibble: 59 × 4
+#> alloc_var_level group subgroup Cage
+#> <chr> <int> <int> <chr>
+#> 1 Treatment A 7 2 7_2
+#> 2 Treatment B 7 1 7_1
+#> 3 Treatment B 7 1 7_1
+#> 4 Treatment A 7 2 7_2
+#> 5 Treatment A 7 2 7_2
+#> 6 Treatment B 7 1 7_1
+#> 7 Treatment A 7 2 7_2
+#> 8 Treatment A 1 1 1_1
+#> 9 Treatment A 1 1 1_1
+#> 10 Treatment A 1 2 1_2
+#> # ℹ 49 more rows
Calling the shuffle proposal function repeatedly produces a valid +(constraint-aware) sample arrangement each time, with the grouping +variable (here: Cage) reported alongside. (The optimizer will merge the +‘Cage’ variable into the batch container after each iteration, so that +it can be used for scoring as if it would have been in the container +from the beginning!)
+We can finally use the customized shuffling function in the +optimization.
+
+bc3 <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ shuffle_proposal_func = shuffle_proposal,
+ max_iter = 300
+)
+#> Usage of sample attributes --> executing first shuffling call.
+#> Checking variances of 1-dim. score vector.
+#> ... (416.772) - OK
+#> Initial score: 289.541
+#> Adding 4 attributes to samples.
+#> Achieved score: 189.066 at iteration 9
+#> Achieved score: 139.439 at iteration 15
+#> Achieved score: 105.405 at iteration 34
+#> Achieved score: 52.931 at iteration 55
+#> Achieved score: 51.304 at iteration 70
+#> Achieved score: 32.863 at iteration 206
+
+design <- bc3$get_samples()
+
+# Obeying all constraints does not lead to a very balanced sample allocation:
+dplyr::count(design, Treatment, Strain) %>% gt::gt()
Treatment | +Strain | +n | +
---|---|---|
Treatment A | +Strain A | +17 | +
Treatment A | +Strain B | +13 | +
Treatment B | +Strain A | +12 | +
Treatment B | +Strain B | +17 | +
Treatment | +Sex | +n | +
---|---|---|
Treatment A | +F | +10 | +
Treatment A | +M | +20 | +
Treatment B | +F | +4 | +
Treatment B | +M | +25 | +
The goal of designit is to generate optimal sample allocations for experimental designs.
+You can install the development version from GitHub with:
+
+# install.packages("devtools")
+devtools::install_github("BEDApub/designit")
The main class used is BatchContainer
, which holds the dimensions for sample allocation. After creating such a container, a list of samples can be allocated in it using a given assignment function.
+library(tidyverse)
+library(designit)
+
+data("longitudinal_subject_samples")
+
+# we use a subset of longitudinal_subject_samples data
+subject_data <- longitudinal_subject_samples %>%
+ filter(Group %in% 1:5, Week %in% c(1,4)) %>%
+ select(SampleID, SubjectID, Group, Sex, Week) %>%
+ # with two observations per patient
+ group_by(SubjectID) %>%
+ filter(n() == 2) %>%
+ ungroup() %>%
+ select(SubjectID, Group, Sex) %>%
+ distinct()
+
+head(subject_data)
+#> # A tibble: 6 × 3
+#> SubjectID Group Sex
+#> <chr> <chr> <chr>
+#> 1 P01 1 F
+#> 2 P02 1 M
+#> 3 P03 1 M
+#> 4 P04 1 F
+#> 5 P19 1 M
+#> 6 P20 1 F
BatchContainer
and assigning samples
+
+# a batch container with 3 batches and 11 locations per batch
+bc <- BatchContainer$new(
+ dimensions = list("batch" = 3, "location" = 11),
+)
+
+# assign samples randomly
+set.seed(17)
+bc <- assign_random(bc, subject_data)
+
+bc$get_samples() %>%
+ ggplot() +
+ aes(x = batch, fill = Group) +
+ geom_bar()
Random assignmet of samples to batches produced an uneven distribution.
+
+# set scoring functions
+scoring_f <- list(
+ # first priority, groups are evenly distributed
+ group = osat_score_generator(batch_vars = "batch",
+ feature_vars = "Group"),
+ # second priority, sexes are evenly distributed
+ sex = osat_score_generator(batch_vars = "batch",
+ feature_vars = "Sex")
+)
+
+bc <- optimize_design(
+ bc, scoring = scoring_f, max_iter = 150, quiet = TRUE
+)
+
+bc$get_samples() %>%
+ ggplot() +
+ aes(x = batch, fill = Group) +
+ geom_bar()
+
+# show optimization trace
+bc$plot_trace()
See vignettes vignette("basic_examples")
.
The logo is inspired by DALL-E 2 and pipette icon by gsagri04.
+NEWS.md
+ BatchContainer
stores locations table (dimensions & excluded)BatchContainer$new()
accepts locations tableBatchContainer
with samples can be created using batch_container_from_table
+bc$n_available
was removed (use bc$n_locations
instead)print(bc)
+optimize_design()
+Describes container dimensions and samples to container location assignment.
+A typical workflow starts with creating a BatchContainer
. Then
+samples can be assigned to locations in that container.
trace
Optimization trace, a tibble::tibble()
scoring_f
Scoring functions used for optimization. +Each scoring function should receive a BatchContainer. +This function should return a floating +point score value for the assignment. This a list of functions. +Upon assignment a single function will be automatically converted to a list +In the later case each function is called.
has_samples
Returns TRUE if BatchContainer
has samples.
has_samples_attr
Returns TRUE if BatchContainer
has sample atrributes assigned.
n_locations
Returns number of locations in a BatchContainer
.
n_dimensions
Returns number of dimensions in a BatchContainer
.
+This field cannot be assigned.
dimension_names
character vector with dimension names. +This field cannot be assigned.
samples
Samples in the batch container. +When assigning data.frame should not have column named .sample_id column.
samples_attr
Extra attributes of samples. If set, this is included into
+BatchContainer$get_samples()
output.
assignment
Sample assignment vector. Should contain NAs for empty locations.
+Assigning this field is deprecated, please use $move_samples()
instead.
new()
Create a new BatchContainer object.
BatchContainer$new(locations_table, dimensions, exclude = NULL)
locations_table
A table with available locations.
dimensions
A vector or list of dimensions. Every dimension
+should have a name. Could be an integer vector of dimensions or
+a named list. Every value of a list could be either dimension size
+or parameters for
+BatchContainerDimension$new().
+Can be used as an alternative to passing locations_table
.
exclude
data.frame with excluded locations of a container. +Only used together with dimensions.
bc <- BatchContainer$new(
+ dimensions = list(
+ "plate" = 3,
+ "row" = list(values = letters[1:3]),
+ "column" = list(values = c(1, 3))
+ ),
+ exclude = data.frame(plate = 1, row = "a", column = c(1, 3), stringsAsFactors = FALSE)
+)
+
+bc
get_samples()
Return table with samples and sample assignment.
BatchContainer$get_samples(
+ assignment = TRUE,
+ include_id = FALSE,
+ remove_empty_locations = FALSE,
+ as_tibble = TRUE
+)
assignment
Return sample assignment. If FALSE, only +samples table is returned, with out batch assignment.
include_id
Keep .sample_id in the table. Use TRUE
for
+lower overhead.
remove_empty_locations
Removes empty locations +from the result tibble.
as_tibble
Return tibble
.
+If FALSE
returns data.table
. This should have
+lower overhead, as internally there is a cached data.table
.
get_locations()
Get a table with all the locations in a BatchContainer
.
A tibble
with all the available locations.
move_samples()
Move samples between locations
+This method can receive either src
and dst
or locations_assignment
.
src
integer vector of source locations
dst
integer vector of destination locations (the same length as src
).
location_assignment
integer vector with location assignment.
+The length of the vector should match the number of locations,
+NA
should be used for empty locations.
score()
Score current sample assignment,
+ + + + +print()
Prints information about BatchContainer
.
scores_table()
Return a table with scores from an optimization.
+ +index
optimization index, all by default
include_aggregated
include aggregated scores
a tibble::tibble()
with scores
plot_trace()
Plot trace
+ +index
optimization index, all by default
include_aggregated
include aggregated scores
...
not used.
a ggplot2::ggplot()
object
+List of scoring functions.
+Tibble with batch container locations.
+Tibble with sample information and sample ids.
+Sample attributes, a data.table.
+Vector with assignment of sample ids to locations.
+Cached data.table with samples assignment.
+Validate sample assignment.
+## ------------------------------------------------
+## Method `BatchContainer$new`
+## ------------------------------------------------
+
+bc <- BatchContainer$new(
+ dimensions = list(
+ "plate" = 3,
+ "row" = list(values = letters[1:3]),
+ "column" = list(values = c(1, 3))
+ ),
+ exclude = data.frame(plate = 1, row = "a", column = c(1, 3), stringsAsFactors = FALSE)
+)
+
+bc
+#> Batch container with 16 locations.
+#> Dimensions: plate, row, column
+
R/batch_container_dimension.R
+ BatchContainerDimension.Rd
R6 Class representing a batch container dimension.
+R6 Class representing a batch container dimension.
+name
dimension name.
values
vector of dimension values.
size
Returns size of a dimension.
short_info
Returns a string summarizing the dimension. +E.g., "mydim<size=10>".
new()
Create a new BatchContainerDimension object.
+This is usually used implicitly via BatchContainer$new()
.
BatchContainerDimension$new(name, size = NULL, values = NULL)
name
Dimension name, a character string. Requiered.
size
Dimension size. Setting this implies that dimension values are 1:size
.
values
Explicit list of dimension values. Could be numeric, character or factor.
+It is required to provide dimension namd and either size of values.
plate_dimension <- BatchContainerDimension$new("plate", size=3)
+row_dimension <- BatchContainerDimension$new("row", values = letters[1:3])
+column_dimension <- BatchContainerDimension$new("column", values = 1:3)
+
+bc <- BatchContainer$new(
+ dimensions = list(plate_dimension, row_dimension, column_dimension),
+ exclude = data.frame(plate = 1, row = "a", column = c(1, 3), stringsAsFactors = FALSE)
+)
+
+bc
+## ------------------------------------------------
+## Method `BatchContainerDimension$new`
+## ------------------------------------------------
+
+plate_dimension <- BatchContainerDimension$new("plate", size=3)
+row_dimension <- BatchContainerDimension$new("row", values = letters[1:3])
+column_dimension <- BatchContainerDimension$new("column", values = 1:3)
+
+bc <- BatchContainer$new(
+ dimensions = list(plate_dimension, row_dimension, column_dimension),
+ exclude = data.frame(plate = 1, row = "a", column = c(1, 3), stringsAsFactors = FALSE)
+)
+
+bc
+#> Batch container with 25 locations.
+#> Dimensions: plate, row, column
+
This function enables comparison of the results of two scoring functions by calculating +an L1 norm (Manhattan distance from origin).
+L1_norm(scores, ...)
A score or multiple component score vector
Parameters to be ignored by this aggregation function
The L1 norm as an aggregated score.
+L1_norm(c(2, 2))
+#> [1] 4
+
This function enables comparison of the results of two scoring functions by calculating +an L2 norm (euclidean distance from origin). Since this is only used for ranking solutions, +the squared L2 norm is returned.
+L2s_norm(scores, ...)
A score or multiple component score vector
Parameters to be ignored by this aggregation function
The squared L2 norm as an aggregated score.
+L2s_norm(c(2, 2))
+#> [1] 8
+
R/score_accept.R
+ accept_leftmost_improvement.Rd
Alternative acceptance function for multi-dimensional scores in which order (left to right, e.g. first to last) denotes relevance.
+accept_leftmost_improvement(current_score, best_score, ..., tolerance = 0)
One- or multi-dimensional score from the current optimizing iteration (double or vector of doubles)
Best one- or multi-dimensional score found so far (double or vector of doubles)
Ignored arguments that may be used by alternative acceptance functions
Tolerance value: When comparing score vectors from left to right, differences within +/- tol won't immediately +shortcut the comparison at this point, allowing improvement in a less important score to exhibit some influence
Boolean, TRUE if current score should be taken as the new optimal score, FALSE otherwise
+R/score_accept.R
+ accept_strict_improvement.Rd
Default acceptance function. Accept current score if and only if all elements are less than or equal than in best score +and there's at least one improvement.
+accept_strict_improvement(current_score, best_score, ...)
One- or multi-dimensional score from the current optimizing iteration (double or vector of doubles)
Best one- or multi-dimensional score found so far (double or vector of doubles)
Ignored arguments that may be used by alternative acceptance functions
Boolean, TRUE if current score should be taken as the new optimal score, FALSE otherwise
+This will convert factors to characters and disregard +row and column order
+all_equal_df(df1, df2)
first data.frame()
to compare
second data.frame()
to compare
TRUE
or FALSE
in case differences are present
Distributes samples based on a sample sheet.
+assign_from_table(batch_container, samples)
Instance of BatchContainer class
data.frame
with samples (a sample sheet). This data.frame
+(or tibble::tibble()
) should contain samples together with their locations. No .sample_id
+column can be present in the sample sheet. In batch_container
already has samples assigned,
+the function will check if samples in batch_container
are identical to the ones in the
+samples
argument.
Returns a new BatchContainer
.
bc <- BatchContainer$new(
+ dimensions = list(
+ plate = 2,
+ column = list(values = letters[1:3]),
+ row = 3
+ )
+)
+
+sample_sheet <- tibble::tribble(
+ ~plate, ~column, ~row, ~sampleID, ~group,
+ 1, "a", 1, 1, "TRT",
+ 1, "b", 2, 2, "CNTRL",
+ 2, "a", 1, 3, "TRT",
+ 2, "b", 2, 4, "CNTRL",
+ 2, "a", 3, 5, "TRT",
+)
+# assign samples from the sample sheet
+bc <- assign_from_table(bc, sample_sheet)
+
+bc$get_samples(remove_empty_locations = TRUE)
+#> # A tibble: 5 × 5
+#> plate column row sampleID group
+#> <int> <fct> <int> <dbl> <chr>
+#> 1 1 a 1 1 TRT
+#> 2 1 b 2 2 CNTRL
+#> 3 2 a 1 3 TRT
+#> 4 2 a 3 5 TRT
+#> 5 2 b 2 4 CNTRL
+
+
First sample is assigned to the first location, second +sample is assigned to the second location, etc.
+assign_in_order(batch_container, samples = NULL)
Instance of BatchContainer class
data.frame with samples.
Returns a new BatchContainer
.
samples <- data.frame(sampId = 1:3, sampName = letters[1:3])
+samples
+#> sampId sampName
+#> 1 1 a
+#> 2 2 b
+#> 3 3 c
+
+bc <- BatchContainer$new(dimensions = c("row" = 3, "column" = 2))
+bc
+#> Batch container with 6 locations.
+#> Dimensions: row, column
+
+set.seed(42)
+# assigns samples randomly
+bc <- assign_random(bc, samples)
+bc$get_samples()
+#> # A tibble: 6 × 4
+#> row column sampId sampName
+#> <int> <int> <int> <chr>
+#> 1 1 1 1 a
+#> 2 1 2 NA NA
+#> 3 2 1 NA NA
+#> 4 2 2 NA NA
+#> 5 3 1 2 b
+#> 6 3 2 3 c
+
+# assigns samples in order
+bc <- assign_in_order(bc)
+bc$get_samples()
+#> # A tibble: 6 × 4
+#> row column sampId sampName
+#> <int> <int> <int> <chr>
+#> 1 1 1 1 a
+#> 2 1 2 2 b
+#> 3 2 1 3 c
+#> 4 2 2 NA NA
+#> 5 3 1 NA NA
+#> 6 3 2 NA NA
+
R/assignment.R
+ assign_random.Rd
Assignment function which distributes samples randomly.
+assign_random(batch_container, samples = NULL)
Instance of BatchContainer class
data.frame with samples.
Returns a new BatchContainer
.
samples <- data.frame(sampId = 1:3, sampName = letters[1:3])
+samples
+#> sampId sampName
+#> 1 1 a
+#> 2 2 b
+#> 3 3 c
+
+bc <- BatchContainer$new(dimensions = c("row" = 3, "column" = 2))
+bc
+#> Batch container with 6 locations.
+#> Dimensions: row, column
+
+set.seed(42)
+# assigns samples randomly
+bc <- assign_random(bc, samples)
+bc$get_samples()
+#> # A tibble: 6 × 4
+#> row column sampId sampName
+#> <int> <int> <int> <chr>
+#> 1 1 1 1 a
+#> 2 1 2 NA NA
+#> 3 2 1 NA NA
+#> 4 2 2 NA NA
+#> 5 3 1 2 b
+#> 6 3 2 3 c
+
+# assigns samples in order
+bc <- assign_in_order(bc)
+bc$get_samples()
+#> # A tibble: 6 × 4
+#> row column sampId sampName
+#> <int> <int> <int> <chr>
+#> 1 1 1 1 a
+#> 2 1 2 2 b
+#> 3 2 1 3 c
+#> 4 2 2 NA NA
+#> 5 3 1 NA NA
+#> 6 3 2 NA NA
+
R/batch_container_from_table.R
+ batch_container_from_table.Rd
Creates a BatchContainer from a table +(data.frame/tibble::tibble) containing sample and location information.
+batch_container_from_table(tab, location_cols)
A table with location and sample information.
+Table rows with all NA
s in sample information columns are treated as empty
+locations.
Names of columns containing information about locations.
A BatchContainer assigned samples.
+tab <- data.frame(
+ row = rep(1:3, each = 3),
+ column = rep(1:3, 3),
+ sample_id = c(1, 2, 3, NA, 5, 6, 7, NA, 9)
+)
+bc <- batch_container_from_table(tab, location_cols = c("row", "column"))
+
R/permute_subgroups.R
+ compile_possible_subgroup_allocation.Rd
All information needed to perform this function (primarily the number and size of subgroups plus the levels of the +allocation variable) are contained in and extracted from the subgroup object.
+compile_possible_subgroup_allocation(
+ subgroup_object,
+ fullTree = FALSE,
+ maxCalls = 1e+06
+)
A subgrouping object as returned by form_homogeneous_subgroups()
Boolean: Enforce full search of the possibility tree, independent of the value of maxCalls
Maximum number of recursive calls in the search tree, to avoid long run times with very large trees
List of possible allocations; Each allocation is an integer vector of allocation levels that are assigned in that order to the subgroups with given sizes
+R/shuffle_samples.R
+ complete_random_shuffling.Rd
This function was just added to test early on the functionality of optimize_design() to accept a +permutation vector rather than a list with src and dst indices.
+complete_random_shuffling(batch_container, ...)
The batch-container.
Other params that are passed to a generic shuffling function (like the iteration number).
A random permutation of the sample assignment in the container.
+data("invivo_study_samples")
+bc <- BatchContainer$new(
+ dimensions = c("plate" = 2, "column" = 5, "row" = 6)
+)
+scoring_f <- osat_score_generator("plate", "Sex")
+bc <- optimize_design(
+ bc, scoring = scoring_f, invivo_study_samples,
+ max_iter = 100,
+ shuffle_proposal_func = complete_random_shuffling
+)
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Checking variances of 1-dim. score vector.
+#> ... (142.337) - OK
+#> Initial score: 182.5
+#> Achieved score: 42.5 at iteration 1
+#> Achieved score: 0.5 at iteration 2
+
Gini index on a factor vector of counts using the Gini function +from the package ineq
+countGini(m)
a factor vector
the value of the Gini index
+R/designit-package.R
+ designit-package.Rd
Reducing batch effect by intelligently assigning samples to batches. Batch-effect have a strong influence in data analysis, especially when samples-to-batches assignment is coinciding with contrast groups. By defining your batch container and a scoring function reflecting the contrasts, you can assign samples in such a way that the potential batch-effect has minimal effect on the comparison of interest.
+Useful links:
R/utils-datatable.R
+ dot-datatable.aware.Rd
See data.table::let()
and
+vignette("datatable-importing", "data.table")
+section "data.table in Imports but nothing imported".
Drop highest order interactions
+drop_order(.terms, m = -1)
order of interaction (highest available if -1)
R/optimize.R
+ extract_shuffle_params.Rd
Any shuffling function should return one of the following:
atomic index vector for a direct location assignment
a list with src and dst vector
a list with locations vector (for location assignment) and optional sample_attr data frame/tibble
extract_shuffle_params(shuffle, attributes_expected)
Return value of a shuffle function
Logical; if TRUE, sample attributes are expected from the shuffling result and the +function dies if they are not provided.
A list with components src, dst, location_assignment and samples_attr, depending on the output +of the specific shuffling function
+This function parses the output, performs a few checks and returns results in a simple-to-use list.
+R/permute_subgroups.R
+ find_possible_block_allocations.Rd
Internal function to generate possible subgroup combinations that add up to specific levels of an allocation variable
+find_possible_block_allocations(
+ block_sizes,
+ group_nums,
+ fullTree = FALSE,
+ maxCalls = 1e+06
+)
(Integer) vector of sizes of the various subgroups that can be combined to form groups
Vector of sizes of the different groups to be formed
Boolean: Enforce full search of the possibility tree, independent of the value of maxCalls
Maximum number of recursive calls in the search tree, to avoid long run times with very large trees
List of possible allocations; Each allocation is an integer vector of allocation levels that are assigned in that order to the subgroups with sizes given by block sizes
R/score_aggregation.R
+ first_score_only.Rd
This function enables comparison of the results of two scoring functions by just basing +the decision on the first element. This reflects the original behavior of the optimization +function, just evaluating the 'auxiliary' scores for the user's information.
+first_score_only(scores, ...)
A score or multiple component score vector
Parameters to be ignored by this aggregation function
The aggregated score, i.e. the first element of a multiple-component score vector.
+first_score_only(c(1, 2, 3))
+#> [1] 1
+
R/permute_subgroups.R
+ form_homogeneous_subgroups.Rd
Form groups and subgroups of 'homogeneous' samples as defined by certain variables and size constraints
+form_homogeneous_subgroups(
+ batch_container,
+ allocate_var,
+ keep_together_vars = c(),
+ n_min = NA,
+ n_max = NA,
+ n_ideal = NA,
+ subgroup_var_name = NULL,
+ prefer_big_groups = TRUE,
+ strict = TRUE
+)
Batch container with all samples assigned that are to be grouped and sub-grouped
Name of a variable in the samples
table to inform possible groupings, as (sub)group sizes must add up to the correct totals
Vector of column names in sample table; groups are formed by pooling samples with identical values of all those variables
Minimal number of samples in one sub(!)group; by default 1
Maximal number of samples in one sub(!)group; by default the size of the biggest group
Ideal number of samples in one sub(!)group; by default the floor or ceiling of mean(n_min,n_max)
, depending on the setting of prefer_big_groups
An optional column name for the subgroups which are formed (or NULL)
Boolean; indicating whether or not bigger subgroups should be preferred in case of several possibilities
Boolean; if TRUE, subgroup size constraints have to be met strictly, implying the possibility of finding no solution at all
Subgroup object to be used in subsequent calls to compile_possible_subgroup_allocation()
Generate terms.object
(formula with attributes)
generate_terms(.tbl, ...)
data
columns to skip (unquoted)
Summary score for an experimental layout based on the distribution of levels +of the factors to be balanced.
+a data.frame with the factors to be balanced and their current positions
a vector with the names of experimental conditions to be balanced
list of dimension name groups to be used jointly for scoring
list of tests to use for each scoring group
named vector of weights for the factors to be balanced (default all 1)
named vector of weights for the dimensions (default all 1)
the summarized penalty score
+Get highest order interaction
+get_order(.terms)
highest order (numeric).
+
+ All functions+ + |
+ |
---|---|
+ + | +R6 Class representing a batch container. |
+
+ + | +R6 Class representing a batch container dimension. |
+
+ + | +Aggregation of scores: L1 norm |
+
+ + | +Aggregation of scores: L2 norm squared |
+
+ + | +Alternative acceptance function for multi-dimensional scores in which order (left to right, e.g. first to last) denotes relevance. |
+
+ + | +Distributes samples based on a sample sheet. |
+
+ + | +Distributes samples in order. |
+
+ + | +Assignment function which distributes samples randomly. |
+
+ + | +Creates a BatchContainer from a table +(data.frame/tibble::tibble) containing sample and location information. |
+
+ + | +Compile list of all possible ways to assign levels of the allocation variable to a given set of subgroups |
+
+ + | +Reshuffle sample indices completely randomly |
+
+ + | +Gini index on counts |
+
+ + | +Drop highest order interactions |
+
+ + | +Aggregation of scores: take first (primary) score only |
+
+ + | +Form groups and subgroups of 'homogeneous' samples as defined by certain variables and size constraints |
+
+ + | +Generate |
+
+ + | +get Score |
+
+ + | +Get highest order interaction |
+
+ + | +A sample list from an in vivo experiment with multiple treatments and 2 strains |
+
+ + | +A treatment list together with additional constraints on the strain and sex of animals |
+
+ + | +Kruskal Wallis on Layout dimension |
+
+ + | +Create locations table from dimensions and exclude table |
+
+ + | +Subject sample list with group and time plus controls |
+
+ + | +Mean difference |
+
+ + | +Alternative acceptance function for multi-dimensional scores with exponentially downweighted score improvements from left to right |
+
+ + | +Create a list of scoring functions (one per plate) that quantify the spatially homogeneous distribution of conditions across the plate |
+
+ + | +Generate acceptance function for an optimization protocol based on simulated annealing |
+
+ + | +Create a temperature function that returns the annealing temperature at a given step (iteration) |
+
+ + | +Created a shuffling function that permutes samples within certain subgroups of the container locations |
+
+ + | +Create function to propose swaps of samples on each call, either with a constant number of swaps or following +a user defined protocol |
+
+ + | +Unbalanced treatment and time sample list |
+
+ + | +Penalty for neighbors with same level |
+
+ + | +Generic optimizer that can be customized by user provided functions for generating shuffles and progressing towards the minimal score |
+
+ + | +Convenience wrapper to optimize a typical multi-plate design |
+
+ + | +Compute OSAT score for sample assignment. |
+
+ + | +Convenience wrapper for the OSAT score |
+
+ + | +Example dataset with a plate effect |
+
+ + | +Plot plate layouts |
+
+ + | +randomize experimental layout |
+
+ + | +Generate in one go a shuffling function that produces permutations with specific constraints on multiple sample variables and group sizes fitting one specific allocation variable |
+
+ + | +Shuffling proposal function with constraints. |
+
+ + | +Compose shuffling function based on already available subgrouping and allocation information |
+
+ + | +Aggregation of scores: sum up all individual scores |
+
+ + | +Validates sample data.frame. |
+
+ + | +Aggregation of scores: take the maximum (i.e. worst score only) |
+
R/data.R
+ invivo_study_samples.Rd
This sample list is intended to be used in connection with the "invivo_study_treatments"
data object
data(invivo_study_samples)
An object of class "tibble"
The animal IDs, i.e. unique identifiers for each animal
Strain (A or B)
Female (F) or Male (M)
Date of birth, not available for all the animals
Markings to distinguish individual animals, applied on the left (L), right (R) or both(B) ears
Initial body weight of the animal
Unit of the body weight, here: grams
The litter IDs, grouping offspring from one set of parents
R/data.R
+ invivo_study_treatments.Rd
This treatment list is intended to be used in connection with the "invivo_study_samples"
data object
data(invivo_study_treatments)
An object of class "tibble"
The treatment to be given to an individual animal (1-3, plus a few untreated cases)
Strain (A or B) - a constraint which kind of animal may receive the respective treatment
Female (F) or Male (M) - a constraint which kind of animal may receive the respective treatment
Kruskal-Wallis test statistic from the ranking of a factor vector +TODO: This is currently not used, it does not optimize for the correct layout
+kruskal(m)
a factor vector
the Kruskal-Wallis rank sum statistic
+R/batch_container.R
+ locations_table_from_dimensions.Rd
Create locations table from dimensions and exclude table
+locations_table_from_dimensions(dimensions, exclude)
A vector or list of dimensions. Every dimension +should have a name. Could be an integer vector of dimensions or +a named list. Every value of a list could be either dimension size +or parameters for BatchContainerDimension$new().
data.frame with excluded locations of a container.
a tibble::tibble()
with all the available locations.
R/data.R
+ longitudinal_subject_samples.Rd
A sample list with 9 columns as described below.
+There are 3 types of records (rows) indicated by the SampleType
variable.
+Patient samples, controls and spike-in standards.
+Patient samples were collected over up to 7 time points.
+Controls and SpikeIns are QC samples for distribution of the samples on
+96 well plates.
data(longitudinal_subject_samples)
An object of class "tibble"
A unique sample identifier.
Indicates whether the sample is a patient sample, control oder spike-in.
The subject identifier.
Indicates the treatment group of a subject.
Sampling time points in weeks of study.
Subject Sex, Female (F) or Male (M).
Subject age.
Subject Body Mass Index.
Look up variable for the number of samples per subject. +This varies as not subject have samples from all weeks.
Mean differens on vector to score continuous variables
+meanDiff(m)
a numeric vector
the mean difference
+R/score_autoscaling.R
+ mk_autoscale_function.Rd
Create a function that transforms a current (multi-dimensional) score into a boxcox normalized one
+mk_autoscale_function(
+ batch_container,
+ scoring,
+ random_perm,
+ use_boxcox = TRUE,
+ sample_attributes_fixed = FALSE
+)
An instance of BatchContainer.
A named list()
of scoring function. Each function should
+return a vector of non-zero length.
Number of random sample permutations for the estimation of the scaling params.
Logical; if TRUE and the bestNormalize
package is available, boxcox transformations will be used to
+normalize individual scores. If not possible, scores will just be transformed to a zero mean and unit standard deviation.
Logical; if FALSE, simulate a shuffle function that alters sample attributes at each iteration.
The transformation function for a new score vector
+R/shuffle_samples.R
+ mk_constant_swapping_function.Rd
This internal function is wrapped by mk_swapping_function()
+mk_constant_swapping_function(n_swaps, quiet = FALSE)
Number of swaps to be proposed (valid range is 1..floor(n_samples/2))
Do not warn if number of swaps is too big.
Function accepting batch container & iteration number. +Return a list with length n vectors 'src' and 'dst', denoting source and destination index for +the swap operation on each call
+R/score_plates.R
+ mk_dist_matrix.Rd
Internal helper function to set up an (n m) x (n m) pairwise distance matrix for a plate with n rows and m columns
+mk_dist_matrix(
+ plate_x = 12,
+ plate_y = 8,
+ dist = "minkowski",
+ p = 2,
+ penalize_lines = "soft"
+)
Dimension of plate in x direction (i.e number of columns)
Dimension of plate in y direction (i.e number of rows)
Distance function as understood by stats::dist()
p parameter, used only if distance metric is 'minkowski'. Special cases: p=1 - Manhattan distance; p=2 - Euclidean distance
How to penalize samples of the same group in one row or column of the plate. Valid options are: +'none' - there is no penalty and the pure distance metric counts, 'soft' - penalty will depend on the well distance within the +shared plate row or column, 'hard' - samples in the same row/column will score a zero distance
The matrix with pairwise distances between any wells on the plate
+R/score_accept.R
+ mk_exponentially_weighted_acceptance_func.Rd
Alternative acceptance function for multi-dimensional scores with exponentially downweighted score improvements from left to right
+mk_exponentially_weighted_acceptance_func(
+ kappa = 0.5,
+ simulated_annealing = FALSE,
+ temp_function = mk_simanneal_temp_func(T0 = 500, alpha = 0.8)
+)
Coefficient that determines how quickly the weights for the individual score improvements drop when going from left to right +(i.e. first to last score). Weight for the first score's delta is 1, then the original delta multiplied with kappa^(p-1) for the p'th score
Boolean; if TRUE, simulated annealing (SA) will be used to minimize the weighted improved score
In case SA is used, a temperature function that returns the annealing temperature for a certain iteration number
Acceptance function which returns TRUE if current score should be taken as the new optimal score, FALSE otherwise
+R/score_plates.R
+ mk_plate_scoring_functions.Rd
Create a list of scoring functions (one per plate) that quantify the spatially homogeneous distribution of conditions across the plate
+mk_plate_scoring_functions(
+ batch_container,
+ plate = NULL,
+ row,
+ column,
+ group,
+ p = 2,
+ penalize_lines = "soft"
+)
Batch container (bc) with all columns that denote plate related information
Name of the bc column that holds the plate identifier (may be missing or NULL in case just one plate is used)
Name of the bc column that holds the plate row number (integer values starting at 1)
Name of the bc column that holds the plate column number (integer values starting at 1)
Name of the bc column that denotes a group/condition that should be distributed on the plate
p parameter for minkowski type of distance metrics. Special cases: p=1 - Manhattan distance; p=2 - Euclidean distance
How to penalize samples of the same group in one row or column of the plate. Valid options are: +'none' - there is no penalty and the pure distance metric counts, 'soft' - penalty will depend on the well distance within the +shared plate row or column, 'hard' - samples in the same row/column will score a zero distance
List of scoring functions, one per plate, that calculate a real valued measure for the quality of the group distribution (the lower the better).
+data("invivo_study_samples")
+bc <- BatchContainer$new(
+ dimensions = c("column" = 6, "row" = 10)
+)
+bc <- assign_random(bc, invivo_study_samples)
+scoring_f <- mk_plate_scoring_functions(
+ bc,
+ row = "row", column = "column", group = "Sex"
+)
+bc <- optimize_design(bc, scoring = scoring_f, max_iter = 100)
+#> Checking variances of 1-dim. score vector.
+#> ... (0.15) - OK
+#> Initial score: 9.413
+#> Achieved score: 9.364 at iteration 5
+#> Achieved score: 9.351 at iteration 7
+#> Achieved score: 9.312 at iteration 15
+#> Achieved score: 9.266 at iteration 37
+#> Achieved score: 9.257 at iteration 38
+#> Achieved score: 9.257 at iteration 42
+#> Achieved score: 9.24 at iteration 47
+#> Achieved score: 9.235 at iteration 83
+plot_plate(bc$get_samples(), .col = Sex)
+
+
+
R/sim_annealing.R
+ mk_simanneal_acceptance_func.Rd
Generate acceptance function for an optimization protocol based on simulated annealing
+mk_simanneal_acceptance_func(
+ temp_function = mk_simanneal_temp_func(T0 = 500, alpha = 0.8)
+)
A temperature function that returns the annealing temperature for a certain cycle k
A function that takes parameters (current_score
, best_score
, iteration
) for an optimization step and return a Boolean indicating whether the current solution should be accepted or dismissed. Acceptance probability of a worse solution decreases with annealing temperature.
R/sim_annealing.R
+ mk_simanneal_temp_func.Rd
Supported annealing types are currently "Exponential multiplicative", "Logarithmic multiplicative", "Quadratic multiplicative" and "Linear multiplicative", each with dedicated constraints on alpha. For information, see http://what-when-how.com/artificial-intelligence/a-comparison-of-cooling-schedules-for-simulated-annealing-artificial-intelligence/
+mk_simanneal_temp_func(T0, alpha, type = "Quadratic multiplicative")
Initial temperature at step 1 (when k=0)
Rate of cooling
Type of annealing protocol. Defaults to the quadratic multiplicative method which seems to perform well.
Temperature at cycle k
.
R/shuffle_samples.R
+ mk_subgroup_shuffling_function.Rd
If length(n_swaps)==1, the returned function may be called an arbitrary number of times. +If length(n_swaps)>1 the returned function may be called length(n_swaps) timed before returning NULL, which would be the stopping criterion if all requested swaps have been exhausted.
+mk_subgroup_shuffling_function(
+ subgroup_vars,
+ restrain_on_subgroup_levels = c(),
+ n_swaps = 1
+)
Column names of the variables that together define the relevant subgroups
Permutations can be forced to take place only within a level of the factor of the subgrouping variable. In this case, the user must pass only one subgrouping variable and a number of levels that together define the permuted subgroup.
Vector with number of swaps to be proposed in successive calls to the returned function (each value should be in valid range from 1..floor(n_locations/2))
Function to return a list with length n vectors src
and dst
, denoting source and destination index for the swap operation, or NULL
if the user provided a defined protocol for the number of swaps and the last iteration has been reached
set.seed(42)
+
+bc <- BatchContainer$new(
+ dimensions = c(
+ plate = 2,
+ row = 4, col = 4
+ )
+)
+
+bc <- assign_in_order(bc, samples = tibble::tibble(
+ Group = c(rep(c("Grp 1", "Grp 2", "Grp 3", "Grp 4"), each = 8)),
+ ID = 1:32
+))
+
+# here we use a 2-step approach:
+# 1. Assign samples to plates.
+# 2. Arrange samples within plates.
+
+# overview of sample assagnment before optimization
+plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group
+)
+
+
+# Step 1, assign samples to plates
+scoring_f <- osat_score_generator(
+ batch_vars = c("plate"), feature_vars = c("Group")
+)
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ max_iter = 10, # the real number of iterations should be bigger
+ n_shuffle = 2,
+ quiet = TRUE
+)
+plot_plate(
+ bc,
+ plate = plate, row = row, column = col, .color = Group
+)
+
+
+# Step 2, distribute samples within plates
+scoring_f <- mk_plate_scoring_functions(
+ bc,
+ plate = "plate", row = "row", column = "col", group = "Group"
+)
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ max_iter = 50,
+ shuffle_proposal_func = mk_subgroup_shuffling_function(subgroup_vars = c("plate")),
+ aggregate_scores_func = L2s_norm,
+ quiet = TRUE
+)
+plot_plate(bc,
+ plate = plate, row = row, column = col, .color = Group
+)
+
+
R/shuffle_samples.R
+ mk_swapping_function.Rd
If length(n_swaps)==1
, the returned function may be called an arbitrary number of times.
+If length(n_swaps)>1
and called without argument, the returned function may be called length(n_swaps) timed before returning NULL, which would be the stopping criterion if all requested swaps have been exhausted. Alternatively, the function may be called with an iteration number as the only argument, giving the user some freedom how to iterate over the sample swapping protocol.
mk_swapping_function(n_swaps = 1)
Vector with number of swaps to be proposed in successive calls to the returned function (each value should be in valid range from 1..floor(n_samples/2)
)
Function to return a list with length n vectors src
and dst
, denoting source and destination index for the swap operation, or NULL if the user provided a defined protocol for the number of swaps and the last iteration has been reached.
data("invivo_study_samples")
+bc <- BatchContainer$new(
+ dimensions = c("plate" = 2, "column" = 5, "row" = 6)
+)
+scoring_f <- osat_score_generator("plate", "Sex")
+optimize_design(
+ bc, scoring = scoring_f, invivo_study_samples,
+ max_iter = 100,
+ shuffle_proposal_func = mk_swapping_function(1)
+)
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Checking variances of 1-dim. score vector.
+#> ... (418.374) - OK
+#> Initial score: 182.5
+#> Achieved score: 132.5 at iteration 1
+#> Achieved score: 90.5 at iteration 5
+#> Achieved score: 56.5 at iteration 11
+#> Achieved score: 30.5 at iteration 23
+#> Achieved score: 12.5 at iteration 24
+#> Achieved score: 2.5 at iteration 26
+#> Achieved score: 0.5 at iteration 36
+#> Batch container with 60 locations and 59 samples (assigned).
+#> Dimensions: plate, column, row
+
A sample list with 4 columns SampleName, Well, Time and Treatment +Not all treatments are avaliable at all time points. +All samples are placed on the same plate.
+data(multi_trt_day_samples)
An object of class "tibble"
Penalty score for number of neighboring pairs with the same level of a factor. +The number of such pairs is summed.
+neighbors(m)
a factor vector
the number of pairs
+R/optimize.R
+ optimize_design.Rd
Generic optimizer that can be customized by user provided functions for generating shuffles and progressing towards the minimal score
+optimize_design(
+ batch_container,
+ samples = NULL,
+ scoring = NULL,
+ n_shuffle = NULL,
+ shuffle_proposal_func = NULL,
+ acceptance_func = accept_strict_improvement,
+ aggregate_scores_func = identity,
+ check_score_variance = TRUE,
+ autoscale_scores = FALSE,
+ autoscaling_permutations = 100,
+ autoscale_useboxcox = TRUE,
+ sample_attributes_fixed = FALSE,
+ max_iter = 10000,
+ min_delta = NA,
+ quiet = FALSE
+)
An instance of BatchContainer
.
A data.frame
with sample information.
+Should be NULL
if the BatchContainer
already has samples in it.
Scoring function or a named list()
of scoring functions.
Vector of length 1 or larger, defining how many random sample
+swaps should be performed in each iteration. If length(n_shuffle)==1
,
+this sets no limit to the number of iterations. Otherwise, the optimization
+stops if the swapping protocol is exhausted.
A user defined function to propose the next shuffling of samples.
+Takes priority over n_shuffle if both are provided. The function is called with
+a BatchContainer bc
and an integer parameter iteration
for the current iteration number,
+allowing very flexible shuffling strategies.
+Mapper syntax is supported (see purrr::as_mapper()
).
+The returned function must either return a list with fields src
and dst
(for pairwise sample swapping)
+or a numeric vector with a complete re-assigned sample order.
Alternative function to select a new score as the best one.
+Defaults to strict improvement rule, i.e. all elements of a score have to be smaller or equal in order to accept the solution as better.
+This may be replaced with an alternative acceptance function included in the package
+(e.g. mk_simanneal_acceptance_func()
) or a user provided function.
+Mapper syntax is supported (see purrr::as_mapper()
).
A function to aggregate multiple scores AFTER (potential) auto-scaling and BEFORE acceptance evaluation.
+If a function is passed, (multi-dimensional) scores will be transformed (often to a single double value) before calling the acceptance function.
+E.g., see first_score_only()
or worst_score()
.
+Note that particular acceptance functions may require aggregation of a score to a single scalar in order to work, see for example those
+generated by mk_simanneal_acceptance_func()
.
+Mapper syntax is supported (see purrr::as_mapper()
).
Logical: if TRUE, scores will be checked for variability under sample permutation +and the optimization is not performed if at least one subscore appears to have a zero variance.
Logical: if TRUE, perform a transformation on the fly to equally scale scores +to a standard normal. This makes scores more directly comparable and easier to aggregate.
How many random sample permutations should be done to estimate autoscaling parameters. +(Note: minimum will be 20, regardless of the specified value)
Logical; if TRUE, use a boxcox transformation for the autoscaling if possible at all.
+Requires installation of the bestNormalize
package.
Logical; if TRUE, sample shuffle function may generate altered sample attributes at each iteration. +This affects estimation of score distributions. (Parameter only relevant if shuffle function does introduce attributes!)
Stop optimization after a maximum number of iterations, +independent from other stopping criteria (user defined shuffle proposal or min_delta).
If not NA, optimization is stopped as soon as successive improvement (i.e. euclidean distance between score vectors +from current best and previously best solution) drops below min_delta.
If TRUE, suppress non-critical warnings or messages.
A trace object
+data("invivo_study_samples")
+bc <- BatchContainer$new(
+ dimensions = c("plate" = 2, "column" = 5, "row" = 6)
+)
+bc <- optimize_design(bc, invivo_study_samples,
+ scoring = osat_score_generator("plate", "Sex"),
+ max_iter = 100
+)
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Checking variances of 1-dim. score vector.
+#> ... (141.154) - OK
+#> Initial score: 182.5
+#> Achieved score: 132.5 at iteration 1
+#> Achieved score: 90.5 at iteration 2
+#> Achieved score: 56.5 at iteration 6
+#> Achieved score: 30.5 at iteration 11
+#> Achieved score: 12.5 at iteration 15
+#> Achieved score: 2.5 at iteration 16
+#> Achieved score: 0.5 at iteration 31
+plot_plate(bc$get_samples(), .col = Sex)
+
+
R/score_plates.R
+ optimize_multi_plate_design.Rd
The batch container will in the end contain the updated experimental layout
+optimize_multi_plate_design(
+ batch_container,
+ across_plates_variables = NULL,
+ within_plate_variables = NULL,
+ plate = "plate",
+ row = "row",
+ column = "column",
+ n_shuffle = 1,
+ max_iter = 1000,
+ quiet = FALSE
+)
Batch container (bc) with all columns that denote plate related information
Vector with bc column name(s) that denote(s) groups/conditions to be balanced across plates, +sorted by relative importance of the factors
Vector with bc column name(s) that denote(s) groups/conditions to be spaced out within each plate, +sorted by relative importance of the factors
Name of the bc column that holds the plate identifier
Name of the bc column that holds the plate row number (integer values starting at 1)
Name of the bc column that holds the plate column number (integer values starting at 1)
Vector of length 1 or larger, defining how many random sample
+swaps should be performed in each iteration. See optimize_design()
.
Stop any of the optimization runs after this maximum number of iterations. See optimize_design()
.
If TRUE, suppress informative messages.
A list with named traces, one for each optimization step
+The OSAT score is intended to ensure even distribution of samples across +batches and is closely related to the chi-square test contingency table +(Yan et al. (2012) doi:10.1186/1471-2164-13-689 +).
+osat_score(bc, batch_vars, feature_vars, expected_dt = NULL, quiet = FALSE)
BatchContainer with samples
+or data.table
/data.frame where every row is a location
+in a container and a sample in this location.
character vector with batch variable names to take into account for the +score computation.
character vector with sample variable names to take into account for +score computation.
A data.table
with expected number of samples sample
+variables and batch variables combination. This is not required, however it does not change
+during the optimization process. So it is a good idea to cache this value.
Do not warn about NA
s in feature columns.
a list with two attributes: $score
(numeric score value), $expected_dt
(expected counts data.table
for reuse)
sample_assignment <- tibble::tribble(
+ ~ID, ~SampleType, ~Sex, ~plate,
+ 1, "Case", "Female", 1,
+ 2, "Case", "Female", 1,
+ 3, "Case", "Male", 2,
+ 4, "Control", "Female", 2,
+ 5, "Control", "Female", 1,
+ 6, "Control", "Male", 2,
+ NA, NA, NA, 1,
+ NA, NA, NA, 2,
+)
+
+osat_score(sample_assignment,
+ batch_vars = "plate",
+ feature_vars = c("SampleType", "Sex")
+)
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> $score
+#> [1] 3
+#>
+#> $expected_dt
+#> Key: <plate, SampleType, Sex>
+#> plate SampleType Sex .n_expected
+#> <num> <char> <char> <num>
+#> 1: 1 Case Female 1.0
+#> 2: 1 Case Male 0.5
+#> 3: 1 Control Female 1.0
+#> 4: 1 Control Male 0.5
+#> 5: 2 Case Female 1.0
+#> 6: 2 Case Male 0.5
+#> 7: 2 Control Female 1.0
+#> 8: 2 Control Male 0.5
+#>
+
This function wraps osat_score()
in order to take full advantage of the speed gain without
+managing the buffered objects in the user code.
osat_score_generator(batch_vars, feature_vars, quiet = FALSE)
A function that returns the OSAT score for a specific sample arrangement
+sample_assignment <- tibble::tribble(
+ ~ID, ~SampleType, ~Sex, ~plate,
+ 1, "Case", "Female", 1,
+ 2, "Case", "Female", 1,
+ 3, "Case", "Male", 2,
+ 4, "Control", "Female", 2,
+ 5, "Control", "Female", 1,
+ 6, "Control", "Male", 2,
+ NA, NA, NA, 1,
+ NA, NA, NA, 2,
+)
+
+osat_scoring_function <- osat_score_generator(
+ batch_vars = "plate",
+ feature_vars = c("SampleType", "Sex")
+)
+
+osat_scoring_function(sample_assignment)
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> [1] 3
+
R/shuffle_samples.R
+ pairwise_swapping.Rd
This function will ensure that one of the locations is always non-empty. It should not
+return trivial permutations (e.g., src=c(1,2)
and dst=c(1,2)
).
pairwise_swapping(batch_container, iteration)
The batch-container.
The current iteration number.
Function accepting batch container & iteration number. It returns a list with length 1 vectors 'src' and 'dst', denoting source and destination index for the swap operation
+See magrittr::%>%
for details.
lhs %>% rhs
Here top and bottom row were both used as controls (in dilutions). The top +row however was affected differently than the bottom one. This makes +normalization virtually impossible.
+data(plate_effect_example)
An object of class "tibble"
Plate row
Plate column
Sample concentration
Logarithm of sample concentration
Sample treatment
Readout from experiment
Plot plate layouts
+a tibble
(or data.frame
) with the samples assigned to locations.
+Alternatively a BatchContainter with samples can be supplied here.
optional dimension variable used for the plate ids
the dimension variable used for the row ids
the dimension variable used for the column ids
the continuous or discrete variable to color by
a continuous variable encoding transparency
a discrete variable encoding tile pattern (needs ggpattern)
string for the plot title
flag to add excluded wells (in bc$exclude) to the plot. +A BatchContainer must be provided for this.
whether NA entries in sample table should be renamed to 'empty`.
the ggplot object
+nPlate <- 3
+nColumn <- 4
+nRow <- 6
+
+treatments <- c("CTRL", "TRT1", "TRT2")
+timepoints <- c(1, 2, 3)
+
+
+bc <- BatchContainer$new(
+ dimensions = list(
+ plate = nPlate,
+ column = list(values = letters[1:nColumn]),
+ row = nRow
+ )
+)
+
+sample_sheet <- tibble::tibble(
+ sampleID = 1:(nPlate * nColumn * nRow),
+ Treatment = rep(treatments, each = floor(nPlate * nColumn * nRow) / length(treatments)),
+ Timepoint = rep(timepoints, floor(nPlate * nColumn * nRow) / length(treatments))
+)
+
+# assign samples from the sample sheet
+bc <- assign_random(bc, samples = sample_sheet)
+
+plot_plate(bc$get_samples(),
+ plate = plate, column = column, row = row,
+ .color = Treatment, .alpha = Timepoint
+)
+
+
+if (FALSE) {
+plot_plate(bc$get_samples(),
+ plate = plate, column = column, row = row,
+ .color = Treatment, .pattern = Timepoint
+)
+}
+
R/score_autoscaling.R
+ random_score_variances.Rd
Estimate the variance of individual scores by a series of completely random sample permutations
+random_score_variances(
+ batch_container,
+ scoring,
+ random_perm,
+ sample_attributes_fixed
+)
An instance of BatchContainer
.
A named list()
of scoring function. Each function should
+return a vector of non-zero length.
Number of random sample permutations to be done.
Logical; if FALSE, simulate a shuffle function that alters sample attributes at each iteration.
Vector of length m (=dimensionality of score) with estimated variances of each subscore
+This function generates a randomized experimental layout based on given +experimental dimensions and factors to be balanced
+randomize(
+ design,
+ report,
+ layout_dim,
+ balance,
+ scoring_groups = as.list(names(layout_dim)),
+ scoring_tests = as.list(rep("countGini", length(layout_dim))),
+ burnin = 100,
+ annealprotocol,
+ scoring_weights = rep(1, length(scoring_groups)),
+ balance_weights = rep(1, length(balance)),
+ distribute = 1:(prod(layout_dim))
+)
a data.frame with the sample ids, experimental conditions and +information about fixed samples (columns with 'Fix' prefix and then the dimension name)
a string with the sample identifier
a named vector with the experimental dimensions
a vector with the names of experimental conditions to be balanced
list of dimension name groups to be used jointly for scoring
list of dimension name groups to be used jointly for scoring
a number of initial burnin runs (default=100)
a vector with the number of pairs to swap in each annealing step
named vector of weights for the dimensions (default all 1)
named vector of weights for the factors to be balanced (default all 1)
a starting distribution
the value of the Gini index
+if (FALSE) {
+# samples to use
+samples <- data.frame(
+ Group = c(1, 2, 3, 4, 5),
+ Treatment = c(
+ "vehicle", "RTR24", "RTR25",
+ "RTR26", "RTR27"
+ ),
+ Dose = c(0, 25, 25, 25, 25),
+ animals = c(2, 2, 2, 2, 2)
+)
+
+# generate initial sample table (2 animals per group with 3 replicates)
+samples <- dplyr::bind_rows(samples %>% dplyr::mutate(animals = 1), samples) %>%
+ dplyr::rename(animal = animals)
+samples <- dplyr::bind_rows(list(samples, samples, samples)) %>%
+ dplyr::mutate(
+ replicate = rep(1:3, each = 10),
+ SampleID = paste0(Treatment, "_", animal, "_", replicate),
+ AnimalID = paste0(Treatment, "_", animal)
+ )
+
+# to be put on a 96 well plate
+# (empty wells: first and last column plus two more wells)
+empty <- 8 * 4 - nrow(samples) # n locations - n samples
+emptydf <- data.frame(
+ Treatment = "empty",
+ FixColumn = 4, FixRow = 8 + 1 - (1:empty)
+)
+
+# final sample table
+design <- dplyr::full_join(samples, emptydf)
+
+# set parameters
+layout_dim <- c(Row = 8, Column = 4)
+scoring_groups <- list(c("Row"), c("Column"))
+scoring_tests <- list("countGini", "countGini")
+scoring_weights <- c(Row = 1, Column = 2)
+balance <- c("AnimalID")
+balance_weights <- c(1, 1)
+names(balance_weights) <- balance
+report <- "SampleID" # column with a unique ID
+# annealprotocol <- rep(c(10,5,2,1), c(500,1000,2000,5000))
+annealprotocol <- rep(c(10, 5, 2, 1), c(50, 100, 200, 300))
+
+# run randomization
+result <- randomize(design, report, layout_dim, balance,
+ scoring_groups = scoring_groups,
+ scoring_tests = scoring_tests,
+ burnin = 200, annealprotocol = annealprotocol,
+ scoring_weights = scoring_weights,
+ balance_weights = balance_weights,
+ distribute = sample(1:(prod(layout_dim)))
+)
+
+final_design <- result$design %>%
+ dplyr::mutate(Column = Column + 1) # first column empty
+
+# plot
+library(ggplot)
+ggplot(
+ final_design,
+ aes(x = Column, y = Row, fill = Treatment, alpha = factor(animal))
+) +
+ theme_minimal() +
+ geom_tile() +
+ scale_x_continuous(breaks = unique(final_design$Column)) +
+ scale_y_reverse(breaks = rev(unique(final_design$Row))) +
+ scale_alpha_manual(values = c(1, 0.5))
+
+kable(table(final_design$Treatment, final_design$Column),
+ digits = 0,
+ caption = "Treatment distribution across Columns."
+)
+
+# optimization curve
+plot(score ~ iteration,
+ data = result$opti,
+ log = "x", ylab = "penalty", type = "b"
+)
+}
+
+
R/optimize.R
+ report_scores.Rd
Helper function to print out one set of scores plus (if needed) aggregated values
+report_scores(score, agg_score, iteration)
Vector of (non-aggregated) scores (may be a single value as well)
Vector of aggregated scores (may be a single value as well)
Iteration number
R/score_autoscaling.R
+ sample_random_scores.Rd
Sample scores from a number of completely random sample permutations
+sample_random_scores(
+ batch_container,
+ scoring,
+ random_perm,
+ sample_attributes_fixed
+)
An instance of BatchContainer.
A named list()
of scoring function. Each function should
+return a vector of non-zero length.
Number of random sample permutations to be done.
Logical; if FALSE
, simulate a shuffle function that alters sample attributes at each iteration.
A score matrix with n (# of permutations) rows and m (dimensionality of score) columns.
+Shrinks a matrix with scores and adds an iteration column.
+shrink_mat(m, last_iteration)
input matrix
last iteration
a tibble::tibble()
wrapped in a list
R/permute_subgroups.R
+ shuffle_grouped_data.Rd
Generate in one go a shuffling function that produces permutations with specific constraints on multiple sample variables and group sizes fitting one specific allocation variable
+shuffle_grouped_data(
+ batch_container,
+ allocate_var,
+ keep_together_vars = c(),
+ keep_separate_vars = c(),
+ n_min = NA,
+ n_max = NA,
+ n_ideal = NA,
+ subgroup_var_name = NULL,
+ report_grouping_as_attribute = FALSE,
+ prefer_big_groups = FALSE,
+ strict = TRUE,
+ fullTree = FALSE,
+ maxCalls = 1e+06
+)
Batch container with all samples assigned that are to be grouped and sub-grouped
Name of a variable in the samples
table to inform possible groupings, as (sub)group sizes must add up to the correct totals
Vector of column names in sample table; groups are formed by pooling samples with identical values of all those variables
Vector of column names in sample table; items with identical values in those variables will not be put into the same subgroup if at all possible
Minimal number of samples in one sub(!)group; by default 1
Maximal number of samples in one sub(!)group; by default the size of the biggest group
Ideal number of samples in one sub(!)group; by default the floor or ceiling of mean(n_min,n_max)
, depending on the setting of prefer_big_groups
An optional column name for the subgroups which are formed (or NULL)
Boolean, if TRUE, add an attribute table to the permutation functions' output, to be used in scoring during the design optimization
Boolean; indicating whether or not bigger subgroups should be preferred in case of several possibilities
Boolean; if TRUE, subgroup size constraints have to be met strictly, implying the possibility of finding no solution at all
Boolean: Enforce full search of the possibility tree, independent of the value of maxCalls
Maximum number of recursive calls in the search tree, to avoid long run times with very large trees
Shuffling function that on each call returns an index vector for a valid sample permutation
+R/assignment.R
+ shuffle_with_constraints.Rd
Can be used with optimize_design
to improve convergence speed.
shuffle_with_constraints(src = TRUE, dst = TRUE)
Expression to define possible source locations in the samples/locations
+table. Usually evaluated based on
+BatchContainer$get_samples(include_id = TRUE, as_tibble = FALSE)
as an environment
+(see also with()
). A single source location is selected from rows where the
+expression evaluates toTRUE
.
Expression to define possible destination locations in the
+samples/locations table. Usually evaluated based on BatchContainer$get_samples()
as an
+environment.
+Additionally a special variable .src
is available in this environment which
+describes the selected source row from the table.
Returns a function which accepts a BatchContainer
and an iteration
+number (i
). This function returns a list with two names: src
vector of length
+2 and dst
vector of length two. See BatchContainer$move_samples()
.
set.seed(43)
+
+samples <- data.frame(
+ id = 1:100,
+ sex = sample(c("F", "M"), 100, replace = TRUE),
+ group = sample(c("treatment", "control"), 100, replace = TRUE)
+)
+
+bc <- BatchContainer$new(
+ dimensions = c("plate" = 5, "position" = 25)
+)
+
+scoring_f <- function(samples) {
+ osat_score(
+ samples,
+ "plate",
+ c("sex", "group")
+ )$score
+}
+
+# in this example we treat all the positions in the plate as equal.
+# when shuffling we enforce that source location is non-empty,
+# and destination location has a different plate number
+bc <- optimize_design(
+ bc,
+ scoring = scoring_f,
+ samples,
+ shuffle_proposal = shuffle_with_constraints(
+ # source is non-empty location
+ !is.na(.sample_id),
+ # destination has a different plate
+ plate != .src$plate
+ ),
+ max_iter = 10
+)
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Checking variances of 1-dim. score vector.
+#> ... (455.222) - OK
+#> Initial score: 172.8
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Achieved score: 158.8 at iteration 3
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Achieved score: 138.8 at iteration 4
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Achieved score: 136.8 at iteration 8
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+#> Warning: NAs in features / batch columns; they will be excluded from scoring
+
R/permute_subgroups.R
+ shuffle_with_subgroup_formation.Rd
Compose shuffling function based on already available subgrouping and allocation information
+shuffle_with_subgroup_formation(
+ subgroup_object,
+ subgroup_allocations,
+ keep_separate_vars = c(),
+ report_grouping_as_attribute = FALSE
+)
A subgrouping object as returned by form_homogeneous_subgroups()
A list of possible assignments of the allocation variable as returned by compile_possible_subgroup_allocation()
Vector of column names in sample table; items with identical values in those variables will not be put into the same subgroup if at all possible
Boolean, if TRUE, add an attribute table to the permutation functions' output, to be used in scoring during the design optimization
Shuffling function that on each call returns an index vector for a valid sample permutation
+R/sim_annealing.R
+ simanneal_acceptance_prob.Rd
A solution is always to be accepted if it leads to a lower score. Worse solutions should be accepted with a probability given by this function.
+simanneal_acceptance_prob(current_score, best_score, temp, eps = 0.1)
Score from the current optimizing iteration (scalar value, double)
Score from the current optimizing iteration (scalar value, double)
Current value of annealing temperature
Small parameter eps(ilon), achieving that not always the new solution is taken when scores are exactly equal
Probability with which to accept the new score as the best one
+R/score_aggregation.R
+ sum_scores.Rd
Aggregation of scores: sum up all individual scores
+sum_scores(scores, na.rm = FALSE, ...)
A score or multiple component score vector
Boolean. Should NA values be ignored when obtaining the maximum? FALSE by default as ignoring NA values may render the sum meaningless.
Parameters to be ignored by this aggregation function
The aggregated score, i.e. the sum of all indicidual scores.
+sum_scores(c(3, 2, 1))
+#> [1] 6
+
R/utils-datatable.R
+ testxxx.Rd
See data.table::let()
and
+vignette("datatable-importing", "data.table")
+section "data.table in Imports but nothing imported".
R/optimize.R
+ update_batchcontainer.Rd
As post-condition, the batch container is in a different state
+update_batchcontainer(bc, shuffle_params)
The batch container to operate on.
Shuffling parameters as returned by extract_shuffle_params()
.
TRUE if sample attributes have been assigned, FALSE otherwise
+Validates sample data.frame.
+validate_samples(samples)
A data.frame
having a sample annotation per row.
R/permute_subgroups.R
+ validate_subgrouping_object.Rd
Validate subgroup object and stop with error message if not all required fields are there
+validate_subgrouping_object(subgroup_object)
Subgrouping object as returned by form_homogeneous_subgroups()
R/score_aggregation.R
+ worst_score.Rd
This function enables comparison of the results of two scoring functions by just basing +the decision on the largest element. This corresponds to the infinity-norm in ML terms.
+worst_score(scores, na.rm = FALSE, ...)
A score or multiple component score vector
Boolean. Should NA values be ignored when obtaining the maximum? FALSE by default as ignoring NA values may hide some issues with the provided scoring functions and also the aggregated value cannot be seen as the proper infinity norm anymore.
Parameters to be ignored by this aggregation function
The aggregated score, i.e. the value of the largest element in a multiple-component score vector.
+worst_score(c(3, 2, 1))
+#> [1] 3
+