Skip to content

Commit

Permalink
rebuild the website to fix some formatting issues
Browse files Browse the repository at this point in the history
  • Loading branch information
zachary-foster committed Nov 2, 2023
1 parent 863dfdd commit 2b63494
Show file tree
Hide file tree
Showing 20 changed files with 2,269 additions and 2,584 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@



```{r}
library(tidyverse)
library(readxl)
library(metacoder)
```

Martha sent us this data (email "RPS10 database" on sep 6, 2022) and Ricardo needs it in the format for the old database.


## Read in input data

Data from Martha:

```{r}
raw_data <- read_xlsx('rps10database_updates_july22_mas.xlsx')
```

The first version of the database:

```{r}
old_seqs <- read_fasta('../../data/releases/release_1.fa')
old_data <- map_dfr(names(old_seqs), function(header) {
parts <- strsplit(header, split = '|', fixed = TRUE)[[1]]
parts <- strsplit(parts, split = '=', fixed = TRUE)
map_dfc(parts, function(p) tibble(!!p[1] := p[2]))
})
```

add column for genus

```{r}
raw_data$genus <- str_match(raw_data$name, "^([^ ]+) ")[, 2]
old_data$genus <- str_match(old_data$name, "^([^_]+)_")[, 2]
```

check that all genera are in the old database so that we can use their taxonomy:

```{r}
stopifnot(all(raw_data$genus %in% old_data$genus))
```

Get below-genus taxonomy from old format:

```{r}
parts <- old_data$taxonomy[!duplicated(old_data$genus)] %>%
str_match(pattern = '(.+;.+;.+;.+;.+;.+);(.[^_]+)')
tax_key <- setNames(parts[,2], parts[,3])
```

Apply taxonomy to new data

```{r}
raw_data$taxonomy <- paste0(tax_key[raw_data$genus], ';', raw_data$name) %>%
gsub(pattern = ' ', replacement = '_')
```

make fasta file output

```{r}
raw_data <- mutate(raw_data, header = paste0('name=', gsub(name, pattern = ' ', replacement = '_'),
'|strain=', "na",
'|ncbi_acc=', "na",
'|ncbi_taxid=', "na",
'|oodb_id=', 9000 + seq_len(nrow(raw_data)),
'|taxonomy=', taxonomy))
paste0('>', raw_data$header, '\n', raw_data$sequence) %>%
write_lines(file = 'new_seqs_formatted.fasta')
```


786 changes: 786 additions & 0 deletions misc_analyses/convert_new_seqs_to_old_format/new_seqs_formatted.fasta

Large diffs are not rendered by default.

Binary file not shown.
6 changes: 6 additions & 0 deletions notes/database_format_and_update_ideas.Rmd
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
---
output:
pdf_document: default
html_document: default
word_document: default
---

## Google Drive folder structure

Expand Down
1,618 changes: 0 additions & 1,618 deletions website/2020-10-22_release_1_rps10.fasta

This file was deleted.

Loading

0 comments on commit 2b63494

Please sign in to comment.