Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simulated data and real data workflow have diverged too far and that affects testing #114

Open
famulare opened this issue Oct 16, 2019 · 1 comment
Labels

Comments

@famulare
Copy link
Member

@tinghf alerted me that this block of code breaks on the simulated data because sample isn't a valid column.

# filter out nested PCR targets to retain high-level target only
# Flu A
keepTargetList <- unique(db$sample[db$pathogen %in% c("Flu_A_H1","Flu_A_H3")])
dropTargetList <- unique(db$sample[db$pathogen %in% c("Flu_A_pan")])
dropSampleList <- intersect(dropTargetList,keepTargetList)
db <- db %>% filter( !(sample %in% dropSampleList & db$pathogen %in% c("Flu_A_pan")))
# enterovirus
keepTargetList <- unique(db$sample[db$pathogen %in% c("EV_D68")])
dropTargetList <- unique(db$sample[db$pathogen %in% c("EV_pan")])
dropSampleList <- intersect(dropTargetList,keepTargetList)
db <- db %>% filter( !(sample %in% dropSampleList & db$pathogen %in% c("EV_pan")))

The short-term fix is to wrap this block with an if(source == 'production') as in

if(source == 'production'){

# filter out nested PCR targets to retain high-level target only
  # Flu A
  keepTargetList <- unique(db$sample[db$pathogen %in% c("Flu_A_H1","Flu_A_H3")])
  dropTargetList <- unique(db$sample[db$pathogen %in% c("Flu_A_pan")])
  
  dropSampleList <- intersect(dropTargetList,keepTargetList)
  
  db <- db %>% filter( !(sample %in% dropSampleList & db$pathogen %in% c("Flu_A_pan")))
  
  # enterovirus
  keepTargetList <- unique(db$sample[db$pathogen %in% c("EV_D68")])
  dropTargetList <- unique(db$sample[db$pathogen %in% c("EV_pan")])
  
  dropSampleList <- intersect(dropTargetList,keepTargetList)
  
  db <- db %>% filter( !(sample %in% dropSampleList & db$pathogen %in% c("EV_pan")))
}

Long term, we should keep the simulated data synchronized with the necessary test cases. You can see the workflow pattern to do that in commits to the simulated-data repo: https://github.com/seattleflu/simulated-data/commits/master.

  • introduce a script that makes a specific format change to the data without breaking other columns (unless this is on purpose!)
  • change the data
  • commit both together explaining the change.
@famulare famulare added the test label Oct 16, 2019
@tinghf
Copy link
Collaborator

tinghf commented Nov 14, 2019

in bamboo this manifest as error like following:

Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
Calls: expandDB ... as.data.frame -> filter -> filter.tbl_df -> filter_impl -> %in%
Execution halted

That's failed in following line in selectFromDB.R:

db <- db %>% filter( !(sample %in% dropSampleList & db$pathogen %in% c("EV_pan")))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants