Skip to content

Commit

Permalink
Fixing sequence clipping bug in tokenizer (#46)
Browse files Browse the repository at this point in the history
  • Loading branch information
justin-barton authored Dec 21, 2023
1 parent 0e48cff commit c92083b
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions protein_lm/tokenizer/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ def batch_encode(
output = []
if max_sequence_length is None and return_tensors:
max_sequence_length = max([len(sequence) for sequence in sequences])
if add_special_tokens:
max_sequence_length += 2
if max_sequence_length is not None:
sequences = [
sequence[:(max_sequence_length - 2) if add_special_tokens else max_sequence_length]
Expand Down

0 comments on commit c92083b

Please sign in to comment.