Tokenizer Links

Some links and random observations relating to tokenisation as gathered over the past week.
nlp
balochi-language-model
tokenisation
links
Author

Alex Strick van Linschoten

Published

June 4, 2023

This is just a collection of various links and observations that I came across while learning about tokenisation during the past week that would otherwise have no other home.

More Questions

And some other questions (beyond my larger questions around how to evaluate tokenisers):

  • How useful (or not) is data augmentation when it comes to training a tokenizer?
  • Is a list of dictionary words useful for training a tokenizer?