aboutsummaryrefslogtreecommitdiff
path: root/.venv/lib/python3.12/site-packages/tokenizers-0.19.0.dist-info/METADATA
diff options
context:
space:
mode:
Diffstat (limited to '.venv/lib/python3.12/site-packages/tokenizers-0.19.0.dist-info/METADATA')
-rw-r--r--.venv/lib/python3.12/site-packages/tokenizers-0.19.0.dist-info/METADATA209
1 files changed, 209 insertions, 0 deletions
diff --git a/.venv/lib/python3.12/site-packages/tokenizers-0.19.0.dist-info/METADATA b/.venv/lib/python3.12/site-packages/tokenizers-0.19.0.dist-info/METADATA
new file mode 100644
index 00000000..92ed843e
--- /dev/null
+++ b/.venv/lib/python3.12/site-packages/tokenizers-0.19.0.dist-info/METADATA
@@ -0,0 +1,209 @@
+Metadata-Version: 2.3
+Name: tokenizers
+Version: 0.19.0
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Education
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.7
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Dist: huggingface-hub >=0.16.4, <1.0
+Requires-Dist: pytest ; extra == 'testing'
+Requires-Dist: requests ; extra == 'testing'
+Requires-Dist: numpy ; extra == 'testing'
+Requires-Dist: datasets ; extra == 'testing'
+Requires-Dist: black ==22.3 ; extra == 'testing'
+Requires-Dist: ruff ; extra == 'testing'
+Requires-Dist: sphinx ; extra == 'docs'
+Requires-Dist: sphinx-rtd-theme ; extra == 'docs'
+Requires-Dist: setuptools-rust ; extra == 'docs'
+Requires-Dist: tokenizers[testing] ; extra == 'dev'
+Provides-Extra: testing
+Provides-Extra: docs
+Provides-Extra: dev
+Keywords: NLP,tokenizer,BPE,transformer,deep learning
+Author: Anthony MOI <m.anthony.moi@gmail.com>
+Author-email: Nicolas Patry <patry.nicolas@protonmail.com>, Anthony Moi <anthony@huggingface.co>
+Requires-Python: >=3.7
+Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
+Project-URL: Homepage, https://github.com/huggingface/tokenizers
+Project-URL: Source, https://github.com/huggingface/tokenizers
+
+<p align="center">
+ <br>
+ <img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/>
+ <br>
+<p>
+<p align="center">
+ <a href="https://badge.fury.io/py/tokenizers">
+ <img alt="Build" src="https://badge.fury.io/py/tokenizers.svg">
+ </a>
+ <a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE">
+ <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue">
+ </a>
+</p>
+<br>
+
+# Tokenizers
+
+Provides an implementation of today's most used tokenizers, with a focus on performance and
+versatility.
+
+Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation.
+If you are interested in the High-level design, you can go check it there.
+
+Otherwise, let's dive in!
+
+## Main features:
+
+ - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
+ most common BPE versions).
+ - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
+ less than 20 seconds to tokenize a GB of text on a server's CPU.
+ - Easy to use, but also extremely versatile.
+ - Designed for research and production.
+ - Normalization comes with alignments tracking. It's always possible to get the part of the
+ original sentence that corresponds to a given token.
+ - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
+
+### Installation
+
+#### With pip:
+
+```bash
+pip install tokenizers
+```
+
+#### From sources:
+
+To use this method, you need to have the Rust installed:
+
+```bash
+# Install with:
+curl https://sh.rustup.rs -sSf | sh -s -- -y
+export PATH="$HOME/.cargo/bin:$PATH"
+```
+
+Once Rust is installed, you can compile doing the following
+
+```bash
+git clone https://github.com/huggingface/tokenizers
+cd tokenizers/bindings/python
+
+# Create a virtual env (you can use yours as well)
+python -m venv .env
+source .env/bin/activate
+
+# Install `tokenizers` in the current virtual env
+pip install -e .
+```
+
+### Load a pretrained tokenizer from the Hub
+
+```python
+from tokenizers import Tokenizer
+
+tokenizer = Tokenizer.from_pretrained("bert-base-cased")
+```
+
+### Using the provided Tokenizers
+
+We provide some pre-build tokenizers to cover the most common cases. You can easily load one of
+these using some `vocab.json` and `merges.txt` files:
+
+```python
+from tokenizers import CharBPETokenizer
+
+# Initialize a tokenizer
+vocab = "./path/to/vocab.json"
+merges = "./path/to/merges.txt"
+tokenizer = CharBPETokenizer(vocab, merges)
+
+# And then encode:
+encoded = tokenizer.encode("I can feel the magic, can you?")
+print(encoded.ids)
+print(encoded.tokens)
+```
+
+And you can train them just as simply:
+
+```python
+from tokenizers import CharBPETokenizer
+
+# Initialize a tokenizer
+tokenizer = CharBPETokenizer()
+
+# Then train it!
+tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
+
+# Now, let's use it:
+encoded = tokenizer.encode("I can feel the magic, can you?")
+
+# And finally save it somewhere
+tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")
+```
+
+#### Provided Tokenizers
+
+ - `CharBPETokenizer`: The original BPE
+ - `ByteLevelBPETokenizer`: The byte level version of the BPE
+ - `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
+ - `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
+
+All of these can be used and trained as explained above!
+
+### Build your own
+
+Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer,
+by putting all the different parts you need together.
+You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs.
+
+#### Building a byte-level BPE
+
+Here is an example showing how to build your own byte-level BPE by putting all the different pieces
+together, and then saving it to a single file:
+
+```python
+from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
+
+# Initialize a tokenizer
+tokenizer = Tokenizer(models.BPE())
+
+# Customize pre-tokenization and decoding
+tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
+tokenizer.decoder = decoders.ByteLevel()
+tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
+
+# And then train
+trainer = trainers.BpeTrainer(
+ vocab_size=20000,
+ min_frequency=2,
+ initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
+)
+tokenizer.train([
+ "./path/to/dataset/1.txt",
+ "./path/to/dataset/2.txt",
+ "./path/to/dataset/3.txt"
+], trainer=trainer)
+
+# And Save it
+tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)
+```
+
+Now, when you want to use this tokenizer, this is as simple as:
+
+```python
+from tokenizers import Tokenizer
+
+tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
+
+encoded = tokenizer.encode("I can feel the magic, can you?")
+```
+