Hugging face: Powerful tokenizer API
Huggingface에 관한 포스트는 Huggingface 공식 홈페이지를 참고하여 작성하였으며 그 중에서도 Huggingface를 사용하는 방법에 관해 친절하게 설명해 놓은 글(Huggingface course)이 있어 이것을 바탕으로 작성하였습니다.
이번 포스트에서는 tokenizer가 얼마나 high level의 API인지 보도록 하겠습니다.🤗
1. Multiple sentences
We’ve explored how tokenizers work and looked at tokenization, conversion to input IDs, padding, truncation, and attention masks.
🤗Transformers API can handle all of this
for us with a high-level function that we’ll dive into here. When you call your tokenizer directly on the sentence, you get back inputs that are ready to pass through your model:
(모델에 맞는 tokenizer를 호출하면 tokenizer는 raw text에 모델에 필요한 모든 tokenization과정을 알아서 해줍니다.)
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)
Here, the model_inputs variable contains everything that’s necessary for a model to operate well
. For DistilBERT, that includes the input IDs as well as the attention mask. Other models that accept additional inputs will also have those output by the tokenizer object.
- Handle single, multiple sequence
- Support padding, truncate, attention masking operation
- Support diverse return type
- Support special tokens
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")
# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
sequences = [
"I've been waiting for a HuggingFace course my whole life.",
"So have I!"
]
# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)
# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)
sequences = [
"I've been waiting for a HuggingFace course my whole life.",
"So have I!"
]
# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
print(tokenizer.decode(model_inputs["input_ids"]))
--------------------------------------------------
"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
🔔 어떤 parameter가 존재하는지 어떻게 알 수 있을까?
이 부분은 파이썬 관련된 부분입니다. 개인적인 궁금증으로 알아보니 inspect
라는 라이브러리의 signature
함수가 이를 지원합니다. 바로 예시를 보도록 하겠습니다.
signature(tokenizer)
---------------------------------------
<Signature (text: Union[str, List[str], List[List[str]]], text_pair: Union[str, List[str], List[List[str]], NoneType] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Union[int, NoneType] = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: Union[int, NoneType] = None, return_tensors: Union[str, transformers.file_utils.TensorType, NoneType] = None, return_token_type_ids: Union[bool, NoneType] = None, return_attention_mask: Union[bool, NoneType] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, **kwargs) -> transformers.tokenization_utils_base.BatchEncoding>
# 좀 더 예쁘게 출력하는 방법
for param in signature(tokenizer).parameters.values():
print(param)
-----------------------------------------------------------
text: Union[str, List[str], List[List[str]]]
text_pair: Union[str, List[str], List[List[str]], NoneType] = None
add_special_tokens: bool = True
padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = False
truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False
max_length: Union[int, NoneType] = None
stride: int = 0
is_split_into_words: bool = False
pad_to_multiple_of: Union[int, NoneType] = None
return_tensors: Union[str, transformers.file_utils.TensorType, NoneType] = None
return_token_type_ids: Union[bool, NoneType] = None
return_attention_mask: Union[bool, NoneType] = None
return_overflowing_tokens: bool = False
return_special_tokens_mask: bool = False
return_offsets_mapping: bool = False
return_length: bool = False
verbose: bool = True
**kwargs