Getting Started with Hugging Face
Getting Started with Hugging Face
1.本地在线调用Hugging Face API
登录后点击头像, 获取列表的 Access Tokens. 建议权限都打开
随后可以尝试在本地简单调用
1
2
3
4
5
6
7
8
import requests
API_URL = 'https://api-inference.huggingface.co/models/uer/gpt2-chinese-cluecorpussmall'
API_TOKEN = '刚才的token'
headers= {'Authorization': f'Bearer {API_TOKEN}'}
#
response = requests.post(API_URL, headers=headers,json={'inputs':'hello, hugging face'})
print(response.json())
可能会有网络波动, 回复的不快或失败.
2.本地下载模型, 分词器
1
2
3
4
5
6
7
8
9
10
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = 'uer/gpt2-chinese-cluecorpussmall'
cache_dir = 'model/uer/gpt2-chinese-cluecorpussmall'
AutoModelForCausalLM.from_pretrained(model_name,cache_dir=cache_dir)
AutoTokenizer.from_pretrained(model_name,cache_dir=cache_dir)
print(f'All have downloaded to {cache_dir}')
我们可以看到, 模型存在safetensors,也可能.pt. 这里config是设置, tokenizer_config是大概多少长度, vocab是字典大小
3.调用本地模型生成器
1
2
3
4
5
6
7
8
9
10
11
12
13
from transformers import AutoModelForCausalLM,AutoTokenizer,pipeline
# 绝对路径
model_dir = r"C:\绝对路径"
model = AutoModelForCausalLM.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
generator = pipeline('text-generation', model=model,tokenizer=tokenizer,device='cpu')
output = generator('hello, i am a', max_length=50,num_return_sequences=1)
print(output)
要注意这个绝对路径是包含config.json的地方, 例如
结果如下:
4.数据获取
地址: https://huggingface.co/datasets
一般如果没写缓存, 会默认存在.cache/huggingface/datasets
例如:
1
2
3
4
5
6
7
8
from datasets import load_dataset
#在线加载数据
dataset = load_dataset(path='lansinuote/ChnSentiCorp')
print(dataset)
#扩展转为CSV格式
dataset.tocsv(path_or_buf=r'绝对路径.CSV') #Linux不加csv,windows加.csv
下载完后能看到huggging face专有的.arrow数据集
1
2
3
4
5
6
7
8
9
10
11
12
13
from datasets import load_dataset,load_from_disk
#在线加载数据
#dataset = load_dataset(path='lansinuote/ChnSentiCorp')
#print(dataset)
#加载缓存数据
dataseats = load_from_disk(r'绝对路径')
print(datasets) # train, validation, test 训练,验证,测试
train_data = datasets['train']
for data in train_data:
print(data)
This post is licensed under CC BY 4.0 by the author.