Hugging Face launched Native-gemma, a framework constructed on prime of Transformers and Bitsandbytes to run Gemma 2 domestically.
It facilitates establishing an area occasion of Gemma 2 with three reminiscence presets buying and selling off pace and accuracy for reminiscence:
That is merely achieved through the use of two methods for lowering GPU reminiscence consumption:
- 4-bit quantization with bitsandbytes
- Gadget map to dump components of the mannequin to the CPU
Furthermore, local-gemma additionally presets completely different “mode” for inference relying in your goal duties: “chat”, “factual” or “inventive”.
There’s a CLI however you would possibly desire code for extra flexibility (code example published by Hugging Face):
from local_gemma import LocalGemma2ForCausalLM
from transformers import AutoTokenizer
mannequin =…