exporter

torch_to_nnef.llm_tract.exporter

LLMExporter

LLMExporter(hf_model_causal: nn.Module, tokenizer: AutoTokenizer, local_dir: T.Optional[Path] = None, force_module_dtype: T.Optional[DtypeStr] = None, force_inputs_dtype: T.Optional[DtypeStr] = None, num_logits_to_keep: int = 1)

Init LLMExporter.

Parameters:

Name	Type	Description	Default
`hf_model_causal`	`Module`	Any Causal model from `transformers` library	required
`tokenizer`	`AutoTokenizer`	Any tokenizer from `transformers` library	required
`local_dir`	`Optional[Path]`	If set this is the local directory from where model was loaded.	`None`
`force_module_dtype`	`Optional[DtypeStr]`	Force PyTorch dtype in parameters.	`None`
`force_inputs_dtype`	`Optional[DtypeStr]`	Force PyTorch dtype in inputs of the models.	`None`
`num_logits_to_keep`	`int`	int number of token to keep (if 0 all are kept) by default for classical inference setting it to 1 is fine, in case of speculative decoding it may be more (typically 2 or 3)	`1`

apply_half_precision_fixes

apply_half_precision_fixes()

Align float dtype arguments in few graph ops.

Indeed all LLM are trained using GPU/TPU/CPU kernels related PyTorch backend support f16 dtype in some operators contrary to PyTorch CPU inference (@ 2024-09-09).

To solve this issue we monkey patch in this cli few functional API.

check_wrapper_io

check_wrapper_io()

Check the wrapper gives same outputs compared to vanilla model.

dump

dump(**kwargs)

Prepare and export model to NNEF.

dump_all_io_npz_kind

dump_all_io_npz_kind(io_npz_dirpath: Path, size: int = 6) -> T.List[Path]

Realistic dump of IO's.

export_model

export_model(export_dirpath: Path, inference_target: TractNNEF, naming_scheme: VariableNamingScheme = LM_VAR_SCHEME, log_level=logging.INFO, dump_with_tokenizer_and_conf: bool = False, check_inference_modes: bool = True, sample_generation_total_size: int = 0, ignore_already_exist_dir: bool = False, export_dir_struct: ExportDirStruct = ExportDirStruct.DEEP, debug_bundle_path: T.Optional[Path] = None)

Export model has is currently in self.hf_model_causal.

and dump some npz tests to check io latter-on

load `staticmethod`

load(model_slug: T.Optional[str] = None, local_dir: T.Optional[Path] = None, **kwargs)

Load from either huggingface model slug hub or local_dir.

prepare

prepare(compression_method: T.Optional[str] = None, compression_registry: str = DEFAULT_COMPRESSION_REGISTRY, test_display_token_gens: bool = False, wrapper_io_check: bool = True, export_dirpath: T.Optional[Path] = None, log_level: int = logging.INFO)

Prepare model to export (f16/compression/checks...).

reset_torch_fns

reset_torch_fns()

Cleanup any torch behavior alterations.

StateLessF32LayerNorm

Bases: Module

forward

forward(input: torch.Tensor, normalized_shape: T.List[int], weight: T.Optional[torch.Tensor] = None, bias: T.Optional[torch.Tensor] = None, eps: float = 1e-05)

Upcast and apply layer norm in f32.

This is because f16 is not implemented on CPU in PyTorch (only GPU) as of torch 2.2.2 (2024-09-10):

RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

dump_llm

dump_llm(model_slug: T.Optional[str] = None, local_dir: T.Optional[Path] = None, force_module_dtype: T.Optional[DtypeStr] = None, force_inputs_dtype: T.Optional[DtypeStr] = None, merge_peft: T.Optional[bool] = None, num_logits_to_keep: int = 1, device_map: TYPE_OPTIONAL_DEVICE_MAP = None, **kwargs) -> T.Tuple[T.Union[Path, None], LLMExporter]

Util to export LLM model.

find_subdir_with_filename_in

find_subdir_with_filename_in(dirpath: Path, filename: str) -> Path

Find a subdir with filename in it.

load_peft_model

load_peft_model(local_dir, kwargs)

Load PEFT adapted models.

Try to avoid direct reference to tokenizer object/config to limit dependencies of the function

While also trying to be robust to 'wrong' key/values