{"author":"chenyu","author_email":"chenyu@fastmail.com","author_time":1711145737,"commit_time":1711145737,"committer":"GitHub","committer_email":"noreply@github.com","hash":"f7f67e0cc5283c9743d7ce903bfe2d78225c50f5","message":"simple fix llama shard with quantize (#3882)\n\ncopy scale on all device for now. naive sharding does not work because scale needs expand to really save memory.\r\n\r\n70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES.\r\n\r\n`python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt \"Hello.\" --count 10 --temperature 0 --timing --quantize`\r\n\r\n13B on 6 gpus uses 47 GB v.s. 34 GB quantized","parents":["ee502c805578c2035abffae8d7159788121094b5"],"tree_hash":"080068ba472166ff04fb1ae21cb4e229fe523371"}