{"author":"Yixiang Gao","author_email":"yixiangg310573@gmail.com","author_time":1705019462,"commit_time":1705019462,"committer":"GitHub","committer_email":"noreply@github.com","hash":"13e872b53f467c97a8f505901a7b553c52fa5547","message":"add mutigpu support for llama attention (#3064)\n\n* add llama attention test for multigpu\r\n\r\n* test fails\r\n\r\n* kv cache trying to shrink on sharded axis\r\n\r\n* mask None works for scale dot product\r\n\r\n* kv cache seems to be working but scale dot product breaks\r\n\r\n* scaled dot product works, but the last linear layer failed\r\n\r\n* running into the reshape case where it could be wrong for multigpu\r\n\r\n* making sure it was the reshape\r\n\r\n* adding contiguous doesn't solve\r\n\r\n* need to shard more properly\r\n\r\n* remove reshape test\r\n\r\n* minor adjustment to scale dot product attention test\r\n\r\n* weights are sharded wrong\r\n\r\n* continue fix new weight sharding\r\n\r\n* clean up\r\n\r\n* fix attention when start_pos is 0\r\n\r\n* remove print\r\n\r\n* add TODOs for the best mutigpu interface","parents":["dcf7ecaaffe3c896a9a866083b92eda192dd59a8"],"tree_hash":"5e60fcd3edafd3dbbcd27c4a1b8b5260cc50365e"}