{"author":"David Hou","author_email":"david.hou314@gmail.com","author_time":1710392021,"commit_time":1710392021,"committer":"GitHub","committer_email":"noreply@github.com","hash":"199f7c43424e05be7ff901d3dcff485920ba0fd6","message":"MLPerf Resnet (cleaned up) (#3573)\n\n* this is a lot of stuff\r\n\r\nTEST_TRAIN env for less data\r\n\r\ndon't diskcache get_train_files\r\n\r\ndebug message\r\n\r\nno lr_scaler for fp32\r\n\r\ncomment, typo\r\n\r\ntype stuff\r\n\r\ndon't destructure proc\r\n\r\nmake batchnorm parameters float\r\n\r\nmake batchnorm parameters float\r\n\r\nresnet18, checkpointing\r\n\r\nhack up checkpointing to keep the names in there\r\n\r\noops\r\n\r\nwandb_resume\r\n\r\nlower lr\r\n\r\neval/ckpt use e+1\r\n\r\nlars\r\n\r\nreport top_1_acc\r\n\r\nsome wandb stuff\r\n\r\nsplit fw and bw steps to save memory\r\n\r\noops\r\n\r\nsave model when reach target\r\n\r\nformatting\r\n\r\nmake sgd hparams consistent\r\n\r\njust always write the cats tag...\r\n\r\npass X and Y into backward_step to trigger input replace\r\n\r\nshuffle eval set to fix batchnorm eval\r\n\r\ndataset is sorted by class, so the means and variances are all wrong\r\n\r\nsmall cleanup\r\n\r\nhack restore only one copy of each tensor\r\n\r\ndo bufs from lin after cache check (lru should handle it fine)\r\n\r\nrecord epoch in wandb\r\n\r\nmore digits for topk in eval\r\n\r\nmore env vars\r\n\r\nsmall cleanup\r\n\r\ncleanup hack tricks\r\n\r\ncleanup hack tricks\r\n\r\ndon't save ckpt for testeval\r\n\r\ncleanup\r\n\r\ndiskcache train file glob\r\n\r\nclean up a little\r\n\r\ndevice_str\r\n\r\nSCE into tensor\r\n\r\nsmall\r\n\r\nsmall\r\n\r\nlog_softmax out of resnet.py\r\n\r\noops\r\n\r\nhack :(\r\n\r\ncomments\r\n\r\nHeNormal, track gradient norm\r\n\r\noops\r\n\r\nlog SYNCBN to wandb\r\n\r\nreal truncnorm\r\n\r\nless samples for truncated normal\r\n\r\ncustom init for Linear\r\n\r\nlog layer stats\r\n\r\nsmall\r\n\r\nRevert \"small\"\r\n\r\nThis reverts commit 988f4c1cf35ca4be6c31facafccdd1e177469f2f.\r\n\r\nRevert \"log layer stats\"\r\n\r\nThis reverts commit 9d9822458524c514939adeee34b88356cd191cb0.\r\n\r\nrename BNSYNC to SYNCBN to be consistent with cifar\r\n\r\noptional TRACK_NORMS\r\n\r\nfix label smoothing :/\r\n\r\nlars skip list\r\n\r\nonly weight decay if not in skip list\r\n\r\ncomment\r\n\r\ndefault 0 TRACK_NORMS\r\n\r\ndon't allocate beam scratch buffers if in cache\r\n\r\nclean up data pipeline, unsplit train/test, put back a hack\r\n\r\nremove print\r\n\r\nrun test_indexing on remu (#3404)\r\n\r\n* emulated ops_hip infra\r\n\r\n* add int4\r\n\r\n* include test_indexing in remu\r\n\r\n* Revert \"Merge branch 'remu-dev-mac'\"\r\n\r\nThis reverts commit 6870457e57dc5fa70169189fd33b24dbbee99c40, reversing\r\nchanges made to 3c4c8c9e16d87b291d05e1cab558124cc339ac46.\r\n\r\nfix bad seeding\r\n\r\nUnsyncBatchNorm2d but with synced trainable weights\r\n\r\nlabel downsample batchnorm in Bottleneck\r\n\r\n:/\r\n\r\n:/\r\n\r\ni mean... it runs... its hits the acc... its fast...\r\n\r\nnew unsyncbatchnorm for resnet\r\n\r\nsmall fix\r\n\r\ndon't do assign buffer reuse for axis change\r\n\r\n* remove changes\r\n\r\n* remove changes\r\n\r\n* move LARS out of tinygrad/\r\n\r\n* rand_truncn rename\r\n\r\n* whitespace\r\n\r\n* stray whitespace\r\n\r\n* no more gnorms\r\n\r\n* delete some dataloading stuff\r\n\r\n* remove comment\r\n\r\n* clean up train script\r\n\r\n* small comments\r\n\r\n* move checkpointing stuff to mlperf helpers\r\n\r\n* if WANDB\r\n\r\n* small comments\r\n\r\n* remove whitespace change\r\n\r\n* new unsynced bn\r\n\r\n* clean up prints / loop vars\r\n\r\n* whitespace\r\n\r\n* undo nn changes\r\n\r\n* clean up loops\r\n\r\n* rearrange getenvs\r\n\r\n* cpu_count()\r\n\r\n* PolynomialLR whitespace\r\n\r\n* move he_normal out\r\n\r\n* cap warmup in polylr\r\n\r\n* rearrange wandb log\r\n\r\n* realize both x and y in data_get\r\n\r\n* use double quotes\r\n\r\n* combine prints in ckpts resume\r\n\r\n* take UBN from cifar\r\n\r\n* running_var\r\n\r\n* whitespace\r\n\r\n* whitespace\r\n\r\n* typo\r\n\r\n* if instead of ternary for resnet downsample\r\n\r\n* clean up dataloader cleanup a little?\r\n\r\n* separate rng for shuffle\r\n\r\n* clean up imports in model_train\r\n\r\n* clean up imports\r\n\r\n* don't realize copyin in data_get\r\n\r\n* remove TESTEVAL (train dataloader didn't get freed every loop)\r\n\r\n* adjust wandb_config entries a little\r\n\r\n* clean up wandb config dict\r\n\r\n* reduce lines\r\n\r\n* whitespace\r\n\r\n* shorter lines\r\n\r\n* put shm unlink back, but it doesn't seem to do anything\r\n\r\n* don't pass seed per task\r\n\r\n* monkeypatch batchnorm\r\n\r\n* the reseed was wrong\r\n\r\n* add epoch number to desc\r\n\r\n* don't unsyncedbatchnorm is syncbn=1\r\n\r\n* put back downsample name\r\n\r\n* eval every epoch\r\n\r\n* Revert \"the reseed was wrong\"\r\n\r\nThis reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f.\r\n\r\n* cast lr in onecycle\r\n\r\n* support fp16\r\n\r\n* cut off kernel if expand after reduce\r\n\r\n* test polynomial lr\r\n\r\n* move polynomiallr to examples/mlperf\r\n\r\n* working PolynomialDecayWithWarmup + tests.......\r\n\r\nadd lars_util.py, oops\r\n\r\n* keep lars_util.py as intact as possible, simplify our interface\r\n\r\n* no more half\r\n\r\n* polylr and lars were merged\r\n\r\n* undo search change\r\n\r\n* override Linear init\r\n\r\n* remove half stuff from model_train\r\n\r\n* update scheduler init with new args\r\n\r\n* don't divide by input mean\r\n\r\n* mistake in resnet.py\r\n\r\n* restore whitespace in resnet.py\r\n\r\n* add test_data_parallel_resnet_train_step\r\n\r\n* move initializers out of resnet.py\r\n\r\n* unused imports\r\n\r\n* log_softmax to model output in test to fix precision flakiness\r\n\r\n* log_softmax to model output in test to fix precision flakiness\r\n\r\n* oops, don't realize here\r\n\r\n* is None\r\n\r\n* realize initializations in order for determinism\r\n\r\n* BENCHMARK flag for number of steps\r\n\r\n* add resnet to bechmark.yml\r\n\r\n* return instead of break\r\n\r\n* missing return\r\n\r\n* cpu_count, rearrange benchmark.yml\r\n\r\n* unused variable\r\n\r\n* disable tqdm if BENCHMARK\r\n\r\n* getenv WARMUP_EPOCHS\r\n\r\n* unlink disktensor shm file if exists\r\n\r\n* terminate instead of join\r\n\r\n* properly shut down queues\r\n\r\n* use hip in benchmark for now\r\n\r\n---------\r\n\r\nCo-authored-by: George Hotz <72895+geohot@users.noreply.github.com>","parents":["0f050b10283af66f08e34cc91cc1192f53966883"],"tree_hash":"d1ccd0c3fe3a495bb8cab8013e92786d5ba4c52f"}