{"author":"qazal","author_email":"77887910+Qazalin@users.noreply.github.com","author_time":1701639949,"commit_time":1701639949,"committer":"GitHub","committer_email":"noreply@github.com","hash":"4380ccb1699e32849819f2c043c15579b160de39","message":"Non fp32 math (#2264)\n\n* `global_load` and `global_store` using buffer dtype\r\n\r\n* `UOps.PHI` in all dtypes\r\n\r\n* `UOps.ALU` in all dtypes\r\n\r\n* `UOps.CONST` & `UOps.DEFINE_ACC` in all dtypes\r\n\r\n* -- endof implementation --\r\n+tiny lint changes\r\n\r\n* these tests require the fp16 extention\r\n\r\nyou can run them locally to confirm they're green: (GPT2 test is broken in master for mac, see [this](https://discord.com/channels/1068976834382925865/1069001075828469790/1177993277958533261)\r\n\r\n`GPU=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_max_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_min_float16_cpu test/models/test_real_world.py::TestRealWorld::test_llama test/models/test_real_world.py::TestRealWorld::test_gpt2 test/models/test_whisper.py test/test_specific_conv.py::TestSpecific::test_big_vec_mul`\r\n\r\nskip the new test_linearizer_failures in CI GPU because of the fp16 extention\r\n\r\nThis passes on a real GPU since the extention is available:\r\n`GPU=1 python3 -m pytest test/test_linearizer_failures.py::TestLinearizerFailures::test_failure_8`\r\n\r\nsee CI logs [here](https://github.com/tinygrad/tinygrad/actions/runs/6996590597/job/19032641427#step:14:644)\r\n\r\n* these tests fail in CI due to segfaults and CPU crashes\r\n\r\nTo confirm they're green locally, you can run the following commands:\r\n\r\n1. For the tests skipped in test_ops.py (note: CLANG is very slow)\r\n\r\n`for var in GPU CUDA CLANG; do export $var=1; for test in test/test_ops.py::TestOps::test_slice_fancy_indexing_no_dim_collapse test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_collapse_int test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_none test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_and_collapse; do python3 -m pytest $test; done; unset $var; done`\r\n\r\n2. For the ONNX tests skipped in CLANG:\r\n\r\n```\r\nCLANG=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_ai_onnx_ml_array_feature_extractor_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_0_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_1_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_none_no_weight_negative_ii_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_ii_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_log_prob_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_negative_indices_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_log_prob_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_log_prob_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_no_weight_reduction_mean_ii_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_mean_weight_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_mean_weight_negative_ii_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_log_prob_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_mean_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_ii_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_ii_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_sum_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_none_no_weight_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_sum_weight_high_ii_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_mean_expanded_cpu \\\r\ntest/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_expanded_cpu\r\n```\r\n\r\n3. The LLVM test I skipped here is already [skipped in master for all backends](https://github.com/tinygrad/tinygrad/blob/master/test/external/external_test_onnx_backend.py#L186), I just made it more specific\r\n\r\n`LLVM=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu`\r\n\r\n* Revert \"these tests fail in CI due to segfaults and CPU crashes\"\r\n\r\nThis reverts commit 15db57014381a4449d563526ac6c870e36257658.\r\n\r\n* merge with cleanup-vectorized-hip-renders\r\n\r\n* barely working HIP P1, ALU ops need a refactor?\r\n\r\n* manage the fact that in HIP [half2 is actually an unsigned int vec](https://github.com/argonne-lcf/THAPI/blob/f921880387730e75f04b97a8b07d191a8d28bc28/hip/include/hip/amd_detail/amd_hip_fp16.h#L59) and half is a totally different __half that [has an unsigned int element in it](https://github.com/argonne-lcf/THAPI/blob/f921880387730e75f04b97a8b07d191a8d28bc28/hip/include/hip/amd_detail/amd_hip_fp16.h#L50) but can't be accessed [because it's private](https://github.com/argonne-lcf/THAPI/blob/f921880387730e75f04b97a8b07d191a8d28bc28/hip/include/hip/amd_detail/amd_hip_fp16.h#L86). If you just do this:\r\n\r\n```\r\nhalf2 val0 = // ...\r\nhalf val1 = // ...\r\n```\r\nthen you can't do:\r\n```\r\nval0.x + val1 // error: use of overloaded operator '+' is ambiguous (with operand types 'unsigned short' and 'half' (aka '__half'))\r\n```\r\n\r\n* update the sign definition to avoid division by zero in all dtypes\r\n\r\n* diff cleanup p1: why were these in the diff anyways\r\n\r\n* less hacky HIP, enable CIFAR fp16 benchmark, test ops for HIP in CI!\r\n\r\nadd ALU ops overloads for HIP\r\n\r\nthis will make HIP max work\r\n\r\nhandle mod\r\n\r\nRevert \"handle mod\"\r\n\r\nThis reverts commit 370fd4b3fbe99b6ae8cc293d005b106628205933.\r\n\r\nupdate max to use hmax\r\n\r\nadd HIP GEP render logic\r\n\r\nenable CIFAR fp16 benchmark\r\n\r\ntest ops for HIP\r\n\r\nback to store as float because this only works for float4 grouping right now\r\n\r\ntest_ops for hip!!\r\n\r\nalways sign\r\n\r\n* back to the sign we had before because we cant do a backward pass on a Less node\r\n\r\n* remove old hacks\r\n\r\nHIP compiling test_ops in CI takes ~9 mins, not doing it for now\r\n\r\nnew HIP ALUs\r\n\r\n* reduce accs done right\r\n\r\n* refactor to function\r\n\r\n* no device hacks\r\n\r\nhacks p2\r\n\r\nthe other way\r\n\r\n* LLVM ALU ops\r\n\r\nhalf, float and double are all float\r\n\r\nupdate max\r\n\r\n* update test_uops, cmplt is always a bool in the real linearizer. assertAlmostEqual is wrong when ret is bool\r\n\r\n* cleanup LLVM wrong code\r\n\r\n* dummy change for the CUDA install glitch\r\n\r\n---------\r\n\r\nCo-authored-by: George Hotz <72895+geohot@users.noreply.github.com>","parents":["1ac958a058c8f84dfedc1f75bdafe156c429a824"],"tree_hash":"eccfee1e8afce5399d2024f4f2503b828a1197c5"}